UTF-8: Difference between revisions

From Hydrogenaudio Knowledgebase
(Initial commit)
 
No edit summary
Line 20: Line 20:


Used in [[APE Tag Item]]s
Used in [[APE Tag Item]]s


===Weblinks:===
===Weblinks:===
Line 28: Line 29:
*[http://www.columbia.edu/kermit/utf8.html UTF-8 sampler] (Web browser test)
*[http://www.columbia.edu/kermit/utf8.html UTF-8 sampler] (Web browser test)
*[http://www.microsoft.com/typography/unicode/cscp.htm Codepages used by OS/2 and Windows]
*[http://www.microsoft.com/typography/unicode/cscp.htm Codepages used by OS/2 and Windows]


===Windows API===
===Windows API===
Line 42: Line 44:


Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER.
Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER.


===ISO API===
===ISO API===
Line 59: Line 62:


Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE.
Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE.
===Conversion scheme===
{|border="1" cellspacing="1"
|width="100px"|
'''''Unicode Glyph'''''
|width="280px"|
'''''Binary Represenation of Glyph in Unicode'''''
|width="65px"|
'''''Byte 1'''''
|width="65px"|
'''''Byte 2'''''
|width="65px"|
'''''Byte 3'''''
|width="65px"|
'''''Byte 4'''''
|width="65px"|
'''''Byte 5'''''
|width="65px"|
'''''Byte 6'''''
|-
|U-00000000... U-0000007F
||00000000 00000000 00000000 0xxxxxxx
||0xxxxxxx
|-
|U-00000080... U-000007FF
||00000000 00000000 00000xxx xxyyyyyy
||110xxxxx
||10yyyyyy
|-
|U-00000800... U-0000FFFF
||00000000 00000000 xxxxyyyy yyzzzzzz
||1110xxxx
||10yyyyyy
||10zzzzzz
|-
|U-00010000... U-001FFFFF
||00000000 000xxxyy yyyyzzzz zzuuuuuu
||11110xxx
||10yyyyyy
||10zzzzzz
||10uuuuuu
|-
|U-00200000... U-03FFFFFF
||000000xx yyyyyyzz zzzzuuuu uuvvvvvv
||111110xx
||10yyyyyy
||10zzzzzz
||10uuuuuu
||10vvvvvv

Revision as of 15:59, 27 April 2005

UTF-8

UTF-8 stands for UCS Transformation Format 8 bit. It is a upward compatible way to portable encode all languages on this planet.

Another remark: The following control characters (Range: 0x00...0x1F, 0x7F) are allowed:

  • 0x0A: Line feed (Unix Way)
  • 0x0C: Form feed (with intrinsic line feed)


UTF-8 has the following properties:

  • UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility).
  • This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
  • All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00 to 0x7F) can appear as part of any other character.
  • The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
  • UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
  • The sorting order of Bigendian UCS-4 byte strings is preserved.
  • The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

Used in APE Tag Items


Weblinks:


Windows API

MultiByteToWideChar- convert a MultiByte string to a WideChar string

   int
   MultiByteToWideChar ( UINT    CodePage,        // code page
                         DWORD   dwFlags,         // character-type options
                         LPCSTR  lpMultiByteStr,  // address of string to map
                         int     cchMultiByte,    // number of bytes in string
                         LPWSTR  lpWideCharStr,   // address of wide-character buffer
                         int     cchWideChar );   // size of buffer

Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER.


ISO API

mbstowcs - convert a multibyte string to a wide character string

mbsrtowcs - convert a multibyte string to a wide character string

   #include <stdlib.h>
   size_t  mbstowcs ( wchar_t* dst, const char* src, size_t maxlen );
   #include <wchar.h>
   size_t  mbsrtowcs  ( wchar_t* dst, const char** src,             size_t maxlen, mbstate_t* ps );
   size_t  mbsnrtowcs ( wchar_t* dst, const char** src, size_t nms, size_t maxlen, mbstate_t* ps );


Interface is very similar to Windows API, but mr crptc t b mr dffclt t ndrstnd.

Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE.


Conversion scheme

Unicode Glyph

Binary Represenation of Glyph in Unicode

Byte 1

Byte 2

Byte 3

Byte 4

Byte 5

Byte 6

U-00000000... U-0000007F 00000000 00000000 00000000 0xxxxxxx 0xxxxxxx
U-00000080... U-000007FF 00000000 00000000 00000xxx xxyyyyyy 110xxxxx 10yyyyyy
U-00000800... U-0000FFFF 00000000 00000000 xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz
U-00010000... U-001FFFFF 00000000 000xxxyy yyyyzzzz zzuuuuuu 11110xxx 10yyyyyy 10zzzzzz 10uuuuuu
U-00200000... U-03FFFFFF 000000xx yyyyyyzz zzzzuuuu uuvvvvvv 111110xx 10yyyyyy 10zzzzzz 10uuuuuu 10vvvvvv