==== UTF-8 ==== http://www.cl.cam.ac.uk/~mgk25/unicode.html ==== Thumbnail sketch of UTF-8 ==== from;: http://www.pemberley.com/janeinfo/latin1.html In UTF-8, each 16-bit Unicode character is encoded as a sequence of one, two, or three 8-bit bytes, depending on the value of the character. The following table shows the format of such UTF-8 byte sequences (where the "free bits" shown by x's in the table are combined in the order shown, and interpreted from most significant to least significant) Binary format of bytes in sequence: Number of Maximum expressible 1st byte 2nd byte 3rd byte free bits: Unicode value: 0xxxxxxx 7 007F hex (127) 110xxxxx 10xxxxxx (5+6)=11 07FF hex (2047) 1110xxxx 10xxxxxx 10xxxxxx (4+6+6)=16 FFFF hex (65535) The value of each individual byte indicates its UTF-8 function, as follows: 00 to 7F hex (0 to 127): first and only byte of a sequence. 80 to BF hex (128 to 191): continuing byte in a multi-byte sequence. C2 to DF hex (194 to 223): first byte of a two-byte sequence. E0 to EF hex (224 to 239): first byte of a three-byte sequence. Other byte values are either not used when encoding 16-bit Unicode characters (i.e. F0 to F4 hex), or are not part of any well-formed Unicode UTF-8 sequence (i.e. C0, C1, and F5 to FF hex); see the links to UTF-8 standards documents below for further details.