This is an old revision of the document!
Some Falsehoods programmers believe via Thias
Text processing is an old problem in computer science, most programming tutorial do some form of text manipulation – if only to send back an “Hello World” back to the user. This does not mean that it is an easy problem, and there are many misconception about text and its representation. Here is a list – it is certainly not exhaustive and probably not original, just stuff I noticed (and that I need to get off my chest).
Cat
* 日本語
* Full-width characters only represent Asian characters
* There is a full-width version of the ASCII range U+0021 to U+0073 starting from U+FF01.
* Non control characters are either half-width or full-width
* The character ﷽ (Arabic Ligature Bismillah ar-Rahman ar-Raheem) (U+FDFD) is very wide.
* A terminal can display Unicode text
* Terminals are typically character oriented, so even if it is set up in UTF-8 mode, all fonts are installed, scripts which rely on ligatures (like Arabic), will not display properly.
* UTF-8 can be safely manipulated at the byte level
* Any substring operation that cuts a multi-byte character will yield invalid UTF-8.
* Language like Java have Unicode support
* Java and Javascript were designed around the now abandoned UCS-2 encoding, which assumes that all Unicode characters will fit on 16 bits, i.e. the U+0000 – U+FFFF range. They don’t. Java and JavaScript now uses the UTF-16 encoding, which means their character type might represent a fraction of a character (surrogate pairs).
* Splitting at code-point boundary is safe
* In general, cutting between a combining character and the character it combines with (say an accent), will not yield the expected result. In UTF-16, you might cut surrogate pairs. Certain ranges, like the Emoji flag range are only defined for character pairs.
* A Unicode Code-point can only represented in one way in UTF-8
* In theory this is true, but some systems wrongly convert UTF-16 into UTF-8 by encoding surrogate characters directly, instead of decoding them and encoding them as UTF-8. Some UTF-8 decoders accept this mistake. They might also decode UTF-8 encoded Windows-1252 code-points into their respective characters.
* Unicode does not handle formatting
* Even if you ignore ANSI escape sequences (which allow underlining and bolding of text), variation selectors (which select the colour / black and white version of Emojis), there are ranges that duplicate ASCII with various style attributes (italic, bold). I wrote a markless, a tool that allows to render Markdown data using just Unicode tricks.
* There is a standard to encode Unicode characters in ASCII
* There are many, HTML has three (named entities, hex-encoded, decimal encoded), The C programming language has one (based on UTF-8), C99 has another. URL and CSS use different schemes, and so do e-mails.
* ASCII and its variants can be mapped into Unicode
* Some obscure variants like PETSCII contain graphical characters that are not mapped (yet).
* Unicode data is meant to be represented in black and white
* Emoji characters are typically in colour.
* Unicode data is meant to be represented visually**