Utf8 is one of the most momentous and under appreciated / relatively unknown achievements in software.
A sketch on a diner placemat has lead to every person in the world being able to communicate written language digitally using a common software stack. Thanks to Ken Thompson and Rob Pike we have avoided the deeply siloed and incompatible world that code pages, wide chars and other insufficient encoding schemes were guiding us towards.
UTF-8 is a great system, but all those dreadful code pages existed because they were under different technical constraints.
Windows machines in 1990s had several megabytes of main memory, and people could barely get it to support one East Asian language at a time, never mind multiple of them. No sane person would propose using three bytes per a Korean character when two would do - that would mean your word processor will die after adding 50 pages of document, while your competitor can do 75.
And even if you did have UTF-8, you wouldn't see those Thai characters anyway, because who would even have these fonts when your OS must fit in a handful of stacked floppies.
It took years before UTF-8 made technical sense for most users.
Why would you keep 50 pages of document in main memory at once? It’s not like 75 is some magic limit that’s enough and 50 isn’t. No, if you stand any chance of getting anywhere near such a limit, you would certainly design your data structures so you don’t need all the content in memory at once, and so the difference is not so ferociously significant.
It wasn’t free and limitless, but it wasn’t scarce either—you probably had 100–1000× more disk space than RAM, which is close enough to unlimited for most text purposes. (https://en.wikipedia.org/wiki/History_of_hard_disk_drives suggests 1GB was typical in the mid-1990s.)
Consider also that at this very time we’re talking of (early 1990s) the industry was shifting away from largely 8-bit code pages to 16-bit UCS-2, which is an even more extreme cost when compared to UTF-8, doubling space requirements for most people, rather than merely the 50% increase yongjik speaks of for certain languages. Yet this change was being done (more’s the pity).
Concerning the scarcity of bytes, yongjik’s point would certainly be valid if it referred to the 1970s, was probably valid of the 1980s, but is not valid of the 1990s. (But the point about keeping the full document in RAM is an unrealistic strawman.)
No need to exaggerate the scarcity either. Documents were often produced in the Word 97 format, which can be easily be an order of magnitude larger than the underlying text. If the amount of bytes really was that important, any one of a number of more efficient formats could have been chosen.
It's great as a global character set and really enabled the world to move ahead at just the right time when the web started to connect us all together.
But the whole emoji modifier (e.g. guy + heart + lips + girl = one kissing couple character) thing is a disaster. Too many rules made up on the fly that make building an accurate parser a nightmare. It should have either specified this strictly and consistently as part of the standard, or just left it out for a future standard to implement, and just just used separate codepoints for the combinations that were really necessary.
This complexity is also something that has led to multiple vulnerabilities especially on mobiles.
Emoji are not any more complex than some natural languages that people need to encode in Unicode. It's a weird, more modern "language" than most, but it's a still a human language with human quirks. It's a lovely privilege that English speakers generally don't have any language encoding problems to deal with beyond Emoji, but it's a brilliant bit of Unicode today that Emoji is becoming a required by users common denominator that stress tests incorrect assumptions English speakers want to make about language encoding.
I have nothing against emoji. Just the combination mechanism.
And yeah this problem only came later after UTF-8 was already invented. It's a smart solution also because at the time a lot of detractors of Unicode were opposed to the extra data requirements of the long form representations.
This post is a really good illustration of UTF-8. Very clear! The key brilliance of the design is not only the embedding of ASCII in UTF-8, but the fact that nothing in ASCII can appear anywhere else UTF-8, and more generally that no UTF-8 character can appear as a substring of another character’s encoding. That means that all the byte-oriented libc string functions just work. I wrote this up recently in a StackOverflow answer with some examples: https://stackoverflow.com/a/69756619/659248
> nothing in ASCII can appear anywhere else UTF-8, and more generally that no UTF-8 character can appear as a substring of another character’s encoding
How is that defined and enforced? Very narrowly, it seems to me:
* ASCII Hyphen-minus (U+002D) has similar functions and appearance to Small Hyphen-minus (U+FE63), Fullwidth Hyphen-minus (U+FF0D), Hyphen (U+2010), Minus Sign (U+2212), Heavy Minus Sign (U+2796), En dash (U+2013), Em Dash (U+2014), Small Em Dash (U+FE58), Horizontal Bar (U+2015), Figure dash (U+2012). (I'm probably missing a few!)
* There are separate delta symbols for Greek and for mathematics (sorry, no more time for looking up code points).
* Very many other characters have appearances so similar that nobody could tell them apart.
So characters have apparently, to users and almost anyone not looking at the actual codes, identical functions and appearances.
OP (and the article) is talking about the encoding of UTF-8, not Unicode in general.
ASCII is itself valid utf-8, because ASCII is a subset of utf-8. But a multi byte encoded codepoint in UTF-8 cannot be confused with ASCII, because the highest bit is set in all the octets.
It really is wonderful. I was forced to wrap my head around it in the past year while writing a tree-sitter grammar for a language that supports Unicode. Calculating column position gets a whole lot trickier when the preceding codepoints are of variable byte-width!
It's one of those rabbit holes where you can see people whose entire career is wrapped up in incredibly tiny details like what number maps to what symbol - and it can get real political!
What do you mean by that? Unicode does have double-wide characters and, I discovered recently, some other characters called 'wide' that at least are wider than one column but smaller than two in at least some monospaced fonts. Try:
* Small Hyphen-minus (U+FE63) "﹣": Seems to be >1 and <2 columns in at least some monospaced fonts.
The code unit of UTF8 is the byte, so you only deal with bytes and byte sequences. Thus no issues of byte-order and such, which you have to deal with in UTF-16 and UTF-32 (as their code units are respectively 2 and 4 bytes).
UTF-8 is one of the most brilliant things I've ever seen. I only wish it had been invented and caught on before so many influential bodies started using UCS-2 instead.
Like anything new, people had a hard time with it at the beginning.
I remember that I got a home assignment in an interview for a PHP job. The person evaluating my code said I should not have used UTF8, which causes "compatibility problems". At the time, I didn't know better, and I answered that no, it was explicitly created to solve compatibility problems, and that they just didn't understand how to deal with encoding properly.
Needing-less to say, I didn't get the job :)
Same with Python 2 code. So many people, when migrating to Python 3, suddenly though python 3 encoding management was broken, since it was raising so many UnicodeDecodingError.
Only much later people realize the huge number of programs that couldn't deal with non ASCII characters in file paths, html attributes or user names, because they just implicitly assume ASCII. "My code used to work fine", they said. But it worked fine on their machine, set to an english locale, tested only using ascii plain text files on their ascii named directories with their ascii last name.
That's in general a problem with dynamic languages with weak type systems. How "Your code runs without crashing" is really really != "your code works". How do people even manage production python! A bug could be lurking anywhere, undetected until it's actually run. Whereas in a compiled language with a strong type system, "your code compiles" is much closer to "your code is correct".
A type system can refuse to turn a `bytes` into a `utf8str` until it's been appropriately parsed.
(It doesn't even need to be a very good or strongly-enforced type system - Go makes it dangerously easy to convert between `[]byte` and `string` by other-type-system standards, and yet everything works pretty well. It's enough to hitch your thinking and make you realize you need another step.)
But this is not a matter of UTF-8 in the code, rather a matter of UTF-8 in the input or output. How does compiling a program ensure that it is robust on a range of inputs?
> How does compiling a program ensure that it is robust on a range of inputs?
This is quite literally the job of a type system: to impose a semantic interpretation on sequences of "raw bits" and let you specify legal (and only legal) operations in terms of the semantic interpretation rather than the bits.
There are a number of mitigations, so those kind of bugs are quite rare. In our large code base, about 98% of bugs we find are of the "we need to handle another case" variety. Pyflakes quickly finds typos which eliminates most of the rest.
This is the difference between people who embrace static typing and everyone else. A static type lover hears that 98% of your bugs are of the "we need to handle another case" variety and says, "well, that means you could have gotten rid of 98% of your bugs with better typing".
No, what I mean is that an additional key comes in (with the json or similar hash) and we now need to do some thing with it, or something different than we thought we were supposed to with it. Typing is not going to fix it because the full cases were unknown at development time.
How is it anything but the truth? The express purpose of static analysis, like a type system, is to catch bugs before running your code. That pretty clearly means that code that successfully compiles is closer to being correct than code that doesn't.
The parser assures your code is grammatically correct; the type system assures your code is semantically consistent, which is usually a much stronger guarantee, and by most practical measures will be closer - often much closer, and for total functions on total types, sometimes all the way - to "logically correct".
Python 3 encoding management was broken, because it tried to impose Unicode semantics on things that were actually byte streams. For anyone actually correctly handling encodings in Python 2 it was awful because suddenly the language runtime was hiding half the data you needed.
Nowadays, passing bytes to any os function returns bytes objects, not unicode. You'll get string if you pass string objects though, and they will be using utf8 surrogate escaping.
And nowadays, lots of people left the Python ecosystem completely because the 3 upgrade was broken and raising so many UnicodeDecodingErrors for so long. I'm glad it's fixed but it cost too much.
(And it had nothing to do with UTF-8. Actually, it's at least partially caused by the CPython developers avoiding UTF-8 for poor reasons.)
I'd argue that the people correctly handling encodings in Python 2 were vastly outnumbered by the people that weren't, but were getting away with it because the code didn't outright crash. Now in Python 3 it crashes, which is a pain in the short term but better in the long run.
I don't really think it's better - certainly it was not better until many years after release when they started admitting their big mistakes.
But even today, codepoint indexing is too easy and doesn't crash so lots of code is still subtly wrong. Memory usage of a string grows 3x if you add a single emoji. Most libraries are still agnostic about whether they take bytes or str so you still get exceptions thrown with no easy solution. (The growing popularity of type hints is fixing this, but that's not really related to Python 3.)
My Slack name at work is "Τĥιs ñåmè įß ą váĺîδ POSIX paτĥ". My hope is that it serves as an amusing reminder to consider things like spaces and non-ASCII characters.
One of my friends is these days a colleague, with an utterly ordinary English name but his identity management data is full of spurious accents to check APIs do the Right Thing™.
I was delighted recently to stumble on a history of modern women course HIST1158 named Liberté Egalité Beyoncé and I immediately thought two things: 1. Why are our Computer Science courses given unimaginative names? and 2. What a useful test input, I bet some of our systems don't work correctly for this input even though an acute accent is hardly a bleeding edge feature.
I haven't been able to interest any Computer Science professors in fun names for their courses, but I was able in my test environment to name a COMP series course "Untitled Course Name" with a description explaining that "It is a lovely day in the village and there are only two hard problems in Computer Science".
Absolutely. At least it’s well supported now in very old languages (like C) and very new languages (like Rust). But Java, Javascript, C# and others will probably be stuck using UCS-2 forever.
There's actually a proposal with a decent amount of support to add utf-8 strings to C#. Probably won't be added to the language for another 3 or 4 years (if ever) but it's not outside the realm of possibility.
>What is stopping [...] Java, JS, and C# files in UTF-8?
The output of files on disk can be UTF-8. The continued use of UCS-2 (later revised to UTF16) is happening in the runtime because things like the Win32 API which C# uses is UCS-2. The internal raw memory of layout of strings in Win32 is UCS-2.
Code page 65001 has existed for a long time now, but it was discouraged because there were a lot of corner cases that didn't work. Did they finally get all the kinks out of it?
When Windows adopted Unicode, I think the only encoding available was UCS-2. They converted pretty quickly to UTF-16 though, and I think the same is true of everybody else who started with UCS-2. Unfortunately UTF-16 has its own set of hassles.
Yeah, there's sometimes a lot more hacks like WTF-8 and WTF-16 in practice on UCS-2 originally systems (including Windows and JS) than is healthy: https://simonsapin.github.io/wtf-8/
Nothing at all, and in fact there's a site set up specifically to advocate for this: https://utf8everywhere.org/
The biggest problem is when you're working in an ecosystem that uses a different encoding and you're forced to convert back and forth constantly.
I like the way Python 3 does it - every string is Unicode, and you don't know or care what encoding it is using internally in memory. It's only when you read or write to a file that you need to care about encoding, and the default has slowly been converging on UTF-8.
The problem with "every string is Unicode" is if you want to represent things that look like unicode but aren't really guaranteed to be unicode. This includes filenames on Windows (WTF-18 aka arbitrary WCHAR sequences) and Linux (arbitrary byte sequences) that are interpreted as UTF-16 / UTF-8 for display purposes but limiting yourself to valid UTF-16 / UTF-8 means that you cannot represent all paths that you might come across.
Yep. In javascript (and Java and C# from memory) the String.length property is based on the encoding length in UTF16. It’s essentially useless. I don’t know if I’ve ever seen a valid use for the javascript String.length field in a program which handles Unicode correctly.
There’s 3 valid (and useful) ways to measure a string depending on context:
- Number of Unicode characters (useful in collaborative editing)
- Byte length when encoded (these days usually in utf8)
- and the number of rendered grapheme clusters
All of these measures are identical in ASCII text - which is an endless source of bugs.
Sadly these languages give you a deceptively useless .length property and make you go fishing when you want to make your code correct.
This is also rarely useful unless you are working with a monospace font where all grapheme clusters have the same width, which is probably none if you support double-width characters. More likely what you are interested in is the display length with a particular font or column count with a monospace font.
Java's char is a strong competitor for most stupid "char" type award.
I would give it to Java outright if not for the fact that C's char type doesn't define how big it is at all, nor whether it is signed. In practice it's probably a byte, but you aren't actually promised that, and even if it is a byte you aren't promised whether this byte is treated as signed or unsigned, that's implementation dependant. Completely useless.
For years I thought char was just pointless, and even today I would still say that a high level language like Java (or Javascript) should not offer a "char" type because the problems you're solving with these languages are so unlikely to make effective use of such a type as to make it far from essential. Just have a string type, and provide methods acting on strings, forget "char". But Rust did show me that a strongly typed systems language might actually have some use for a distinct type here (Rust's char really does only hold the 21-bit Unicode Scalar Values, you can't put arbitrary 32-bit values in it, nor UTF-16's surrogate code points) so I'll give it that.
And POSIX does guarantees that CHAR_BIT == 8 so in practice this is only a concern on embedded platform where you are only dealing with "C-ish" anyway.
How many non-embedded non-POSIX systems do you know? Windows also guarantees CHAR_BIT == 8 and since most software is first written for Windows or POSIX there is plenty of software that assumes that CHAR_BIT == 8. That means that anything that will want to run general software needs to also ensure CHAR_BIT == 8 - not to mention all the algorithms and data formats designed around you being able to efficiently access octets. The only platforms that can get away with CHAR_BIT != 8 are precisely those that have software specially written for them, i.e. embedded systems.
You’re being far too harsh. The Java char type isn’t “stupid”; really, it’s just unfortunate in hindsight. There are plenty of decisions that were stupid even at the time they were decided, and this isn’t that: people actually thought that 2 bytes was enough for all characters, and Han unification was going to work. Looking backward this is “obviously” futile but certainly not then.
C’s character type, FWIW, has a use: it more or less indicates the granularity that is efficiently addressable by the host architecture. Trying to use it for more than that is generally not that fruitful, but it definitely has a purpose and it’s pretty good at that.
Finally, speaking of unfortunate decisions, Rust happens to make one that I don’t particularly like: it lets you misalign characters (and panics), which is…not great. It would be much nicer if the view just don’t let you do this unless you specifically asked for bytes or something.
When Java was first conceived UTF-16 didn't exist, but we shouldn't rewrite history entirely here, Java 1.0 and Unicode 2.0 (with UTF-16) are from the same year. It would have been wiser (albeit drastic) to pull char in Java 1.0, reserve the word char and the literal syntax and spend a year or two deciding what you actually wanted here in light of the fact Unicode is not going to be a 16-bit encoding.
And again, I don't think Java probably needed 'char' at all, it's the sort of low-level implementation detail Java has been trying to escape from so this is a needless self-inflicted wound. I think there's a char in Java for the reason it has both increment operators - C does it and Java wants to look like C so as not to scare the C programmers.
C's unsigned char could just be named "byte" and if signed char must exist, call that "signed byte". The old C standard actually pretends these are characters, which of course today they clearly aren't which is why this is a thread about UTF-8. I don't have any objection to a byte type especially in a low-level language.
Presumably your Rust annoyance is related to things like String::insert? But I don't understand how this problem arises, if you are inserting characters at random positions in a String, that's just going to be nonsense. I can't conceive of a situation where I want to insert characters (or sub-strings) unless I know where they're supposed to go exactly relative to what is in the string already, whereupon it won't panic.
I don’t get your argument at all. People want characters from their strings, and around the time Java decided on UTF-16 because at the time it seemed like the “right” way to do Unicode. What would you suggest they have adopted back then? Similarly C’s char type is named “char” because people dealt with ASCII back then and characters used to be a byte. It turns out that sucks but being able to do byte arithmetic is cool so it’s still around for that purpose (and C++ actually has added std::byte for exactly this; perhaps C will get it as well at some point). For Rust, this is just a thing about holding it wrong: the operation is generally not relevant, so why even expose it? It doesn’t make sense to allow for random indexing if you’re just going to crash on misalignment. It would be better to just have an API that doesn’t allow misalignment at all: see Swift’s implementation for example.
People should stop wanting "characters from their strings" especially in the sort of high level software you'd attempt in Java - and Java was in a good position to do that the way we've successfully done it for similar things, by not providing the misleading API shape. Reserve char but don't implement it is what I'm saying, like goto.
Compare for example decryption, where we learned not to provide decrypt(someBytes) and checkIntegrity(someBytes) even though that's what People often want, it's a bad idea. Instead we provide decrypt(wholeBlock) and you can't call it until you've got a whole block we can do integrity checks on, it fails without releasing bogus plaintext if the block was tampered with. An entire class of stupid bugs becomes impossible.
Java should have provided APIs that work on Strings, and said if you think you care about the things Strings are made up of, either you need a suitable third party API (e.g. text rendering, spelling) or you want bytes because that's how Strings are encoded for transmission over the network or storage on disk. You don't want to treat the string as a series of "characters" because they aren't.
The idea that a String is just a vector of characters is wrong, that's not what it is at all. A very low level language like C, C++ or Rust can be excused for exposing something like that, because it's necessary to the low-level machinery, but almost nobody should be programming that layer.
Imagine if Java insisted on acting as though your Java references were numbers and that it could make sense to add them together. Sure in fact they are pointers, and the pointer is an integral type and so you could mechanically add them together, but that's nonsense, you would never write code that needs to do this in Java.
K&R C claimed that char isn't just for representing "ASCII" (which wasn't at that time set in stone as the encoding you'll be using) but for representing the characters on the system you're programming regardless of whether they're ASCII. 'A' wasn't defined as 65 but as whatever the code happens to be for A on your computer. Presumably the current ISO C doesn't make the same foolish claim.
I think you're being too harsh on the C char. It is guaranteed sizeof(char) == 1, and it is guaranteed to be at least 8 bits long, i.e. long enough for any ascii character.
These requirements are perfectly good for the needs of a CHARacter type. If you need to control signed / unsigned because you want to use the char as a small integer, you can specify yourself whether it is signed or not.
In reality, where chars are used to store ASCII, the signdness of the datatype is meaningless because the highest bit is never set.
The really tragic thing is that UTF-8 was invented before UTF-16. But a few big companies had put a couple of years of heavy investment into UCS-2, and weren’t willing to let that and the attendant pain they were beginning to foist on developers and users alike go to waste, and so ruined Unicode with UTF-16 and the disaster called surrogates that is the cause of the significant majority of programming languages handling strings incorrectly (e.g. JavaScript uses potentially ill-formed Unicode, indexed by UTF-16 code unit; and Python strings are sequences of code points rather than scalar values). If only they had reversed course and said “sorry for all the pain we were just starting to put you through, that fixed-width 16-bit encoding thing didn’t pan out, we’re going back to 8-bit encodings with this UTF-8 thing that is conveniently also backwards-compatible with ASCII”. If only.
Constant time subscripting is a myth. There's nothing(*) useful to be obtained by adding a fixed offset to the base of your string, in any unicode encoding, including UTF-32.
If you're hoping that a fixed offset gives you a user-percieved character boundary, then you're not handling composed characters or zero-width-joiners or any number of other things that may cause a grapheme cluster to be composed of multiple UTF code points.
The "fixed" size of code points in encodings like UTF-32 are just that: code points. Whether a code point corresponds with anything useful, like the boundary of a visible character, will always require linear-time indexing of the string, in any encoding.
(*) Approximately nothing. If you're in a position where you've somehow already vetted that the text is of a subset of human languages where you're guaranteed to never have grapheme clusters that occupy more than a single code point, then you maybe have a use case for this, but I'd argue you really just have a bunch of bugs waiting to happen.
Getting tired of people calling things "useless". Clearly I have a usecase for fixed width text encodings.
Source code manipulation is frequently Unicode aware but doesn't care about combinations or things outside of a strict subset of Unicode to modify lexing control flow.
Being able to store (and later refer to) character offsets in the source code is a plus because they'll only ever occur in places where the strict subset is enforced.
This is especially true of languages with line-only comments, etc, where different writing systems being used won't affect the error message information.
Like I said, there are a few useful cases where having a fixed width encoding is beneficial. It's less helpful to the discussion to assert you know better for every case, ever.
> Constant time subscripting is a myth. There's nothing(*) useful to be obtained by adding a fixed offset to the base of your string, in any unicode encoding, including UTF-32.
What about UTF-256? Maybe not today, maybe not tomorrow, but someday...
I know you're kidding, but I want to note that UTF-256 isn't enough. There's an Arabic ligature that decomposes into 20 codepoints. That was already in Unicode 20 years ago. You can probably do something even crazier with the family emoji. These make "single characters" that do not have precomposed forms.
Also, if you want O(1) indexing by grapheme cluster you can get that with less memory overhead by precomputing a lookup table of the location in the string where you can find every k-th grapheme cluster, for some constant k >= 1. (This requires a single O(n) pass through the string to build the index, but you were always going to have do make at least one such pass through the string for other reasons.)
I see this mentioned periodically in discussions about UTF-8 and it just doesn't seem to match reality. Very often you can be certain you're not operating with multi-codepoint grapheme clusters. Whether through string literals, conversion from other types (e.g., numeric to string), restrictions on identifiers, specification for file formats, company-policy on language for source files, conversion from strings with an ASCII charset, etc., you very often can be certain about the contents of that string. And optimizing around that information is considerably faster than a naive linear scan for the string, constantly rediscovering properties about that string.
E.g., Ruby runtimes scan the bytes in a string and then cache data about them in value called a code range. Knowing the code range, you can optimize many operations to not require additional linear scans of the string. Knowing a UTF-8 string consists only of ASCII characters can allow operations to be just as fast as if the string truly were ASCII-only (Ruby supports 100+ string encodings). And that fact is used throughout the core library to provide fast implementations of many operations (upcase, downcase, capitalize, gsub, substring, and so on). Moreover, a JIT can generate extremely tight code in those situations. Having to take a linear pass through the string to discover codepoint boundaries incurs a huge performance cost. While all strings could be treated uniformly and use Unicode tables for case mapping and such, the extra overhead is brutal. It has a measurable impact on string-heavy applications, such as template rendering and text processing.
In the most general case, yes, you know nothing about the string and can't make any assumptions. You can't even be sure the byte sequence is valid UTF-8. But, very often you do know properties of those strings. And you can manage boundaries where strings with known properties are joined with strings with unknown properties (e.g., variable interpolation in a template file).
> Whether through string literals, conversion from other types (e.g., numeric to string), restrictions on identifiers, specification for file formats, company-policy on language for source files, conversion from strings with an ASCII charset, etc., you very often can be certain about the contents of that string.
With the exception of conversion from numbers (which has its own optimizations that are likely equally applicable in UTF8 since Arabic numbers are just ASCII anyway), I’d say all of your examples sound like bugs waiting to happen.
Why shouldn’t string literals be allowed to contain complex emoji? Why should identifiers disallow them? Why should there be a company policy around putting complex emoji places?
Just saying “let’s just declare things such that strings aren’t allowed to have multi-code point grapheme clusters” sounds great until you accidentally let that assumption leak into a place where a user wants to use an emoji and can’t make it match their skin tone.
I’d also say that such restrictions are putting the cart before the horse; the typical reasons for restricting the allowed character set, are precisely because you want to make lazy assumptions about things like string offsets. Saying that such assumptions are a good thing because you have these restrictions in place, seems like circular logic to me.
> Why shouldn’t string literals be allowed to contain complex emoji?
I didn't say they shouldn't, just that many do not and you know that at parse time.
> Why should identifiers disallow them?
I don't write the language specs. Many languages don't allow classes, methods, variables, etc. to have complex grapheme clusters in them.
> Why should there be a company policy around putting complex emoji places?
Performance. Code sanity. Indexing. Ease of typing. Again, I'm not the one writing the policies. But, they exist.
> Just saying “let’s just declare things such that strings aren’t allowed to have multi-code point grapheme clusters” sounds great until you accidentally let that assumption leak into a place where a user wants to use an emoji and can’t make it match their skin tone.
I'm making a clear distinction between situations where you have user-supplied data and data under control of the language runtime, developer-created files, or those just adhering well-defined file formats. These are all strings and commonly consist of simple codepoints; indeed, many times they're just ASCII characters.
I addressed user-supplied values when I wrote "And you can manage boundaries where strings with known properties are joined with strings with unknown properties (e.g., variable interpolation in a template file)." TruffleRuby, for example, uses ropes as its underlying structure, so if you have a template written using all ASCII characters (rather common) and interpolate a user-supplied value, you can put the user string in one rope, the template in others and link them all together into a tree with ConcatRopes. The template ropes still know they only have simple codepoints and operations on those parts can be fast. The user variable only knows it's a generic UTF-8 string and operations on that string, if any, can go down the slower path. Oftentimes, there are no operations to perform on that user string other than to display it. Its mere presence doesn't need to adversely affect the rest of the template.
> I’d also say that such restrictions are putting the cart before the horse; the typical reasons for restricting the allowed character set, are precisely because you want to make lazy assumptions about things like string offsets. Saying that such assumptions are a good thing because you have these restrictions in place, seems like circular logic to me.
I'm not making any assumptions. I've spent an awful lot of time optimizing string performance in the context of a Ruby runtime and your initial claim of constant-time subscripting being a myth doesn't match my experience. Ruby allows as complex of a string as you want, but the reality is there are many situations where strings, by either by restrictions or de facto, will not have multi-codepoint grapheme clusters. In many situations you'll have strings with all the codepoints in the ASCII range. If the only information the runtime records when parsing a string is that "this is a UTF-8" string and then operates on all UTF-8 strings uniformly, you leave a lot of performance on the table. The best performing situation is when you don't have to deal with variable-width codepoints in a UTF-8 string. UTF-16 and UTF-32 aren't terribly common in Ruby, but they exist as valid encodings (well, UTF-16BE/UTF-16LE and UTF-32BE/UTF-32LE) and have simpler execution paths than UTF-8 for many use cases.
> For example, constant time subscripting, or improved length calculations, are made possible by encodings other than utf-8.
Assuming you mean different encoding forms of Unicode (rather than entirely different and far less comprehensive character sets, such as ASCII or Latin-1), there are very few use cases where "subscripting" or "length calculations" would benefit significantly from using a different encoding form, because it is rare that individual Unicode code points are the most appropriate units to work with.
(If you're happy to sacrifice support for most of the world's writing systems in favour of raw performance for a limited subset of scripts and text operations, that's different.)
Fun fact: Ken Thompson and Rob Pike of Unix, Plan 9, Go, and other fame had a heavy influence on the standard while working on Plan 9. To quote Wikipedia:
> Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike.
If that isn't a classic story of an international standard's creation/impactful update, then I don't know what is.
For whatever it's worth Rob Pike seems to credit Ken Thompson for the invention, though they both worked together to make it the encoding used by Plan 9 and to advocate for its use more widely.
Recently I learned about UTF-16 when doing some stuff with PowerShell on Windows.
Parallel with my annoyance with Microsoft, I realized how long it’s been since I encountered any kind of text encoding drama. As a regular typer of åäö, many hours of my youth was spent on configuring shells, terminal emulators, and IRC clients to use compatible encodings.
The wide adoption of UTF-8 has been truly awesome. Let’s just hope it’s another 15-20 years until I have to deal with UTF-16 again…
There are many reasons why UTF-8 is a better encoding but UTF-16 does at least have the benefit of being simpler. Every scalar value is either encoded as a single unit or a pair of units (leading surrogate + trailing surrogate).
However, Powershell (or more often the host console) has a lot of issues with handling Unicode. This has been improving in recent years but it's still a work in progress.
UTF-16 only makes sense if you were sure UCS-2 would be fine, and then oops, Unicode is going to be more than 16-bits and so UCS-2 won't work and you need to somehow cope anyway. It makes zero sense to adopt this in greenfield projects today, whereas Java and Windows, which had bought into UCS-2 back in the early-mid 1990s, needed UTF-16 or else they would need to throw all their 16-bit text APIs away and start over.
UTF-32 / UCS-4 is fine but feels very bloated especially if a lot of your text data is more or less ASCII, which if it's not literally human text it usually will be, and feels a bit bloated even on a good day (it's always wasting 11-bits per character!)
UTF-8 is a little more complicated to handle than UTF-16 and certainly than UTF-32 but it's nice and compact, it's pretty ASCII compatible (lots of tools that work with ASCII also work fine with UTF-8 unless you insist on adding a spurious UTF-8 "byte order mark" to the front of text) and so it was a huge success once it was designed.
So I guess if you're archiving pure CJK text, maybe you could get a 10% benefit, though I suspect non-Unicode encodings of that text would be more compact anyway.
From what I can remember, UTF-8 consumes more CPU as it's more complex to process, has space savings for mostly ascii & European codepages, but can significantly bloat storage sizes for character sets that consistently require 3 or 4 bytes per character.
My team managed a system that did a read from user data, doing input validation. One day we got a smart quote character that happened to be > U+10000. But because the data validation happened in chunks, we only got half of it. Which was an invalid character, so input validation failed.
In UTF-8, partial characters happen so often, they're likely to get tested. In UTF-16, they are more rarely seen, so things work until someone pastes in emoji and then it falls apart.
> There are many reasons why UTF-8 is a better encoding but UTF-16 does at least have the benefit of being simpler. Every scalar value is either encoded as a single unit or a pair of units (leading surrogate + trailing surrogate).
UTF16 is really not noticeably simpler. Decoding UTF8 is really rather straightforward in any language which has even minimal bit-twiddling abilities.
And that’s assuming you need to write your own encoder or decoder, which seems unlikely.
I never understood why ITF-8 did not use the much simpler encoding of:
- 0xxxxxxx -> 7 bits, ASCII compatible (same as UTF-8)
- 10xxxxxx -> 6 bits, more bits to come
- 11xxxxxx -> final 6 bits.
It has multiple benefits:
- It encodes more bits per octet: 7, 12, 18, 24 vs 7, 11, 16, 21 for UTF-8
- It is easily extensible for more bits.
- Such extra bits extension is backward compatible for reasonable implementations.
The last point is key: UTF-8 would need to invent a new prefix to go beyond 21 bits. Old software would not know the new prefix and what to do with it. With the simpler scheme, they could potentially work out of the box up to at least 30 bits (that's a billion code points, much more than the mere million of 21 bits).
The problem is that UTF-8 has the ability to detect and reject partial characters at the start of the string; this encoding would silently produce an incorrect character. Also, UTF-8 is easily extensible already: the bit patterns 111110xx, 1111110x, and 11111110 are only disallowed for compatibility with UTF-16's limits.
How often are stream truncated at the start? In my career, I've seen plenty of end truncation, but start truncation never happens. Or, to be more precise, it only happens if previous decoding is already borked. If a previous decoding read too much data, then even UTF-8 is borked. You could be decoding UTF-8 from the bits of any follow-up data.
Even for pure text data, if a previous field was over-read (the only plausible way to have start-truncation), then you probably are decoding incorrect data from then on.
IOW, this upside is both ludicrously improbable and much more damning to the decoding than simply be able to skip a character.
Imagine you plug a dumb terminal into an RS232 port. The computer may have already sent some data. Or it may be in the middle of sending log messages to the console as you plug in.
With a code like this, your dumb terminal can be built so it just automatically works. Without it, the terminal may be off by some number of bytes, and you have to have some sort of other (probably manual) synchronization procedure.
Similarly, imagine you've dialed in over modem to a remote system. You get some errors or lost text due to noise on the phone line. Or some bytes are lost due to bad RS232 flow control or no flow control. Your terminal emulator would be able to recover.
There might also be uses where text is broadcast, such as in TV closed captions, although maybe those have their own framing so you know where strings begin.
Prefix code: The first byte indicates the number of bytes in the sequence. Reading from a stream can instantaneously decode each individual fully received sequence, without first having to wait for either the first byte of a next sequence or an end-of-stream indication. The length of multi-byte sequences is easily determined by humans as it is simply the number of high-order 1s in the leading byte. An incorrect character will not be decoded if a stream ends mid-sequence.
"instantaneously" in the sense of first having to read the first byte to know how many bytes to read. So it's a two-step process. Given the current maximum length and SIMD, detecting the end-byte of my scheme is easily parallelizable for up to 4 bytes, which conveniently goes to 24 bits, enough for all current unicode code points, so there is no waiting for termination. Furthermore, to decode a UTF-8 characters needs bits extraction and shifting of all bytes, so there is no practical gain of not looking at every byte. It actually makes the decoding loop more complex.
Also, the human readability sounds fishy. Humans are really bad at decoding high-order bits. For example can you tell the length of a UTF-8 sequence that would begin with 0xEC at a glance? With my scheme, either the high bit is not set (0x7F or less), which is easy to see you only need to compare the first digit to 7. Or the high bit is set and the high nibble is less than 0xC, meaning there is another byte, also easy to see, you compare the first digit to C.
The quote also implicitly mis-characterized the fact that in my scheme an incorrect character would also not be decoded if interrupted since it would lack the terminating flag (No byte > 0xC0).
"Instantly" as in you get a stream of characters and with UTF-8 you always know as soon as you have received a full character. With your encoding it is always possible that you have not received the full character yet and need to wait until the start of the next character (or a timeout).
UTF-8 as defined (or restricted) is a prefix code, it gets all relevant information on the first read, and the rest on the (optional) second. Your scheme requires an unbounded number of reads.
> - It is easily extensible for more bits.
UTF8 already is easily extensible to more bits, either 7 continuation bytes (and 42 bits), or infinite. Neither of which is actually useful to its purposes.
> The last point is key: UTF-8 would need to invent a new prefix to go beyond 21 bits
UTF8 was defined as encoding 31 bits over 6 bytes. It was restricted to 21 bits (over 4 bytes) when unicode itself was restricted to 21 bits.
> UTF8 already is easily extensible to more bits, either 7 continuation bytes (and 42 bits), or infinite.
Extending UTF-8 to 7 continuation bytes (or more) loses the useful property that the all-ones byte (0xFF) never happens in a valid UTF-8 string. Limiting it to 36 bits (6 continuation bytes) would be better.
You can use FF as a sentinel byte internally (I think utf8proc actually does that?); given that FE never occurs, either, if you see the byte sequence corresponding to U+FEFF BYTE ORDER MARK in one of the other UTFs you can pretty much immediately tell it can’t possibly be UTF-8. (In general UTF-8, because of all the self-synchronization redundancy, has a very distinctive pattern that allows it to be detected with almost perfect reliability, and that is a frequent point of UTF-8 advocacy, which lends some irony to the fact that UTF-8 is the one encoding that Web browsers support but refuse to detect[1].) I don’t think there is any other advantage to excluding FF specifically, it’s not like we’re using punched paper tape.
No software decodes data by reading a stream byte-by-byte. Like I said in a previous comment, decoding 4 bytes using SIMD is possible and probably the best way to go. Furthermore, to actually decode, you need bit twiddling anyway, so you do need to do byte-processing. Finally, the inner loop of detecting character boundary is simpler: the UTF-8 scheme, due to the variable-length prefixes, requires to detect the first non-1 bits. It is probably written with a switch/case in C, vs two bit tests in my scheme. I'm not convinced the UTF-8 ends-up with a faster loop.
> No software decodes data by reading a stream byte-by-byte.
Maybe not in your bubble, but in my bubble this is the common case and highly optimized components using SIMD parsing is the exception.
By byte-by-byte, I presume you mean logically, not invoking a syscall for every byte. A parsing library is often handed a buffer to parse. If every layer had it's own buffering, you'd end up with too much data copying, which compounds and can spill your CPU caches in ways that microbenchmarks won't reflect, particularly in a streaming pipeline. So you stick to simpler, straight-forward code and data paths, unless and until you know a particular component is a bottleneck. I've had much better results optimizing globally first before optimizing locally. You can usually go back and optimize locally whenever you want, but optimizing globally is more often a one-shot deal as it typically requires non-local (i.e. cross-component) analysis and refactors, which nobody has time for.
The current scheme is extensible to 7x6=42 bits (which will probably never be needed). The advantage of the current scheme is that when you read the first byte you know how long the code point is in memory and you have less branching dependencies, i.e. better performance.
EDIT: another huge advantage is that lexicographical comparison/sorting is trivial (usually the ascii version of the code can be reused without modification).
Unicode 13 uses 143859 code points. 21 bits can encode more than 10x of that. And there are not that many languages left to add. We are already debating elvish. I am not sure what would warrant extending the coding scheme if we are already on the level of fictional languages today. Also note that many things like flags or emoji skin colors, etc. are now done via grapheme clusters, which conserves code points.
UTF-8 is self-resynchronizing. You can scan forwards and/or backwards and all you have to do is look for bytes that start a UTF-8 codepoint encoding to find the boundaries between codepoints. It's genius.
Excellent presentation! One improvement to consider is that many usages of "code point" should be "Unicode scalar value" instead. Basically, you don't want to use UTF-8 to encode UTF-16 surrogate code points (which are not scalar values).
> Fun fact, UTF-8's prefix scheme can cover up to 31 payload bits.
It’d probably be more correct to say that it was originally defined to cover 31 payload bits: you can easily complete the first byte to get 7 and 8 byte sequences (35 and 41 bits payloads).
Alternatively, you could save the 11111111 leading byte to flag the following bytes as counts (5 bits each since you’d need a flag bit to indicate whether this was the last), then add the actual payload afterwards, this would give you an infinite-size payload, though it would make the payload size dynamic and streamed (where currently you can get the entire USV in two fetches, as the first byte tells you exactly how many continuation bytes you need).
No. Let UTF-8* denote UTF-8, except that surrogate code points are allowed and have no special meaning. To encode a string in CESU-8, each supplementary character is converted to a surrogate pair, leaving existing surrogates alone, then each codepoint is encoded as UTF-8*. To encode a string in WTF-8, each surrogate pair is converted to a supplementary character, leaving unpaired surrogates alone, then each codepoint is encoded as UTF-8*. So really, they have the opposite effect; CESU-8 always uses surrogate characters, whereas WTF-8 removes them if possible. Both are similar in that they directly UTF-8*-encode unpaired surrogates.
The obvious advantage being that WTF-16 -> WTF-8 conversion maps all valid UTF-16 to the corresponding valid UTF-8 and only unpaired surrogates will produce invalid UTF-8 - but only invalid because UTF-8 explicitly disallows encoding those surrogates and not because the actual encoding differs, so you can almost always treat WTF-8 as UTF-8.
I spent 2 hours last Friday trying to wrap my head around what UTF-8 was (https://www.joelonsoftware.com/2003/10/08/the-absolute-minim is great, but doesn't explain the inner workings like this does) and completely failed, could not understand it. This made it super easy to grok, thank you!
>NOTE: You can always find a character boundary from an arbitrary point in a stream of octets by moving left an octet each time the current octet starts with the bit prefix 10 which indicates a tail octet. At most you'll have to move left 3 octets to find the nearest header octet.
This is incorrect. You can only find boundaries between code points this way.
Until your you learn that not all "user perceived characters" (grapheme clusters) can be expressed as single code point Unicode seems cool. These UTF-8 explanations explain the encoding but leave out this unfortunate detail. Author might not even know this because they deal with subset of Unicode in their life.
If you want to split text between two user perceived characters, not between them, this tutorial does not help.
Unicode encodings are is great if you want to handle subset of languages and characters, if you want to be complete, it's a mess.
You're right, that should read "codepoint boundary" not "character boundary". I can fix that.
I do briefly mention grapheme clusters near the end, didn't want to introduce them as this article was more about the encoding mechanism itself. Maybe a future article after more research :)
Not sure if the issue is with Chrome or my local config generally (bog standard Windows, nothing fancy), but the us-flag example doesn't render as intended. It shows as "US" with the components in the next step being "U" and "S" (not the ASCII characters U & S, the encoding is as intended but those characters are being given in place of the intended).
Displays as I assume intended in Firefox on the same machine: American flag emoji then when broken down in the next step U-in-a-box & S-in-a-box. The other examples seem fine in Chrome.
Take care when using relatively new additions to the Unicode emoji-set, test to make sure your intentions are correctly displayed in all the brower's you might expect your audience to be using.
They aren't new (2010) - this is a Windows thing - speculation is it's a policy decision to avoid awkward conversations with various governments (presumably large customers) about TW , PS and others -- see long discussion here for instance https://answers.microsoft.com/en-us/windows/forum/all/flag-e...
Yeah, there's not much I can do there unfortunately (since I'm using SVG with the actual U and S emojis to show the flag). I can't comment on whether it's your config or not, but I've tested the SVGs on iOS and Firefox/Chrome on desktop to make sure they rendered nicely for most people. Sorry you aren't getting a great experience there.
And Chrome on Android, though with a rendering difference on the not-plain-ol'-U and not-plain-ol'-S (in both cases, blue letters rather than white with a blue background).
Great explanation. The only part that tripped me up was in determining the number of octets to represent the codepoint. From the post:
>From the previous diagram the value 0x1F602 falls in the range for a 4 octets header (between 0x10000 and 0x10FFFF)
Using the diagram in the post would be a crutch to rely on. It seems easier to remember the maximum number of "data" bits that each octet layout can support (7, 11, 16, 21). Then by knowing that 0x1F602 maps to 11111011000000010, which is 17 bits, you know it must fit into the 4-octet layout, which can hold 21 bits.
As the continuation bytes always bear the payload in the low 6 bits, Connor Lane Smith suggests writing them out in octal[1]. Though that 3 octets of UTF-8 precisely cover the BMP is also quite convenient and easy to remember (but perhaps don’t use that like MySQL did[2]?..).
If you’re more into watching a presentation, I recorded “A Brief History of Unicode” last year, And there’s a YouTube recording of it as well as the slides:
Great post and intuitive visuals! I recently had to rack my brain around UTF-8 encoding and decoding when building the Unicode ETH Project (https://github.com/devstein/unicode-eth) and this post would have been very useful
BTW here’s a surprise I had to learn at some point: strings in JS are UTF-16. Keep that in mind if you want to use the console to follow this great article, you’ll get the surrogate pair for the emoji instead.
Opentype makes this impossible. A glyph has an index of a UINT16, so you can't fit all of the ~143k Unicode characters.
There are some attempts at font families to cover the majority of characters. Like Noto ( https://fonts.google.com/noto/fonts ), broken out into different fonts for different regions.
Or, Unifont's ( http://www.unifoundry.com/ ) goal of gathering the first 65536 code points in one font, though it leaves a lot to be desired if you actually use it as a font.
Take care using recently added Unicode entries, unless you have some control of your user-base and when they update or are providing a custom font that you know has those items represented. You could be giving out broken-looking UI to many if their setup does not interpret the newly assigned codes correctly.
Is there any standard system where each byte/word maps to one character/grapheme? I feel there is a general sentiment that not being able to jump to the Nth character is programmatically.. irritating and disappointing. I'm sure such a system wouldn't support some languages - but in the words of Lord Farquaad "That's a sacrifice I'm willing to make". Most of the world's languages would do just fine and it'd make sense to exclude right to left arabic ligatured text in, for instance, your monospaced computer code.
I'm guessing you could extract a subset of UTF-8 - but has anyone done anything like that?
You’re basically describing the various character encodings which preceded adoption of Unicode. No, UTF-8 doesn’t support that. That’s the sacrifice it was willing to make, to eliminate the need to distinguish between character encodings: a string isn’t a sequence of bytes.
Even if you don’t appreciate that convenience (and the highly standardized, easily abstracted rules of combined bytes), maybe you’ll appreciate that the complexity of supporting multilingual users would mean multipart data, arbitrarily fractured. As in a single text message using multiple character sets could be dozens of payload boundaries. Does that really sound easier or more painless than UTF-8’s variable-length code points?
From what I understood of the link there is the "grapheme" which is an intermediary representation between being UTF-8 and being drawn on the screen. Like all possible glyphs have been prerendered. It's not got some simple graphics engine to draw those on the fly.
Couldn't you directly provide an index in the grapheme "cache" ?
Okay but ascii only covers English. Doesn't even cover all European languages. There is probably something between ascii and utf-8 where you could cover prolly like 90% of people's needs with English Chinese Japanese Korean Greek Russian. Weird languages that need ligatures and combining symbols are the exception (and in some cases like Korean there is a very finite amount of combinations)
Glad you enjoyed! Unicode and how it interacts with other aspects of computers (IDNA, NFKC, grapheme clusters, etc) is some of the spaces I want to explore more.
A sketch on a diner placemat has lead to every person in the world being able to communicate written language digitally using a common software stack. Thanks to Ken Thompson and Rob Pike we have avoided the deeply siloed and incompatible world that code pages, wide chars and other insufficient encoding schemes were guiding us towards.