The real insights here are that strings in C++ suck and UTF-16 is extremely unin...

criddell · 2024-10-08T17:29:24 1728408564

Strings in C++ standard library do suck (and C++ is my favorite language).

As for UTF-16, well, I don't know that UTF-8 is a whole lot more intuitive:

> And for UTF-8 data, you have the same issues discussed before: Multibyte characters will not be converted properly, and it breaks for case mappings that alter string lengths.

recursive · 2024-10-08T18:25:50 1728411950

UTF-16 has all the complexity of UTF-8 plus surrogate pairs.

zahlman · 2024-10-08T19:01:40 1728414100

Surrogate pairs aren't more complex than UTF-8's scheme for determining the number of bytes used to represent a code point. (Arguably the logic is slightly simpler.) But the important point is that UTF-16 pretends to be a constant-length encoding while actually having the surrogate-pair loophole - that's because it's a hack on top of UCS-2 (which originally worked well enough for Microsoft to get married to; but then the BMP turned out not to be enough code points). UTF-8 is clearly designed from scratch to be a multi-byte encoding (and, while the standard now makes the corresponding sequences illegal, the scheme was designed to be able to support much higher code points - up to 2^42 if we extend the logic all the way; hypothetical 6-byte sequences starting with values FC or FD would neatly map up to 2^31).