Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The real insights here are that strings in C++ suck and UTF-16 is extremely unintuitive.


Strings in C++ standard library do suck (and C++ is my favorite language).

As for UTF-16, well, I don't know that UTF-8 is a whole lot more intuitive:

> And for UTF-8 data, you have the same issues discussed before: Multibyte characters will not be converted properly, and it breaks for case mappings that alter string lengths.


UTF-16 has all the complexity of UTF-8 plus surrogate pairs.


Surrogate pairs aren't more complex than UTF-8's scheme for determining the number of bytes used to represent a code point. (Arguably the logic is slightly simpler.) But the important point is that UTF-16 pretends to be a constant-length encoding while actually having the surrogate-pair loophole - that's because it's a hack on top of UCS-2 (which originally worked well enough for Microsoft to get married to; but then the BMP turned out not to be enough code points). UTF-8 is clearly designed from scratch to be a multi-byte encoding (and, while the standard now makes the corresponding sequences illegal, the scheme was designed to be able to support much higher code points - up to 2^42 if we extend the logic all the way; hypothetical 6-byte sequences starting with values FC or FD would neatly map up to 2^31).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: