But we don't have to make everything Unicode aware. Backward compatibility is in...

bayindirh · 2024-10-08T10:09:50 1728382190

> Converting one Unicode string to another is a purely in-memory, in-CPU operation.

...but it's a complex operation. This is what libICU is mostly for. You can't just look-up a single table and convert a string to another like you work on ASCII table or any other simple encoding.

Germans have their ß to S (or capital ß depending on the year), Turkish has ı/I and i/İ pairs, and tons of other languages have other rules.

Esp, this I/ı and İ/i pairs break tons of applications in very unexpected ways. I don't remember how many bugs I reported, and how many workarounds I have implemented in my systems.

Adding a type is nice, but the surrounding machinery is so big, it brings tons of work with itself. Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

SAI_Peregrinus · 2024-10-08T13:56:34 1728395794

> Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

Because there are more than 65,535 characters. That's just writing systems, not Unicode's fault. Most of the unnecessary complexity of Unicode is legacy compatibility: UTF-16 & UTF-32 are bad ideas that increase complexity, but they predate UTF-8 which actually works decently well so they get kept around for backwards compatibility. Likewise with the need for multiple normalization forms.

numpad0 · 2024-10-09T00:07:25 1728432445

It's because Unicode don't allow for language switching.

It takes up to eight bytes per character in Unicode if you want to support both Chinese and Japanese in a single font using IVS(and I don't think there's any font that actually supports this).

AFAICS(As far as I can search), Simplified(PRC) and Traditional(Taiwan) Chinese encoding are respectively called GB2312 and Big5, and they're both two byte encodings with good practical coverage. Same applies for Japanese Shift_JIS. If e.g. :flag_cc: were allowed to be used as start-of-language marker, one could theoretically cut that back down to two bytes per character without losing much and actually improving language supports.

account42 · 2024-10-09T13:55:06 1728482106

The number of characters is not the problem, the mess due to legacy compatibility is - case folding and normaltization could be much simpler if the codepoints were laid out with that in mind. Also the fact the Unicode can't make up its mind if it wants to encode glyphs (turkish I and i, han unification) or semantic characters (e.g. cyrillic vs. latin letters) or just "ideas" (emojis).

bayindirh · 2024-10-08T14:06:49 1728396409

I mean, I already know some Unicode internals and linguistics (since I developed a language-specific compression algorithm back in the day), but I have never seen a single character requiring four bytes (and I know Emoji chaining for skin color, etc.).

So, seeing this just moved the complexity of Unicode one notch up in my head, and I respect the guys who designed and made it work. It was not whining or complaining of any sort. :)

fluoridation · 2024-10-08T19:06:42 1728414402

Cuneiform codepoints are 17 bits long. If you're using UTF-16 you'll need two code units to represent a character.

gpderetta · 2024-10-09T12:25:20 1728476720

you also need two UTF16 code units for plain emojis.

TorKlingberg · 2024-10-09T13:53:27 1728482007

Lots of emoji are outside the Basic Multilingual Plane and need 4 bytes in UTf-8 and UTF-16. That's without going into skin color and other modifiers and combinations.

account42 · 2024-10-09T13:45:29 1728481529

> Germans have their ß to S (or capital ß depending on the year)

FYI, it's never S. If there is no better option then SS and ss are the proper capital and lowercase substitutions.

blenderob · 2024-10-08T10:17:08 1728382628

Thanks for the reply! Really appreciate the time you have taken to write down a thoughtful reply.

bayindirh · 2024-10-08T10:34:46 1728383686

No problems! If you want a slightly longer write-up, here's a classic I constantly share with people:

https://blog.codinghorror.com/whats-wrong-with-turkey/