This is a cool article about Unicode encoding however I still feel like it shoul...

happytoexplain · on Jan 27, 2022

Swift, for example, does what you're saying. I thought that the reason many languages don't do it that way is that part of the definition of an array (or at least expected-by-convention) is constant-time operations. If you treat a string as an array, then having to deal with variable-length units breaks that rule. That's why, when there is an API for dealing with grapheme clusters, it is usually a special case that duplicates an array-like API, instead of literally using an array.

I actually don't know how/why Python is apparently using code points, since they are variable length. That seems like a compromise between using code units and using grapheme clusters that gets you the worst of both worlds.

Edit: Maybe it uses UTF-32 under the hood when it's doing array operations on code points?

nitely · on Jan 28, 2022

CPython 3 does use UTF-32 under the hood for strings (there is bytes for plain sequence of bytes). As you say, it's the worst of both worlds. High memory usage, and not really useful if you are dealing with unicode characters (grapheme clusters).

My impression is most modern languages that bother with unicode (swift, rust, nim) are using utf-8, and doing linear time operations to handle unicode. I think that's the right approach, as I don't recall ever needing random access on a unicode string.

astrange · on Jan 28, 2022

Swift originally had UTF-32 strings (an upgrade from ObjC which uses UTF-16), but they redid String to be UTF-8. It's typically the best choice even for CJK text, because it has ASCII mixed in often enough to still be smallest.

I don't think reversing a string is a meaningful operation though; there's no reason to think of a string as a "list of characters" when it's also a "list of words" and several other things. Swift provides more than one kind of iteration for that reason.

nitely · on Jan 28, 2022

Of course it's possible, the Unicode standard even has a table[0] you can use to build a DFA (Deterministic Finite Automata) to break up a string into grapheme clusters. You can reverse the DFA to match and yield the graphemes backwards as well, which will give you the reversed unicode string.

[0]: http://www.unicode.org/reports/tr29/#Table_Combining_Char_Se...