Hacker News new | past | comments | ask | show | jobs | submit login

This is a cool article about Unicode encoding however I still feel like it should be possible to reverse strings with Flag emojis. I don't see why computers can't handle multi rune symbols in the same way that they handle multi byte runes. We could combine all the runes that should be a single symbol and make sure that we're maintaining the ordering of those runes in the reversed string. Of course that means that naive string reversing doesn't work anymore but naive string reversing wouldn't work in the world of UTF-8 if we just went byte by byte.



Swift, for example, does what you're saying. I thought that the reason many languages don't do it that way is that part of the definition of an array (or at least expected-by-convention) is constant-time operations. If you treat a string as an array, then having to deal with variable-length units breaks that rule. That's why, when there is an API for dealing with grapheme clusters, it is usually a special case that duplicates an array-like API, instead of literally using an array.

I actually don't know how/why Python is apparently using code points, since they are variable length. That seems like a compromise between using code units and using grapheme clusters that gets you the worst of both worlds.

Edit: Maybe it uses UTF-32 under the hood when it's doing array operations on code points?


CPython 3 does use UTF-32 under the hood for strings (there is bytes for plain sequence of bytes). As you say, it's the worst of both worlds. High memory usage, and not really useful if you are dealing with unicode characters (grapheme clusters).

My impression is most modern languages that bother with unicode (swift, rust, nim) are using utf-8, and doing linear time operations to handle unicode. I think that's the right approach, as I don't recall ever needing random access on a unicode string.


Swift originally had UTF-32 strings (an upgrade from ObjC which uses UTF-16), but they redid String to be UTF-8. It's typically the best choice even for CJK text, because it has ASCII mixed in often enough to still be smallest.

I don't think reversing a string is a meaningful operation though; there's no reason to think of a string as a "list of characters" when it's also a "list of words" and several other things. Swift provides more than one kind of iteration for that reason.


Of course it's possible, the Unicode standard even has a table[0] you can use to build a DFA (Deterministic Finite Automata) to break up a string into grapheme clusters. You can reverse the DFA to match and yield the graphemes backwards as well, which will give you the reversed unicode string.

[0]: http://www.unicode.org/reports/tr29/#Table_Combining_Char_Se...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: