Hacker News new | past | comments | ask | show | jobs | submit login

>NOTE: You can always find a character boundary from an arbitrary point in a stream of octets by moving left an octet each time the current octet starts with the bit prefix 10 which indicates a tail octet. At most you'll have to move left 3 octets to find the nearest header octet.

This is incorrect. You can only find boundaries between code points this way.

Until your you learn that not all "user perceived characters" (grapheme clusters) can be expressed as single code point Unicode seems cool. These UTF-8 explanations explain the encoding but leave out this unfortunate detail. Author might not even know this because they deal with subset of Unicode in their life.

If you want to split text between two user perceived characters, not between them, this tutorial does not help.

Unicode encodings are is great if you want to handle subset of languages and characters, if you want to be complete, it's a mess.




You're right, that should read "codepoint boundary" not "character boundary". I can fix that.

I do briefly mention grapheme clusters near the end, didn't want to introduce them as this article was more about the encoding mechanism itself. Maybe a future article after more research :)


Please do. You have the best visualizations of UTF-8 I have seen so far.

Usually people write just the UTF-8 encoding part, then don't mention the rest of the Unicode, because it's clearly not as good and simple.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: