This misses the real problem with flag emoji in that they are composed of codepo...

jug · on Jan 27, 2022

I think that somewhere in this answer lies a reason why Windows still doesn't support flag emoji. I don't count Microsoft Edge as "Windows" in this case, but as Chromium. Windows doesn't support flag emoji in its native text boxes, but it does support even colorized emoji.

But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.

kingcharles · on Jan 27, 2022

Flags are political. Microsoft has removed country borders from its products for political reasons, a post above says the flags rendering was excluded for the same reason.

masklinn · on Jan 27, 2022

> But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.

Flags are not that hard, they're a very specific block combining in very predictable way. They're little more than ligatures. Family emoji are much harder.

And this is not "post-Unicode" in any way.

kevin_thibedeau · on Jan 28, 2022

Consider you have to split a string with 20 flags in sequence at a given offset. That's 40 codepoints with no readily discernible boundaries. To parse that you have to scan backwards to find the first non-flag codepoint. Otherwise you could split the middle of a flag pair. You also have to handle rendering invalid combinations as two glyphs and unpaired codes. For normal codepoints with combining characters you can scan forwards until you reach a non-combining character.

masklinn · on Jan 28, 2022

> Consider you have to split a string with 20 flags in sequence at a given offset. That's 40 codepoints with no readily discernible boundaries.

So consider that you have [a really bad idea], it’s not convenient?

You do realise essentially the same issue occurs if you have a stack of diacritics right?

kevin_thibedeau · on Jan 28, 2022

No it doesn't. You aren't forced to scan backwards.

cygx · on Jan 27, 2022

Flags are not that hard, they're a very specific block combining in very predictable way.

But before their introduction, you could decide if there's a grapheme cluster break between codepoints just by looking at the two codepoints in question. Now, you may need to parse a whole sequence of codepoints to see how flags pair up.

uniqueuid · on Jan 27, 2022

Thanks for that interesting detail!

If such re-purposing continues, it might be easier to go straight to utf-32 for some use cases.

dhosek · on Jan 27, 2022

Nope, because the repurposing is independent of how the Unicode is represented. There's absolutely no advantage to having a string in UTF-32 over UTF-8 since you'll still need to examine every character and the added overhead for converting byte strings in UTF-8 to 32-bit code points is by far offset by the huge memory increase necessary to store UTF-32.

What's more, it's really not that difficult to start at the end of a valid UTF-8 string and get the characters in reverse order. UTF-8 is well-designed that way in that there's never ambiguity about whether you're looking at the beginning byte of a code point.

colejohnson66 · on Jan 27, 2022

> UTF-8 is well-designed that way in that there's never ambiguity about whether you're looking at the beginning byte of a code point.

To expand, if the most-significant-bit is a 0, it's an ASCII codepoint. If the top two are '10', it's a continuation byte, and if they're '11', it's the start of a multibyte codepoint (the other most-significant-bits specify how long it is to facilitate easy codepoint counting).

So a naive codepoint reversal algorithm would start at the end, and move backwards until it sees either an ASCII codepoint or the start of a multibyte one. Upon reaching it, copy those 1-4 bytes to the start of a new buffer. Continue until you reach the start.

[0]: https://en.wikipedia.org/wiki/UTF-8#Encoding