This misses the real problem with flag emoji in that they are composed of codepoints that can be in any order. With other emoji you get a base codepoint with potential combining characters. Using a table of combining character ranges you can skip over them and isolate the logical glyph sequences. You don't need surrounding context to parse them out like flags need.
I think that somewhere in this answer lies a reason why Windows still doesn't support flag emoji. I don't count Microsoft Edge as "Windows" in this case, but as Chromium. Windows doesn't support flag emoji in its native text boxes, but it does support even colorized emoji.
But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.
Flags are political. Microsoft has removed country borders from its products for political reasons, a post above says the flags rendering was excluded for the same reason.
> But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.
Flags are not that hard, they're a very specific block combining in very predictable way. They're little more than ligatures. Family emoji are much harder.
Consider you have to split a string with 20 flags in sequence at a given offset. That's 40 codepoints with no readily discernible boundaries. To parse that you have to scan backwards to find the first non-flag codepoint. Otherwise you could split the middle of a flag pair. You also have to handle rendering invalid combinations as two glyphs and unpaired codes. For normal codepoints with combining characters you can scan forwards until you reach a non-combining character.
Flags are not that hard, they're a very specific block combining in very predictable way.
But before their introduction, you could decide if there's a grapheme cluster break between codepoints just by looking at the two codepoints in question. Now, you may need to parse a whole sequence of codepoints to see how flags pair up.
Nope, because the repurposing is independent of how the Unicode is represented. There's absolutely no advantage to having a string in UTF-32 over UTF-8 since you'll still need to examine every character and the added overhead for converting byte strings in UTF-8 to 32-bit code points is by far offset by the huge memory increase necessary to store UTF-32.
What's more, it's really not that difficult to start at the end of a valid UTF-8 string and get the characters in reverse order. UTF-8 is well-designed that way in that there's never ambiguity about whether you're looking at the beginning byte of a code point.
> UTF-8 is well-designed that way in that there's never ambiguity about whether you're looking at the beginning byte of a code point.
To expand, if the most-significant-bit is a 0, it's an ASCII codepoint. If the top two are '10', it's a continuation byte, and if they're '11', it's the start of a multibyte codepoint (the other most-significant-bits specify how long it is to facilitate easy codepoint counting).
So a naive codepoint reversal algorithm would start at the end, and move backwards until it sees either an ASCII codepoint or the start of a multibyte one. Upon reaching it, copy those 1-4 bytes to the start of a new buffer. Continue until you reach the start.