I'm in the process of designing a scripting language and implementing it in C++. I plan to put together a YouTube series about it. (Doesn't everyone want to see Bison and Flex mixed with proper unit tests and C++20 code?)
Due to my future intended use case, I needed good support for Unicode. I thought that I could write it myself, and I was wrong. I wasted two weeks (in my spare time, mostly evenings) trying to cobble together things that should work, identifying patterns, figuring out how to update it as Unicode itself is updated, thinking about edge cases, i18n, zalgo text, etc. And then I finally reached the point where I knew enough to know that I was making the wrong choice.
I'm now using ICU. (https://icu.unicode.org/) It's huge, it was hard to get it working in my environment, and there are very few examples of it's usage online, but after the initial setup dues are paid, it WORKS.
Aside: Yes, I know I'm crazy for implementing a programming language that I intend for serious usage. Yes, I have good reasons for doing it, and yes I have considered alternatives. But it's fun, so I'm doing it anyways.
Moral of the story: Dealing with Unicode is hard, and if you think it shouldn't be that hard, then you probably don't know enough about the problem!
Handling unicode can be fine, depending on what you're doing. The hard parts are:
- Counting, rendering and collapsing grapheme clusters (like the flag emoji)
- Converting between legacy encodings (shiftjis, ko8, etc) and UTF-8 / UTF-16
- Canonicalization
If all you need is to deal with utf8 byte buffers, you don't need all that stuff. And your code can stay simple, small and fast.
IIRC the rust standard library doesn't bother supporting any of the hard parts in unicode. The only real unicode support in std is utf8 validation for strings. All the complex aspects of unicode are delegated to 3rd party crates.
By contrast, nodejs (and web browsers) do all of this. But they implement it in the same way you're suggesting - they simply call out to libicu.
> The only real unicode support in std is utf8 validation for strings.
Rust's core library gives char methods such as is_numeric which asks whether this Unicode codepoint is in one of Unicode's numeric classes such as the letter-like-numerics and various digits. (Rust does provide char with is_ascii_digit and is_ascii_hexdigit if that's all you actually cared about)
So yes, the Rust standard library is carrying around the entire Unicode standard class rule list among other things, of course Rust's library isn't built to a vast binary, so if you never use these features your binary doesn't get that code.
It always feels like the most amount of work goes to the least used emoji. So many revisions and additions to the family emoji and yet it’s one of the ones I don’t recall anyone ever using.
I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.
> It always feels like the most amount of work goes to the least used emoji.
I always feel like those emoji were added on purpose in order to force implementations to fix their unicode support. Before emoji were added, most software had completely broken support for anything beyond the BMP (case study: MySQL's so-called "UTF8" encoding). The introduction of emoji, and their immediate popularity, forced many systems to better support astral planes (that is officially acknowledged: https://unicode.org/faq/emoji_dingbats.html#EO1)
Progressively, emoji using more advanced features got introduced, which force systems (and developers) to fix their unicode-handling, or at least improve it somewhat e.g. skintones for combining codepoints, etc....
> I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.
You should try to follow a new character through the process, because that's absolutely not what happens and shepherding a new emoji through to standardisation is not an easy task. The unicode consortium absolutely does say no, and has many reasons to do so. There's an entire page on just proposal guidelines (https://unicode.org/emoji/proposals.html), and following it does not in any way ensure it'll be accepted.
WTF business do emojis have in Unicode? The BMP is all there ever should have been. Standardize the actual writing systems of the world, so everyone can write in their language. And once that is done, the standard doesn't need to change for a hundred years.
What we need now is a standardized, sane subset of Unicode that implementations can support while rejecting the insane scope creep that got added on top of that. I guess the BMP is a good start, even though it already contains superfluous crap like "dingbats" and boxes.
Unicode didn't invent emoji, they incorporated it because they were already popular in Japan, and if they didn't incorporate it, it would greatly reduce Japanese adoption.
Keep in mind that Unicode was intended to unify all the disparate encodings that had been brewed up to support different languages and which made exchanging documents between non-English speaking countries a nightmare. The term "mojibake" comes to mind [0] - Japan alone had so many encodings that a slang term for text encoded with something different than what your device expected (and subsequently got rendered as nonsensical/garbled text) came about. And they weren't alone, of course [1].
> What we need now is a standardized, sane subset of Unicode that implementations can support while rejecting the insane scope creep that got added on top of that.
Unicode wasn't intended to be pretty. It was intended to be the one system that everyone used, and a way to increase adoption was to do some less than ideal things, like duplicate characters (so it would be easier to convert to Unicode).
You may never need anything outside the BMP, but that doesn't make the rest of the planes worthless. Ignoring the value of including dead and nearing-extinct languages for preservation purposes (not being able to type a language will basically guarantee its extinction, with inventing a new encoding and storing text as jpgs being the only real alternatives), there are a lot of people speaking languages found in the SMP [2][3] ([2] has 83 million native speakers, for example).
> The term "mojibake" comes to mind [0] - Japan alone had so many encodings that a slang term for text encoded with something different than what your device expected (and subsequently got rendered as nonsensical/garbled text) came about.
Mojibake was not a "Japan has too many encodings" problem. It was a "western developers assume everyone is using CP1252" problem.
> Unicode wasn't intended to be pretty. It was intended to be the one system that everyone used, and a way to increase adoption was to do some less than ideal things, like duplicate characters (so it would be easier to convert to Unicode).
Unfortunately they undermined all that with Han Unification, with the result that it's never going to be adopted in Japan.
Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8.
Unicode/UTF-8 is widely adopted/recommended in Japan and there are no widely used alternative. Japanese company tend to still use SJIS but it's just laziness. Han unification isn't a problem to handle only Japanese text: just use Japanese font everywhere. To handle multiple language text, it's pain but anyway there are no alternatives.
> Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8.
In theory it can happen with any combination of character sets, sure, but in practice every example of mojibake I've seen has been SJIS (or UTF-8) encoded text being decoded as CP1252 ("Latin-1" but that's an ambiguous term) by software that assumed the whole world used that particular western codepage. If you've got examples of SJIS vs EUC-JP confusion in the wild I'd be vaguely interested to see them (is there even anywhere that still uses EUC-JP?)
> Japanese company tend to still use SJIS but it's just laziness.
It's not just laziness; switching to unicode is a downgrade, because in practice it means you're going to get your characters rendered wrongly (Chinese style) for a certain percentage of your customers, for little clear benefit.
> To handle multiple language text, it's pain but anyway there are no alternatives.
Given that you have to build a structure that looks like a sequence of spans with language metadata attached to each one, there's not much benefit to using unicode versus letting each span specify its own encoding.
Maybe the guess order depends on locale reasonably. GP is my experience mainly on old days ja-JP localed Windows software. IIRC Unix software tend to not good at guess so maybe you referring them.
Nowadays I rarely see new EUC-JP contents (or I just not recognized) but still sometimes I encounter mojibake on Chrome while visiting old homepage (like once per month). For web page, anyway most modern pages (including SJIS) don't rely on guess but have <meta charset> tag so mojibake very rarely happen. For plaintext files, I still see UTF-8 file shown as SJIS on Windows Chrome.
Viewing Japanese only UTF-8 text is totally fine for Japanese localed Windows/Mac/(Linux but YMMV). So your case is to view the text on non-Japanese locale. It possibly have a problem but how SJIS solved the issue? What software switches font if it opens SJIS file? Is the app/format don't support specifying font/lang like HTML/Word?
I believe no developer want to treat foreign charset like GBK/Big-5/whatever. There are very few information. If developer can switch reading charset on a file, then they can also switch font.
> Viewing Japanese only UTF-8 text is totally fine for Japanese localed Windows/Mac/(Linux but YMMV). So your case is to view the text on non-Japanese locale.
The issue is that realistically a certain proportion of customers are going to have the wrong locale setting or wrong default font set.
> It possibly have a problem but how SJIS solved the issue? What software switches font if it opens SJIS file? Is the app/format don't support specifying font/lang like HTML/Word?
Certainly Firefox will use a Japanese font by default for SJIS whereas it will use a generic (i.e. Chinese) font by default for UTF-8. I would expect most encoding-aware programs would do the same?
> If developer can switch reading charset on a file, then they can also switch font.
Sure, but it works both ways. And it's actually much easier for a lazy developer to ignore the font case because it's essentially only an issue for Japan. Whereas if you make a completely encoding-unaware program it will cause issues in much of Europe and all of Asia (well, it did pre-UTF8 anyway).
I think by far the largest contributor that coined mojibake was E-mail MTA. Some E-mail implementations assumed 7-bit ASCII for all text and dropped MSB on 8-bit SJIS/Unicode/etc, ending up as corrupt text at the receiving end. Next up was texts written in EUC(Extended UNIX Code)-JP probably by someone either running a real Unix(likely a Solaris) or early GNU/Linux, and floppies from a classic MacOS computer. Those must have defined it and various edge cases on web like header-encoding mismatch popularized it.
"Zhonghua fonts" issue is not necessarily linked to encoding, it's an issue about assuming or guessing locales - that has to be solved by adding a language identifier or by ending han unification.
> Unfortunately they undermined all that with Han Unification, with the result that it's never going to be adopted in Japan.
This is an absolute shame and there is no excuse for fixing it so that variations for unified characters can be encoded before adding unimportant things like skin tones.
> So rather than treat the issue as a rich text problem of glyph alternates, Unicode added the concept of variation selectors, first introduced in version 3.2 and supplemented in version 4.0.[10] While variation selectors are treated as combining characters, they have no associated diacritic or mark. Instead, by combining with a base character, they signal the two character sequence selects a variation (typically in terms of grapheme, but also in terms of underlying meaning as in the case of a location name or other proper noun) of the base character. This then is not a selection of an alternate glyph, but the selection of a grapheme variation or a variation of the base abstract character. Such a two-character sequence however can be easily mapped to a separate single glyph in modern fonts. Since Unicode has assigned 256 separate variation selectors, it is capable of assigning 256 variations for any Han ideograph. Such variations can be specific to one language or another and enable the encoding of plain text that includes such grapheme variations. - https://en.m.wikipedia.org/wiki/Han_unification
This is what you’re asking for, right? Control characters that designates which version of a unified character is to be displayed.
Have emoji not become part of our writing structure though? A decent percentage of online chats and comments, especially on social networks, includes at least one emoji that couldn't be easily or accurately represented in the regular written language.
Recently implementers of unicode have censored the gun emoji in a way that changes the meaning of many existing online chats and comments. So you can't easily or accurately represent things even with unicode.
Emoji have never been effective or consistent at conveying meaning; at best they convey something to someone in the same subculture and time period, and often not even that. Given that unicode implementers are ok with erasing the meaning of some of them, it should be ok to eliminate more of them.
> Emoji have never been effective or consistent at conveying meaning; at best they convey something to someone in the same subculture and time period
Isn't that the same with all words though? Think how much English usage changes in a generation. For instance, my girlfriend will use the term "I'm dead!" in a similar context to where I would say "LOL" and where my father would have said "What the fuck is loll?"
There's a spectrum. Subculture-specific slang changes quickly, but most words have a longer lifetime; reading Chaucer today is difficult but doable. Given that we don't encode words but only letters, for English you have to go back to the disappearance of þ to get a change that's relevant to text encoding. Emoji shift faster and are less effective at conveying meaning than any "real" language.
This argument was lost the moment Unicode was created. Japanese carriers had created their own standard for emoji encoding for sms. And they would not switch to Unicode unless the emoji were ported over.
It’s a tricky situation. Maybe allowing an arbitrary bitmap char to represent any emoji would have been better but then we could have ended up in a situation where normal text or meaningful punctuation or perhaps even fonts would get encoded as bitmaps.
For something like a face or hand gesture, a bitmap likely would have been better since it would at least look the same on all platforms.
I don't think that argument holds water. Emoji could just as well have been encoded as markup. There were for instance long-established conventions of using strings starting with : and ; . Bulletin boards extended that to a convention using letters delimited by : for example :rolleyes: . Not to mention that those codes can be typed more efficiently than browsing in an Emoji Picker box.
Because emoji became characters, text rendering and font formats had to be extended to support them.
There are four different ways to encode emoji in OpenType 1.8:
* Apple uses embedded PNG
* Google uses embedded colour bitmaps
* Microsoft uses flat glyphs in different colours layered on top of one-another
> Emoji could just as well have been encoded as markup.
They could have, but they were already being encoded as character codepoints in existing charactersets. So any character encoding scheme that wanted to replace all use cases for existing charactersets needed to match that. If switching charactersets meant you lost the ability to use emoji until you upgraded all your applications to support some markup format, people would just not switch.
> If switching charactersets meant you lost the ability to use emoji until you upgraded all your applications to support some markup format, people would just not switch.
You need to upgrade those applications to support Unicode too.
Not necessarily, most applications already supported multiple encodings, having the OS implement one of the unicode encodings was often all that was needed.
I might think the important part was Japanese carriers were weaponizing flip phone culture to gatekeep "PCs" and open standard smartphones out of their microtransaction ecosystem. Emoji was one of the keys to disprove the FUD that iPhone can't be equal to flip phones and establish first class citizen status.
You are underestimating how much language evolves. In fact, you are proposing brakes to stop if evolving. If nothing else, new currency symbols need to be incorporated every few years. The initial emoji were part of the actual writing systems of the world, even if it was relatively new and only being used by foreigners. Or maybe they have been part of world culture since the 1950s :-) ? https://en.wikipedia.org/wiki/Smiley
Exactly this. Humans have incredibly complicated writing systems, and all Unicode wants to do is encode them all. Keep in mind that the trivial toy system we're more familiar with, ASCII, already has some pretty strange features because even to half-arse one human writing system they needed those features.
Case is totally wild, it only applies to like 5% of the symbols in ASCII, but in the process it means they each need two codepoints and you're expected to carry around tech for switching back and forth between cases.
And then there are several distinct types of white space, each gets a codepoint, some of them try to mess with your text's "position" which may not make any sense in the context where you wanted to use it. What does it mean to have a "horizontal tab" between two parts of the text I wanted to draw on this mug? I found a document which says it is the same as "eight spaces" which seems wrong because surely if you wanted eight spaces you'd just write eight spaces.
And after all that ASCII doesn't have working quotation marks, it doesn't understand how to spell a bunch of common English words like naïve or café, pretty disappointing.
This work wasn't done for emoji. They use the same zero-width joiner character [1] that exists to support Indic scripts like Devanagari, and any system that properly handles these languages will also properly handle the emoji.
Yes, this adds a lot of complexity, but it's really a question of whether that complexity is justified in order to support all of the world's languages. And I think many would argue that it is.
I know how that feels. I wrote a little c++ program to fetch data in Unicode from a dB and then normalize it to ascii to be used for analytic purposes. A lot faster to do it on ascii than trying to handle all the fun cases of how many ways can an e etc... be input. ICU to the rescue! Took a couple weeks of getting up to speed as ICU itself wasn't too bad to figure out. But, you find out very quickly that to use it, you need to have a good understanding of a number of the Unicode technical reports to actually understand how to make use of it. Fun times indeed.
Do you have a YouTube for people to subscribe to in anticipation of you releasing your YouTube series about your work? The development processes of new languages is so intriguing.
It would actually be pretty interesting to see how you use Bison and Flex with utf-8.
Most resources say to not bother due to lack of support for Unicode, but they're so ubiquitous
Do they need special support for UTF-8? One of the nice things about UTF-8 is that you can treat it as an 8-bit encoding in many cases if you only care about substrings and don't need to decode individual non-ASCII characters.
poor man gave me and many others something like half of our introduction to computer science, but has gotten far more fame as the "emoji guy" for his repeated bouts with this particular part of unicode :)
This reminds me of an interesting bug I saw where I was seeing a strange flag in some Arabic text. However when I copied the string and pasted it into a text editor, the flag of Saudi Arabia appeared instead (which made much more sense). After some vexillologic research on Wikipedia I identified the original flag as American Samoa and it suddenly all made sense. Turns out some broken RTL support was flipping the SA into AS at presentation.
Maybe I'm missing some prerequisite knowledge here, but why would I assume `flag="us"` is an emoji? Looking at that first block of code, there is no reason for me to think "us" is a single character.
Edit: Turns out my browser wasn't rendering the flags.
In Windows Chrome, it doesn't render the emoji for me. In Android Chrome, it renders a flag emoji - not the raw region indicators (which look like the letters "u" and "s").
In my browser (Firefox on Windows), the thing between the quotes in the first block of code looks like a picture of the US flag cropped to a circle, not like the characters "us".
Ah I see, I just opened it in firefox. It looks like some JS library is not getting loaded in Edge. The author was talking about "us", "so", etc. looking like one character and I thought I was going crazy, lol.
I don't think that's about a JS library. Firefox bundles an emoji font that supports some things -- such as the flags -- that aren't supported by Segoe UI Emoji on Windows, so it has additional coverage for such character sequences.
If it's Windows, it doesn't actually use flags for those emojis, it renders a country code instead. If it wasn't supported you would just see the glyph for an unknown character.
The reason was because they didn't want to be caught up in any arguments about what flag to render for a country during any dispute, as with, e.g. the flag for Afghanistan after the Taliban took control.
Do you have a citation for that? I suspected it was because of the political issues, so I tried hunting down the reason one day and came up blank.
[Microsoft had this same issue with the timezone map in Windows. The early versions were cool and had country borders, but then I think it was India/Pakistan threw a fit and it was simplified to take the borders out]
What I'd like to know is, given the explosion of the character set for emoji, does the rationale for Han unification still make sense? The case for not allowing national variants seems less and less compelling with every emoji they add.
This is a bit of a hobby horse, but imagine if every time you read an article in English on your phone some of the letters were replaced with "equivalent" Greek or Cyrillic one and you can get an idea of the annoyance. Yeah, you can still read it with a bit of thought, but who wants to read that way?
I agree that Han unification was an unfortunate design decision, but I'd argue that the consortium is following a consistent approach to the Han unification with emoji. For example, they treat "regional" vendor variations in emoji as a font issue. If you get a message with the gun emoji, unless you have out-of-band information regarding which vendor variant is intended, there's no way in software to know if it should be displayed as a water gun (Apple "regional" variant) or a weapon (other vendor variants). Which is not that different from a common problem stemming from Han unification.
I don't disagree, but my point is more than their concern was about having "too many characters" in Unicode, which no longer seems to be a real concern, so what would be the harm of adding national variants?
Have skin tone variants (which is somethine Unicode chose to add rather than added because of existing use) is consistent with not have distinct variants for glyphs from different languages?
Han unification was a try to fit CJK characters into 16bit BMP. Finally BMP is failed so meaningless but reverting it also produces huge compatibility issue.
New characters have same glyph as old characters. It's the nightmare. For example, I can't find old one by searching new one. It's hard to know the reason for normal people. Should all software support searching by both characters? I don't expect all western developer take care. Equality comparison also fails without special support.
That is a bad exuse since it would preclude adding any new characters for existing languages. Would you have made the same objection for U+1E9E "ẞ", which was added in 2008?
Also, equality comparison already requires special support, e.g. normalization before comparison.
Sure, there would be an period where software support is incomplete but that is a bad reason to keep things broken forever.
> were replaced with "equivalent" Greek or Cyrillic one
The subset of equivalent letters, or different ones? If they looked the same, it wouldn't bother me if the letters in the center were a single codepoint between European languages:
The problem is they don't look the same. So imagine, for instance, Я instead of "R" or И instead of "N" (I don't think the sounds are actually equivalent but let's run with it for the sake of example). Not insurmountable. One could still read a text with these substitutions. But it'd be distracting, and extra detrimental for people who don't speak English as their first language.
To an extent that's true, but introducing national variant characters in addition to the unified ones would at least allow careful writers to avoid the problem.
Exactly, this is not rocket science: Introduce variantes of the affected characters in unicode (either variant selectors or distinct codepoints, doesn't matter too much but variant selectors could allow falling back to the old context-based detection). Then wait for software to be updated to use the variants based on the input language. This allows the writer to verify the variant used which will then be the same in all contexts.
But you can, and did, reverse a string. It seems you would need more details, such as a request to reverse the meaning or interpretation of the string, which is what the author is getting at.
If someone challenges you to reverse an image, what do you do? Do you invert the colors? Mirror horizontally? Mirror vertically? Just reverse the byte order?
There's a specification problem here. I like to say that a "string" isn't a data structure, it's the absence of one. Discussing "strings" is pointless. It follows that comparing programming languages by their "string" handling is likewise pointless.
Case in point: a "struct" in languages like C and Rust is literally a specification of how to treat segments of a "string" of contiguous bytes.
In languages like C “string” isn’t a proper data structure, it’s a `char` array, which itself is little more than a `int` array or `byte` array.
But these languages don’t provide true “string” support. They just have a vaguely useful type alias that renames a byte array to a char array, and a bunch of byte array functions that have been renamed to sound like string functions. In reality all the language supports are byte arrays, with some syntactical sugar so you can pretend they’re strings.
Newer languages, like go and Python 3, that where created in the world of Unicode provide true string types. Where the type primitives properly deal with idea of variable length characters, and provide tools to make it easy to manipulate strings and characters as independent concepts. If you want to ignore Unicode, because your specific application doesn’t need to understand, then you need cast your strings into byte arrays, and all pretences of true string manipulation vanish at the same time.
This is not to say the C can’t handle Unicode etc. just like the language doesn’t provide true primitives to manipulate strings, instead relies on libraries to provide that functionality, which is perfectly valid approach. Just as baking in more complex string primitives into your language is also a perfectly valid approach. It’s just a question of trade offs and use cases, I.e. the problem at the heart of all good engineering.
Having your strings be conceptually made up of UTF-8 code units makes them no less strings than those made up of Unicode code points. As this article shows, working with code points is often not the right abstraction anyway and you need to up all the way to grapheme clusters to have anything close to what someone would intuitively call a character. Calling a code point a character is not more correct or useful than calling a code unit a char.
All you gain by having Unicode code point strings is the illusion of Unicode support until you test anything that uses combining characters or variant selectors. In essence, languages opting for such strings are making the same mistake at Windows/Java/etc. did when adopting UTF-16.
You qualified "string" with "ASCII", and also tacitly admitted you still need more information than the octets themselves--the length.
Of course, various programming languages have primitives and concepts which they may label "string". But you still need to specify that context, drawing in the additional specification those languages provide. Plus, traditionally and in practice, such concepts often serve the function of importing or exporting unstructured data. So even in the context of a specific programming language, the label "string" is often used to elide details necessary to understanding the content and semantics of some particular chunk of data.
We would all be better off if this were actually true.
Tragically, in C, a string is just barely a data structure, because it must have \0 at the end.
If it were the complete absence of a data structure, we would need some way to get at the length of it, and could treat a slice of it as the same sort of thing as the thing itself.
C doesn’t really have strings at all. It has char pointers, and some standard functions that take a char pointer and act on all the chars starting from that pointer up to the first \0.
When you’re handling any kind of C pointer you need to know how big the buffer is around that pointer where pointer-arithmetic accesses make sense - but for a string, you also want to know ‘how much of the buffer is full of meaningful character data?’ - or else you’re stuck with fixed width text fields like some kind of a COBOL caveman.
But because C was designed by clever people for clever people they figured the standard string functions can just be handed a char pointer without any buffer bounds info because you can be trusted to always make sure that the pointer you give them is below a \0 within a single contiguous char buffer.
You can work with pointer+length or begin+end pairs in C just fine - it's just annoying. But you can always upgrade to C++ and use std::string_view to abstract that for you if you want.
My intent here is "same subject, but now you're standing on the other side of it", not "same viewer location but turned 180º".
For instance, if you started with an image of the Washington Monument with the Lincoln Memorial in the background, the "reverse" would be an image of the Washington Monument with assorted Smithsonian museum buildings behind it. Or whatever you theorize would be on the east of the Washington Monument if no reference is available.
So, in terms of acing interviews, increasingly one of the best answers to the question "Write some code that reverses a string" is that in a world of unicode, "reversing a string" is no longer possible or meaningful.
You'll probably be told "oh, assume US ASCII" or something, but in the meantime, if you can back that up when they dig into it, you'll look really smart.
I'd go further and argue that in general reversing a string isn't possible or meaningful.
It's just not a thing people do, so it's just... not very interesting to argue about what the 'correct' way to do it is.
Similarly, any argument over whether a string has n characters or n+1 characters in it is almost entirely meaningless and uninteresting for real world string processing problems. Allow me to let you into a secret:
there's never really such a thing as a 'character limit'
There might be a 'printable character width' limit; or there might be a 'number of bytes of storage' limit. Which means interesting questions about a string include things like 'how wide is it when displayed in this font?' or 'how many bytes does it take to store or transmit it?'... But there's rarely any point where, for a general string, it is really interesting to know 'how many characters does the string contain?'
Processing direct user text input is the only situation where you really need a rich notion of 'character', because you need to have a clear sense of what will happen if the user moves a cursor using a left or right arrow, and for exactly what will be deleted when a user hits backspace, or copied/cut and pasted when they operate on a selection. The ij ligature might be a single glyph, but is it a single character? When does it matter? Probably not at all unless you're trying to decide whether to let a user put a cursor in the middle of it or not.
And next to that, I just feel to argue that there is such a thing as a 'correct' way to reverse "Rijndæl" according to a strict reading of Unicode glyph composability rules seems like a supremely silly thing to try to do.
I'd much rather, when asked to reverse a string, more developers simply said 'that doesn't make sense, you can't arbitrarily chunk up a string and reassemble it in a different order and expect any good to come of it'.
Boy, that's implicitly a good question... when's the last time I "reversed" a string, on purpose, for something useful?
It took me a bit, but I think I have an answer. It's about 15 years ago. I didn't actually do the original design, but I perpetuated it and didn't remove it. We reversed domain name strings (which, given that they are a subset of ASCII, actually is a well-defined operation) so that the DB we're using, which supported efficient prefix lookups but not suffix lookups, could be used to efficiently query for all subdomains of a given domain, by reversing the domain and using that as the prefix.
I mean this as strong support for your point, not a contradictory "gotcha". I'm a big believer in not doing lots of work to save effort or make correct something you do less than once a decade, e.g., http://www.jerf.org/iri/post/2954 . And it's not even a gotcha anyhow, because we aren't reversing a general string; we were reversing a string very tightly constrained to a subset of ASCII where the operation was fully well-defined. I can't think of when I ever reversed a general string.
Right - any case where you are reversing a string as part of some other operation you will have some goal in mind that is not simply 'produce the reverse of any arbitrary string'. Even if your goal is doing something like printing the crossword puzzle answers backwards at the bottom of the page, you have a tightly constrained set of possible characters so you can literally just throw an error if someone asks you to reverse a string containing a flag.
I actually should admit, for all my protesting above that you never need to do this, I did once actually implement something that "required", as part of the process, reversing a string. It should be apparent once I share what it was why I put scare-quotes around "required" though.
We wanted to test and demonstrate the localization and unicode-readiness capabilities of our software, and to verify that every UI string was actually coming from the resource file for the selected locale, and handled in a unicode-safe way.
So I implemented a program that took in the en-GB resource file, and outputted an en-AU one that contained all the original strings, just flipped upside down. This being, of course, the canonical way to localize a product for Australia.
And to turn a string upside down, you need to reverse the order of the characters, before mapping them to their unicode upside-down equivalent.
Unfortunately, the Unicode consortium do not make available a comprehensive database of which glyphs are 180º reversals of other glyphs, so my solution ended up not having comprehensive coverage of all unicode codepoints, but since my source data was en-US text that wasn't that important; what was more important was that some of the resource strings used a 'safe subset' of HTML so I needed to not turn <strong> into <ƃuoɹʇs>.
More than anything, it was probably that experience that gave me a true appreciation for what nonsense it is to try to break a string into characters and manipulate them.
(Also, while I do love the ingenuity of string reversal for suffix-based indexing, reversing a domain name for efficient prefix-based lookup can of course also be done by breaking the name up into subcomponents (thus not requiring you to care about character composition at all between dots), reversing the sequence of those parts and reassembling the string from the components in reverse order - which has the added benefit of preserving human readability of the domain name, and a natural sort order...)
"reversing the sequence of those parts and reassembling the string from the components in reverse order"
Given that this was Perl and that's a small chunk of code, it's probably what I would have done in the same circumstance, but given that it already existed it wasn't worth shipping a migration out to the field with a new version. Generally humans didn't consult this table anyhow.
But it was good for a couple of good "wtf is that" faces from other developers the first time they look at the DB, if nothing else. They get it pretty quickly; the preponderance of "moc." and "ude." gets to be a dead giveaway pretty quickly, especially combined with some popular names ("moc.elgoog" almost sounds like a real domain Google might register someday). But still fun if you catch their face at the right moment.
Reversing a string is still meaningful. Take a step back outside the implementation and imagine handing a Unicode string to a human. They could without any knowledge look at the characters they see and produce the correct string reversal.
There is a solution to this which is to compute the list of grapheme clusters, and reverse that.
> imagine handing a Unicode string to a human. They could without any knowledge look at the characters they see and produce the correct string reversal.
I really highly doubt it.
How do you reverse this?: مرحبًا ، هذه سلسلة.
Can you do it without any knowledge about whether what looks like one character is actually a special case joiner between two adjacent codepoints that only happens in one direction? Can you do it without knowing that this string appears wrongly in the HN textbbox due to an apparent RTL issue?
It's just not well-defined to reverse a string, and the reason we say it's not meaningful is that no User Story ever starts "as a visitor to this website I want to be able to see this string in opposite order, no not just that all the bytes are reversed, but you know what I mean."
You can even demonstrate a similar concept with English and Latin characters. There is no single thing called a "grapheme" linguistically. There are actually two different types of graphemes. The character sequence "sh" in English is a single referential grapheme but two analogical graphemes. Depending on what the specification means, "short" could be reversed as either "trosh" or "trohs". That's without getting into transliteration. The word for Cherokee in the Cherokee language is "Tsalagi" but the "ts" is a Latin transliteration of a single Cherokee character. Should we count that as one grapheme or two?
Of course, if an interviewer is really asking you how to do this, they're probably either 1) working in bioinformatics, in which case there are exactly four ASCII characters they really care about and the problem is well-defined, or 2) it's implementing something like rev | cut -d '-' -f1 | rev to get rid of the last field and it doesn't matter how you implement "rev" just so long as it works exactly the same in reverse and you can always recover the original string.
The fact that how to reverse a piece of text is locale dependent doesn't mean it's impossible. Basically and transformation on text will be locale dependent. Hell, length is locale dependent.
>what looks like one character is actually a special case joiner between two adjacent codepoints
Are you referring to a grouping not covered by the definition of grapheme clusters (which I am only passingly familiar with)? If so, then I don't think it's any more non-meaningful to reverse it than to reverse an English string. The result is gibberish to humans either way - it sounds more like you're saying that there is no universally "meaningful to humans" way to reverse some text in potentially any language, which is true regardless of what encoding or written language you're using. I was thinking of it more from the programmer side - i.e. that Unicode provides ways to reverse strings that are more "meaningful" (as opposed to arbitrary) than e.g. just reversing code points.
I mean no but only because I don’t understand the characters. Someone who reads Arabic (I assume based on the shape) would have no trouble. You’re nitpicking cases where for some readers visual characters might be hard to distinguish but it doesn’t change the fact that there exists a correct answer for every piece of text that will be obvious to readers of that text which is the definition of a grapheme cluster.
> the fact that there exists a correct answer for every piece of text that will be obvious to readers of that text which is the definition of a grapheme cluster.
No, I insist there is not a single "correct answer," even if a reader has perfect knowledge of the language(s) involved. Now remember, this is already moving the goalposts, since it was claimed that a human needed "no knowledge" to get to this allegedly "correct answer."
You already admit that people who don't speak Arabic will have trouble finding the "grapheme clusters," but even two people who speak Arabic may do your clustering or not, depending on some implicit feeling of "the right way to do it" vs taking the question literally and pasting the smallest highlight-able selection of the string in reverse at a time.
Anyway, take a string like this: "here is some Arabic text: <RLM> <Arabic codepoints> <LRM> And back to English"
Whether you discard the ordering mark[0], keep them, or inverse them is an implementation decision that already produces three completely different strings. Unless we want to write a rulebook for the right way to reverse a string, it remains an impossibility to declare anything the correct answer, and because there is no reason to reverse such a string outside of contrived interview questions and ivory tower debates, it is also meaningless.
You added the requirement that it be a single correct answer. I just asserted that there existed a correct answer. You're being woefully pedantic -- a human who can read the text presented to them but no knowledge of unicode was my intended meaning. Grapheme clusters are language dependent and chosen for readers of languages that use the characters involved. There's no implicit feeling, this is what the standards body has decided is the "right way to do it." If you want to use different grapheme clusters because you think the Unicode people are wrong then fine, use those. You can still reverse the string.
Like what are you even arguing? You declared that something was impossible and then ended with that it's not only possible but it's so possible that there are many reasonable correct answers. Pick one and call it a day.
It is impossible to "correctly reverse a string" because "reverse a string" is not well defined. We explored many different potential definitions of it, to show that there is no meaningful singular answer.
> You added the requirement that it be a single correct answer.
Your original post says "they could produce the correct string reversal"?
UAX #29 is insufficient: at the very least, you must depend on collation too.
In Norwegian, “æ” is a letter, so I believe (as a non-speaker) that they would reverse “blåbærene” to “eneræbålb”; but in English, it’s a ligature representing the diphthong “ae”, and if asked to reverse “æsthetic” I would certainly write “citehtsea” and consider “citehtsæ” to be wrong. (And I enjoy writing the ligature; I fairly consistently write and type æsthetic rather than aesthetic, though I only write encyclopædia instead of encyclopaedia when I’m in a particular sort of mood.)
In Dutch, the digraph “ij” is sometimes considered a ligature and sometimes a letter; as a non-speaker, I don’t know whether natives would say that it should be treated as an atom in reversing or not.
And not all languages will have the concept of reversing even letters, let alone other things. Face it: in English we have the concept of reversing things, but it just doesn’t work the same way in other languages. Sure, UAX #29 defines something that happens to be a good heuristic for reversing, but it doesn’t define reversing, and in the grand scheme of things reversing grapheme-wise is still Wrong. “Reversing a string” is just not a globally meaningful concept.
Another person here has cited Cherokee transliteration, where one extended grapheme cluster turns into multiple English letters. You can apply this to translation in general, but also even keep it inside English and ask: what are we reversing? Letters? Phonemes? Syllables? Words? There are plenty of possibilities which are used in different contexts (and it’s mostly in puzzles, frankly, not general day-to-day life).
The concept of grapheme clusters is acknowledged as approximate. Collations are acknowledged as approximate. Reversing would be even more approximate.
Interesting addendum on the matter of reversing “æsthetic”: I asked my parents what they reckoned, and they both went the other way, reckoning that if you wrote æ in the initial word it should stay æ; my dad said he wouldn’t write it that way in the first place, but that if you’ve written it that way you were treating æ as a letter more than just a way of drawing a certain pair of letters. In declaring otherwise, I was using the linguistic approach, which acknowledges æ as a ligature of the ae diphthong from Latin, being purely stylistic and not semantic. And so we see still more how these things are approximations and subjective.
Keep it first? Like that’s not a gotcha. Your input is a string and the output is that string visually reversed. What it looks like in memory is irrelevant.
The use case: Make an animation, where the text appears starting from the end - for example if you stylize it as vertical text falling from the top.
I think the example shows quite clearly the problem really comes down to dividing the string to logical parts (atoms, grapheme clusters, however you want to call it).
UTF-8 reverse string has been a thing for a long time in most/all programming languages. It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible.
"It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible."
It depends on your point of view. From a strict point of view, it does exactly mean it is no longer possible. By contrast, we all 100% knew what reversing an ASCII string meant, with no ambiguity.
It also depends on the version of Unicode you are using, and oh by the way, unicode strings do not come annotated with the version they are in. Since it's supposed to be backwards compatible hopefully the latest works, but I'd be unsurprised if someone can name something whose correct reversal depends on the version of Unicode. And, if not now, then in some later not-yet-existing pair of Unicode standards.
I always thought it was interesting that ASCII is transparently just a bunch of control codes for a typewriter (where "strike an 'a'" is a mechanical instruction no different from "reset the carriage position"), but when we wanted to represent symbolic data we copied it and included all of the nonsensical mechanical instructions.
> It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible.
I don't understand why in maths finding one single counter-example is enough to disprove a theorem yet in programming people seem to be happy with 99.x % of success rate. To me, "It may not work perfectly in 100% of the cases" exactly means "no longer possible" as "possible" used to imply that it would work consistently, 100% of the time.
> "reversing a string" is no longer possible or meaningful.
If you really wanted to, you could write a string reversal algorithm that treated two-character emojis as an indivisible element of the string and preserved its order (just as you'd need to preserve the order of the bytes in a single multi-byte UTF-8 character). You'd just need to carefully specify what you mean by the terms "string", "character" and "reverse" in a way that includes ordered, multi-character sequences like flag emojis.
I would argue that it is possible and meaningful. AFAIK extended grapheme clusters are well defined by the standard, and are very well suited to the default meaning of when somebody says "character", so, given no other information, it's reasonable to reverse a string based on them. I guess the issue is "reverse a string" lacks details, but I think that's different from "not meaningful".
On the challenge front, there are things like á which might be a single code point or two code points (a+´). Then there are the really challenging things like ᾷ where if the components are individual characters, the order of ͺ and ῀ are not guaranteed to be consistent.
No, that only shows that the ISO 3166-2 registry is a bad basis for Unicode flags since having things lose meaning over time should not be acceptable for a text encoding.
Flags have another issue here in that they can change even when the country stays the same - a recent example here being Afghanistan, but also France who recently changed the official shades of the colors in their flag. Ideally you'd want a new Unicode representation for any changed flags in order to not retroactively change the meaning in old documents.
It's up to the installed fonts really. I don't know if the combination of S + U is standardized as a Soviet Union flag emoji, but even if it is, your locally installed fonts may not contain every single flag emoji, so the browser would still fall back to rendering the two letters instead.
UTF-8 does not represent Unicode code points, but rather Unicode scalar values. The difference between the two is surrogates, the way that UTF-16 ruined Unicode: code points are 0₁₆ to 10FFFF₁₆, scalar values are 0₁₆ to D7FF₁₆ and E000₁₆ to 10FFFF₁₆. Yes, the author quoted Wikipedia, but Wikipedia is wrong on this point; surprisingly comprehensively wrong: the UTF-8 page completely ignores the distinction, and even the page on code points doesn’t mention scalar values! This error propagates to other places, too: for example, “and there are a total of 1,112,064 possible code points”: no, that’s how many scalar values there are; code points also include the 2,048 surrogates, so there are 1,114,112 code points.
This is a nice dive into limitations in Python's unicode handling and at the end, how to work around some problems. But you could use languages with proper unicode support like Swift or Elixir (weirdly HN is fighting flags in comment code which makes examples header to demonstrate).
As I understand it, there is no two-letter ISO code for the USSR because when they update the standard they remove countries that no longer exist. In at least one case they have reused a code point: CS has been both "Czechoslovakia" and "Serbia and Montenegro", neither of which currently exist.
As a result, two-letter ISO codes are useless for many potential applications, such as, for example, recording which country a book was published in, unless you supplement them with a reference to a particular version of the standard.
Is there a way of getting the Czechoslovakian flag as an emoji? And did Serbia and Montenegro get round to making a flag?
Ah, I didn't realize they reused codes from ISO 3166-3. I figured, because they keep these regions around in their own set, that was some implication that the codes would not be reused.
True, but Unicode explicitly defines "SU" as a deprecated combination, regardless of flags. Seems like they omit everything from the list of "no longer used" country codes, with some exceptions. I would think they would have no reason not to allow historical regions.
As for this article & Python - as usual it is biasing towards convenience and implicit behavior rather than properly handling all edge cases.
Compare with Rust where you can't "reverse" a string - that is not a defined operation. But you can either break it into a sequence of characters or graphemes and then reverse that, with expected results: https://play.rust-lang.org/?version=stable&mode=debug&editio...
(Sadly the grapheme segmentation is not part of standard library, at least yet)
Interesting article. Written for beginners, conversationally. Has excessive amounts of whitespace, for "readability" I guess. But at the same time, it dives quite deep, which I don't think this "style" of presentation matches up with the amount of time a more novice reader is going to devote to a single long form article.
As to the content, for all the deep dive, a simple link to https://unicode.org/reports/tr51/#Flags and what an emoji is, would have saved so much exposition. I also wish he'd touched on normalization. With the amount of time he's demanding from readers he could have mentioned this important subject. Because then he could discuss why (starting from his emoji example) a-grave (à) might or might not be reversible, depending how the character is composed.
Also wish he'd pointed to some libraries that can do such reversals.
Upper and lower codepoints are really way too obscure and can create issues you didn't even know you had.
I once had the very unpleasant experience of debugging a case where data saved with R on windows and loaded on macOS ended up with individually double-encoded codepoints.
Something I haven't seen mentioned yet is one of the most annoying things about regional indicator symbols, which is that interpreting them correctly requires arbitrary backtracking, and handling this correctly is very annoying for things like text fields.
Basically: A single, unpaired RIS counts as a single grapheme. Similarly, a pair of RIS count as a single grapheme. Now imagine if your cursor position is after an RTS, and you arrow backwards (assuming LTR text, imagine your cursor is to the right of an RIS, and you press the left arrow.) Your textbox should now move the cursor to the left by one grapheme. How do you figure out where this is, in code units? You basically have to scan backwards until you find the first non-RIS codepoint, and then you have to match them up into pairs to figure out if your left-arrow movement should correspond to a movement of one or two codepoints.
This is a longstanding source of bugs, and if you're bored you can play around with pasting a huge sequence of flags into a textfield and then trying to navigate around it with the arrow keys/mouse. There are some broken implementations out there.
edit: while I'm thinking about this I will point out that an alternative design, which would have solved this problem (and which was first pointed out to me by @raphlinus) would have been to have two separate sets of RI symbols, one for 'first position' and one for 'second position'; then you could always determine the appropriate cursor position without needing context. Isn't hindsight a wonderful thing?
Julia docs do a (surprisingly) good job of being clear and explicit about this: the docstring for `reverse(AbstractString)` says:
> Reverses a string. Technically, this function reverses the codepoints in a string and its main utility is for reversed-order string processing [...]. See also [...] `graphemes` from module Unicode to operate on user-visible "characters" (graphemes) rather than codepoints.
Properly reversing a string of flags (or any other grapheme clusters) is just a `using Unicode: grapheme` away.
I do feel like these are all 'gotcha' questions - I haven't seen any real-world requirement to reverse a string and then have it be displayed in a useful way.
> Challenge: How would you go about writing a function that reverses a string while leaving symbols encoded as sequences of code points intact? Can you do it from scratch? Is there a package available in your language that can do it for you? How did that package solve the problem?
So are there any good libraries that can deal with code points that are merged together into a single pictographic and reverse them "as expected"?
This misses the real problem with flag emoji in that they are composed of codepoints that can be in any order. With other emoji you get a base codepoint with potential combining characters. Using a table of combining character ranges you can skip over them and isolate the logical glyph sequences. You don't need surrounding context to parse them out like flags need.
I think that somewhere in this answer lies a reason why Windows still doesn't support flag emoji. I don't count Microsoft Edge as "Windows" in this case, but as Chromium. Windows doesn't support flag emoji in its native text boxes, but it does support even colorized emoji.
But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.
Flags are political. Microsoft has removed country borders from its products for political reasons, a post above says the flags rendering was excluded for the same reason.
> But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.
Flags are not that hard, they're a very specific block combining in very predictable way. They're little more than ligatures. Family emoji are much harder.
Consider you have to split a string with 20 flags in sequence at a given offset. That's 40 codepoints with no readily discernible boundaries. To parse that you have to scan backwards to find the first non-flag codepoint. Otherwise you could split the middle of a flag pair. You also have to handle rendering invalid combinations as two glyphs and unpaired codes. For normal codepoints with combining characters you can scan forwards until you reach a non-combining character.
Flags are not that hard, they're a very specific block combining in very predictable way.
But before their introduction, you could decide if there's a grapheme cluster break between codepoints just by looking at the two codepoints in question. Now, you may need to parse a whole sequence of codepoints to see how flags pair up.
Nope, because the repurposing is independent of how the Unicode is represented. There's absolutely no advantage to having a string in UTF-32 over UTF-8 since you'll still need to examine every character and the added overhead for converting byte strings in UTF-8 to 32-bit code points is by far offset by the huge memory increase necessary to store UTF-32.
What's more, it's really not that difficult to start at the end of a valid UTF-8 string and get the characters in reverse order. UTF-8 is well-designed that way in that there's never ambiguity about whether you're looking at the beginning byte of a code point.
> UTF-8 is well-designed that way in that there's never ambiguity about whether you're looking at the beginning byte of a code point.
To expand, if the most-significant-bit is a 0, it's an ASCII codepoint. If the top two are '10', it's a continuation byte, and if they're '11', it's the start of a multibyte codepoint (the other most-significant-bits specify how long it is to facilitate easy codepoint counting).
So a naive codepoint reversal algorithm would start at the end, and move backwards until it sees either an ASCII codepoint or the start of a multibyte one. Upon reaching it, copy those 1-4 bytes to the start of a new buffer. Continue until you reach the start.
I wish languages did a far better job clearly distinguishing between their operations:
1. You are acting in byte space and it’s pretty unambiguous what should happen. We are not acknowledging the semantics of language and alphabets.
2. You’re acting in language space and these operations will behave the way you probably think they should (depending on your cultural expectations, probably)
Thankfully, there isn't an assigned ISO 3166-1 2-letter country code for SU currently; people may have interesting reactions seeing what happens when reversing a US flag emoji if there were.
This is a cool article about Unicode encoding however I still feel like it should be possible to reverse strings with Flag emojis. I don't see why computers can't handle multi rune symbols in the same way that they handle multi byte runes. We could combine all the runes that should be a single symbol and make sure that we're maintaining the ordering of those runes in the reversed string. Of course that means that naive string reversing doesn't work anymore but naive string reversing wouldn't work in the world of UTF-8 if we just went byte by byte.
Swift, for example, does what you're saying. I thought that the reason many languages don't do it that way is that part of the definition of an array (or at least expected-by-convention) is constant-time operations. If you treat a string as an array, then having to deal with variable-length units breaks that rule. That's why, when there is an API for dealing with grapheme clusters, it is usually a special case that duplicates an array-like API, instead of literally using an array.
I actually don't know how/why Python is apparently using code points, since they are variable length. That seems like a compromise between using code units and using grapheme clusters that gets you the worst of both worlds.
Edit: Maybe it uses UTF-32 under the hood when it's doing array operations on code points?
CPython 3 does use UTF-32 under the hood for strings (there is bytes for plain sequence of bytes). As you say, it's the worst of both worlds. High memory usage, and not really useful if you are dealing with unicode characters (grapheme clusters).
My impression is most modern languages that bother with unicode (swift, rust, nim) are using utf-8, and doing linear time operations to handle unicode. I think that's the right approach, as I don't recall ever needing random access on a unicode string.
Swift originally had UTF-32 strings (an upgrade from ObjC which uses UTF-16), but they redid String to be UTF-8. It's typically the best choice even for CJK text, because it has ASCII mixed in often enough to still be smallest.
I don't think reversing a string is a meaningful operation though; there's no reason to think of a string as a "list of characters" when it's also a "list of words" and several other things. Swift provides more than one kind of iteration for that reason.
Of course it's possible, the Unicode standard even has a table[0] you can use to build a DFA (Deterministic Finite Automata) to break up a string into grapheme clusters. You can reverse the DFA to match and yield the graphemes backwards as well, which will give you the reversed unicode string.
Personally, I found the Unicode spec very messy. Critical information is all over the place. You can see the direct effect of this when you compare various packages across different languages and discover that every library disagrees in multiple places. Even JS String.normalize() isn't consistent in the latest version of most browsers: https://adraffy.github.io/ens-normalize.js/test/report-nf.ht... (fails in Chrome, Safari)
The major difference between ENS and DNS is emoji are front and center. ENS resolves by computing a hash of a name in a canonicalized form. Since resolution must happen decentralized, simply punting to punycode and relying custom logic for Unicode-handling isn't possible. On-chain records are 1:1, so there's no fuzzy matching either. Additionally, ENS is actively registering names, so any improvement to the system must preserve as many names as possible.
At the moment, I'm attempting to improve upon the confusables in the Common/Greek/Latin/Cyrillic scripts, and will combine these new grouping with the mixed-script limitations similar to IDN handling in Chromium.
With all the criticism I normally have for Rust, I must say that its type safe handling of UTF-8 and its unambiguous distinction between byte strings and UTF-8 strings are extremely helpful in handling situations mentioned in the article correctly (and also efficiently).
Yes it's a pain, but the way the standard library designed its types force you to handle conversions correctly, for example when byte arrays are converted to UTF-8 strings and may contain invalid UTF-8 sequences.
It's an emoji... Are there any emojis with only one character? My assumption going in would be that any emoji is > 1 character. Admittedly, despite lots of string processing, I never have to deal with emojis so I guess I'm not sure.
An interesting exercise would be emoji detection during string reversal to preserve the original emoji. I though something like that would be the crux of the article.
Why reverse them if one barely can implement, display and edit them correctly. I never could make them work perfectly in VIM. Also I had to open a bug in Firefox recently:
> The answer is: it depends. There isn't a canonical way to reverse a string, at least that I'm aware of.
Unicode defines grapheme clusters[1] that represent "user-perceived characters" separating a string into those and reversing seems like a pretty good way to go about it.
I'm not surprised the flag had two components, but I _was_ surprised the US flag was made by literally U and S, haha!
I definitely thought it'd be something like [I am a Flag] and [The flag ID between 0 and 65535]. And reversing it would be [Flag ID] + [I am a Flag] which would not be a defined "component" and instead rendered as the individual two nonsense characters.
You might also have noticed this is partly a very well thought out hack to make Unicode less sensitive to disagreements and changes in consensus on which flags are encoded, or even the names of the countries concerned!
Related — I did a deep dive a couple years ago on emoji codepoints and how they're encoded in the Apple emoji font file, with the end goal of extracting the embedded images — https://github.com/alfredxing/emoji
> A separate mechanism (emoji tag sequences) is used for regional flags, such as England , Scotland , Wales , Texas or California . It uses U+1F3F4 WAVING BLACK FLAG and formatting tag characters instead of regional indicator symbols. It is based on ISO 3166-2 regions with hyphen removed and lowercase, e.g. GB-ENG → gbeng, terminating with U+E007F CANCEL TAG. Flag of England is therefore represented by a sequence U+1F3F4, U+E0067, U+E0062, U+E0065, U+E006E, U+E0067, U+E007F.
This was the only part that was surprising to me, and as it turns out my surprise mostly stems from still not really understanding how the United Kingdom works.
I was born and raised in the UK and I still can't explain to people how it works. Are England, Scotland, Wales and Northern Ireland separate countries? Is the UK a country? It's countries all the way down.
Don't worry, "How the United Kingdom works" is a political question and so subject to change.
For example, Wales was essentially just straight up conquered, and so for long periods Wales did not have any distinct legal identity from England. You'll see that today there's a bunch of laws which are for England and Wales but notably not Scotland, including criminal laws. In living memory Wales got some measure of independent control over its own affairs, via an elected "Assembly" but what powers are "devolved" to this assembly are in effect the gift of the Parliament, in Westminster, which is sovereign. Whether taking away those powers would go well is a good question.
On the other hand, Northern Ireland is what's left of English/ British dominion over the entire island of Ireland, most of which today is the Republic of Ireland, a sovereign entity with its own everything. It's only existed for about a century, and is a result of the agreed "partition" when the Irish rebelled because most of the Irish wanted independence but those in the North not so much. Feel free to read about euphemistically named "Troubles". In the modern era, Northern Ireland, like Wales, gets a devolved government in Stormont. Unlike Wales, the Northern Ireland government is a total mess, and e.g. they have abortion (like the rest of the UK, and like the rest of Ireland) only because Stormont was so broken that Westminster imposed abortion legalisation on them since they weren't actually governing. If you think the US Congress is dysfunctional, check out Stormont...
Finally Scotland was for a very long time an independent but closely related sovereign nation. It agreed to join this United Kingdom about three hundred years ago in the Acts of Union after about a century with the same Monarch ruling both countries. However, it too got a devolved government, a Parliament, probably the most powerful of the three, in Holyrood, Edingburgh in the 20th century and it has a relatively powerful pro-independence politics, the Scottish National Party is the dominant power in Scottish politics, although how many of its voters actually support independence per se is tricky to judge.
Brexit changed all this again, because as part of the EU a bunch of the powers you could reasonably localise, and so were "devolved" to Wales, Scotland and Northern Ireland, had been controlled by EU law. So Westminster could say they were devolved, knowing that the constituent entities couldn't actually do much with this supposed power. Having left the EU, those powers were among the thing Brexiteers seemed to have imagined now lay at Westminster, but of course the devolved countries said no, these are our powers, we get to decide e.g. how agricultural subsidies are distributed to suit our farmers.
That's even more fun in Northern Ireland, because they share a border with the Republic, an EU member, and so they're not allowed to have certain rules that would obviously result in a physical border with guards and so on. Their Unionists (the people who are why it isn't just part of the Republic of Ireland because they want to be in the United Kingdom) feel like they were sold out by Westminster politicians, while the Republicans (those who'd rather be part of the Republic) see this as potentially a further argument in favour of that. All of which isn't helping at all to keep the peace between these rivals, that peace being the whole reason we don't want to put up a border...
Most flags use the ISO 2-character country code to access their values. However, some flags don't map to 2-character country codes (Scotland being one example). In this case it uses the sequence black flag, GBSCT (for Great Britain-Scotland, represented using the tag latin small letter codes for the letters) then cancel tag to end the sequence. Changing the middle five to be GBENG gives the English flag and GBWLS gives the Welsh flag.
In normal conditions you can check for a ZWJ, but with regional coding chars, you would have to consider the regional chars block as a single char in the reversal. Given that is isn't necessarily locale dependant but presentation layer dependant, there might not be anough info to decide how to act.
URL Encoding works on bytes and does not concern itself with the character encoding of those bytes (except assuming that it is an ASCII superset) so this is only a limitation of the JS implementation.
Of course you can reverse a string with a flag emoji. You just need to treat a "string" as a collected of Extended Grapheme Clusters, and then you reverse the order of the EGCs. So if the string is `a<flag unicode bytes>b`, the output should be `b<flag unicode bytes>a`.
There are plenty of country codes that when reversed become a different, valid country code: e.g. Israel (IL) when reversed is Lithuania (LI); Australia (AU) becomes Ukraine (UA).
Whether "reversing flag emojis" causes such transformations will depend on what is meant by "reversing", which is kind of the whole point here: there are a number of possible interpretations of "reverse".
It's sad that Unicode doesn't include flags for dissolved countries. If it did, reversing an US flag would make a Soviet Union flag (code SU). This would make the text much more fun
The whole reason for handling the flag emojis that way was so that the Unicode Consortium wouldn't have to decide which countries should or should not be recognized. It is totally valid for you to configure your computer to display SU as a Soviet flag.
Let me cheat a bit and say Unicode comes in three flavors: UTF-8, UCS-2 aka UTF-16, and UTF-32. UTF-8 is byte-oriented, UTF-16 is double-byte oriented, and UTF-32 nobody uses because you waste half the word almost all of the time.
You can't reduce the bytes in UTF-8 or UTF-16, because you'll scramble the encoding. But you could parsing the string, codepoint-at-a-time, handling the specifics of UTF-8, or UTF-16 with its surrogate pairs, and reversing those. This sounds equivalent to reversing UTF-32, and I believe is what the original poster was imagining.
Except you can't do that, because Unicode has composing characters. Now, I'm American and too stupid to type anything other than ASCII, but I know about n+~ = ñ. If you have the pre-composed version of ñ, you can reverse the codepoint (it's one codepoint). If you don't have it, and you have n+dead ~, you can't reverse it, or in the word "año" you might put the ~ on the "o". (Even crazier things happen when you get to the ligatures in Arabic; IIRC one of those is about 20 codepoints.)
So we can't just reverse codepoints, even ancient versions of Unicode. Other posters have talked about the even more exotic stuff like Emoji + skin tone. It's necessary to be very careful.
Now, the old fart in me says that ASCII never had this problem. But the old fart in me knows about CRLF in text protocols, and that's never LFCR; and that if you want to make a ñ in ASCII you must send n ^H ~. I guess you can reverse that, but if you want to do more exotic things it becomes less obvious.
(IIRC UCS-2 is the deadname, now we call it UTF-16 to remind us to always handle surrogate pairs correctly, which we don't.)
When I first realized that the skin tone emojis were a code-point + a color code-point modifier, I tried to see what other colors there were and if I could apply those to other emojis. The immature child in me looked to see if there was a red color code point and if so, could I use it to make a "blood poop" emoji. Turns out.... no.
The person tries to define character when there isn't actually any definition of what that even means. Character is a term limited to languages that actually use them and not all text is made up of characters.
I'm in the process of designing a scripting language and implementing it in C++. I plan to put together a YouTube series about it. (Doesn't everyone want to see Bison and Flex mixed with proper unit tests and C++20 code?)
Due to my future intended use case, I needed good support for Unicode. I thought that I could write it myself, and I was wrong. I wasted two weeks (in my spare time, mostly evenings) trying to cobble together things that should work, identifying patterns, figuring out how to update it as Unicode itself is updated, thinking about edge cases, i18n, zalgo text, etc. And then I finally reached the point where I knew enough to know that I was making the wrong choice.
I'm now using ICU. (https://icu.unicode.org/) It's huge, it was hard to get it working in my environment, and there are very few examples of it's usage online, but after the initial setup dues are paid, it WORKS.
Aside: Yes, I know I'm crazy for implementing a programming language that I intend for serious usage. Yes, I have good reasons for doing it, and yes I have considered alternatives. But it's fun, so I'm doing it anyways.
Moral of the story: Dealing with Unicode is hard, and if you think it shouldn't be that hard, then you probably don't know enough about the problem!