Why can't you reverse a string with a flag emoji?

coreyp_1 · on Jan 27, 2022

If you think the Unicode flag emoji take a lot of bytes, then consider the family emoji! (https://unicode.org/emoji/charts/full-emoji-list.html#family)

I'm in the process of designing a scripting language and implementing it in C++. I plan to put together a YouTube series about it. (Doesn't everyone want to see Bison and Flex mixed with proper unit tests and C++20 code?)

Due to my future intended use case, I needed good support for Unicode. I thought that I could write it myself, and I was wrong. I wasted two weeks (in my spare time, mostly evenings) trying to cobble together things that should work, identifying patterns, figuring out how to update it as Unicode itself is updated, thinking about edge cases, i18n, zalgo text, etc. And then I finally reached the point where I knew enough to know that I was making the wrong choice.

I'm now using ICU. (https://icu.unicode.org/) It's huge, it was hard to get it working in my environment, and there are very few examples of it's usage online, but after the initial setup dues are paid, it WORKS.

Aside: Yes, I know I'm crazy for implementing a programming language that I intend for serious usage. Yes, I have good reasons for doing it, and yes I have considered alternatives. But it's fun, so I'm doing it anyways.

Moral of the story: Dealing with Unicode is hard, and if you think it shouldn't be that hard, then you probably don't know enough about the problem!

josephg · on Jan 27, 2022

Handling unicode can be fine, depending on what you're doing. The hard parts are:

- Counting, rendering and collapsing grapheme clusters (like the flag emoji)

- Converting between legacy encodings (shiftjis, ko8, etc) and UTF-8 / UTF-16

- Canonicalization

If all you need is to deal with utf8 byte buffers, you don't need all that stuff. And your code can stay simple, small and fast.

IIRC the rust standard library doesn't bother supporting any of the hard parts in unicode. The only real unicode support in std is utf8 validation for strings. All the complex aspects of unicode are delegated to 3rd party crates.

By contrast, nodejs (and web browsers) do all of this. But they implement it in the same way you're suggesting - they simply call out to libicu.

tialaramex · on Jan 27, 2022

> The only real unicode support in std is utf8 validation for strings.

Rust's core library gives char methods such as is_numeric which asks whether this Unicode codepoint is in one of Unicode's numeric classes such as the letter-like-numerics and various digits. (Rust does provide char with is_ascii_digit and is_ascii_hexdigit if that's all you actually cared about)

So yes, the Rust standard library is carrying around the entire Unicode standard class rule list among other things, of course Rust's library isn't built to a vast binary, so if you never use these features your binary doesn't get that code.

Gigachad · on Jan 27, 2022

It always feels like the most amount of work goes to the least used emoji. So many revisions and additions to the family emoji and yet it’s one of the ones I don’t recall anyone ever using.

I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.

masklinn · on Jan 27, 2022

> It always feels like the most amount of work goes to the least used emoji.

I always feel like those emoji were added on purpose in order to force implementations to fix their unicode support. Before emoji were added, most software had completely broken support for anything beyond the BMP (case study: MySQL's so-called "UTF8" encoding). The introduction of emoji, and their immediate popularity, forced many systems to better support astral planes (that is officially acknowledged: https://unicode.org/faq/emoji_dingbats.html#EO1)

Progressively, emoji using more advanced features got introduced, which force systems (and developers) to fix their unicode-handling, or at least improve it somewhat e.g. skintones for combining codepoints, etc....

> I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.

You should try to follow a new character through the process, because that's absolutely not what happens and shepherding a new emoji through to standardisation is not an easy task. The unicode consortium absolutely does say no, and has many reasons to do so. There's an entire page on just proposal guidelines (https://unicode.org/emoji/proposals.html), and following it does not in any way ensure it'll be accepted.

mike_hock · on Jan 27, 2022

WTF business do emojis have in Unicode? The BMP is all there ever should have been. Standardize the actual writing systems of the world, so everyone can write in their language. And once that is done, the standard doesn't need to change for a hundred years.

What we need now is a standardized, sane subset of Unicode that implementations can support while rejecting the insane scope creep that got added on top of that. I guess the BMP is a good start, even though it already contains superfluous crap like "dingbats" and boxes.

saltminer · on Jan 28, 2022

> WTF business do emojis have in Unicode?

Unicode didn't invent emoji, they incorporated it because they were already popular in Japan, and if they didn't incorporate it, it would greatly reduce Japanese adoption.

Keep in mind that Unicode was intended to unify all the disparate encodings that had been brewed up to support different languages and which made exchanging documents between non-English speaking countries a nightmare. The term "mojibake" comes to mind [0] - Japan alone had so many encodings that a slang term for text encoded with something different than what your device expected (and subsequently got rendered as nonsensical/garbled text) came about. And they weren't alone, of course [1].

> What we need now is a standardized, sane subset of Unicode that implementations can support while rejecting the insane scope creep that got added on top of that.

Unicode wasn't intended to be pretty. It was intended to be the one system that everyone used, and a way to increase adoption was to do some less than ideal things, like duplicate characters (so it would be easier to convert to Unicode).

You may never need anything outside the BMP, but that doesn't make the rest of the planes worthless. Ignoring the value of including dead and nearing-extinct languages for preservation purposes (not being able to type a language will basically guarantee its extinction, with inventing a new encoding and storing text as jpgs being the only real alternatives), there are a lot of people speaking languages found in the SMP [2][3] ([2] has 83 million native speakers, for example).

[0]: https://en.wikipedia.org/wiki/Mojibake

[1]: https://segfault.kiev.ua/cyrillic-encodings/

[2]: https://en.wikipedia.org/wiki/Modi_(Unicode_block)

[3]: https://en.wikipedia.org/wiki/Chakma_(Unicode_block)

lmm · on Jan 28, 2022

> The term "mojibake" comes to mind [0] - Japan alone had so many encodings that a slang term for text encoded with something different than what your device expected (and subsequently got rendered as nonsensical/garbled text) came about.

Mojibake was not a "Japan has too many encodings" problem. It was a "western developers assume everyone is using CP1252" problem.

> Unicode wasn't intended to be pretty. It was intended to be the one system that everyone used, and a way to increase adoption was to do some less than ideal things, like duplicate characters (so it would be easier to convert to Unicode).

Unfortunately they undermined all that with Han Unification, with the result that it's never going to be adopted in Japan.

fomine3 · on Jan 28, 2022

Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8.

Unicode/UTF-8 is widely adopted/recommended in Japan and there are no widely used alternative. Japanese company tend to still use SJIS but it's just laziness. Han unification isn't a problem to handle only Japanese text: just use Japanese font everywhere. To handle multiple language text, it's pain but anyway there are no alternatives.

lmm · on Jan 28, 2022

> Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8.

In theory it can happen with any combination of character sets, sure, but in practice every example of mojibake I've seen has been SJIS (or UTF-8) encoded text being decoded as CP1252 ("Latin-1" but that's an ambiguous term) by software that assumed the whole world used that particular western codepage. If you've got examples of SJIS vs EUC-JP confusion in the wild I'd be vaguely interested to see them (is there even anywhere that still uses EUC-JP?)

> Japanese company tend to still use SJIS but it's just laziness.

It's not just laziness; switching to unicode is a downgrade, because in practice it means you're going to get your characters rendered wrongly (Chinese style) for a certain percentage of your customers, for little clear benefit.

> To handle multiple language text, it's pain but anyway there are no alternatives.

Given that you have to build a structure that looks like a sequence of spans with language metadata attached to each one, there's not much benefit to using unicode versus letting each span specify its own encoding.

fomine3 · on Jan 28, 2022

Maybe the guess order depends on locale reasonably. GP is my experience mainly on old days ja-JP localed Windows software. IIRC Unix software tend to not good at guess so maybe you referring them.

Nowadays I rarely see new EUC-JP contents (or I just not recognized) but still sometimes I encounter mojibake on Chrome while visiting old homepage (like once per month). For web page, anyway most modern pages (including SJIS) don't rely on guess but have <meta charset> tag so mojibake very rarely happen. For plaintext files, I still see UTF-8 file shown as SJIS on Windows Chrome.

Viewing Japanese only UTF-8 text is totally fine for Japanese localed Windows/Mac/(Linux but YMMV). So your case is to view the text on non-Japanese locale. It possibly have a problem but how SJIS solved the issue? What software switches font if it opens SJIS file? Is the app/format don't support specifying font/lang like HTML/Word?

I believe no developer want to treat foreign charset like GBK/Big-5/whatever. There are very few information. If developer can switch reading charset on a file, then they can also switch font.

lmm · on Jan 28, 2022

> Viewing Japanese only UTF-8 text is totally fine for Japanese localed Windows/Mac/(Linux but YMMV). So your case is to view the text on non-Japanese locale.

The issue is that realistically a certain proportion of customers are going to have the wrong locale setting or wrong default font set.

> It possibly have a problem but how SJIS solved the issue? What software switches font if it opens SJIS file? Is the app/format don't support specifying font/lang like HTML/Word?

Certainly Firefox will use a Japanese font by default for SJIS whereas it will use a generic (i.e. Chinese) font by default for UTF-8. I would expect most encoding-aware programs would do the same?

> If developer can switch reading charset on a file, then they can also switch font.

Sure, but it works both ways. And it's actually much easier for a lazy developer to ignore the font case because it's essentially only an issue for Japan. Whereas if you make a completely encoding-unaware program it will cause issues in much of Europe and all of Asia (well, it did pre-UTF8 anyway).

numpad0 · on Jan 28, 2022

I think by far the largest contributor that coined mojibake was E-mail MTA. Some E-mail implementations assumed 7-bit ASCII for all text and dropped MSB on 8-bit SJIS/Unicode/etc, ending up as corrupt text at the receiving end. Next up was texts written in EUC(Extended UNIX Code)-JP probably by someone either running a real Unix(likely a Solaris) or early GNU/Linux, and floppies from a classic MacOS computer. Those must have defined it and various edge cases on web like header-encoding mismatch popularized it.

"Zhonghua fonts" issue is not necessarily linked to encoding, it's an issue about assuming or guessing locales - that has to be solved by adding a language identifier or by ending han unification.

account42 · on Jan 28, 2022

> Unfortunately they undermined all that with Han Unification, with the result that it's never going to be adopted in Japan.

This is an absolute shame and there is no excuse for fixing it so that variations for unified characters can be encoded before adding unimportant things like skin tones.

wodenokoto · on Jan 28, 2022

> So rather than treat the issue as a rich text problem of glyph alternates, Unicode added the concept of variation selectors, first introduced in version 3.2 and supplemented in version 4.0.[10] While variation selectors are treated as combining characters, they have no associated diacritic or mark. Instead, by combining with a base character, they signal the two character sequence selects a variation (typically in terms of grapheme, but also in terms of underlying meaning as in the case of a location name or other proper noun) of the base character. This then is not a selection of an alternate glyph, but the selection of a grapheme variation or a variation of the base abstract character. Such a two-character sequence however can be easily mapped to a separate single glyph in modern fonts. Since Unicode has assigned 256 separate variation selectors, it is capable of assigning 256 variations for any Han ideograph. Such variations can be specific to one language or another and enable the encoding of plain text that includes such grapheme variations. - https://en.m.wikipedia.org/wiki/Han_unification

This is what you’re asking for, right? Control characters that designates which version of a unified character is to be displayed.

Sure looks like it exists.

kingcharles · on Jan 27, 2022

Have emoji not become part of our writing structure though? A decent percentage of online chats and comments, especially on social networks, includes at least one emoji that couldn't be easily or accurately represented in the regular written language.

lmm · on Jan 28, 2022

Recently implementers of unicode have censored the gun emoji in a way that changes the meaning of many existing online chats and comments. So you can't easily or accurately represent things even with unicode.

Emoji have never been effective or consistent at conveying meaning; at best they convey something to someone in the same subculture and time period, and often not even that. Given that unicode implementers are ok with erasing the meaning of some of them, it should be ok to eliminate more of them.

kingcharles · on Jan 28, 2022

> Emoji have never been effective or consistent at conveying meaning; at best they convey something to someone in the same subculture and time period

Isn't that the same with all words though? Think how much English usage changes in a generation. For instance, my girlfriend will use the term "I'm dead!" in a similar context to where I would say "LOL" and where my father would have said "What the fuck is loll?"

lmm · on Jan 31, 2022

There's a spectrum. Subculture-specific slang changes quickly, but most words have a longer lifetime; reading Chaucer today is difficult but doable. Given that we don't encode words but only letters, for English you have to go back to the disappearance of þ to get a change that's relevant to text encoding. Emoji shift faster and are less effective at conveying meaning than any "real" language.

Gigachad · on Jan 27, 2022

This argument was lost the moment Unicode was created. Japanese carriers had created their own standard for emoji encoding for sms. And they would not switch to Unicode unless the emoji were ported over.

It’s a tricky situation. Maybe allowing an arbitrary bitmap char to represent any emoji would have been better but then we could have ended up in a situation where normal text or meaningful punctuation or perhaps even fonts would get encoded as bitmaps.

For something like a face or hand gesture, a bitmap likely would have been better since it would at least look the same on all platforms.

Findecanor · on Jan 28, 2022

I don't think that argument holds water. Emoji could just as well have been encoded as markup. There were for instance long-established conventions of using strings starting with : and ; . Bulletin boards extended that to a convention using letters delimited by : for example :rolleyes: . Not to mention that those codes can be typed more efficiently than browsing in an Emoji Picker box.

Because emoji became characters, text rendering and font formats had to be extended to support them. There are four different ways to encode emoji in OpenType 1.8:

* Apple uses embedded PNG

* Google uses embedded colour bitmaps

* Microsoft uses flat glyphs in different colours layered on top of one-another

* Adobe and Mozilla use embedded SVG.

lmm · on Jan 28, 2022

> Emoji could just as well have been encoded as markup.

They could have, but they were already being encoded as character codepoints in existing charactersets. So any character encoding scheme that wanted to replace all use cases for existing charactersets needed to match that. If switching charactersets meant you lost the ability to use emoji until you upgraded all your applications to support some markup format, people would just not switch.

account42 · on Jan 28, 2022

> If switching charactersets meant you lost the ability to use emoji until you upgraded all your applications to support some markup format, people would just not switch.

You need to upgrade those applications to support Unicode too.

lmm · on Jan 28, 2022

Not necessarily, most applications already supported multiple encodings, having the OS implement one of the unicode encodings was often all that was needed.

numpad0 · on Jan 28, 2022

I might think the important part was Japanese carriers were weaponizing flip phone culture to gatekeep "PCs" and open standard smartphones out of their microtransaction ecosystem. Emoji was one of the keys to disprove the FUD that iPhone can't be equal to flip phones and establish first class citizen status.

stubish · on Jan 28, 2022

You are underestimating how much language evolves. In fact, you are proposing brakes to stop if evolving. If nothing else, new currency symbols need to be incorporated every few years. The initial emoji were part of the actual writing systems of the world, even if it was relatively new and only being used by foreigners. Or maybe they have been part of world culture since the 1950s :-) ? https://en.wikipedia.org/wiki/Smiley

fomine3 · on Jan 28, 2022

BMP was failed concept even without emoji 16bit isn't enough to contain all CJK characters.

laumars · on Jan 27, 2022

They do say no though. Frequently too.

The problem with Unicode is simply that it’s trying to solve a very hard problem.

tialaramex · on Jan 27, 2022

Exactly this. Humans have incredibly complicated writing systems, and all Unicode wants to do is encode them all. Keep in mind that the trivial toy system we're more familiar with, ASCII, already has some pretty strange features because even to half-arse one human writing system they needed those features.

Case is totally wild, it only applies to like 5% of the symbols in ASCII, but in the process it means they each need two codepoints and you're expected to carry around tech for switching back and forth between cases.

And then there are several distinct types of white space, each gets a codepoint, some of them try to mess with your text's "position" which may not make any sense in the context where you wanted to use it. What does it mean to have a "horizontal tab" between two parts of the text I wanted to draw on this mug? I found a document which says it is the same as "eight spaces" which seems wrong because surely if you wanted eight spaces you'd just write eight spaces.

And after all that ASCII doesn't have working quotation marks, it doesn't understand how to spell a bunch of common English words like naïve or café, pretty disappointing.

xxpor · on Jan 27, 2022

>Humans have incredibly complicated writing systems

Not only that, there isn't even agreement about what's correct all the time!

>it doesn't understand how to spell a bunch of common English words like naïve or café, pretty disappointing.

A perfect example of this, since I would argue English doesn't have any diacritics at all. So the use of café is code switching. :)

jfk13 · on Jan 28, 2022

https://dictionary.cambridge.org/dictionary/english/resume, for example, makes the distinction between "resume" and "résumé" pretty clear.

mattkrause · on Jan 27, 2022

Not a New Yorker writer, I see....

xxpor · on Jan 28, 2022

Thank God!

(seriously, every time I read an article from them I get distracted/nerd sniped by their diaeresis use...)

brimble · on Jan 27, 2022

It's still "rôle" to me, damnit.

jonas21 · on Jan 28, 2022

This work wasn't done for emoji. They use the same zero-width joiner character [1] that exists to support Indic scripts like Devanagari, and any system that properly handles these languages will also properly handle the emoji.

Yes, this adds a lot of complexity, but it's really a question of whether that complexity is justified in order to support all of the world's languages. And I think many would argue that it is.

[1] https://en.wikipedia.org/wiki/Zero-width_joiner

jahewson · on Jan 28, 2022

Emoji has turned out to be a great way to enhance complex script support in much real-world software and they get my full support for that reason too.

Vindicis · on Jan 28, 2022

I know how that feels. I wrote a little c++ program to fetch data in Unicode from a dB and then normalize it to ascii to be used for analytic purposes. A lot faster to do it on ascii than trying to handle all the fun cases of how many ways can an e etc... be input. ICU to the rescue! Took a couple weeks of getting up to speed as ICU itself wasn't too bad to figure out. But, you find out very quickly that to use it, you need to have a good understanding of a number of the Unicode technical reports to actually understand how to make use of it. Fun times indeed.

DecoPerson · on Jan 28, 2022

Do you have a YouTube for people to subscribe to in anticipation of you releasing your YouTube series about your work? The development processes of new languages is so intriguing.

coreyp_1 · on Jan 28, 2022

I'll post about it here on HN after I have a few episodes up.

dagmx · on Jan 28, 2022

It would actually be pretty interesting to see how you use Bison and Flex with utf-8. Most resources say to not bother due to lack of support for Unicode, but they're so ubiquitous

account42 · on Jan 28, 2022

Do they need special support for UTF-8? One of the nice things about UTF-8 is that you can treat it as an 8-bit encoding in many cases if you only care about substrings and don't need to decode individual non-ASCII characters.

dagmx · on Jan 28, 2022

At some point in the pipeline you need the tooling to constrain identifiers to XID_START and XID_CONTINUE

amelius · on Jan 28, 2022

Why is this stuff even reinvented for every programming language?

Isn't it about time that we have some common language that every other language builds on?

johndough · on Jan 28, 2022

> Isn't it about time that we have some common language that every other language builds on?

That language is C. It is debatable whether it was a good choice, but at least this is how it turned.

IncRnd · on Jan 28, 2022

So said every writer of a standard immediately before writing another standard to replace all others.

rwbhn · on Jan 28, 2022

Obligatory xkcd https://xkcd.com/927/

lmm · on Jan 28, 2022

ICU is out there with bindings in many languages. People who know what they're doing use ICU.

WA9ACE · on Jan 27, 2022

I feel like I'm obligated to share this almost 20 year old Spolsky post that gave me my understanding of characters.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

xmprt · on Jan 27, 2022

In that same vein, here's my introduction to Unicode about 10 years ago from Tom Scott.

https://www.youtube.com/watch?v=MijmeoH9LT4

ciupicri · on Jan 27, 2022

That's more about the UTF-8 encoding than Unicode itself.

zerox7felf · on Jan 27, 2022

poor man gave me and many others something like half of our introduction to computer science, but has gotten far more fame as the "emoji guy" for his repeated bouts with this particular part of unicode :)

zarzavat · on Jan 28, 2022

This reminds me of an interesting bug I saw where I was seeing a strange flag in some Arabic text. However when I copied the string and pasted it into a text editor, the flag of Saudi Arabia appeared instead (which made much more sense). After some vexillologic research on Wikipedia I identified the original flag as American Samoa and it suddenly all made sense. Turns out some broken RTL support was flipping the SA into AS at presentation.

zarzavat · on Jan 28, 2022

After writing this comment I did some more research. Apparently this is actually a bug in Chrome itself (!).

https://bugs.chromium.org/p/chromium/issues/detail?id=127243...

yoyohello13 · on Jan 27, 2022

Maybe I'm missing some prerequisite knowledge here, but why would I assume `flag="us"` is an emoji? Looking at that first block of code, there is no reason for me to think "us" is a single character.

Edit: Turns out my browser wasn't rendering the flags.

happytoexplain · on Jan 27, 2022

In Windows Chrome, it doesn't render the emoji for me. In Android Chrome, it renders a flag emoji - not the raw region indicators (which look like the letters "u" and "s").

greenyoda · on Jan 27, 2022

In my browser (Firefox on Windows), the thing between the quotes in the first block of code looks like a picture of the US flag cropped to a circle, not like the characters "us".

yoyohello13 · on Jan 27, 2022

Ah I see, I just opened it in firefox. It looks like some JS library is not getting loaded in Edge. The author was talking about "us", "so", etc. looking like one character and I thought I was going crazy, lol.

jfk13 · on Jan 27, 2022

I don't think that's about a JS library. Firefox bundles an emoji font that supports some things -- such as the flags -- that aren't supported by Segoe UI Emoji on Windows, so it has additional coverage for such character sequences.

yoyohello13 · on Jan 27, 2022

That makes sense. I saw a failure to load a JS module in the console and assumed that was part of the problem.

masklinn · on Jan 27, 2022

There should not be any JS involved though, only a font able to render these grapheme clusters.

Do you see the US flag after "copy and paste this emoji" on https://emojipedia.org/flag-united-states/?

bialpio · on Jan 27, 2022

Reminds me of an image that renders differently on Macs (https://www.bleepingcomputer.com/news/technology/this-image-...), I bet it'd make for a fun conversation that could make the participants question their sanity. :-)

da12 · on Jan 27, 2022

A whole lesson in Unicode in itself right there with your experience, haha!

ljm · on Jan 27, 2022

If it's Windows, it doesn't actually use flags for those emojis, it renders a country code instead. If it wasn't supported you would just see the glyph for an unknown character.

The reason was because they didn't want to be caught up in any arguments about what flag to render for a country during any dispute, as with, e.g. the flag for Afghanistan after the Taliban took control.

kingcharles · on Jan 27, 2022

Do you have a citation for that? I suspected it was because of the political issues, so I tried hunting down the reason one day and came up blank.

[Microsoft had this same issue with the timezone map in Windows. The early versions were cool and had country borders, but then I think it was India/Pakistan threw a fit and it was simplified to take the borders out]

Benlights · on Jan 27, 2022

I had the same issue when I read the article, I kept on getting stuck and asking myself what I was missing.

emodendroket · on Jan 27, 2022

What I'd like to know is, given the explosion of the character set for emoji, does the rationale for Han unification still make sense? The case for not allowing national variants seems less and less compelling with every emoji they add.

This is a bit of a hobby horse, but imagine if every time you read an article in English on your phone some of the letters were replaced with "equivalent" Greek or Cyrillic one and you can get an idea of the annoyance. Yeah, you can still read it with a bit of thought, but who wants to read that way?

AlanYx · on Jan 27, 2022

I agree that Han unification was an unfortunate design decision, but I'd argue that the consortium is following a consistent approach to the Han unification with emoji. For example, they treat "regional" vendor variations in emoji as a font issue. If you get a message with the gun emoji, unless you have out-of-band information regarding which vendor variant is intended, there's no way in software to know if it should be displayed as a water gun (Apple "regional" variant) or a weapon (other vendor variants). Which is not that different from a common problem stemming from Han unification.

emodendroket · on Jan 27, 2022

I don't disagree, but my point is more than their concern was about having "too many characters" in Unicode, which no longer seems to be a real concern, so what would be the harm of adding national variants?

account42 · on Jan 28, 2022

Have skin tone variants (which is somethine Unicode chose to add rather than added because of existing use) is consistent with not have distinct variants for glyphs from different languages?

fomine3 · on Jan 28, 2022

Han unification was a try to fit CJK characters into 16bit BMP. Finally BMP is failed so meaningless but reverting it also produces huge compatibility issue.

emodendroket · on Jan 28, 2022

Of course, the old characters must be left alone. But I'm not seeing what stops them from introducing new ones.

fomine3 · on Jan 28, 2022

New characters have same glyph as old characters. It's the nightmare. For example, I can't find old one by searching new one. It's hard to know the reason for normal people. Should all software support searching by both characters? I don't expect all western developer take care. Equality comparison also fails without special support.

account42 · on Jan 28, 2022

That is a bad exuse since it would preclude adding any new characters for existing languages. Would you have made the same objection for U+1E9E "ẞ", which was added in 2008?

Also, equality comparison already requires special support, e.g. normalization before comparison.

Sure, there would be an period where software support is incomplete but that is a bad reason to keep things broken forever.

emodendroket · on Jan 28, 2022

It doesn't seem unfeasible to make a search that would support both.

fomine3 · on Jan 28, 2022

It's possible to implement, but it makes confusion than benefit until all existing software support it.

digisign · on Jan 27, 2022

> were replaced with "equivalent" Greek or Cyrillic one

The subset of equivalent letters, or different ones? If they looked the same, it wouldn't bother me if the letters in the center were a single codepoint between European languages:

https://upload.wikimedia.org/wikipedia/commons/8/84/Venn_dia...

account42 · on Jan 28, 2022

I am disappointed that that diagram omits ꙮ [0]

[0] https://en.wikipedia.org/wiki/Multiocular_O

emodendroket · on Jan 28, 2022

The problem is they don't look the same. So imagine, for instance, Я instead of "R" or И instead of "N" (I don't think the sounds are actually equivalent but let's run with it for the sake of example). Not insurmountable. One could still read a text with these substitutions. But it'd be distracting, and extra detrimental for people who don't speak English as their first language.

digisign · on Jan 31, 2022

The ones in the center are in all three sets, they do look the same. The outer areas are out of bounds.

shalmanese · on Jan 28, 2022

It doesn't make sense but there's also no way to fix it now. Once the Han characters were unified, there's no non-trivial way to ununify them.

emodendroket · on Jan 28, 2022

To an extent that's true, but introducing national variant characters in addition to the unified ones would at least allow careful writers to avoid the problem.

account42 · on Jan 28, 2022

Exactly, this is not rocket science: Introduce variantes of the affected characters in unicode (either variant selectors or distinct codepoints, doesn't matter too much but variant selectors could allow falling back to the old context-based detection). Then wait for software to be updated to use the variants based on the input language. This allows the writer to verify the variant used which will then be the same in all contexts.

treesknees · on Jan 27, 2022

But you can, and did, reverse a string. It seems you would need more details, such as a request to reverse the meaning or interpretation of the string, which is what the author is getting at.

If someone challenges you to reverse an image, what do you do? Do you invert the colors? Mirror horizontally? Mirror vertically? Just reverse the byte order?

wahern · on Jan 27, 2022

There's a specification problem here. I like to say that a "string" isn't a data structure, it's the absence of one. Discussing "strings" is pointless. It follows that comparing programming languages by their "string" handling is likewise pointless.

Case in point: a "struct" in languages like C and Rust is literally a specification of how to treat segments of a "string" of contiguous bytes.

avianlyric · on Jan 27, 2022

In languages like C “string” isn’t a proper data structure, it’s a `char` array, which itself is little more than a `int` array or `byte` array.

But these languages don’t provide true “string” support. They just have a vaguely useful type alias that renames a byte array to a char array, and a bunch of byte array functions that have been renamed to sound like string functions. In reality all the language supports are byte arrays, with some syntactical sugar so you can pretend they’re strings.

Newer languages, like go and Python 3, that where created in the world of Unicode provide true string types. Where the type primitives properly deal with idea of variable length characters, and provide tools to make it easy to manipulate strings and characters as independent concepts. If you want to ignore Unicode, because your specific application doesn’t need to understand, then you need cast your strings into byte arrays, and all pretences of true string manipulation vanish at the same time.

This is not to say the C can’t handle Unicode etc. just like the language doesn’t provide true primitives to manipulate strings, instead relies on libraries to provide that functionality, which is perfectly valid approach. Just as baking in more complex string primitives into your language is also a perfectly valid approach. It’s just a question of trade offs and use cases, I.e. the problem at the heart of all good engineering.

account42 · on Jan 28, 2022

Having your strings be conceptually made up of UTF-8 code units makes them no less strings than those made up of Unicode code points. As this article shows, working with code points is often not the right abstraction anyway and you need to up all the way to grapheme clusters to have anything close to what someone would intuitively call a character. Calling a code point a character is not more correct or useful than calling a code unit a char.

All you gain by having Unicode code point strings is the illusion of Unicode support until you test anything that uses combining characters or variant selectors. In essence, languages opting for such strings are making the same mistake at Windows/Java/etc. did when adopting UTF-16.

shadowgovt · on Jan 27, 2022

Even the most basic ASCII string is still a data structure.

Is it a PASCAL string (length byte followed by data) or a C string (arbitrary run of bytes terminated by a null character)?

wahern · on Jan 27, 2022

You qualified "string" with "ASCII", and also tacitly admitted you still need more information than the octets themselves--the length.

Of course, various programming languages have primitives and concepts which they may label "string". But you still need to specify that context, drawing in the additional specification those languages provide. Plus, traditionally and in practice, such concepts often serve the function of importing or exporting unstructured data. So even in the context of a specific programming language, the label "string" is often used to elide details necessary to understanding the content and semantics of some particular chunk of data.

shadowgovt · on Jan 27, 2022

I think I understand the difference; you're using "string" the way I would use "blob" or "untyped byte array."

Shifting definitions to yours, I agree.

jameshart · on Jan 27, 2022

Yep, it's as meaningful a programming task as 'reverse this double-precision float'.

samatman · on Jan 27, 2022

We would all be better off if this were actually true.

Tragically, in C, a string is just barely a data structure, because it must have \0 at the end.

If it were the complete absence of a data structure, we would need some way to get at the length of it, and could treat a slice of it as the same sort of thing as the thing itself.

jameshart · on Jan 28, 2022

C doesn’t really have strings at all. It has char pointers, and some standard functions that take a char pointer and act on all the chars starting from that pointer up to the first \0.

When you’re handling any kind of C pointer you need to know how big the buffer is around that pointer where pointer-arithmetic accesses make sense - but for a string, you also want to know ‘how much of the buffer is full of meaningful character data?’ - or else you’re stuck with fixed width text fields like some kind of a COBOL caveman.

But because C was designed by clever people for clever people they figured the standard string functions can just be handed a char pointer without any buffer bounds info because you can be trusted to always make sure that the pointer you give them is below a \0 within a single contiguous char buffer.

And that worked out great.

account42 · on Jan 28, 2022

You can work with pointer+length or begin+end pairs in C just fine - it's just annoying. But you can always upgrade to C++ and use std::string_view to abstract that for you if you want.

egypturnash · on Jan 27, 2022

Galaxy brain image reversal: completely redraw it from scratch, with a viewpoint 180º from the original.

McBeige · on Jan 28, 2022

If the FoV is less than 180deg then any image would be a realistic solution as long as it doesn't depict anything from the original.

egypturnash · on Jan 28, 2022

My intent here is "same subject, but now you're standing on the other side of it", not "same viewer location but turned 180º".

For instance, if you started with an image of the Washington Monument with the Lincoln Memorial in the background, the "reverse" would be an image of the Washington Monument with assorted Smithsonian museum buildings behind it. Or whatever you theorize would be on the east of the Washington Monument if no reference is available.

ravi-delia · on Jan 27, 2022

New computer vision challenge

jerf · on Jan 27, 2022

So, in terms of acing interviews, increasingly one of the best answers to the question "Write some code that reverses a string" is that in a world of unicode, "reversing a string" is no longer possible or meaningful.

You'll probably be told "oh, assume US ASCII" or something, but in the meantime, if you can back that up when they dig into it, you'll look really smart.

jameshart · on Jan 27, 2022

I'd go further and argue that in general reversing a string isn't possible or meaningful.

It's just not a thing people do, so it's just... not very interesting to argue about what the 'correct' way to do it is.

Similarly, any argument over whether a string has n characters or n+1 characters in it is almost entirely meaningless and uninteresting for real world string processing problems. Allow me to let you into a secret:

there's never really such a thing as a 'character limit'

There might be a 'printable character width' limit; or there might be a 'number of bytes of storage' limit. Which means interesting questions about a string include things like 'how wide is it when displayed in this font?' or 'how many bytes does it take to store or transmit it?'... But there's rarely any point where, for a general string, it is really interesting to know 'how many characters does the string contain?'

Processing direct user text input is the only situation where you really need a rich notion of 'character', because you need to have a clear sense of what will happen if the user moves a cursor using a left or right arrow, and for exactly what will be deleted when a user hits backspace, or copied/cut and pasted when they operate on a selection. The ĳ ligature might be a single glyph, but is it a single character? When does it matter? Probably not at all unless you're trying to decide whether to let a user put a cursor in the middle of it or not.

And next to that, I just feel to argue that there is such a thing as a 'correct' way to reverse "Rĳndæl" according to a strict reading of Unicode glyph composability rules seems like a supremely silly thing to try to do.

I'd much rather, when asked to reverse a string, more developers simply said 'that doesn't make sense, you can't arbitrarily chunk up a string and reassemble it in a different order and expect any good to come of it'.

jerf · on Jan 28, 2022

Boy, that's implicitly a good question... when's the last time I "reversed" a string, on purpose, for something useful?

It took me a bit, but I think I have an answer. It's about 15 years ago. I didn't actually do the original design, but I perpetuated it and didn't remove it. We reversed domain name strings (which, given that they are a subset of ASCII, actually is a well-defined operation) so that the DB we're using, which supported efficient prefix lookups but not suffix lookups, could be used to efficiently query for all subdomains of a given domain, by reversing the domain and using that as the prefix.

I mean this as strong support for your point, not a contradictory "gotcha". I'm a big believer in not doing lots of work to save effort or make correct something you do less than once a decade, e.g., http://www.jerf.org/iri/post/2954 . And it's not even a gotcha anyhow, because we aren't reversing a general string; we were reversing a string very tightly constrained to a subset of ASCII where the operation was fully well-defined. I can't think of when I ever reversed a general string.

jameshart · on Jan 28, 2022

Right - any case where you are reversing a string as part of some other operation you will have some goal in mind that is not simply 'produce the reverse of any arbitrary string'. Even if your goal is doing something like printing the crossword puzzle answers backwards at the bottom of the page, you have a tightly constrained set of possible characters so you can literally just throw an error if someone asks you to reverse a string containing a flag.

I actually should admit, for all my protesting above that you never need to do this, I did once actually implement something that "required", as part of the process, reversing a string. It should be apparent once I share what it was why I put scare-quotes around "required" though.

We wanted to test and demonstrate the localization and unicode-readiness capabilities of our software, and to verify that every UI string was actually coming from the resource file for the selected locale, and handled in a unicode-safe way.

So I implemented a program that took in the en-GB resource file, and outputted an en-AU one that contained all the original strings, just flipped upside down. This being, of course, the canonical way to localize a product for Australia.

And to turn a string upside down, you need to reverse the order of the characters, before mapping them to their unicode upside-down equivalent.

Unfortunately, the Unicode consortium do not make available a comprehensive database of which glyphs are 180º reversals of other glyphs, so my solution ended up not having comprehensive coverage of all unicode codepoints, but since my source data was en-US text that wasn't that important; what was more important was that some of the resource strings used a 'safe subset' of HTML so I needed to not turn <strong> into <ƃuoɹʇs>.

More than anything, it was probably that experience that gave me a true appreciation for what nonsense it is to try to break a string into characters and manipulate them.

(Also, while I do love the ingenuity of string reversal for suffix-based indexing, reversing a domain name for efficient prefix-based lookup can of course also be done by breaking the name up into subcomponents (thus not requiring you to care about character composition at all between dots), reversing the sequence of those parts and reassembling the string from the components in reverse order - which has the added benefit of preserving human readability of the domain name, and a natural sort order...)

jerf · on Jan 28, 2022

"reversing the sequence of those parts and reassembling the string from the components in reverse order"

Given that this was Perl and that's a small chunk of code, it's probably what I would have done in the same circumstance, but given that it already existed it wasn't worth shipping a migration out to the field with a new version. Generally humans didn't consult this table anyhow.

But it was good for a couple of good "wtf is that" faces from other developers the first time they look at the DB, if nothing else. They get it pretty quickly; the preponderance of "moc." and "ude." gets to be a dead giveaway pretty quickly, especially combined with some popular names ("moc.elgoog" almost sounds like a real domain Google might register someday). But still fun if you catch their face at the right moment.

Someone · on Jan 27, 2022

Even ASCII can be argued to be problematic.

What is “3 >= 2", reversed?

What is “Rijksmuseum”, reversed? (https://en.wikipedia.org/wiki/IJ_(digraph); capitalization isn’t simple here, either (https://en.wikipedia.org/wiki/IJ_(digraph)#Capitalisation)

What is “Schroeder”, reversed? (https://en.wikipedia.org/wiki/Diaeresis_(diacritic)#Printing...)

Spivak · on Jan 27, 2022

Reversing a string is still meaningful. Take a step back outside the implementation and imagine handing a Unicode string to a human. They could without any knowledge look at the characters they see and produce the correct string reversal.

There is a solution to this which is to compute the list of grapheme clusters, and reverse that.

https://unicode.org/reports/tr29/

akersten · on Jan 27, 2022

> imagine handing a Unicode string to a human. They could without any knowledge look at the characters they see and produce the correct string reversal.

I really highly doubt it.

How do you reverse this?: مرحبًا ، هذه سلسلة.

Can you do it without any knowledge about whether what looks like one character is actually a special case joiner between two adjacent codepoints that only happens in one direction? Can you do it without knowing that this string appears wrongly in the HN textbbox due to an apparent RTL issue?

It's just not well-defined to reverse a string, and the reason we say it's not meaningful is that no User Story ever starts "as a visitor to this website I want to be able to see this string in opposite order, no not just that all the bytes are reversed, but you know what I mean."

nonameiguess · on Jan 27, 2022

You can even demonstrate a similar concept with English and Latin characters. There is no single thing called a "grapheme" linguistically. There are actually two different types of graphemes. The character sequence "sh" in English is a single referential grapheme but two analogical graphemes. Depending on what the specification means, "short" could be reversed as either "trosh" or "trohs". That's without getting into transliteration. The word for Cherokee in the Cherokee language is "Tsalagi" but the "ts" is a Latin transliteration of a single Cherokee character. Should we count that as one grapheme or two?

Of course, if an interviewer is really asking you how to do this, they're probably either 1) working in bioinformatics, in which case there are exactly four ASCII characters they really care about and the problem is well-defined, or 2) it's implementing something like rev | cut -d '-' -f1 | rev to get rid of the last field and it doesn't matter how you implement "rev" just so long as it works exactly the same in reverse and you can always recover the original string.

Spivak · on Jan 27, 2022

The fact that how to reverse a piece of text is locale dependent doesn't mean it's impossible. Basically and transformation on text will be locale dependent. Hell, length is locale dependent.

happytoexplain · on Jan 27, 2022

>what looks like one character is actually a special case joiner between two adjacent codepoints

Are you referring to a grouping not covered by the definition of grapheme clusters (which I am only passingly familiar with)? If so, then I don't think it's any more non-meaningful to reverse it than to reverse an English string. The result is gibberish to humans either way - it sounds more like you're saying that there is no universally "meaningful to humans" way to reverse some text in potentially any language, which is true regardless of what encoding or written language you're using. I was thinking of it more from the programmer side - i.e. that Unicode provides ways to reverse strings that are more "meaningful" (as opposed to arbitrary) than e.g. just reversing code points.

Spivak · on Jan 27, 2022

I mean no but only because I don’t understand the characters. Someone who reads Arabic (I assume based on the shape) would have no trouble. You’re nitpicking cases where for some readers visual characters might be hard to distinguish but it doesn’t change the fact that there exists a correct answer for every piece of text that will be obvious to readers of that text which is the definition of a grapheme cluster.

akersten · on Jan 27, 2022

> the fact that there exists a correct answer for every piece of text that will be obvious to readers of that text which is the definition of a grapheme cluster.

No, I insist there is not a single "correct answer," even if a reader has perfect knowledge of the language(s) involved. Now remember, this is already moving the goalposts, since it was claimed that a human needed "no knowledge" to get to this allegedly "correct answer."

You already admit that people who don't speak Arabic will have trouble finding the "grapheme clusters," but even two people who speak Arabic may do your clustering or not, depending on some implicit feeling of "the right way to do it" vs taking the question literally and pasting the smallest highlight-able selection of the string in reverse at a time.

Anyway, take a string like this: "here is some Arabic text: <RLM> <Arabic codepoints> <LRM> And back to English"

Whether you discard the ordering mark[0], keep them, or inverse them is an implementation decision that already produces three completely different strings. Unless we want to write a rulebook for the right way to reverse a string, it remains an impossibility to declare anything the correct answer, and because there is no reason to reverse such a string outside of contrived interview questions and ivory tower debates, it is also meaningless.

[0]: https://en.m.wikipedia.org/wiki/Right-to-left_mark https://en.m.wikipedia.org/wiki/Left-to-right_mark

Spivak · on Jan 27, 2022

You added the requirement that it be a single correct answer. I just asserted that there existed a correct answer. You're being woefully pedantic -- a human who can read the text presented to them but no knowledge of unicode was my intended meaning. Grapheme clusters are language dependent and chosen for readers of languages that use the characters involved. There's no implicit feeling, this is what the standards body has decided is the "right way to do it." If you want to use different grapheme clusters because you think the Unicode people are wrong then fine, use those. You can still reverse the string.

Like what are you even arguing? You declared that something was impossible and then ended with that it's not only possible but it's so possible that there are many reasonable correct answers. Pick one and call it a day.

akersten · on Jan 27, 2022

> Like what are you even arguing?

It is impossible to "correctly reverse a string" because "reverse a string" is not well defined. We explored many different potential definitions of it, to show that there is no meaningful singular answer.

> You added the requirement that it be a single correct answer.

Your original post says "they could produce the correct string reversal"?

adolph · on Jan 27, 2022

Is a RTL character string already "reversed" from a LTR POV?

Is an absolute value signed as positive?

chrismorgan · on Jan 28, 2022

UAX #29 is insufficient: at the very least, you must depend on collation too.

In Norwegian, “æ” is a letter, so I believe (as a non-speaker) that they would reverse “blåbærene” to “eneræbålb”; but in English, it’s a ligature representing the diphthong “ae”, and if asked to reverse “æsthetic” I would certainly write “citehtsea” and consider “citehtsæ” to be wrong. (And I enjoy writing the ligature; I fairly consistently write and type æsthetic rather than aesthetic, though I only write encyclopædia instead of encyclopaedia when I’m in a particular sort of mood.)

In Dutch, the digraph “ij” is sometimes considered a ligature and sometimes a letter; as a non-speaker, I don’t know whether natives would say that it should be treated as an atom in reversing or not.

And not all languages will have the concept of reversing even letters, let alone other things. Face it: in English we have the concept of reversing things, but it just doesn’t work the same way in other languages. Sure, UAX #29 defines something that happens to be a good heuristic for reversing, but it doesn’t define reversing, and in the grand scheme of things reversing grapheme-wise is still Wrong. “Reversing a string” is just not a globally meaningful concept.

Another person here has cited Cherokee transliteration, where one extended grapheme cluster turns into multiple English letters. You can apply this to translation in general, but also even keep it inside English and ask: what are we reversing? Letters? Phonemes? Syllables? Words? There are plenty of possibilities which are used in different contexts (and it’s mostly in puzzles, frankly, not general day-to-day life).

The concept of grapheme clusters is acknowledged as approximate. Collations are acknowledged as approximate. Reversing would be even more approximate.

chrismorgan · on Jan 28, 2022

Interesting addendum on the matter of reversing “æsthetic”: I asked my parents what they reckoned, and they both went the other way, reckoning that if you wrote æ in the initial word it should stay æ; my dad said he wouldn’t write it that way in the first place, but that if you’ve written it that way you were treating æ as a letter more than just a way of drawing a certain pair of letters. In declaring otherwise, I was using the linguistic approach, which acknowledges æ as a ligature of the ae diphthong from Latin, being purely stylistic and not semantic. And so we see still more how these things are approximations and subjective.

lloeki · on Jan 27, 2022

Should it reverse a BOM as well or keep it first?

account42 · on Jan 28, 2022

Remove it since the BOM is a hack to deal with shitty transfer encodings (i.e. UTF-16LE vs. UTF-16BE) and useless for UTF-8.

Spivak · on Jan 27, 2022

Keep it first? Like that’s not a gotcha. Your input is a string and the output is that string visually reversed. What it looks like in memory is irrelevant.

viktorcode · on Jan 27, 2022

You certainly can. `print(String(flag.reversed()))` in Swift reverses emojis correctly.

account42 · on Jan 28, 2022

How does it handle the ASCII examples in https://news.ycombinator.com/item?id=30108184

And more importantly: What is the use case for a reversed string?

Etherlord87 · on Feb 3, 2022

The use case: Make an animation, where the text appears starting from the end - for example if you stylize it as vertical text falling from the top.

I think the example shows quite clearly the problem really comes down to dividing the string to logical parts (atoms, grapheme clusters, however you want to call it).

paxys · on Jan 27, 2022

UTF-8 reverse string has been a thing for a long time in most/all programming languages. It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible.

jerf · on Jan 27, 2022

"It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible."

It depends on your point of view. From a strict point of view, it does exactly mean it is no longer possible. By contrast, we all 100% knew what reversing an ASCII string meant, with no ambiguity.

It also depends on the version of Unicode you are using, and oh by the way, unicode strings do not come annotated with the version they are in. Since it's supposed to be backwards compatible hopefully the latest works, but I'd be unsurprised if someone can name something whose correct reversal depends on the version of Unicode. And, if not now, then in some later not-yet-existing pair of Unicode standards.

pwdisswordfish9 · on Jan 27, 2022

> By contrast, we all 100% knew what reversing an ASCII string meant, with no ambiguity.

Not if the ASCII string employed the backspace control character to accomplish what is today done with Unicode combining characters.

Or, in fact, if it employed any other kind of control sequence.

thaumasiotes · on Jan 27, 2022

I always thought it was interesting that ASCII is transparently just a bunch of control codes for a typewriter (where "strike an 'a'" is a mechanical instruction no different from "reset the carriage position"), but when we wanted to represent symbolic data we copied it and included all of the nonsensical mechanical instructions.

adzm · on Jan 27, 2022

Well the control codes were specifically for TTY rather than typewriters, many of the control codes still make sense from that standpoint.

jameshart · on Jan 27, 2022

Like... \r\n

jcelerier · on Jan 27, 2022

> It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible.

I don't understand why in maths finding one single counter-example is enough to disprove a theorem yet in programming people seem to be happy with 99.x % of success rate. To me, "It may not work perfectly in 100% of the cases" exactly means "no longer possible" as "possible" used to imply that it would work consistently, 100% of the time.

tux3 · on Jan 27, 2022

It is very useful in engineering to do things that are mathematically impossible, by simply ignoring or rejecting the last 1%.

Sometimes that's unacceptable, because you really do care about 100% of cases. When it isn't, you get really cool "impossible" tools out of it :)

paxys · on Jan 27, 2022

Because programming is not a science (or at most it is an applied science).

By your logic any software that has a single bug would be useless, and if that were the case this entire profession wouldn't exist.

solarmist · on Jan 28, 2022

Because you have additional information that separates it from the general case.

Even in mathematics if you add additional constraints you can solve subclasses of problems.

greenyoda · on Jan 27, 2022

> "reversing a string" is no longer possible or meaningful.

If you really wanted to, you could write a string reversal algorithm that treated two-character emojis as an indivisible element of the string and preserved its order (just as you'd need to preserve the order of the bytes in a single multi-byte UTF-8 character). You'd just need to carefully specify what you mean by the terms "string", "character" and "reverse" in a way that includes ordered, multi-character sequences like flag emojis.

happytoexplain · on Jan 27, 2022

I would argue that it is possible and meaningful. AFAIK extended grapheme clusters are well defined by the standard, and are very well suited to the default meaning of when somebody says "character", so, given no other information, it's reasonable to reverse a string based on them. I guess the issue is "reverse a string" lacks details, but I think that's different from "not meaningful".

spicybright · on Jan 27, 2022

Sure it is, just render the same string in right to left!

dhosek · on Jan 27, 2022

On the challenge front, there are things like á which might be a single code point or two code points (a+´). Then there are the really challenging things like ᾷ where if the components are individual characters, the order of ͺ and ῀ are not guaranteed to be consistent.

happytoexplain · on Jan 27, 2022

Which is why these APIs should always make normalization available: https://unicode.org/reports/tr15/

saltminer · on Jan 27, 2022

Then you have stuff like zalgo text (http://eeemo.net/) which takes pride in abusing code points

otagekki · on Jan 27, 2022

If flag emojis are really a combination of 2 special characters, the reversal of the U.S. flag should result in having the Soviet Union flag.

masklinn · on Jan 27, 2022

> the reversal of the U.S. flag should result in having the Soviet Union flag.

Except it has been deleted from the ISO 3166-2 registry, so not having it is perfectly valid (arguably more so than having it).

account42 · on Jan 28, 2022

No, that only shows that the ISO 3166-2 registry is a bad basis for Unicode flags since having things lose meaning over time should not be acceptable for a text encoding.

Flags have another issue here in that they can change even when the country stays the same - a recent example here being Afghanistan, but also France who recently changed the official shades of the colors in their flag. Ideally you'd want a new Unicode representation for any changed flags in order to not retroactively change the meaning in old documents.

brewmarche · on Jan 27, 2022

Just tried reversing a Spanish flag with Python and indeed I got Sweden back

kingcharles · on Jan 27, 2022

No-one expects the Swedish flag!

TonyTrapp · on Jan 27, 2022

It's up to the installed fonts really. I don't know if the combination of S + U is standardized as a Soviet Union flag emoji, but even if it is, your locally installed fonts may not contain every single flag emoji, so the browser would still fall back to rendering the two letters instead.

jameshart · on Jan 27, 2022

I was so disappointed that didn't turn out to be the case.

chrismorgan · on Jan 28, 2022

UTF-8 does not represent Unicode code points, but rather Unicode scalar values. The difference between the two is surrogates, the way that UTF-16 ruined Unicode: code points are 0₁₆ to 10FFFF₁₆, scalar values are 0₁₆ to D7FF₁₆ and E000₁₆ to 10FFFF₁₆. Yes, the author quoted Wikipedia, but Wikipedia is wrong on this point; surprisingly comprehensively wrong: the UTF-8 page completely ignores the distinction, and even the page on code points doesn’t mention scalar values! This error propagates to other places, too: for example, “and there are a total of 1,112,064 possible code points”: no, that’s how many scalar values there are; code points also include the 2,048 surrogates, so there are 1,114,112 code points.

Mesopropithecus · on Jan 27, 2022

Unfortunately the HN text input won't let me do this, but a funny starter for the article would have been this:

'(Spanish flag)'[::-1]

basically ''.join([chr(127466), chr(127480)]) vs. ''.join([chr(127466), chr(127480)])[::-1]

I'll add this to my collection of party tricks and show myself out.

Cool article!

Crazyontap · on Jan 27, 2022

This section on the linked Wikipedia article(1) is quite amazing on how the family emoji is rendered using a zero-width joiner

(1) https://en.wikipedia.org/wiki/Emoji#Joining

edit: forgot HN doesn't render emojis. Better read it directly on Wikipedia i guess.

tl · on Jan 27, 2022

This is a nice dive into limitations in Python's unicode handling and at the end, how to work around some problems. But you could use languages with proper unicode support like Swift or Elixir (weirdly HN is fighting flags in comment code which makes examples header to demonstrate).

anamexis · on Jan 27, 2022

HN doesn't allow any emoji.

happytoexplain · on Jan 27, 2022

I guessed that it would become the USSR flag (US -> SU), but apparently Unicode doesn't define that one! I wonder why. That would have been humorous.

ts4z · on Jan 27, 2022

IIRC Unicode doesn't define country codes. It was a workaround for a political issue of which countries recognize which other countries.

It would have been difficult to get the CN delegation to sign off on a list that contained TW, although there are probably others.

andylynch · on Jan 27, 2022

There are many more than I realised - Wikipedia has a decent list https://en.m.wikipedia.org/wiki/List_of_states_with_limited_...

bloak · on Jan 27, 2022

As I understand it, there is no two-letter ISO code for the USSR because when they update the standard they remove countries that no longer exist. In at least one case they have reused a code point: CS has been both "Czechoslovakia" and "Serbia and Montenegro", neither of which currently exist.

As a result, two-letter ISO codes are useless for many potential applications, such as, for example, recording which country a book was published in, unless you supplement them with a reference to a particular version of the standard.

Is there a way of getting the Czechoslovakian flag as an emoji? And did Serbia and Montenegro get round to making a flag?

happytoexplain · on Jan 27, 2022

Ah, I didn't realize they reused codes from ISO 3166-3. I figured, because they keep these regions around in their own set, that was some implication that the codes would not be reused.

chungy · on Jan 27, 2022

Unicode doesn't define any flags, really. That's up to the font rendering on systems/libraries.

happytoexplain · on Jan 27, 2022

True, but Unicode explicitly defines "SU" as a deprecated combination, regardless of flags. Seems like they omit everything from the list of "no longer used" country codes, with some exceptions. I would think they would have no reason not to allow historical regions.

progbits · on Jan 27, 2022

Semi-related (about length of emoji "characters", not reversing): https://hsivonen.fi/string-length/

Previously discussed:

https://news.ycombinator.com/item?id=20914184

https://news.ycombinator.com/item?id=26591373

As for this article & Python - as usual it is biasing towards convenience and implicit behavior rather than properly handling all edge cases.

Compare with Rust where you can't "reverse" a string - that is not a defined operation. But you can either break it into a sequence of characters or graphemes and then reverse that, with expected results: https://play.rust-lang.org/?version=stable&mode=debug&editio...

(Sadly the grapheme segmentation is not part of standard library, at least yet)

account42 · on Jan 28, 2022

> Sadly the grapheme segmentation is not part of standard library, at least yet

Seeing as grapheme segmentation is a moving target that only makes sense.

jiveturkey · on Jan 27, 2022

Interesting article. Written for beginners, conversationally. Has excessive amounts of whitespace, for "readability" I guess. But at the same time, it dives quite deep, which I don't think this "style" of presentation matches up with the amount of time a more novice reader is going to devote to a single long form article.

As to the content, for all the deep dive, a simple link to https://unicode.org/reports/tr51/#Flags and what an emoji is, would have saved so much exposition. I also wish he'd touched on normalization. With the amount of time he's demanding from readers he could have mentioned this important subject. Because then he could discuss why (starting from his emoji example) a-grave (à) might or might not be reversible, depending how the character is composed.

Also wish he'd pointed to some libraries that can do such reversals.

utopcell · on Jan 27, 2022

There are unicode characters that reverse parsing order themselves. This has been the basis of a code injection attack, analyzed in [1].

[1] ``Trojan Source: Invisible Vulnerabilities'': https://trojansource.codes/trojan-source.pdf

uniqueuid · on Jan 27, 2022

Upper and lower codepoints are really way too obscure and can create issues you didn't even know you had.

I once had the very unpleasant experience of debugging a case where data saved with R on windows and loaded on macOS ended up with individually double-encoded codepoints.

Not fun.

techwiz137 · on Jan 27, 2022

It's pretty funny that reversing the American flag yields Soviet Union(SU).

cmyr · on Jan 27, 2022

Something I haven't seen mentioned yet is one of the most annoying things about regional indicator symbols, which is that interpreting them correctly requires arbitrary backtracking, and handling this correctly is very annoying for things like text fields.

Basically: A single, unpaired RIS counts as a single grapheme. Similarly, a pair of RIS count as a single grapheme. Now imagine if your cursor position is after an RTS, and you arrow backwards (assuming LTR text, imagine your cursor is to the right of an RIS, and you press the left arrow.) Your textbox should now move the cursor to the left by one grapheme. How do you figure out where this is, in code units? You basically have to scan backwards until you find the first non-RIS codepoint, and then you have to match them up into pairs to figure out if your left-arrow movement should correspond to a movement of one or two codepoints.

This is a longstanding source of bugs, and if you're bored you can play around with pasting a huge sequence of flags into a textfield and then trying to navigate around it with the arrow keys/mouse. There are some broken implementations out there.

edit: while I'm thinking about this I will point out that an alternative design, which would have solved this problem (and which was first pointed out to me by @raphlinus) would have been to have two separate sets of RI symbols, one for 'first position' and one for 'second position'; then you could always determine the appropriate cursor position without needing context. Isn't hindsight a wonderful thing?

account42 · on Jan 28, 2022

> Isn't hindsight a wonderful thing?

Gladly, the creators of UTF-18 did have that foresight so at least we don't have this problem at the code unit -> code point level.

sundarurfriend · on Jan 27, 2022

Julia docs do a (surprisingly) good job of being clear and explicit about this: the docstring for `reverse(AbstractString)` says:

> Reverses a string. Technically, this function reverses the codepoints in a string and its main utility is for reversed-order string processing [...]. See also [...] `graphemes` from module Unicode to operate on user-visible "characters" (graphemes) rather than codepoints.

Properly reversing a string of flags (or any other grapheme clusters) is just a `using Unicode: grapheme` away.

mappu · on Jan 27, 2022

If you like this, you may also like why len(emoji) is still not 1 in Python 3 despite all the unicode breakage: https://storytime.ivysaur.me/posts/grapheme-clusters/

I do feel like these are all 'gotcha' questions - I haven't seen any real-world requirement to reverse a string and then have it be displayed in a useful way.

qqii · on Jan 27, 2022

> Challenge: How would you go about writing a function that reverses a string while leaving symbols encoded as sequences of code points intact? Can you do it from scratch? Is there a package available in your language that can do it for you? How did that package solve the problem?

So are there any good libraries that can deal with code points that are merged together into a single pictographic and reverse them "as expected"?

da12 · on Jan 27, 2022

If you're using Python, check out grapheme: https://github.com/alvinlindstam/grapheme

kevin_thibedeau · on Jan 27, 2022

This misses the real problem with flag emoji in that they are composed of codepoints that can be in any order. With other emoji you get a base codepoint with potential combining characters. Using a table of combining character ranges you can skip over them and isolate the logical glyph sequences. You don't need surrounding context to parse them out like flags need.

jug · on Jan 27, 2022

I think that somewhere in this answer lies a reason why Windows still doesn't support flag emoji. I don't count Microsoft Edge as "Windows" in this case, but as Chromium. Windows doesn't support flag emoji in its native text boxes, but it does support even colorized emoji.

But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.

kingcharles · on Jan 27, 2022

Flags are political. Microsoft has removed country borders from its products for political reasons, a post above says the flags rendering was excluded for the same reason.

masklinn · on Jan 27, 2022

> But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.

Flags are not that hard, they're a very specific block combining in very predictable way. They're little more than ligatures. Family emoji are much harder.

And this is not "post-Unicode" in any way.

kevin_thibedeau · on Jan 28, 2022

Consider you have to split a string with 20 flags in sequence at a given offset. That's 40 codepoints with no readily discernible boundaries. To parse that you have to scan backwards to find the first non-flag codepoint. Otherwise you could split the middle of a flag pair. You also have to handle rendering invalid combinations as two glyphs and unpaired codes. For normal codepoints with combining characters you can scan forwards until you reach a non-combining character.

masklinn · on Jan 28, 2022

> Consider you have to split a string with 20 flags in sequence at a given offset. That's 40 codepoints with no readily discernible boundaries.

So consider that you have [a really bad idea], it’s not convenient?

You do realise essentially the same issue occurs if you have a stack of diacritics right?

kevin_thibedeau · on Jan 28, 2022

No it doesn't. You aren't forced to scan backwards.

cygx · on Jan 27, 2022

Flags are not that hard, they're a very specific block combining in very predictable way.

But before their introduction, you could decide if there's a grapheme cluster break between codepoints just by looking at the two codepoints in question. Now, you may need to parse a whole sequence of codepoints to see how flags pair up.

uniqueuid · on Jan 27, 2022

Thanks for that interesting detail!

If such re-purposing continues, it might be easier to go straight to utf-32 for some use cases.

dhosek · on Jan 27, 2022

Nope, because the repurposing is independent of how the Unicode is represented. There's absolutely no advantage to having a string in UTF-32 over UTF-8 since you'll still need to examine every character and the added overhead for converting byte strings in UTF-8 to 32-bit code points is by far offset by the huge memory increase necessary to store UTF-32.

What's more, it's really not that difficult to start at the end of a valid UTF-8 string and get the characters in reverse order. UTF-8 is well-designed that way in that there's never ambiguity about whether you're looking at the beginning byte of a code point.

colejohnson66 · on Jan 27, 2022

> UTF-8 is well-designed that way in that there's never ambiguity about whether you're looking at the beginning byte of a code point.

To expand, if the most-significant-bit is a 0, it's an ASCII codepoint. If the top two are '10', it's a continuation byte, and if they're '11', it's the start of a multibyte codepoint (the other most-significant-bits specify how long it is to facilitate easy codepoint counting).

So a naive codepoint reversal algorithm would start at the end, and move backwards until it sees either an ASCII codepoint or the start of a multibyte one. Upon reaching it, copy those 1-4 bytes to the start of a new buffer. Continue until you reach the start.

[0]: https://en.wikipedia.org/wiki/UTF-8#Encoding

a_c · on Jan 27, 2022

Understanding unicode would make the question more obvious

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

Waterluvian · on Jan 27, 2022

I wish languages did a far better job clearly distinguishing between their operations:

1. You are acting in byte space and it’s pretty unambiguous what should happen. We are not acknowledging the semantics of language and alphabets.

2. You’re acting in language space and these operations will behave the way you probably think they should (depending on your cultural expectations, probably)

Beldin · on Jan 27, 2022

Interestingly, on my phone the so-called flag is not a flag at all, but "US" in outline.

So python behaves as expected: the 2 character string, when reversed, becomes "SU". Similar stuff happens with the other "flag" strings.

I'm sure emojis in my phone are outdated. I'm not sure how that affects whether I see a flag or letters.

pilsetnieks · on Jan 27, 2022

Thankfully, there isn't an assigned ISO 3166-1 2-letter country code for SU currently; people may have interesting reactions seeing what happens when reversing a US flag emoji if there were.

easrng · on Jan 28, 2022

If this was 1990 (and we somehow had the current emoji standard) SU would be the USSR flag.

kingcharles · on Jan 27, 2022

Out of interest, what phone and browser? The only platform I've seen that doesn't render the flags is Windows.

Beldin · on Jan 29, 2022

An android phone from 2014, with a year out of date chrome.

To update chrome, I'd have to give it permission to access my contacts. That ain't happening. (Phone OS is too old for per-app permissions)

codingkev · on Jan 27, 2022

Yes, this allows for easy building of flag emojis as long as you know the ISO 3166 two-letter country code.

Example: https://github.com/kennell/flagz/blob/master/flagz.py

michaelsbradley · on Jan 27, 2022

See chapter 7 in Hacking the Planet (with Notcurses) for a short treatment of encodings, extended grapheme clusters, etc.

https://nick-black.com/htp-notcurses.pdf#page53

xmprt · on Jan 27, 2022

This is a cool article about Unicode encoding however I still feel like it should be possible to reverse strings with Flag emojis. I don't see why computers can't handle multi rune symbols in the same way that they handle multi byte runes. We could combine all the runes that should be a single symbol and make sure that we're maintaining the ordering of those runes in the reversed string. Of course that means that naive string reversing doesn't work anymore but naive string reversing wouldn't work in the world of UTF-8 if we just went byte by byte.

happytoexplain · on Jan 27, 2022

Swift, for example, does what you're saying. I thought that the reason many languages don't do it that way is that part of the definition of an array (or at least expected-by-convention) is constant-time operations. If you treat a string as an array, then having to deal with variable-length units breaks that rule. That's why, when there is an API for dealing with grapheme clusters, it is usually a special case that duplicates an array-like API, instead of literally using an array.

I actually don't know how/why Python is apparently using code points, since they are variable length. That seems like a compromise between using code units and using grapheme clusters that gets you the worst of both worlds.

Edit: Maybe it uses UTF-32 under the hood when it's doing array operations on code points?

nitely · on Jan 28, 2022

CPython 3 does use UTF-32 under the hood for strings (there is bytes for plain sequence of bytes). As you say, it's the worst of both worlds. High memory usage, and not really useful if you are dealing with unicode characters (grapheme clusters).

My impression is most modern languages that bother with unicode (swift, rust, nim) are using utf-8, and doing linear time operations to handle unicode. I think that's the right approach, as I don't recall ever needing random access on a unicode string.

astrange · on Jan 28, 2022

Swift originally had UTF-32 strings (an upgrade from ObjC which uses UTF-16), but they redid String to be UTF-8. It's typically the best choice even for CJK text, because it has ASCII mixed in often enough to still be smallest.

I don't think reversing a string is a meaningful operation though; there's no reason to think of a string as a "list of characters" when it's also a "list of words" and several other things. Swift provides more than one kind of iteration for that reason.

nitely · on Jan 28, 2022

Of course it's possible, the Unicode standard even has a table[0] you can use to build a DFA (Deterministic Finite Automata) to break up a string into grapheme clusters. You can reverse the DFA to match and yield the graphemes backwards as well, which will give you the reversed unicode string.

[0]: http://www.unicode.org/reports/tr29/#Table_Combining_Char_Se...

raffy · on Jan 27, 2022

Kinda related: I am developing a library for ENS (Ethereum Name Service) name normalization: https://github.com/adraffy/ens-normalize.js

I'm trying to find the best combination of UTS-46, UTS-51, UTS-39, and prior work on IDN resolution w/r/t confusables: https://adraffy.github.io/ens-normalize.js/test/report-confu...

Personally, I found the Unicode spec very messy. Critical information is all over the place. You can see the direct effect of this when you compare various packages across different languages and discover that every library disagrees in multiple places. Even JS String.normalize() isn't consistent in the latest version of most browsers: https://adraffy.github.io/ens-normalize.js/test/report-nf.ht... (fails in Chrome, Safari)

The major difference between ENS and DNS is emoji are front and center. ENS resolves by computing a hash of a name in a canonicalized form. Since resolution must happen decentralized, simply punting to punycode and relying custom logic for Unicode-handling isn't possible. On-chain records are 1:1, so there's no fuzzy matching either. Additionally, ENS is actively registering names, so any improvement to the system must preserve as many names as possible.

At the moment, I'm attempting to improve upon the confusables in the Common/Greek/Latin/Cyrillic scripts, and will combine these new grouping with the mixed-script limitations similar to IDN handling in Chromium.

Interactive Demo: https://adraffy.github.io/ens-normalize.js/test/resolver.htm...

Also this emoji report is pretty cool: https://adraffy.github.io/ens-normalize.js/test/report-emoji...

jupp0r · on Jan 27, 2022

With all the criticism I normally have for Rust, I must say that its type safe handling of UTF-8 and its unambiguous distinction between byte strings and UTF-8 strings are extremely helpful in handling situations mentioned in the article correctly (and also efficiently).

Yes it's a pain, but the way the standard library designed its types force you to handle conversions correctly, for example when byte arrays are converted to UTF-8 strings and may contain invalid UTF-8 sequences.

ineedasername · on Jan 28, 2022

It's an emoji... Are there any emojis with only one character? My assumption going in would be that any emoji is > 1 character. Admittedly, despite lots of string processing, I never have to deal with emojis so I guess I'm not sure.

An interesting exercise would be emoji detection during string reversal to preserve the original emoji. I though something like that would be the crux of the article.

Am I wrong about single character emojis?

easrng · on Jan 28, 2022

It depends what you mean by character, there are lots of single codepoint emojis though.

faebi · on Jan 27, 2022

Why reverse them if one barely can implement, display and edit them correctly. I never could make them work perfectly in VIM. Also I had to open a bug in Firefox recently:

Flag emojis and others are displayed in double the size on Windows 10 using Firefox Nightly https://bugzilla.mozilla.org/show_bug.cgi?id=1746795

easrng · on Jan 28, 2022

Windows doesn't even have flag emojis, they just show up as the country code.

Edit: Actually Firefox ships a copy of twemoji for fallback purposes, so flags will still render.

aidenn0 · on Jan 27, 2022

> The answer is: it depends. There isn't a canonical way to reverse a string, at least that I'm aware of.

Unicode defines grapheme clusters[1] that represent "user-perceived characters" separating a string into those and reversing seems like a pretty good way to go about it.

1: http://www.unicode.org/reports/tr29/