Hacker News new | past | comments | ask | show | jobs | submit login

Reversing a string is still meaningful. Take a step back outside the implementation and imagine handing a Unicode string to a human. They could without any knowledge look at the characters they see and produce the correct string reversal.

There is a solution to this which is to compute the list of grapheme clusters, and reverse that.

https://unicode.org/reports/tr29/




> imagine handing a Unicode string to a human. They could without any knowledge look at the characters they see and produce the correct string reversal.

I really highly doubt it.

How do you reverse this?: مرحبًا ، هذه سلسلة.

Can you do it without any knowledge about whether what looks like one character is actually a special case joiner between two adjacent codepoints that only happens in one direction? Can you do it without knowing that this string appears wrongly in the HN textbbox due to an apparent RTL issue?

It's just not well-defined to reverse a string, and the reason we say it's not meaningful is that no User Story ever starts "as a visitor to this website I want to be able to see this string in opposite order, no not just that all the bytes are reversed, but you know what I mean."


You can even demonstrate a similar concept with English and Latin characters. There is no single thing called a "grapheme" linguistically. There are actually two different types of graphemes. The character sequence "sh" in English is a single referential grapheme but two analogical graphemes. Depending on what the specification means, "short" could be reversed as either "trosh" or "trohs". That's without getting into transliteration. The word for Cherokee in the Cherokee language is "Tsalagi" but the "ts" is a Latin transliteration of a single Cherokee character. Should we count that as one grapheme or two?

Of course, if an interviewer is really asking you how to do this, they're probably either 1) working in bioinformatics, in which case there are exactly four ASCII characters they really care about and the problem is well-defined, or 2) it's implementing something like rev | cut -d '-' -f1 | rev to get rid of the last field and it doesn't matter how you implement "rev" just so long as it works exactly the same in reverse and you can always recover the original string.


The fact that how to reverse a piece of text is locale dependent doesn't mean it's impossible. Basically and transformation on text will be locale dependent. Hell, length is locale dependent.


>what looks like one character is actually a special case joiner between two adjacent codepoints

Are you referring to a grouping not covered by the definition of grapheme clusters (which I am only passingly familiar with)? If so, then I don't think it's any more non-meaningful to reverse it than to reverse an English string. The result is gibberish to humans either way - it sounds more like you're saying that there is no universally "meaningful to humans" way to reverse some text in potentially any language, which is true regardless of what encoding or written language you're using. I was thinking of it more from the programmer side - i.e. that Unicode provides ways to reverse strings that are more "meaningful" (as opposed to arbitrary) than e.g. just reversing code points.


I mean no but only because I don’t understand the characters. Someone who reads Arabic (I assume based on the shape) would have no trouble. You’re nitpicking cases where for some readers visual characters might be hard to distinguish but it doesn’t change the fact that there exists a correct answer for every piece of text that will be obvious to readers of that text which is the definition of a grapheme cluster.


> the fact that there exists a correct answer for every piece of text that will be obvious to readers of that text which is the definition of a grapheme cluster.

No, I insist there is not a single "correct answer," even if a reader has perfect knowledge of the language(s) involved. Now remember, this is already moving the goalposts, since it was claimed that a human needed "no knowledge" to get to this allegedly "correct answer."

You already admit that people who don't speak Arabic will have trouble finding the "grapheme clusters," but even two people who speak Arabic may do your clustering or not, depending on some implicit feeling of "the right way to do it" vs taking the question literally and pasting the smallest highlight-able selection of the string in reverse at a time.

Anyway, take a string like this: "here is some Arabic text: <RLM> <Arabic codepoints> <LRM> And back to English"

Whether you discard the ordering mark[0], keep them, or inverse them is an implementation decision that already produces three completely different strings. Unless we want to write a rulebook for the right way to reverse a string, it remains an impossibility to declare anything the correct answer, and because there is no reason to reverse such a string outside of contrived interview questions and ivory tower debates, it is also meaningless.

[0]: https://en.m.wikipedia.org/wiki/Right-to-left_mark https://en.m.wikipedia.org/wiki/Left-to-right_mark


You added the requirement that it be a single correct answer. I just asserted that there existed a correct answer. You're being woefully pedantic -- a human who can read the text presented to them but no knowledge of unicode was my intended meaning. Grapheme clusters are language dependent and chosen for readers of languages that use the characters involved. There's no implicit feeling, this is what the standards body has decided is the "right way to do it." If you want to use different grapheme clusters because you think the Unicode people are wrong then fine, use those. You can still reverse the string.

Like what are you even arguing? You declared that something was impossible and then ended with that it's not only possible but it's so possible that there are many reasonable correct answers. Pick one and call it a day.


> Like what are you even arguing?

It is impossible to "correctly reverse a string" because "reverse a string" is not well defined. We explored many different potential definitions of it, to show that there is no meaningful singular answer.

> You added the requirement that it be a single correct answer.

Your original post says "they could produce the correct string reversal"?


Is a RTL character string already "reversed" from a LTR POV?

Is an absolute value signed as positive?


UAX #29 is insufficient: at the very least, you must depend on collation too.

In Norwegian, “æ” is a letter, so I believe (as a non-speaker) that they would reverse “blåbærene” to “eneræbålb”; but in English, it’s a ligature representing the diphthong “ae”, and if asked to reverse “æsthetic” I would certainly write “citehtsea” and consider “citehtsæ” to be wrong. (And I enjoy writing the ligature; I fairly consistently write and type æsthetic rather than aesthetic, though I only write encyclopædia instead of encyclopaedia when I’m in a particular sort of mood.)

In Dutch, the digraph “ij” is sometimes considered a ligature and sometimes a letter; as a non-speaker, I don’t know whether natives would say that it should be treated as an atom in reversing or not.

And not all languages will have the concept of reversing even letters, let alone other things. Face it: in English we have the concept of reversing things, but it just doesn’t work the same way in other languages. Sure, UAX #29 defines something that happens to be a good heuristic for reversing, but it doesn’t define reversing, and in the grand scheme of things reversing grapheme-wise is still Wrong. “Reversing a string” is just not a globally meaningful concept.

Another person here has cited Cherokee transliteration, where one extended grapheme cluster turns into multiple English letters. You can apply this to translation in general, but also even keep it inside English and ask: what are we reversing? Letters? Phonemes? Syllables? Words? There are plenty of possibilities which are used in different contexts (and it’s mostly in puzzles, frankly, not general day-to-day life).

The concept of grapheme clusters is acknowledged as approximate. Collations are acknowledged as approximate. Reversing would be even more approximate.


Interesting addendum on the matter of reversing “æsthetic”: I asked my parents what they reckoned, and they both went the other way, reckoning that if you wrote æ in the initial word it should stay æ; my dad said he wouldn’t write it that way in the first place, but that if you’ve written it that way you were treating æ as a letter more than just a way of drawing a certain pair of letters. In declaring otherwise, I was using the linguistic approach, which acknowledges æ as a ligature of the ae diphthong from Latin, being purely stylistic and not semantic. And so we see still more how these things are approximations and subjective.


Should it reverse a BOM as well or keep it first?


Remove it since the BOM is a hack to deal with shitty transfer encodings (i.e. UTF-16LE vs. UTF-16BE) and useless for UTF-8.


Keep it first? Like that’s not a gotcha. Your input is a string and the output is that string visually reversed. What it looks like in memory is irrelevant.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: