I wonder how many of the solutions would still work for arbitrary unicode strings. It looks like the author assumes only ascii exists. Perhaps the interns would be better off if they were taught that "reverse a string" is a fundamentally dangerous (and pointless) operation in almost any real-life context. Strings are way more complex than people give them credit for.
I'm working on a programming language sort of like K or APL. Among other things it has "good" unicode support (I'm fully convinced such a thing is even possible). It introduces a truly frustrating amount of complexity, especially in comparison to how simple ASCII strings are in K or APL (just an array of characters, so you get all the powerful array operations for free as shown in this post).
The approach I've more or less settled on is that strings are not arrays at all but it's possible to get arrays of codepoints (composed or decomposed) or grapheme clusters from them. Really this just pushes complexity to the caller, but it means you more control over what exactly you intend to work with. Pretty much all the solutions here could be straightforwardly translated to my programming language, except you would need to choose what definition of "character" you're going with.
I basically like this solution but the deeper I get the more I become convinced that manipulating an arbitrary user-provided or even user-facing string in any way is a recipe for disaster; there's often simply no sensible way to handle things like direction formatting control codes (or bidirectional text in general).
In other words, I almost think there should be separate types for "computer-facing" strings (like filenames) that one commonly has to manipulate, and "human-facing" strings that one just wants to either display or store somewhere and ideally never touch because you will screw up somewhere.
> In other words, I almost think there should be separate types for "computer-facing" strings (like filenames) that one commonly has to manipulate, and "human-facing" strings that one just wants to either display or store somewhere and ideally never touch because you will screw up somewhere.
That's a fascinating idea, thanks. I imagine it's impractical due to numerous places where machine and human strings overlap, but I'm going to have to ruminate on it for a while.
The GFile API in GTK+'s underlying glib family of libraries kind of has this, at least for your example of filenames. It separates the "actual" name of a file from it's display name, and has APIs to support that division.
From the documentation [1]:
All GFiles have a basename (get with g_file_get_basename()). These names are byte strings that are used to identify the file on the filesystem (relative to its parent directory) and there is no guarantees that they have any particular charset encoding or even make any sense at all. If you want to use filenames in a user interface you should use the display name that you can get by requesting the G_FILE_ATTRIBUTE_STANDARD_DISPLAY_NAME attribute with g_file_query_info(). This is guaranteed to be in UTF-8 and can be used in a user interface. But always store the real basename or the GFile to use to actually access the file, because there is no way to go from a display name to the actual name.
This makes implementing things that deal with filenames a lot (like, cough, a file manager) quite interesting.
It's definitely pretty leaky (just prompting the user for a filename already breaks it; you either leak the rules for machine strings to the user or you pray to god they don't enter something too weird).
However the problem with interaction is really only one-way, since machine strings can be safely promoted to human strings. Unfortunately even concatenating unicode strings is not necessarily straightforward; for example A+B where A has a right-to-left embedding in it but you want B to display left-to-right you need to surround it with left-to-right embedding and pop direction (this actually solves any issues I can think of off the top of my head but no one does this).
The really bad problems involve parsing user-provided non-programmer-oriented text (e.g. markdown). I really don't know if there's a robust way to do that.
Hmm. Surely in order to "display" you need to "manipulate" (someone has to write the text rendering library, although ideally only once). And the simple action of "enter filename" crosses both those domains.
Then there's the whole classic domain of parsers and lexers as the front end to programming languages. I appreciate this gets upsettingly difficult if it's also a security boundary, where things like invisible spaces are a threat, but it remains important.
Maybe what we need is to go the other way from asking interns to reverse strings, and ask library writers to provide some slightly higher-level functions that don't rely on regex. Perhaps LINQ for strings? Most languages give you "split", which is the very beginning of a tokeniser, but we need something a bit more powerful.
A good test case might be writing the notoriously difficult "do these two URLs refer to the same resource?" program.
No; symbols are opaque; to manipulate them at all you have to convert them to strings. (For example you can't concatenate or reverse :foo and :bar or iterate over their characters.)
Ah, I see what you mean now. I was taking "computer facing" for strings you don't have to manipulate often.
Though, I am confused, how often are you manipulating a filename? Even strings, in general. Parse? Sure. Manipulate? Seems uncommon.
Formatting, I can grant. But that is different from manipulating a string. More building up from others. And, outside of madlibs, not much you can hope for. Surprising amount of distance from madlibs, I suppose.
(I consider "formatting" under "manipulating". Parsing too, actually. Really anything where you have to consider the parts of the string individually and not just the whole.)
By and large I think you're right; as the root post said you pretty much never actually have to reverse a string.
Specifically though I think manipulating filenames is not really unusual; for instance adding or reading a file extension, filename<number>.<ext>, splitting into directory paths, etc. Filenames are actually an interesting case because they're a very good candidate for having their own type (since they have some internal structure to them and operations on that structure).
Which leads me to another hypothesis, which is that the number of types you have to do something to truly unstructured data other than compare it is really low, and we actually don't need a generic "string" type at all. Unfortunately this is not really feasible in a world where the Unix principle of "everything is a stream of bytes" is ubiquitous.
I was reading a distinction between recognising parts of a string, and changing parts. Such that I can't remember the last time I modified a string. I can think of plenty of times I took one apart and used a piece of it. Usually well structured to avoid the corner cases. Or, again, forced into a madlibs style structure to present to the user.
But yeah, looking for palindromes is a thing I can't recall ever having done. A prefix tree for searches? Done, but that didn't need well formed text/strings for it to work.
And searches don't even need Unicode to get difficult. Consider: to, two, too, 2, and II. Should those all find each other? Highly dependent on context. And likely you will be reimplementing NLP before you realize it.
Computers should be seen as machines for manipulating symbols, not as machines for manipulating a small number of fundamental data types into which symbols are squeezed, usually with associated wreckage.
A file path, a shell command, a time/date, a URL, on-screen text in a browser, on-screen text in a text editor, and code in an editing window are all completely different data types. They can be implemented as char arrays - usually poorly - but that doesn't mean they're smoothly interchangeable with clear interfaces.
So in reality they're neither abstracted nor standardised nor designed properly, and the result is a lot of pain and confusion, because developers default to "So this is a string..." instead of thinking of them as separate types implementing distinct abstractions with hugely different requirements.
It's very much a "this is productive for me and fits how I think, maybe to the exclusion of making sense to other people" project.
It's closest to J or Dyalog APL but with a much different syntax that aims to make it more natural to write entirely pointfree code. (Personally, I feel like long trains can become kind of hard to read and refactor in J. The fork and hook syntax is really nice for short trains but IMO does not scale up very well.) I've been fiddling with the syntax on and off for about 5 years and use it as a sort of general notation for algorithms.
There's actually a fair bit of Erlang in there as well which is honestly kind of an odd combination but actors+arrays hits a sort of local optima for me. It might be the first APLlike with really good I/O.
I wouldn’t call it dangerous, just meaningless. A lot of the discussion in this thread dives deeply into the technicalities of Unicode. All of that is correct, but a bit besides the point IMO.
I think the real issue is that “reversing text” is fundamentally an operation that only makes sense in Western alphabets — the complexities of Unicode just reflect that. This is a bit hard to fathom for people who grew up with such an alphabet (this includes me), where there’s a rich tradition of games based on shuffling letters. It’s not just reversing; my understanding is that you can’t meaningfully define the concept of an anagram in many writing systems.
Especially it's ironical in the contexts of the interview, when the question is something like "write a function to reverse order of all words in the given string".
So I'm thinking to myself: "You don't even comprehend how hard it's to actually do, or you know it and checking if I know it?". And then there is a need to ask a lot of probing questions just to learn the the string is ASCII and separator is space.
"lol :Man facepalming: :medium light skintone:" becomes the skintone applying to nothing (which might crash?) and the wrong coloured man. (e+accent a) making éa becomes (a+accent e) incorrectly making áe - or possibly invalidly making an error combination. Right-to-left markers[1] and left-to-right markers will change which sections of the text are reversed unless you swap them over.
Codepoints can combine more than once, to the point where if you're too nitpicky you can't validly substring either, you can only read a string from the first codepoint onwards; they could become invalid sequences if reversed, possibly?
Agree. I think reversing in non-ascii should always be thought of as "per-token", where English is character-as-token. So the reverse of what you gave would be:
":medium light skintone: :Man facepalming: lol"
(with the lol reversed). In this problem, it is a much harder problem than, say in python, mystring[::-1]. Therefore, it is a different problem "reverse a string" than to "reverse an array".
Accented characters would be kept as is in my scenario.
The "tokens" you're thinking of are "grapheme clusters" in Unicode.
Unfortunately just reversing by grapheme clusters doesn't solve the problem because of directional formatting codes; if you have e.g. a right-to-left embedding followed by a pop directional formatting you can't naively reverse them.
1/ Treating a string as an array of bytes will give an invalid result if the string unless the string is simple ascii (or an equivalent encoding where each byte has a clearly defined standalone meaning); in particular, just reversing a UTF-8 string in this way will give an invalid answer - ie a string that isn't even valid UTF-8.
2/ The fix for (1) is to convert your string into an array of Unicode code points and reverse that … except that is also broken, because combining characters will now not associate correctly, as per other answers in this thread.
Coding your way out of problem (2) in a robust and sensible way is, I suggest, a significant challenge.
Of all things, VB.Net actually has a string reverse that handles unicode cases. I'm not quite sure how to view it in referencesource, but it is on Github. [1] The result is less than 100 lines of VB so I don't think it's -that- hard. There's certainly some clever index manipulation going on but nothing that looks too crazy.
An array of characters reversed loses the original meaning of the characters. Other than looking for palindromes, there's almost nothing in the way of practical uses of a reversed string that can't be accomplished by subscripting the string and iterating from the end to the beginning.
Iterating from back to front doesn't work either; you probably (depending on what you're doing) still have to segment into grapheme clusters—which is stateful as of unicode 9, so you have to start from the beginning of the string. And then god forbid you get U+202C POP DIRECTIONAL FORMATTING CODE....
Interesting, but questionably useful. If you're converting integer bases or obfuscating email addresses, there are much better approaches that don't involve a bad hack
I was thinking I really need to research how to efficiently deal with strings in a certain language that doesn't allow them to be accessed as arrays of characters. Because then I could get started on porting some C code to it.
TLDR you can straightforwardly access as bytes, or iterate as "Unicode scalar values" as stored in the 4-byte "char" type.
(But a "Unicode scalar value" is not exactly equal to the colloquial meaning of "character".)
However: If you care about valid UTF-8 inputs not breaking the function or intent of your code, it's probably time to understand the problems Unicode is actually trying to address, and apply those considerations to what you're trying to do. :)
I hate string processing. Is there any widely used programming language that does not support strings? I mean, you can do all of math, scientific computing, and graphics without strings. Forcing your language to support strings will inevitably impose compromises that make the language worse when you do not need them. I want such a language, unencumbered by the intrinsic uglyness of strings.
Futhark has only the most basic string support -- string literal syntax, which which is just sugar for a utf-8 byte array. It's a functional array language that generates fast GPU code, and as a GPU program is essentially a pure function (no input/output), string handling is very much a secondary concern. Futhark is used exactly for math, scientific computing, and graphics, and not for strings.
well, C++ doesn't really have too much support for strings, except the "" syntax for constant ASCII strings. Everything else is library based, and lots of people apparently don't use std::string and friends. However, I can't really think of many programming languages where string and string processing affect the syntax or semantics of any other aspect of the language (there are TCL, Perl and most shell languages, but definitely not C, C++, Java, C#, any flavor of LISP, Haskell, *ML and most others)
That said, I don't really think you can use a programming system that doesn't,support strings:at some point you have to communicate something to a human, and at that point you need text.
And traditionally, math is one of the places where people invented all sorts of creative writing systems. Also, you have symbolic math where even the core of the solver needs at least some basic string support (e.g. Solve for x "x + c = 0" => solution is -c.