A World Without Strings Is Chaos

Thiez · on March 22, 2020

I wonder how many of the solutions would still work for arbitrary unicode strings. It looks like the author assumes only ascii exists. Perhaps the interns would be better off if they were taught that "reverse a string" is a fundamentally dangerous (and pointless) operation in almost any real-life context. Strings are way more complex than people give them credit for.

PeCaN · on March 23, 2020

Unicode strings are really, really hard.

I'm working on a programming language sort of like K or APL. Among other things it has "good" unicode support (I'm fully convinced such a thing is even possible). It introduces a truly frustrating amount of complexity, especially in comparison to how simple ASCII strings are in K or APL (just an array of characters, so you get all the powerful array operations for free as shown in this post).

The approach I've more or less settled on is that strings are not arrays at all but it's possible to get arrays of codepoints (composed or decomposed) or grapheme clusters from them. Really this just pushes complexity to the caller, but it means you more control over what exactly you intend to work with. Pretty much all the solutions here could be straightforwardly translated to my programming language, except you would need to choose what definition of "character" you're going with.

I basically like this solution but the deeper I get the more I become convinced that manipulating an arbitrary user-provided or even user-facing string in any way is a recipe for disaster; there's often simply no sensible way to handle things like direction formatting control codes (or bidirectional text in general).

In other words, I almost think there should be separate types for "computer-facing" strings (like filenames) that one commonly has to manipulate, and "human-facing" strings that one just wants to either display or store somewhere and ideally never touch because you will screw up somewhere.

macintux · on March 23, 2020

> In other words, I almost think there should be separate types for "computer-facing" strings (like filenames) that one commonly has to manipulate, and "human-facing" strings that one just wants to either display or store somewhere and ideally never touch because you will screw up somewhere.

That's a fascinating idea, thanks. I imagine it's impractical due to numerous places where machine and human strings overlap, but I'm going to have to ruminate on it for a while.

unwind · on March 23, 2020

The GFile API in GTK+'s underlying glib family of libraries kind of has this, at least for your example of filenames. It separates the "actual" name of a file from it's display name, and has APIs to support that division.

From the documentation [1]:

All GFiles have a basename (get with g_file_get_basename()). These names are byte strings that are used to identify the file on the filesystem (relative to its parent directory) and there is no guarantees that they have any particular charset encoding or even make any sense at all. If you want to use filenames in a user interface you should use the display name that you can get by requesting the G_FILE_ATTRIBUTE_STANDARD_DISPLAY_NAME attribute with g_file_query_info(). This is guaranteed to be in UTF-8 and can be used in a user interface. But always store the real basename or the GFile to use to actually access the file, because there is no way to go from a display name to the actual name.

This makes implementing things that deal with filenames a lot (like, cough, a file manager) quite interesting.

[1] https://developer.gnome.org/gio/stable/GFile.html

rhn_mk1 · on March 23, 2020

There's a similar concept in Rust, where filename strings are OsString https://doc.rust-lang.org/std/ffi/struct.OsString.html , which can be converted to the Unicode-like String.

ncmncm · on March 23, 2020

With perhaps Rust's greatest contribution to the public discourse, the WTF-8 format.

PeCaN · on March 23, 2020

It's definitely pretty leaky (just prompting the user for a filename already breaks it; you either leak the rules for machine strings to the user or you pray to god they don't enter something too weird).

However the problem with interaction is really only one-way, since machine strings can be safely promoted to human strings. Unfortunately even concatenating unicode strings is not necessarily straightforward; for example A+B where A has a right-to-left embedding in it but you want B to display left-to-right you need to surround it with left-to-right embedding and pop direction (this actually solves any issues I can think of off the top of my head but no one does this).

The really bad problems involve parsing user-provided non-programmer-oriented text (e.g. markdown). I really don't know if there's a robust way to do that.

pjc50 · on March 23, 2020

Hmm. Surely in order to "display" you need to "manipulate" (someone has to write the text rendering library, although ideally only once). And the simple action of "enter filename" crosses both those domains.

Then there's the whole classic domain of parsers and lexers as the front end to programming languages. I appreciate this gets upsettingly difficult if it's also a security boundary, where things like invisible spaces are a threat, but it remains important.

Maybe what we need is to go the other way from asking interns to reverse strings, and ask library writers to provide some slightly higher-level functions that don't rely on regex. Perhaps LINQ for strings? Most languages give you "split", which is the very beginning of a tokeniser, but we need something a bit more powerful.

A good test case might be writing the notoriously difficult "do these two URLs refer to the same resource?" program.

etaioinshrdlu · on March 23, 2020

It sounds a lot like what Go and perhaps Swift do. Perhaps other languages.

macintux · on March 23, 2020

Erlang has two different string(ish) value types: binaries and lists of integers. I suppose they could fit into that conceptual model.

taeric · on March 23, 2020

Isn't this just what the lisp world calls symbols?

PeCaN · on March 23, 2020

No; symbols are opaque; to manipulate them at all you have to convert them to strings. (For example you can't concatenate or reverse :foo and :bar or iterate over their characters.)

taeric · on March 23, 2020

Ah, I see what you mean now. I was taking "computer facing" for strings you don't have to manipulate often.

Though, I am confused, how often are you manipulating a filename? Even strings, in general. Parse? Sure. Manipulate? Seems uncommon.

Formatting, I can grant. But that is different from manipulating a string. More building up from others. And, outside of madlibs, not much you can hope for. Surprising amount of distance from madlibs, I suppose.

PeCaN · on March 23, 2020

(I consider "formatting" under "manipulating". Parsing too, actually. Really anything where you have to consider the parts of the string individually and not just the whole.)

By and large I think you're right; as the root post said you pretty much never actually have to reverse a string.

Specifically though I think manipulating filenames is not really unusual; for instance adding or reading a file extension, filename<number>.<ext>, splitting into directory paths, etc. Filenames are actually an interesting case because they're a very good candidate for having their own type (since they have some internal structure to them and operations on that structure).

Which leads me to another hypothesis, which is that the number of types you have to do something to truly unstructured data other than compare it is really low, and we actually don't need a generic "string" type at all. Unfortunately this is not really feasible in a world where the Unix principle of "everything is a stream of bytes" is ubiquitous.

taeric · on March 23, 2020

I was reading a distinction between recognising parts of a string, and changing parts. Such that I can't remember the last time I modified a string. I can think of plenty of times I took one apart and used a piece of it. Usually well structured to avoid the corner cases. Or, again, forced into a madlibs style structure to present to the user.

But yeah, looking for palindromes is a thing I can't recall ever having done. A prefix tree for searches? Done, but that didn't need well formed text/strings for it to work.

PeCaN · on March 23, 2020

>A prefix tree for searches? Done, but that didn't need well formed text/strings for it to work.

I think this does need "well formed" strings and the complexities of Unicode. Is a a prefix of ä? Is 앉 a prefix of 앉다?

taeric · on March 23, 2020

In my case, I was fine with binary prefixes.

And searches don't even need Unicode to get difficult. Consider: to, two, too, 2, and II. Should those all find each other? Highly dependent on context. And likely you will be reimplementing NLP before you realize it.

naniwaduni · on March 24, 2020

Even worse: is ﬀ a prefix of ﬃ?

TheOtherHobbes · on March 23, 2020

Yes, of course.

Computers should be seen as machines for manipulating symbols, not as machines for manipulating a small number of fundamental data types into which symbols are squeezed, usually with associated wreckage.

A file path, a shell command, a time/date, a URL, on-screen text in a browser, on-screen text in a text editor, and code in an editing window are all completely different data types. They can be implemented as char arrays - usually poorly - but that doesn't mean they're smoothly interchangeable with clear interfaces.

So in reality they're neither abstracted nor standardised nor designed properly, and the result is a lot of pain and confusion, because developers default to "So this is a string..." instead of thinking of them as separate types implementing distinct abstractions with hugely different requirements.

quikoa · on March 23, 2020

Mind sharing a bit more about the programming language you're working on? Sounds interesting!

PeCaN · on March 24, 2020

It's very much a "this is productive for me and fits how I think, maybe to the exclusion of making sense to other people" project.

It's closest to J or Dyalog APL but with a much different syntax that aims to make it more natural to write entirely pointfree code. (Personally, I feel like long trains can become kind of hard to read and refactor in J. The fork and hook syntax is really nice for short trains but IMO does not scale up very well.) I've been fiddling with the syntax on and off for about 5 years and use it as a sort of general notation for algorithms.

There's actually a fair bit of Erlang in there as well which is honestly kind of an odd combination but actors+arrays hits a sort of local optima for me. It might be the first APLlike with really good I/O.

quikoa · on March 24, 2020

Thanks for sharing, that sounds quite nice. Always interesting to see what kind of new programming ideas are being worked on.

codeflo · on March 23, 2020

I wouldn’t call it dangerous, just meaningless. A lot of the discussion in this thread dives deeply into the technicalities of Unicode. All of that is correct, but a bit besides the point IMO.

I think the real issue is that “reversing text” is fundamentally an operation that only makes sense in Western alphabets — the complexities of Unicode just reflect that. This is a bit hard to fathom for people who grew up with such an alphabet (this includes me), where there’s a rich tradition of games based on shuffling letters. It’s not just reversing; my understanding is that you can’t meaningfully define the concept of an anagram in many writing systems.

vl · on March 23, 2020

Especially it's ironical in the contexts of the interview, when the question is something like "write a function to reverse order of all words in the given string".

So I'm thinking to myself: "You don't even comprehend how hard it's to actually do, or you know it and checking if I know it?". And then there is a need to ask a lot of probing questions just to learn the the string is ASCII and separator is space.

togs · on March 23, 2020

It’s not clear to me why reversing a string is dangerous or pointless.

jodrellblank · on March 23, 2020

"lol :Man facepalming: :medium light skintone:" becomes the skintone applying to nothing (which might crash?) and the wrong coloured man. (e+accent a) making éa becomes (a+accent e) incorrectly making áe - or possibly invalidly making an error combination. Right-to-left markers[1] and left-to-right markers will change which sections of the text are reversed unless you swap them over.

Codepoints can combine more than once, to the point where if you're too nitpicky you can't validly substring either, you can only read a string from the first codepoint onwards; they could become invalid sequences if reversed, possibly?

[1] https://en.wikipedia.org/wiki/Right-to-left_mark

rlayton2 · on March 23, 2020

Agree. I think reversing in non-ascii should always be thought of as "per-token", where English is character-as-token. So the reverse of what you gave would be:

":medium light skintone: :Man facepalming: lol"

(with the lol reversed). In this problem, it is a much harder problem than, say in python, mystring[::-1]. Therefore, it is a different problem "reverse a string" than to "reverse an array".

Accented characters would be kept as is in my scenario.

PeCaN · on March 23, 2020

The "tokens" you're thinking of are "grapheme clusters" in Unicode.

Unfortunately just reversing by grapheme clusters doesn't solve the problem because of directional formatting codes; if you have e.g. a right-to-left embedding followed by a pop directional formatting you can't naively reverse them.

naniwaduni · on March 23, 2020

Grapheme clusters are a poor approximation of the vaguely-defined linguistic-level concept you're groping for.

PeCaN · on March 23, 2020

Well, yes, but we gotta stop somewhere or just give up any hope of computers operating on text.

Although I think grapheme clusters are a pretty good approximation in that it's usually what you want to backspace in a word processor.

diegoperini · on March 23, 2020

Is there a better approximation?

tragomaskhalos · on March 23, 2020

There's actually two levels of jeopardy here:

1/ Treating a string as an array of bytes will give an invalid result if the string unless the string is simple ascii (or an equivalent encoding where each byte has a clearly defined standalone meaning); in particular, just reversing a UTF-8 string in this way will give an invalid answer - ie a string that isn't even valid UTF-8.

2/ The fix for (1) is to convert your string into an array of Unicode code points and reverse that … except that is also broken, because combining characters will now not associate correctly, as per other answers in this thread.

Coding your way out of problem (2) in a robust and sensible way is, I suggest, a significant challenge.

to11mtm · on March 30, 2020

Of all things, VB.Net actually has a string reverse that handles unicode cases. I'm not quite sure how to view it in referencesource, but it is on Github. [1] The result is less than 100 lines of VB so I don't think it's -that- hard. There's certainly some clever index manipulation going on but nothing that looks too crazy.

[1] - https://github.com/microsoft/referencesource/blob/master/Mic...

bastawhiz · on March 23, 2020

An array of characters reversed loses the original meaning of the characters. Other than looking for palindromes, there's almost nothing in the way of practical uses of a reversed string that can't be accomplished by subscripting the string and iterating from the end to the beginning.

PeCaN · on March 23, 2020

Iterating from back to front doesn't work either; you probably (depending on what you're doing) still have to segment into grapheme clusters—which is stateful as of unicode 9, so you have to start from the beginning of the string. And then god forbid you get U+202C POP DIRECTIONAL FORMATTING CODE....

catlifeonmars · on March 23, 2020

I found this little gem on Software engineering StackExchange: https://softwareengineering.stackexchange.com/questions/2469...

Suffice it to say, there are many interesting uses for string reversal :)

bastawhiz · on March 23, 2020

Interesting, but questionably useful. If you're converting integer bases or obfuscating email addresses, there are much better approaches that don't involve a bad hack

jodrellblank · on March 23, 2020

Dyalog APL:

    0 - Multiplicity (count character)
    'h'(+/=)'fhqwhgads'

    1 - Trapeze Part (palindrome)
    (⊢≡⌽)'racecar'

    2 - chars which appear more than once
        (not too happy with my Key ⌸ ones)
    {k←{⍺,⍨1<≢⍵}⌸⍵ ⋄ k[;1]/k[;2]} 'applause'

    3 - reordered letters
    'teapot'{⍺[⍋⍺]≡⍵[⍋⍵]}'toptea'
    ≡/(⊂∘⍋⌷⊢)¨'teapot' 'toptea'

    4 - chars which appear once
    {k←{⍺,⍨1=≢⍵}⌸⍵ ⋄ k[;1]/k[;2]} 'somewhat heterogenous'

    5 - musical chars
    'barfoo' {∨/⍺≡⍤1⊢⍵⌽⍤0 1⍨⍳≢⍵} 'foobar'

    6 - sort strings by length, ascending
    {⍵[⍋≢¨⍵]} 'books' 'apple' 'peanut' 'aardvark' 'melon' 'pie'
    ((⊂∘⍋(≢¨⊢))⌷⊢) 'books' 'apple' 'peanut' 'aardvark' 'melon' 'pie'

    7 - Most frequent character
    {k←{⍺,≢⍵}⌸⍵ ⋄ ⊃k[;1][⍋k[;2]]} 'abdbbac'

    8 - reverse words
    {⊃{⍺,' ',⍵}/⌽¨' '((~=)⊆⊢)⍵} 'a few words in a sentence'

    9 - compress
    1 0 0 1 0 1/'foobar'

    10 - expand
    1 0 0 1 0 1 {'_'@(⍸~⍺)⍺\⍵} 'fbr'

    11 - Consonants
    {'_'@(⍸'AEIOUYaeiouy'∊⍨⍵)⊢⍵} 'FLAPJACKS'

    12 - Consonants Rdx
    {⍵/⍨~'AEIOUYaeiouy'∊⍨⍵} 'FLAPJACKS'

    13 - Word replace
    ⍸'a few words in a sentence' {idx←⍸⍵⍷⍺ ⋄ idx,←idx+¨⍳¯1+≢⍵ ⋄ 'x'@idx⊢⍺} 'words'

    14 - Permutations
    ? a recursive one shouldn't be so hard, but..
    {0=≢⍵:'' ⋄ ... } 'xyz'

    non-recursive, much harder:
    https://code.jsoftware.com/wiki/Doc/Articles/Play202

xelxebar · on March 23, 2020

Nice! Here are some J solutions. Unfortunately, I only have time for a few, right now. Hopefully, I can add some more later:

    'mississippi' +/@:= 's'         NB. 0 - Multiplicity
    (-:|.) 'racecar'                NB. 1 - Trapeze Part
    (~. #~ 1 < +/@|:@=) 'applause'  NB. 2 - Duplicity
    'teapot' -:&({~/:) 'toptea'     NB. 3 - Sort Yourself Out
    (~. #~ 1 = +/@|:@=) 'foo bar'   NB. 4 - Precious Snowflakes

xelxebar · on March 28, 2020

Just for posterity, here are the rest I came up with:

    'foobar' (e. ] A.~ +/@(!@i.@- *&|: >:@i.@- $&> i.)@#) 'barfoo'  NB. 5 - Musical Chars
    (/: #&>) 'books';'apple';'peanut';'aardvark';'melon';'pie'      NB. 6 - Size Matters
    (~. #~ [: (= >./) +/@|:@=) 'abdbbac'                            NB. 7 - Popularity Contest
    |.&.>&.;: 'a few words in a sentence'                           NB. 8 - esreveR A ecnetneS
    'foobar' #~ 1 0 0 1 0 1                                         NB. 9 - Compression Session
    'fbr' [`(I.@])`($&'_'@#@])} 1 0 0 1 0 1                         NB. 10 - Expansion Mansion
    '_'&(I.@e.&'AIUEOaiueo'@]}) 'FLAPJACKS'                         NB. 11 - C_ns_n_nts
    -.&'AIUEOaiueo' 'FLAPJACKS'                                     NB. 12 - Cnsnnts Rdx
    'one fish two fish' [`($&'X'@#&.>@[)@.e."0 _&.;: 'fish'         NB. 13 - TITLE REDACTED
    (A.~ i.@!@#) 'xyz'                                              NB. 14 - It's More Fun to Permute

Note that 14 is trivially non-recursive. I am fairly happy with these solutions, especially 5 and 13 which took the most thought.

kazinator · on March 23, 2020

  This is the TXR Lisp interactive listener of TXR 233.
  Quit with :quit or Ctrl-D on empty line. Ctrl-X ? for cheatsheet.
  1> (countq #\h "fhqwhgads")
  2
  2> [[callf equal identity reverse] "palindrome"]
  nil
  3> [[callf equal identity reverse] "racecar"]
  t
  4> [(opip (mappend [iff (op > (countq @1 @@1) 1) list] @1) uniq) "applause"]
  "ap"
  5> [(opip (mappend [iff (op > (countq @1 @@1) 1) list] @1) uniq) "foo"]
  "o"
  6> [(opip (mappend [iff (op > (countq @1 @@1) 1) list] @1) uniq) "baz"]
  ""
  7> [[mapf equal sort sort] "teapot" "toptea"]
  t
  8> [[mapf equal identity sort] "apple" "elap"]
  nil
  9> [(op mappend [iff (op eql (countq @1 @@1) 1) list] @1) "somewhat heterogeneous"]
  "mwa rgnu"
  10> [(do and (= (len @1) (len @2)) (search-str `@2@2` @1)) "foobar" "barfoo"]
  3
  11> [(do and (= (len @1) (len @2)) (search-str `@2@2` @1)) "fboaro" "foobar"]
  nil
  12> [sort '#"books apple peanut aardvark melon pie" : len]
  ("pie" "books" "apple" "melon" "peanut" "aardvark")

Yawn ...

  13> (perm "xyz")  ;; non-recursive, lazy, written in C.
  ("xyz" "xzy" "yxz" "yzx" "zxy" "zyx")

kazinator · on March 23, 2020

Erratum:

  8> [[mapf equal identity sort] "apple" "elap"]
  nil

Should be:

  ... sort sort ...

perl4ever · on March 22, 2020

I was thinking I really need to research how to efficiently deal with strings in a certain language that doesn't allow them to be accessed as arrays of characters. Because then I could get started on porting some C code to it.

swsieber · on March 23, 2020

Ha ha. Rust? I suppose it depends on the context of what you're trying to do.

Edit: if it is Rust most of the time you'll want to use character iterators/map/filter/etc. Then benchmark.

trevyn · on March 23, 2020

The Rust docs have good sections on this:

https://doc.rust-lang.org/book/ch08-02-strings.html

https://doc.rust-lang.org/std/primitive.char.html

TLDR you can straightforwardly access as bytes, or iterate as "Unicode scalar values" as stored in the 4-byte "char" type.

(But a "Unicode scalar value" is not exactly equal to the colloquial meaning of "character".)

However: If you care about valid UTF-8 inputs not breaking the function or intent of your code, it's probably time to understand the problems Unicode is actually trying to address, and apply those considerations to what you're trying to do. :)

steveklabnik · on March 23, 2020

If you’re talking about Rust, https://doc.rust-lang.org/stable/std/string/struct.String.ht... and related methods will give you byte slices. If not Rust, I’m sure the language has some sort of API!

Tokkemon · on March 23, 2020

That is an absolutely delightful Mouse Hunt reference I wasn't expecting today.

rambojazz · on March 23, 2020

Can somebody explain the title to me? This is just a collection of puzzles.

MaxBarraclough · on March 23, 2020

It's a reference to a quote from the 1997 film Mouse Hunt.

https://en.wikiquote.org/wiki/MouseHunt_(film)#Lars_Smuntz

enriquto · on March 23, 2020

I hate string processing. Is there any widely used programming language that does not support strings? I mean, you can do all of math, scientific computing, and graphics without strings. Forcing your language to support strings will inevitably impose compromises that make the language worse when you do not need them. I want such a language, unencumbered by the intrinsic uglyness of strings.

bryal · on March 23, 2020

Futhark has only the most basic string support -- string literal syntax, which which is just sugar for a utf-8 byte array. It's a functional array language that generates fast GPU code, and as a GPU program is essentially a pure function (no input/output), string handling is very much a secondary concern. Futhark is used exactly for math, scientific computing, and graphics, and not for strings.

tsimionescu · on March 23, 2020

well, C++ doesn't really have too much support for strings, except the "" syntax for constant ASCII strings. Everything else is library based, and lots of people apparently don't use std::string and friends. However, I can't really think of many programming languages where string and string processing affect the syntax or semantics of any other aspect of the language (there are TCL, Perl and most shell languages, but definitely not C, C++, Java, C#, any flavor of LISP, Haskell, *ML and most others)

That said, I don't really think you can use a programming system that doesn't,support strings:at some point you have to communicate something to a human, and at that point you need text.

And traditionally, math is one of the places where people invented all sorts of creative writing systems. Also, you have symbolic math where even the core of the solver needs at least some basic string support (e.g. Solve for x "x + c = 0" => solution is -c.

m4r35n357 · on March 22, 2020

Physicists will be disappointed ;)

ginko · on March 22, 2020

Fans of Homestar Runner won't be though.

symplee · on March 22, 2020

And orchestras...

tartoran · on March 22, 2020

And solo guitarists too.

soneca · on March 23, 2020

And puppeteers

tartoran · on March 23, 2020

Oh, I forgot about politics..

mhh__ · on March 23, 2020

Well, some physicists