You're right, oh boy! https://github.com/evincarofautumn/kitten/blob/master/comm...

exdsq · on April 10, 2021

Why is this an issue, out of interest? :)

mannerheim · on April 10, 2021

It's not a huge issue since you can have other datatypes to deal with strings efficiently. E.g., Haskell has Text and ByteStrings (both lazy and strict versions) for this reason. But it is considered a design mistake that the least efficient way of handling strings was made the default.

andrepd · on April 10, 2021

"Hello World!" now takes 192 bytes in memory, instead of 12. Traversing (for printing, etc) is at least tens of times slower. Changing it to "Hello World?" is now 12 pointer chases rather than 1 assembly instruction. Etc.

eptcyka · on April 10, 2021

It doesn't have to be an issue, but users should be aware that the default string type should be avoided in hot paths, which can be cumbersome. The reason it'll be slow is because iterating linked lists involve a lot of pointer chasing, which implies bad behavior on x86 and worse behavior on CPUs that don't have a data prefetcher.

fiddlerwoaroof · on April 10, 2021

Random access is pretty common with strings, so using a datastructure that makes it expensive is problematic.

jerf · on April 11, 2021

Exactly the opposite... non-random access is extremely common. If you're reading byte X, the odds that you're going to read X+1 are very good. So it makes most sense to pack them tightly in memory in order, not splash them all over RAM and tie them together with pointers. The latter makes certain things easier, but never by enough on modern hardware to beat the array representation.

fiddlerwoaroof · on April 11, 2021

I’m not really talking about access as the computer sees it: most of the code I write that processes strings ends up doing things like splitting at new lines and stripping off the trailing space of each line. Or getting the length of the string or similar things that require efficient access to a semi-arbitrary offset into the string.

For the type of string processing code I find myself writing, linked lists are the wrong data structure because they tend to turn O(n) algorithms into O(n^2) ones: this can be mitigated by things like skip-lists, and is complicated by things like Unicode. But, personally, I tend to think that the biggest issue is not the performance impact of non-local storage of the string’s contents (which can be mitigated by techniques like cdr-coding). The problem is that string processing code, in my experience, tends to assume getting a predetermined offset into the string is relatively low cost.

jjnoakes · on April 10, 2021

Is random access that common? I usually only iterate over strings (and random access is hard to do with utf-8 and friends).

I think the two biggest reasons to not use a list of characters to represent a string are memory efficiency (not storing a linked list pointer per character) and memory locality (keeping the characters of the string near each other in memory since they are frequently used as a unit).

fiddlerwoaroof · on April 10, 2021

Not exactly random access, but things like finding the last character, trimming trailing whitespace, etc. are all less efficient with linked lists because you have to copy the head of the list to change the tail.

As far as UTF-8 goes, it does add a whole layer of complexity that I had overlooked: although, I suspect intelligent data structure choices could mitigate the random access cost (something like an array of bytes plus an array of grapheme cluster offsets, or similar).

I personally have never found the memory locality concerns relevant to the vast majority of programs I write: when using a language like Python or Ruby or Haskell, you already have a lot of pointer chasing going on because of your non-string data structures such that the memory locality of strings is nearly irrelevant.

jjnoakes · on April 10, 2021

I think if your program is string heavy you would definitely notice. It is quite a difference between chasing one pointer per structure and one per character.

erik_seaberg · on April 10, 2021

Unicode grapheme clusters (letters/ideograms plus accent marks) are inherently variable length, but most devs haven’t read about how that works. We shouldn’t be working with single codepoints or bytes for the same reason we don’t work with US-ASCII one bit at a time, because most slices of a bit string will be nonsense.

bjourne · on April 11, 2021

Treating strings as arrays of bytes is very reasonable. It is the assumption that each byte is exactly one character that causes trouble.

astrange · on April 11, 2021

You should treat them as strings, the backing data store is an implementation detail. What do you want to do with individual bytes if you access them? Whatever you do is likely to be incorrect in some language, even English with emoji mixed in or Zalgo text.

This also lets you implement compressed small strings (eg in tagged pointers) which is a good memory optimization.

dahfizz · on April 10, 2021

Linked lists are slower to iterate through than contiguous arrays, and are much slower at random access. Both are common operations for strings