So in your one, specific, performance-limited situation, Python 3's implementati...

fauigerzigerk · on Nov 27, 2013

>I don't see how this equates to a general purpose language failing at strings

And I don't see where I said it did.

I used to favor a dual Python/C++ strategy, but Python's multithreading limitations and the decisions around unicode have convinced me to move on. It's not like anything has gotten worse in Python 3, it's just that there has been a major change and the opportunity to do the right thing was missed.

I happen to think that UTF-8 everywhere is the right way to go, not just for my particular requirements, but for all applications, because it reduces overall complexity.

berdario · on Nov 27, 2013

I strongly disagree

and I'd like to know what do you think the "right thing" would be

I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"... the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings...

there're some weird exceptions, like Haskell Data.Text (I think that's due to haskell laziness)

would you prefer to have O(n) indexing and slicing of strings... or you'd prefer to get rid of these operations altogheter?

if the latter, what'd you prefer to do? force the developers to use .find() and handle such things manually... or create some compatibility string type restricted to non composable codepoints?

Getting an implementation out to see it used in the wild might be an interesting endeavor... probably it'd be easier to do in a language that allows you to customize it's reader/parser... like some lisp... clojure

fauigerzigerk · on Nov 27, 2013

>I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"

Then we agree entirely. I want all strings to be UTF-8. Period. What I said about an array of codepoints was that I would create one seperately from the string if I ever had a requirement to access individual code point positions repeatedly in a tight loop.

>the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings

If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.

>would you prefer to have O(n) indexing and slicing of strings

I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.

berdario · on Nov 27, 2013

> If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.

Actually, you can in Python... and obviously most developers ignore such issues [citation needed]

My point is that most developers don't know these details, a lot of idioms are ingrained... get them to work with string types properly won't be easy (but a good stdlib would obviously help immensely in this regard)

> I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.

Ok, so with your proposal an hypothetical slicing method on a String class in a java-like language would have this signature?

byte[] slice(int start, int end);

I've been fancying the idea of writing a custom String type/protocol for clojure that deals with the shortcoming of Java's strings... I'll probably have a try with your idea as well :)

masklinn · on Nov 27, 2013

> Actually, you can in Python...

No, you can only get random access on codepoints which will break text as soon as combining characters are involved. Even if you normalize everything beforehand (which most people don't do) as not all possible combinations have precomposed forms.

Unicode makes random access useless at anything other than destroying text.

> but a good stdlib would obviously help immensely in this regard

Which is extremely rare, and which Python does not have.

fauigerzigerk · on Nov 27, 2013

>Actually, you can in Python

You are right (apart from combining characters as masklinn explained), but as I said, that's only possible if an array of 32 bit ints is used to hold string data or if it can be guaranteed that there are no characters from outside ASCII or BMP. If I understand PEP 393 correctly, what Python 3.3 does is to use 32 bit ints to hold the entire string if even one such code point occurs. So if you load a (possibly large) text file into a string and one such code point exists then the file's size is going to quadruple in memory. All of that is done just to implement one very rare operation efficiently. http://www.python.org/dev/peps/pep-0393/#new-api

judk · on Nov 28, 2013

Sounds like you want to use Go. Feels like Python, but technically correct implementations of concepts.