So in your one, specific, performance-limited situation, Python 3's implementation of unicode doesn't work for you. Mostly because you are trying to optimize based on implementation details.
I don't see how this equates to a general purpose language failing at strings, especially when the language isn't particularly focused on performance and optimization. And if memory usage is of concern, I would certainly think anything like Python and Ruby would be out of the running?
>I don't see how this equates to a general purpose language failing at strings
And I don't see where I said it did.
I used to favor a dual Python/C++ strategy, but Python's multithreading limitations and the decisions around unicode have convinced me to move on. It's not like anything has gotten worse in Python 3, it's just that there has been a major change and the opportunity to do the right thing was missed.
I happen to think that UTF-8 everywhere is the right way to go, not just for my particular requirements, but for all applications, because it reduces overall complexity.
and I'd like to know what do you think the "right thing" would be
I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"... the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings...
there're some weird exceptions, like Haskell Data.Text (I think that's due to haskell laziness)
would you prefer to have O(n) indexing and slicing of strings... or you'd prefer to get rid of these operations altogheter?
if the latter, what'd you prefer to do? force the developers to use .find() and handle such things manually... or create some compatibility string type restricted to non composable codepoints?
Getting an implementation out to see it used in the wild might be an interesting endeavor... probably it'd be easier to do in a language that allows you to customize it's reader/parser... like some lisp... clojure
>I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"
Then we agree entirely. I want all strings to be UTF-8. Period. What I said about an array of codepoints was that I would create one seperately from the string if I ever had a requirement to access individual code point positions repeatedly in a tight loop.
>the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings
If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.
>would you prefer to have O(n) indexing and slicing of strings
I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.
> If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.
Actually, you can in Python... and obviously most developers ignore such issues [citation needed]
My point is that most developers don't know these details, a lot of idioms are ingrained... get them to work with string types properly won't be easy (but a good stdlib would obviously help immensely in this regard)
> I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.
Ok, so with your proposal an hypothetical slicing method on a String class in a java-like language would have this signature?
byte[] slice(int start, int end);
I've been fancying the idea of writing a custom String type/protocol for clojure that deals with the shortcoming of Java's strings... I'll probably have a try with your idea as well :)
No, you can only get random access on codepoints which will break text as soon as combining characters are involved. Even if you normalize everything beforehand (which most people don't do) as not all possible combinations have precomposed forms.
Unicode makes random access useless at anything other than destroying text.
> but a good stdlib would obviously help immensely in this regard
Which is extremely rare, and which Python does not have.
You are right (apart from combining characters as masklinn explained), but as I said, that's only possible if an array of 32 bit ints is used to hold string data or if it can be guaranteed that there are no characters from outside ASCII or BMP. If I understand PEP 393 correctly, what Python 3.3 does is to use 32 bit ints to hold the entire string if even one such code point occurs. So if you load a (possibly large) text file into a string and one such code point exists then the file's size is going to quadruple in memory. All of that is done just to implement one very rare operation efficiently.
http://www.python.org/dev/peps/pep-0393/#new-api
I don't see how this equates to a general purpose language failing at strings, especially when the language isn't particularly focused on performance and optimization. And if memory usage is of concern, I would certainly think anything like Python and Ruby would be out of the running?