Hacker News new | past | comments | ask | show | jobs | submit login

> print("你好世界".length); # => 4

NO NO NO NO.

'length' is ... not right, and I don't want to know why it returns 4 for the above, because that's not right. If you want to provide a 'length' function for a Unicode string, you need to know what you are measuring: graphemes, codepoints, bytes? Whichever you decide to use, is inappropriate for 'length'.




Perl 6 has $x.bytes, $x.codes and $x.graphs, but no $x.length for this exact reason.


Interesting objection.

To me it is rather obvious that a text-oriented language would treat any "string" as a) an atomic string and b) a sequence of the "next-lower" logical unit. I do just now realize that "An English sentence.".length could by this reasoning return 3 or 4 (3 words, one punctuation mark...).


I'm glad it's so obvious to you. Here is a quick test:

What is "ö".length?

It's one grapheme, an o with a diaeresis. It's two codepoints, an o (0x006F) with a combining diaeresis (0x0308). It's several bytes, depending on encoding.

How about if you reverse it first, so that the diaeresis doesn't have anything to combine with, and you have a bare letter 'o'? What's the length now? If you answered _one_ to the above, you've got a string whose length doubles when you reverse it. Is that what you want?

Too easy for you?

Let's take the Thai consonant "ก", which is a sort of a g, sort of k sound. One grapheme, one codepoint. Sorted. We'll add a vowel to it: "กอ". Two codepoints, but how many graphemes? One or two? Let's say two, but then let's point out that there is no logical difference there between that and a different vowel: "กี". This is a little more complicated? What's the length now? Is that one or two graphemes? It's clear as day that that's a single consonant + a single vowel, but how long is the string? How about: "เกียะ"? That's still a single consonant + a combining single vowel, only this time it's a compound vowel. One consonant, one vowel, how many graphemes? Are you using vertical slicing to determine what is and isn't a grapheme? Is that right?

To see this taken to its logical end by The Masters of Unicode: http://www.unicode.org/faq/char_combmark.html - "How are characters counted when measuring the length or position of a character in a string?"


TL;DR -- I generally fall in the category of counting graphemes, as per the second FAQ you linked -- when talking about user-facing text processing. I'm don't think it makes sense to try and have one api that tries to both appease (low level) programmers and end-users.

Perhaps I wasn't entirely clear - I certainly see that there are complications. I think you're overcomplicating your examples within the domain of text - I'd say composed characters counts as one, and reversing a string with a composed character, shouldn't reverse/destroy the compositon. The reverse of "õ" isn't "o~", but simply "õ" -- and the length of "o" and "õ" should both be one -- even if they aren't coded similarly.

Now, this won't work for lower level work on "computer language" strings -- so for your unicode-library or whatever you'd have to count differently. Obviously you have to do some magic when converting a multicode-encoded string from big-endian to little-endian and vice-versa -- but that's hardly the same operation as reversing a string.

I'm not familiar with thai, but to me it looks like your "กอ" and "กี" is equivalent to the Norwegian vowel "æ" which used to be written/typset as "ae" (and can still be considered a composition in some input locals). So the length of "ae" is 2, the length of "æ" is 1 as is the length of "a". That would mess up "ae" if reversed -- but I would consider that a "special/archaic" use-case. I'm not sure if that would be similar in Thai -- I don't know for example, if typewriters and computers have been wildly used for comparable time in Norway and Thailand (I'm guessing Thailand have a few thousand more years of printing/literacy).

As mentioned in my comment above, I also find it interesting that if we're taking length to mean "number of things in a sequence", the length of a sentence would be the number of words, the length of a word would be the number of graphemes and the length of a grapheme might either be the number of bit/bytes, or there might be a level in-between of composites.

So we might have:

   "This is an example.".length => 4 (or 5 or 8 depending
      on how we define spaces and punctuation)
   "This".length => 4
   "T".length => 1 byte,7 or 8 bits, or maybe even 2 in a
     prefix-based encoding (capital-transform t).
The logic would be that the full sentence is treated as a sequence of words that's treated as a sequence of graphemes that are treated as a sequence of codepoints that's treated as a stream of bits...


Aren't strings sequences of characters?


What's a character? A codepoint, or a grapheme?

As the correct answer is "that depends", neither answer gets to qualify as "length", especially given that traditionally, a string is a sequence of bytes, which gives a third thing that 'length' could mean.


See Ruby one point nine.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: