We're pretty much on the same page here. When you want to slice a string (becaus...

dietrichepp · on June 20, 2016

I'm not convinced that state machines will operate at the byte level. First of all, not all tokenizers are written using state machines. Even if that is the mathematical language we use to talk about parsers, it's still relatively common to make hand-written parsers. Secondly, if you take a Unicode-specified language and convert it to a state machine that operates on UTF-8, you can easily end up with an explosion in the number of possible states. Remember, this trick doesn't really change the size of the transition table, it just spreads it out among more states. On the other hand, you can get a lot more mileage out of using equivalency classes, as long as you're using something sensible like code points to begin with.

If you're curious, here's the V8 tokenizer header file:

https://github.com/v8/v8/blob/master/src/parsing/scanner.h

You can see that it works on an underlying UTF-16 code unit stream which is then composed into code points before tokenization. This extra step with UTF-16 is a quirk of JavaScript.

If you think that V8 shouldn't be processing by code point, feel free to explain that to them.

Avernar · on June 20, 2016

State machines would have to operate on the byte level. Otherwise each state would have to have have 65536 entries per state. The trick to handle UTF-8 would be to have 0-127 run as a state machine and > 127 break out to functions to handle the various unicode ranges that are valid for identifiers.

For languages that only allow non ascii in string literals a pure state machine would suffice.

Not sure why you're mentioning parsers. At that point you you're dealing with tokens.

As for UTF-16 it's an ugly hack that never should have existed in the first place. Unfortunately the unicode people had to fix their UCS-2 mistake.

Since Javascript is standardised to be either UCS-2 or UTF-16 it probably made sense to make the scanner use UTF-16.

dietrichepp · on June 22, 2016

State machines don't have to operate on the byte level because the tables can use equivalency classes. This will often result in smaller and faster state machines than byte-level state machines, if your language uses Unicode character classes here and there.

Avernar · on June 20, 2016

Looks like Javascript source code is required to be processed as UTF-16:

ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16.

dietrichepp · on June 22, 2016

Right, but the UTF-16 is read code point by code point, not code unit by code unit. At that point, it might as well be UTF-8 or UTF-32.