For any other state machine nuts such as myself, you won't want to miss their paper on tagged DFAs: http://re2c.org/2017_trofimovich_tagged_deterministic_finite... It is a somewhat recent follow on from Laurikari's 2000 paper and Kuklewicz 2007 paper on the same topic. Specifically, tagged DFAs let you do submatch extraction via the DFA itself. Other finite state machine regex libraries either don't do this at all, or fall back to slower engines in the general case (such as RE2). I think the downside of tagged DFAs is that they can be quite large!
It's interesting that it's using Laurikari's algorithm- I looked at the 2000 paper years ago and could not get a working algorithm out of it. Eventually I found another way to do this that I use for JOE's regular expression library. So now I'm curious which one is faster...
"However, Thompson-style algorithms can be adapted to track submatch boundaries without giving up efficient performance. The Eighth Edition Unix regexp(3) library implemented such an algorithm as early as 1985, though as explained below, it was not very widely used or even noticed. "
So in JOE's matcher, when simulating the NFA with the multithreaded machine, each succeeding thread will correspond to one of the sub-match possibilities. Augment this machine to keep the ones you want (the ones with the longest matching strings starting from the left usually).
> So in JOE's matcher, when simulating the NFA with the multithreaded machine, each succeeding thread will correspond to one of the sub-match possibilities. Augment this machine to keep the ones you want (the ones with the longest matching strings starting from the left usually).
Yeah, this is the standard way to do it. It's what I meant by "fall back to a slower engine," since simulating the NFA is much slower than a DFA.
The nice thing about it is that it doesn't impose any constraints on the structure of your code.
lex/flex probably have options not generate piles of messy C code with global variables at this point, but IMO there's no reason not to use re2c instead, so it's moot.
update: Also re2c gives you "semantic actions" like Ragel or yacc, which is better IMO. Not sure if lex/flex does that. I think it has a hard-coded notion of tokens.
I have ported the PHP re2c lexer logic to elisp in my in-progress attempt at a Semantic PHP mode for Emacs. Using the lexer tokens as foundation for syntax-coloring and indentation
I usually reach for Ragel [1] for this kind of work. It has a very steep learning curve and is a bit more complex than a straight-up translation of a conventional regular expression, but the state machine combination algebra is incredibly powerful and flexible. I've used the tool in numerous projects over the years.
I use re2c a lot. It's a great tool for safely working with text in C programs, without any runtime dependencies. It's a secret sauce in many of my programs.
Hyperscan was recently posted.[1] A fast multi-pattern regex matcher. As in 1000's of patterns. Though optimized for deep packet inspection, I'm curious to try it for tokenizing.
I replaced a handwritten lexer of a proprietary PDF library with a re2c generated lexer. Besides fixing a bunch of dormant bugs, the resulting lexer was much faster.
Also the state of c++ std regex implementations is pretty sad (at least it was the last time I checked) and if you just need to match with a few compile-time-known regexes, re2c is there for you to solve the problems.
Safari used to use re2c to identify Unicode Top Level Domains in order to avoid cross-site-scripting attacks. The Unicode support has no parallel that I’m aware of.
Lexer generators seem like a useful algorithm wrapped up in heaps of framework bullshit. Wouldn't it be better to have a simple program that inputs regex and outputs a DFA than one that handles the entire lexing process?
Lexers run multiple regexes in tandem to determine the next token type. To do that with your method you'd have to run multiple DFAs on the same string. Lexers can do it with 1 DFA by having multiple different final states (non-final + 1 per token type vs non-final + final).