Re2c: A free and open-source lexer generator for C and C++

burntsushi · on March 1, 2019

For any other state machine nuts such as myself, you won't want to miss their paper on tagged DFAs: http://re2c.org/2017_trofimovich_tagged_deterministic_finite... It is a somewhat recent follow on from Laurikari's 2000 paper and Kuklewicz 2007 paper on the same topic. Specifically, tagged DFAs let you do submatch extraction via the DFA itself. Other finite state machine regex libraries either don't do this at all, or fall back to slower engines in the general case (such as RE2). I think the downside of tagged DFAs is that they can be quite large!

panic · on March 1, 2019

Also worth mentioning is Danny Dubé's SILex (http://wiki.call-cc.org/eggref/5/silex), which extracts submatches from a DFA trace by walking the trace in reverse, reconstructing the corresponding NFA trace (or traces) using a side table generated during determinization. The technique is described in these two papers: http://www.iro.umontreal.ca/~feeley/papers/DubeFeeleyACTAINF... http://www.schemeworkshop.org/2006/14-dube.pdf

jhallenworld · on March 1, 2019

It's interesting that it's using Laurikari's algorithm- I looked at the 2000 paper years ago and could not get a working algorithm out of it. Eventually I found another way to do this that I use for JOE's regular expression library. So now I'm curious which one is faster...

https://sourceforge.net/p/joe-editor/mercurial/ci/default/tr...

It uses a variant a Thompson's matcher, see:

https://swtch.com/~rsc/regexp/regexp1.html

"However, Thompson-style algorithms can be adapted to track submatch boundaries without giving up efficient performance. The Eighth Edition Unix regexp(3) library implemented such an algorithm as early as 1985, though as explained below, it was not very widely used or even noticed. "

So in JOE's matcher, when simulating the NFA with the multithreaded machine, each succeeding thread will correspond to one of the sub-match possibilities. Augment this machine to keep the ones you want (the ones with the longest matching strings starting from the left usually).

burntsushi · on March 2, 2019

> So in JOE's matcher, when simulating the NFA with the multithreaded machine, each succeeding thread will correspond to one of the sub-match possibilities. Augment this machine to keep the ones you want (the ones with the longest matching strings starting from the left usually).

Yeah, this is the standard way to do it. It's what I meant by "fall back to a slower engine," since simulating the NFA is much slower than a DFA.

pjmlp · on March 1, 2019

Thanks, got something to read over the weekend. :)

chubot · on March 1, 2019

FWIW I encountered re2c in the Ninja lexer

https://github.com/ninja-build/ninja/blob/master/src/lexer.i...

and then used it for my own huge shell lexer:

The Oil Lexer: Introduction and Recap http://www.oilshell.org/blog/2017/12/15.html

The nice thing about it is that it doesn't impose any constraints on the structure of your code.

lex/flex probably have options not generate piles of messy C code with global variables at this point, but IMO there's no reason not to use re2c instead, so it's moot.

chubot · on March 1, 2019

update: Also re2c gives you "semantic actions" like Ragel or yacc, which is better IMO. Not sure if lex/flex does that. I think it has a hard-coded notion of tokens.

saurik · on March 1, 2019

lex/flex definitely supports semantic actions and definitely does not have a hardcoded notion of tokens.

cjohansson · on March 1, 2019

I have ported the PHP re2c lexer logic to elisp in my in-progress attempt at a Semantic PHP mode for Emacs. Using the lexer tokens as foundation for syntax-coloring and indentation

https://github.com/cjohansson/emacs-phps-mode/blob/master/ph...

quotemstr · on March 1, 2019

I usually reach for Ragel [1] for this kind of work. It has a very steep learning curve and is a bit more complex than a straight-up translation of a conventional regular expression, but the state machine combination algebra is incredibly powerful and flexible. I've used the tool in numerous projects over the years.

[1] http://www.colm.net/open-source/ragel/

megous · on March 1, 2019

I use re2c a lot. It's a great tool for safely working with text in C programs, without any runtime dependencies. It's a secret sauce in many of my programs.

For example, my JSON parser that doesn't use intermediate object tree and works directly on strings: https://megous.com/git/sjson/tree/sjson.c

openbasic · on March 1, 2019

re2c is used by PHP: https://externals.io/message/102796

https://github.com/php/php-src/blob/master/Zend/zend_languag...

mncharity · on March 1, 2019

Hyperscan was recently posted.[1] A fast multi-pattern regex matcher. As in 1000's of patterns. Though optimized for deep packet inspection, I'm curious to try it for tokenizing.

[1] https://news.ycombinator.com/item?id=19270199

cleeus · on March 1, 2019

I replaced a handwritten lexer of a proprietary PDF library with a re2c generated lexer. Besides fixing a bunch of dormant bugs, the resulting lexer was much faster. Also the state of c++ std regex implementations is pretty sad (at least it was the last time I checked) and if you just need to match with a few compile-time-known regexes, re2c is there for you to solve the problems.

sovande · on March 1, 2019

re2c is a hidden gem and it is used in libzdb to parse URL's https://bitbucket.org/tildeslash/libzdb/src/9402bd5757d1db34... and SQL Date strings https://bitbucket.org/tildeslash/libzdb/src/9402bd5757d1db34...

CUViper · on March 1, 2019

SystemTap derived its regex functionality from re2c: http://sourceware.org/git/gitweb.cgi?p=systemtap.git;a=blob;...

unsatchmo · on March 1, 2019

Safari used to use re2c to identify Unicode Top Level Domains in order to avoid cross-site-scripting attacks. The Unicode support has no parallel that I’m aware of.

pubby · on March 1, 2019

Lexer generators seem like a useful algorithm wrapped up in heaps of framework bullshit. Wouldn't it be better to have a simple program that inputs regex and outputs a DFA than one that handles the entire lexing process?

jules · on March 1, 2019

Lexers run multiple regexes in tandem to determine the next token type. To do that with your method you'd have to run multiple DFAs on the same string. Lexers can do it with 1 DFA by having multiple different final states (non-final + 1 per token type vs non-final + final).

ori_b · on March 1, 2019

You can do that from a regex api fairly easily. Something like:

    re = regex.compilemulti([
         (".*@(.*)\\.(com|org|net)", TagEmail),
         ("http://.*, TagURL),
         ("[a-z]*", TagWord)
    ])

    match regex.tagmatch(re, pat)
    | (TagEmail, submatches):    domain=submatches[1]
    | (TagURL, submatches):      url=submatches[0]
    | (TagWord, _):              /* nothing */
    ;;

I haven't had time to do it in Myrddin's libregex (https://myrlang.org/doc/libregex/) yet, but I've got plans to do something like this.

jules · on March 2, 2019

Sure, and that's exactly what re2c does too. It lets you do a match on multiple regexes:

  regex1 { code that executes if regex1 matches }
  regex2 { code that executes if regex2 matches }
  regex3 { code that executes if regex3 matches }

ori_b · on March 2, 2019

Yes, compiled into source code, which doesn't let you change the set of regexes at runtime.

ksherlock · on March 2, 2019

> Wouldn't it be better to have a simple program that inputs regex and outputs a DFA than one that handles the entire lexing process?

Like... re2c?

pubby · on March 2, 2019

It outputs code, no? I want tools that output graphs.

clishem · on March 1, 2019

Alternative: https://github.com/fbb-git/flexcpp

(nice with https://github.com/fbb-git/bisoncpp as parser generator)

Very well written with good documentation.