That sounds very much like what Rust's RegEx engine does. I understand it extrac...

glangdale · on Feb 8, 2019

This is not a new technique with the Rust regex engine. We had a considerably more comprehensive literal 'factoring' approach in Hyperscan about a decade earlier (which also satisfied a lifelong ambition of mine; specifically misusing the netflow algorithm in a graph for something).

The multiple literal implementation in that matcher is also a partial lift from Hyperscan's "Teddy" small group literal matcher (not salty about that as "burntsushi" has been very clear about inspiration). I do wish they'd pull in the rest of the algorithm at some point - the bits that allow merging of Teddy literals into buckets to reduce false positive rates...

burntsushi · on Feb 9, 2019

While we're wishing for things, I'd love to see you write up how some of the algorithms (like Teddy) work. Hyperscan is clearly a treasure trove of them, but the effort required to extract its secrets is quite large! For example, I've spent quite a bit of time perusing its code (and your excellent mailing list posts), and I still don't think I could describe its execution flow at a high level.

As far as literals go, I was doing that well before I heard about Hyperscan. I got the idea from hints in the literature. (On mobile or else I'd provide a link.)

glangdale · on Feb 9, 2019

We have a paper accepted to NSDI, so this will hopefully shed some light on things. A conference paper is, sadly, too small to fit more than a few subsystems so a writeup of Teddy isn't included.

As I don't work for Intel any more, it's unlikely that I will put in the effort for anything significantly more comprehensive. I am considering another building another regex matcher but it wouldn't be the encyclopedic list of optimizations approach. Paul Terra refers to Hyperscan as the "Dwarf Fortress of regular expression matchers". I have some new ideas I'd like to try out, in any case.

I wasn't suggesting that your use of literal factoring comes from Hyperscan, only the Teddy matcher. Literal factoring seems an old technique and I don't know the genuinely first cite of that.

burntsushi · on Feb 9, 2019

Gotya. Thanks for the clarification! I look forward to reading your paper! I'll hold out hope for the folks working on Hyperscan at Intel to do some more in depth write ups.

Cu3PO42 · on Feb 9, 2019

I didn't mean to insinuate that it was. The Rust RegEx engine was simply the first that came to mind for me. What I did intend to say that if this is indeed the optimisation that Nevod claims it has, it is not the first to do it.

However I did take the time to read up on Hyperscan and would like to thank you for contributing to what looks like an excellent tool.