Hacker News new | past | comments | ask | show | jobs | submit login

That sounds very much like what Rust's RegEx engine does. I understand it extracts longer literals (when available) to find a starting point for where a match might be according to [1].

[1] https://blog.burntsushi.net/ripgrep/#literal-optimizations




This is not a new technique with the Rust regex engine. We had a considerably more comprehensive literal 'factoring' approach in Hyperscan about a decade earlier (which also satisfied a lifelong ambition of mine; specifically misusing the netflow algorithm in a graph for something).

The multiple literal implementation in that matcher is also a partial lift from Hyperscan's "Teddy" small group literal matcher (not salty about that as "burntsushi" has been very clear about inspiration). I do wish they'd pull in the rest of the algorithm at some point - the bits that allow merging of Teddy literals into buckets to reduce false positive rates...


While we're wishing for things, I'd love to see you write up how some of the algorithms (like Teddy) work. Hyperscan is clearly a treasure trove of them, but the effort required to extract its secrets is quite large! For example, I've spent quite a bit of time perusing its code (and your excellent mailing list posts), and I still don't think I could describe its execution flow at a high level.

As far as literals go, I was doing that well before I heard about Hyperscan. I got the idea from hints in the literature. (On mobile or else I'd provide a link.)


We have a paper accepted to NSDI, so this will hopefully shed some light on things. A conference paper is, sadly, too small to fit more than a few subsystems so a writeup of Teddy isn't included.

As I don't work for Intel any more, it's unlikely that I will put in the effort for anything significantly more comprehensive. I am considering another building another regex matcher but it wouldn't be the encyclopedic list of optimizations approach. Paul Terra refers to Hyperscan as the "Dwarf Fortress of regular expression matchers". I have some new ideas I'd like to try out, in any case.

I wasn't suggesting that your use of literal factoring comes from Hyperscan, only the Teddy matcher. Literal factoring seems an old technique and I don't know the genuinely first cite of that.


Gotya. Thanks for the clarification! I look forward to reading your paper! I'll hold out hope for the folks working on Hyperscan at Intel to do some more in depth write ups.


I didn't mean to insinuate that it was. The Rust RegEx engine was simply the first that came to mind for me. What I did intend to say that if this is indeed the optimisation that Nevod claims it has, it is not the first to do it.

However I did take the time to read up on Hyperscan and would like to thank you for contributing to what looks like an excellent tool.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: