> 2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expre...

ychetyrko · on Feb 9, 2019

Geoff, for me it's real honor that you, well-known creator of Hyperscan, paid attention to what we are doing. We truly respect all the prior work in this field (Hyperscan, RE2, Rust, etc), so different benchmarks will be available.

However, we all need to keep in mind that Nevod operates on words, not characters, and Nevod targets large rule decks (1000-100000 patterns). So, we don't compete in the field of individual pattern matching or streaming matching (to some extent, goals of Hyperscan and Nevod are different). In other words, a lot of things depend on target goals, which usually are reflected in the kind of patterns and in their number.

Regarding patent. Sorry, for legal reasons I can't expose details at the moment.

Thank you for your valuable comments.

glangdale · on Feb 9, 2019

I don't think I'm all that well-known. :-)

Staying away from streaming is a good idea for a lot of designs. We all lost a lot of our hair making streaming work in Hyperscan, and it was often the thing suppressing easier optimizations. Or at least we would have to have a separate approach for streaming and block mode; not exactly great s/w engineering.

Word-based approaches are interesting. So is large scale. Hyperscan does embarrassingly badly on natural language stuff at scale due to some baked in assumptions about relative frequencies of literal factors and the literal matchers tend to fall apart on really huge literal sets in this domain. So there's definitely an opportunity here.

I would strongly encourage you to produce some repeatable benchmarks (at least, things where we can see what the task is for something like RE2 or Hyperscan). We spent many years in stealth mode with a lot of superlatives and NDA-only Powerpoint decks and I don't think it did ourselves any favors. It means that you wind up being the only guardian of your own honesty, which is ... tricky ("What the hell is a medium-complexity pattern, anyhow?").

ychetyrko · on Feb 11, 2019

Based on what you are saying I'm concluding that we are on the right path. Again, thank a lot for all the valuable feedback and advice!

nostrademons · on Feb 9, 2019

Isn't this basically what regexp-based lexer generators (eg. lex, 1975) do? Stdlib regexps need to run multiple passes because the library itself has no idea whether it's being used with one pattern or multiple patterns, but any lexer generator that has all the patterns at once (i.e. all of them) can merge the DFAs generated and match the input in a single pass.

Hell, I've been known to do quick & dirty versions of this technique by concatenating all the patterns together with "|" before passing them to Pattern.compile. It takes all of one line of code. Backtracking regexp engines don't really benefit from this, but DFA-based regexp engines like RE2 can get massive speedups.

glangdale · on Feb 9, 2019

The idea of compiling everything together into a single DFA is very old. However, this doesn't always work - it will result in a combinatorial explosion in many cases (informally, if the patterns contain a lot of states that overlap with other potential states you might be in from other patterns or elsewhere in that pattern). So DFA merging is a dark art and many people spend whole careers taping things to the side of the basic DFA algorithm to handle this.

This doesn't seem to come up with most lexers because the idea of using some crazy overlapping tokens doesn't sound great - i.e. you're probably not going to want to allow your tokens to overlap that much. So it's not a problem for lex, but it might be a problem if you're scanning 10,000 text analytics patterns simultaneously.

Concatenating patterns is fine for RE2 - and RE2 can survive this - to a point. However, RE2 just falls back to a 'survivable' Thompson NFA regex when determinization fails and has no other strategies.

For backtracking engines, concatenating patterns with "|" is equivalent to just trying one pattern, then the next, etc.