First of all, thanks everyone for the valuable feedback, critics, suggestions, etc. Truly appreciate that!
And thanks for patience to all people who are playing with the Nevod right now and getting errors. We have an unexpectedly high interest and number of visitors is very high in our playground. Meanwhile, it's a preview of the technology, please keep in mind.
Let me clarify few things. The speed of Nevod is based on two things:
1. Nevod operates on words, not individual characters. From my 30 years of coding experience I would say that roughly 95% of text processing/rules are word-based. So it covers most of tasks.
2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.
So, at a RULE DECK of 1000 patterns, we got 300 times improvement comparing to RegExp. This is a kind of tasks when you let say need to highlight syntax, to massively verify and route incoming requests in cloud, perform massive text validations, track certain phrases and relations in massive human communication, etc.
With growing number of patterns the difference between Nevod and RegExp is higher and higher and go far beyond 300 times. So, I'm a kind of disagree with moderators removed the word "hundreds" from the headline. :)
We will publish benchmark results and we will also release "negrep" (Nevod GREP) command line tool for Windows, Linux, Mac, so everyone will be able to run "negrep", play with it and verify benchmark him/herself. Generally, Nevod technology will be available both in form of a library, in form of command line tool, and in form of service.
> 2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.
This is a good idea. It's such a good idea that we did it in 2006, sold a company to Intel based around in in 2013 and open-sourced the resulting library (https://github.com/intel/hyperscan) in 2015.
All multi-pattern systems get a huge improvement over one-at-a-time regex. It would be embarrassing if they did not. Our rule of thumb for Hyperscan was that it should be about 5-10x faster than libpcre on single patterns and the gap should get progressively bigger.
I don't doubt that there is quite a bit of interesting work (entirely outside of regex - the word-based focus looks interesting) going on in this project, but the fact that you couldn't be bothered to Google around for a sensible baseline is not encouraging.
Also, which bit of your system do you think is novel enough to patent? You may find considerable prior art out there, which you don't really seem to be aware of based on how much emphasis you're placing on the not-entirely-novel idea of using a state machine.
Geoff, for me it's real honor that you, well-known creator of Hyperscan, paid attention to what we are doing. We truly respect all the prior work in this field (Hyperscan, RE2, Rust, etc), so different benchmarks will be available.
However, we all need to keep in mind that Nevod operates on words, not characters, and Nevod targets large rule decks (1000-100000 patterns). So, we don't compete in the field of individual pattern matching or streaming matching (to some extent, goals of Hyperscan and Nevod are different). In other words, a lot of things depend on target goals, which usually are reflected in the kind of patterns and in their number.
Regarding patent. Sorry, for legal reasons I can't expose details at the moment.
Staying away from streaming is a good idea for a lot of designs. We all lost a lot of our hair making streaming work in Hyperscan, and it was often the thing suppressing easier optimizations. Or at least we would have to have a separate approach for streaming and block mode; not exactly great s/w engineering.
Word-based approaches are interesting. So is large scale. Hyperscan does embarrassingly badly on natural language stuff at scale due to some baked in assumptions about relative frequencies of literal factors and the literal matchers tend to fall apart on really huge literal sets in this domain. So there's definitely an opportunity here.
I would strongly encourage you to produce some repeatable benchmarks (at least, things where we can see what the task is for something like RE2 or Hyperscan). We spent many years in stealth mode with a lot of superlatives and NDA-only Powerpoint decks and I don't think it did ourselves any favors. It means that you wind up being the only guardian of your own honesty, which is ... tricky ("What the hell is a medium-complexity pattern, anyhow?").
Isn't this basically what regexp-based lexer generators (eg. lex, 1975) do? Stdlib regexps need to run multiple passes because the library itself has no idea whether it's being used with one pattern or multiple patterns, but any lexer generator that has all the patterns at once (i.e. all of them) can merge the DFAs generated and match the input in a single pass.
Hell, I've been known to do quick & dirty versions of this technique by concatenating all the patterns together with "|" before passing them to Pattern.compile. It takes all of one line of code. Backtracking regexp engines don't really benefit from this, but DFA-based regexp engines like RE2 can get massive speedups.
The idea of compiling everything together into a single DFA is very old. However, this doesn't always work - it will result in a combinatorial explosion in many cases (informally, if the patterns contain a lot of states that overlap with other potential states you might be in from other patterns or elsewhere in that pattern). So DFA merging is a dark art and many people spend whole careers taping things to the side of the basic DFA algorithm to handle this.
This doesn't seem to come up with most lexers because the idea of using some crazy overlapping tokens doesn't sound great - i.e. you're probably not going to want to allow your tokens to overlap that much. So it's not a problem for lex, but it might be a problem if you're scanning 10,000 text analytics patterns simultaneously.
Concatenating patterns is fine for RE2 - and RE2 can survive this - to a point. However, RE2 just falls back to a 'survivable' Thompson NFA regex when determinization fails and has no other strategies.
For backtracking engines, concatenating patterns with "|" is equivalent to just trying one pattern, then the next, etc.
> 2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.
Regexps do that too. Specifically, there is a whole class of "lexers"/"tokenizers" which are regexp-based, and which specialize in matching 1000's of patterns against the text, in one pass.
(You can kinda approximate it in regular regexp by doing multiple rules at once, each in it's own group, and seeing which groups are non-empty, like in "(rule1)|(rule2)|(rule3)" and so on. If your regexp library is good, it will match this in time independent from number of rules. But real lexers will have nicer syntax)
I see a lot of criticism but not a lot of praise. For me creating something so comprehensive and also easy to use is an achievement. And in my opinion this is much better than Regex for most cases because of the ease of use and readability. I think the main pushback you are seeing is just because people are invested in Regex.
Um.. no? He is seeing pushback because one of his main claims (100x faster) is very likely factually wrong. And also because he is not even acknowledging all other prior art in this space.
> Also, patent sends marketing message that we are serious about the technology we created.
That's the message it sends to business people maybe, but your target audience here is hackers and tinkerers and it sends a completely different message (namely: "keep this technology out of your projects or you'll get into legal trouble down the road").
And thanks for patience to all people who are playing with the Nevod right now and getting errors. We have an unexpectedly high interest and number of visitors is very high in our playground. Meanwhile, it's a preview of the technology, please keep in mind.
Let me clarify few things. The speed of Nevod is based on two things:
1. Nevod operates on words, not individual characters. From my 30 years of coding experience I would say that roughly 95% of text processing/rules are word-based. So it covers most of tasks.
2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.
So, at a RULE DECK of 1000 patterns, we got 300 times improvement comparing to RegExp. This is a kind of tasks when you let say need to highlight syntax, to massively verify and route incoming requests in cloud, perform massive text validations, track certain phrases and relations in massive human communication, etc.
With growing number of patterns the difference between Nevod and RegExp is higher and higher and go far beyond 300 times. So, I'm a kind of disagree with moderators removed the word "hundreds" from the headline. :)
We will publish benchmark results and we will also release "negrep" (Nevod GREP) command line tool for Windows, Linux, Mac, so everyone will be able to run "negrep", play with it and verify benchmark him/herself. Generally, Nevod technology will be available both in form of a library, in form of command line tool, and in form of service.