They ignored a key detail in their benchmark. They did not include the time it took to transfer the input file from memory to the GPU, (the GPU cores can only operate on graphics memory). This is easily going to be the most time consuming part. They also didn't include any IO operations in their timing, so instead of a 10x speedup, it is more likely to be a couple of percent faster as grep is going to be IO bound. This is a neat project, but there is a reason we are not using the GPU for regex matching.
Yes you are right, you should not replace grep for one off regexes due to the latency of the io operations. As far as I recall, memory throughput however is much higher on house than on CPUs (or at least this was true in 2012). Our intended use cases for this sort of a grep were cases where you are finding many needles in enormous haystacks such as: Looking for malicious machine code in executables that you download off of the internet (virus scanning), or searching for certain genetic sequences in DNA code. I believe we discussed this in our final presentation, but perhaps we left it out of the written report.
Yes we probably should have included the numbers to show why you wouldn't want to use this instead of grep for simple tasks. I am sorry we missed it.
well if you're looking for certain motifs or k-mers (https://en.wikipedia.org/wiki/K-mer), it's mostly scanning through pre-processed FASTA files.
My last K-mer counter I wrote was CPU/Memory bound (big enough size strings means big hash tables to store this stuff in, or using something like a bloom filter (Jellyfish's approach), not IO bound.
Anyway, for anyone who wants to write something fun, fasta parsers / k-mer counters are a good weekend project.
here was my work in the area (shameless plug, though i'm not in the field anymore)
just tested again against a 400mbish genome fasta file from ncbi and it's still clearly an cpu/memory issue with huge hash tables. See below (4^x is the potential combinations of K-mers). K=9 is done with a 4^9 array (sane in memory), 4^12 is a sparse array (stored as a a std::unordered_map, because it's too large to allocate that much space in memory). One is obviously way slower
calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 9 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null
real 0m4.235s
user 0m4.116s
sys 0m0.108s
calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 12 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null
I'm quite certain Windows Defender uses (or used, at some point) GPUs in the machine to accelerate malware pattern matching. I remember a coworker talking about a funny bug as a consequence when the graphics driver executable itself was infected.
That is for their enterprise "Microsoft Defender Advanced Threat Protection" endpoint security product which scans live memory for any hidden malware not writing to disk. Which would otherwise be consuming far more CPU than the typical default windows security software. Which is why they needed GPU in the first place. But it's an interesting solution none-the-less.
I’ve run into this for simple tasks on a GPU, like merging sorted lists. The massive parallelism can’t make up for the transfer costs unless the operations performed are expensive enough. The “roofline” model is usually the perspective used for this kind of accelerator application.
I know in pretty nearly all the domains I've worked, discrete packets of data will use regexp directly.
If we are using something more 'grep'-like, it's for large quantities of data, and in many cases we could treat these as a stream. So as long as the bandwidth to the GPU is sufficient, a little stall at the beginning would represent an undue hardship, if the results come back faster.
I have not been able to inspect the code (dead links), but from the description of their approach, they do not depend on anything particularly fancy from CUDA. An OpenCL implementation seems just as feasible.
> They did not include the time it took to transfer the input file from memory to the GPU
Fair point, it's more support for this HN post[2] making the case for memory-mapping files to GPU.
> This is a neat project, but there is a reason we are not using the GPU for regex matching.
I think it's more interesting that it's a general facility for running automata on a GPU, as it's opening up the universe of things that can be accelerated by a GPU. For instance, Numba[1] is a library to run arbitrary math code, and it has a feature for generating code to run on the GPU.
CUDA already has this built in (for many years) using UVM. That won't take away the bottleneck of the transfer, but it does make it transparent to the user by handling all the paging/copies.
Totally disagree. Moving memory to the GPU can be parallelized while doing the computation. For larger files that can fully saturate the gpu's bandwidth, it's completely negligible. "Easily the most time consuming part", no. Not even close.
Also, IO bound? Again, no. My consumer MBP can read at 3.2GB/s. Good luck doing regex on that in real time on a CPU. This is a CPU bound operation for a typical high end configuration.
Does this make sense for a 1MB file? probably not. But anything well beyond that it will be way faster.
> They did not include the time it took to transfer the input file from memory to the GPU
To give some context: PCIe 3.0 x16 achieves about 11-12 GB/s. So, transfers for benchmarked 53MB file would add ~4.8 ms (per direction, assuming pinned host memory)
Now, they could also pursue a streaming approach, processing in "micro-batches". I.e., pipelining input transfers (HOST->GPU), GPU-based processing, and transfer of results (GPU->HOST), as pursued by [1] (disclaimer: I'm a co-author). This lower's processing latency. Since the interconnect is full-duplex, it would just limit the processing rate to about ~11 GB/s (i.e., 5 ms for mentioned 53MB). To be fair, though, input has to be sufficiently large, such that there are (a) enough micro-batches to overlap transfer and compute and (b) the micro-batches can be chosen to be still large enough to exploit full parallelism of the GPU.
Or you can flip your assumptions and run 10X faster... everywhere. It was a tough investment, but we went GPU-first via Apache Arrow & Nvidia RAPIDS. (In this case, I'd look at RAPIDS / NVStrings before this.)
One of the authors here: Happy to answer any questions. Keep in mind this was a school project from 2012 (Kayvon's 15-418 at CMU, a wonderful class), so it's been a while.
I remember another story where someone claimed to be faster than grep. Turned out the comparison was unfair because if this distinction. I believe LANG=C is enough to turn grep into ASCII mode.
Sure, but it must be relatively simple to identify, pre-compilation, if the regex is regular or not; there's only certain constructions that break regularity. Why couldn't Perl's engine fall back to a more optimized regular-language-only implementation for those cases, if it has such a big impact on performance?
It can, and it's one of my explicit goals to do so. But the biggest problem is perl's horrible code in regcomp and regexec. (based on Henry Spencer's old code). It mostly does state transitions via goto or setjmp/longjmp. The initial regex scanner did 2 passes, so it would have been trivial to detect NFA illegal backtracking with zero cost. With a cleanly written codebase it would be much easier.
my re::engine::PCRE2 replacement is much faster, can do everything and even has less bugs than perl5 regex updates. But it's still not production ready, and full replacement would still be huge work. Now it's just via hooks. Which are slow, because most of the work is done twice.
Theoretically, a regular expression can be expressed in constant space, and that's profoundly limiting, so anything that requires so much as a stack is not a regex.
That includes basic things like capture groups. That doesn't mean you can't use automata in a regex-plus implementation, though.
I can't find the paper, but I think there is a hybrid engine that tries to get the best of both worlds; constant space where possible and using additional space to provide more features.
It's also possible that they might be including perl's startup time in that measurement? Their comments about 'parallelising' 'by using ampersand' make me concerned about the methodology involved.
Deterministic Finite Automaton. It's a concept from automata theory which is a concept from theory of computation. Implementations of DFA's are how libraries like google's re2 or golang's regex are implemented. They're not as powerful as regex libs built with a non-regular language model but that's by design.
Fun trivia: DFAs and NFAs are computationally equivalent. Any NFA can be converted into an equivalent DFA.
parsing any context-free grammar[1] is provably impossible using pure regular expressions. but add backtracing support and you end up with one of my favorite stackoverflow answers[2] of all time. (bonus: there's another legendary answer at the top of that question.)
Actually, it's been proved for ... longer than PERL exists ?
The various features in PERL allow you to emulate context-free languages. There is a proof somewhere, but it's trivial to see you can use backreferences to parse languages with well-nested parens.
Languages of well-nested parens are known to be non-regular, and thus impossible to write with regular expressions.
Also note that knowing if an arbitrary context-free language is regular is undecidable. That doesn't mean you can't try to optimize down some PERL grammar to a regex, but your optimization will never be complete.
>Also note that knowing if an arbitrary context-free language is regular is undecidable. That doesn't mean you can't try to optimize down some PERL grammar to a regex, but your optimization will never be complete.
Are you reffering to the programming language or PCRE? I would expect PCRE to be a simple superset of the regex grammar, and thus the optimization is complete and trivial: the appearance of any PCRE constructs denies the use of the DFA engine; otherwise, use the DFA.
I don’t see how you could ever optimize out eg a backreference (converting to some equivalent regex) without already having parsed the subject text and determining it to be unnecessary. In which case, I don’t see how the inability to determine regularity of a context-free grammar is relevant (if you’re referring to the context-free nature of PCRE’s grammar, its not clear to me why you should care where the regex input grammar itself is regular/context-free, and why you’d want to convert PCRE to a regular grammar; optimizing a regex search shouldn’t care about the grammar defining the search)
Backrefs are nonregular. I believe lookaround by itself is not a problem -- it's essentially intersection, which is regular (though often omitted from regex engines). Yeah, here's a reference: https://cs.stackexchange.com/questions/2557/how-to-simulate-...
Replying to my own comment to clarify: it is true that you can't express a non-regular language with regular expressions. What I meant is, it is likely we can combine automata and backtracking based techniques to get the best of both worlds. In other words, an automata based engine doesn't necessarily mean you can't also have backreferences and still have efficiency gains.
The proof that finite automata can’t implement backreferences is actually extremely simple: you need to remember all the text matches as part of any given capture group, in order to match it again as a backreference. This requires unbounded memory.
"Nondeterministic" in automata is a very specialized word that doesn't refer to anything like a random process in a stochastic universe.
It's a representation for state machines in which state transitions are apparently ambiguous: a given state in the NFA can, for exactly the same input, go into multiple other states. This is a very convenient representation for pattern matching.
The way the ambiguity resolves itself is not through randomness, but by the understanding that the machine is in all of the possible states at the same time. Then, in a concrete implementation in a programming language, we represent that situation by using a set of states as the run-time state.
E.g. when the input symbol b is received, some NFA machine goes from state { 0, 3, 5 } to { 1, 3, 7, 8 }; i.e. it is in NFA states 0, 3 and 5, which transition to 1, 3, 7, 8 when b is seen.
NFA graphs can be executed directly by an NFA simulator which calculates these transitions on the fly.
We can also statically figure out all the state sets there can be and their transitions.Then we re-label these sets as the simple states of a deterministic automaton. E.g. { 0, 3, 5 } is dubbed S0, { 1, 3, 7, 8} is dubbed S1. There is a transition from S0 to S1 on b. We can work through all the possible transitions, ferret out all the subsets and their transitions to build a simple machine. That is the "subset construction".
Yes for regular languages but not for higher level ones. For example, deterministic context-free is a subset of context-free. For languages that are turing complete, the question is less about ability to compute and more about the speed at which something can be computed. This is an unsolved problem known as P vs. NP.
Said another way: a state machine. True regular languages can be defined by a simple state machine, which is relatively much faster than the extended regular-like language that perl exposes.
Would be awesome if they use proper open source software stack, like OpenCL, not that NVIDIA proprietary crap. Using anything from NVIDIA harms open source, because until they bankrupt there won't be FOSS drivers and software for their GPUs.
CUDA is not crap. At the time this paper was written it had been out for a number of years, had good multi-platform support, and had good support for nvidia GPUs.
If they had used opencl they may have been using an immature product, that might really have been apple only at the time (or apple and IBM?). I don't know thie history but 8-9 years ago (when the research work may have started) nvidia was tpushing the cuda idea (almost entirely alone - AMD still playing catchup).
There's open-source graphics drivers that use NVIDIA like 'nouveau', which in my experience are much better than the proprietary nvidia graphics drivers on linux. Otherwise, I believe CUDA exists and remains proprietary because ISAs change by GPU generation, and because NVIDIA has a habit of patching graphics bugs outside its API for gaming customers, both of which lead to tight coupling between client libraries and the underlying hardware.
I know about nouveau and respect its developers so much. Trying to make it work despite an active resistance from NVIDIA itself - it requires some patience and rigor. The stance on providing firmwares to the project is ridiculous. I believe, NVIDIA invented the whole signing feature just to undermine nouveau efforts, using security just as a pretext (as many vendors do).
If you hate nvidia so much, why do you want open source drivers for their crap products? You might be an OSS purist, but most people are trying to solve problems, and cuda is apparently more useful than opencl.
Nothing is stopping you from enhancing opencl as a language, building useful libraries, writing documentation, holding conferences to build an ecosystem around opencl, and evangelizing the technology, just like nvidia did for cuda.
If anyone is slightly interested in a different approach.
https://github.com/vqd8a/iNFAnt2. I think its designed around executing incredibly large regex sets
I remember several years ago I saw this IOCCC winning entry on regex and thought "maybe we should write a parallel version" And it's good to see someone has it implemented now!
https://www.ioccc.org/2012/hou/hint.html