Hacker News new | past | comments | ask | show | jobs | submit login
CUDA Grep (cmu.edu)
160 points by usgroup on June 27, 2019 | hide | past | favorite | 77 comments



They ignored a key detail in their benchmark. They did not include the time it took to transfer the input file from memory to the GPU, (the GPU cores can only operate on graphics memory). This is easily going to be the most time consuming part. They also didn't include any IO operations in their timing, so instead of a 10x speedup, it is more likely to be a couple of percent faster as grep is going to be IO bound. This is a neat project, but there is a reason we are not using the GPU for regex matching.


One of the authors here:

Yes you are right, you should not replace grep for one off regexes due to the latency of the io operations. As far as I recall, memory throughput however is much higher on house than on CPUs (or at least this was true in 2012). Our intended use cases for this sort of a grep were cases where you are finding many needles in enormous haystacks such as: Looking for malicious machine code in executables that you download off of the internet (virus scanning), or searching for certain genetic sequences in DNA code. I believe we discussed this in our final presentation, but perhaps we left it out of the written report.

Yes we probably should have included the numbers to show why you wouldn't want to use this instead of grep for simple tasks. I am sorry we missed it.


But getting any of that data off disk will typically be the bottleneck.

(Also, you'd want BLAST for DNA instead of grep, but perhaps this was an intentional simplification)


well if you're looking for certain motifs or k-mers (https://en.wikipedia.org/wiki/K-mer), it's mostly scanning through pre-processed FASTA files.

My last K-mer counter I wrote was CPU/Memory bound (big enough size strings means big hash tables to store this stuff in, or using something like a bloom filter (Jellyfish's approach), not IO bound.

Anyway, for anyone who wants to write something fun, fasta parsers / k-mer counters are a good weekend project.

here was my work in the area (shameless plug, though i'm not in the field anymore)

https://github.com/mutantturkey/dna-utils.

just tested again against a 400mbish genome fasta file from ncbi and it's still clearly an cpu/memory issue with huge hash tables. See below (4^x is the potential combinations of K-mers). K=9 is done with a 4^9 array (sane in memory), 4^12 is a sparse array (stored as a a std::unordered_map, because it's too large to allocate that much space in memory). One is obviously way slower

calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 9 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null

real 0m4.235s

user 0m4.116s

sys 0m0.108s

calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 12 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null

real 0m25.524s

user 0m25.292s

sys 0m0.152s

calvin@bison:~/src/dna-utils


I'm quite certain Windows Defender uses (or used, at some point) GPUs in the machine to accelerate malware pattern matching. I remember a coworker talking about a funny bug as a consequence when the graphics driver executable itself was infected.


There's an article on the announcement last year: https://arstechnica.com/gadgets/2018/04/intel-microsoft-to-u...


That is for their enterprise "Microsoft Defender Advanced Threat Protection" endpoint security product which scans live memory for any hidden malware not writing to disk. Which would otherwise be consuming far more CPU than the typical default windows security software. Which is why they needed GPU in the first place. But it's an interesting solution none-the-less.

https://www.microsoft.com/en-us/microsoft-365/windows/micros...


I’ve run into this for simple tasks on a GPU, like merging sorted lists. The massive parallelism can’t make up for the transfer costs unless the operations performed are expensive enough. The “roofline” model is usually the perspective used for this kind of accelerator application.


Reminded me of a 12 year old paper on file carving by Marziale, Richard and Roussev[0].

[0] https://www.dfrws.org/sites/default/files/session-files/pres...


I know in pretty nearly all the domains I've worked, discrete packets of data will use regexp directly.

If we are using something more 'grep'-like, it's for large quantities of data, and in many cases we could treat these as a stream. So as long as the bandwidth to the GPU is sufficient, a little stall at the beginning would represent an undue hardship, if the results come back faster.


Some GPUs have integrated memory with the CPU and don't need to do the transfer, it could still speed things up there couldn't it?

Also, they didn't ignore the key detail, they mention it: "This is without taking into account the overhead of transferring data to the GPU."


Yeah, but integrated GPUs wouldn't be usable with CUDA.


I have not been able to inspect the code (dead links), but from the description of their approach, they do not depend on anything particularly fancy from CUDA. An OpenCL implementation seems just as feasible.


Nvidia makes several with integrated memory. Nintendo Switch is based on one etc.


Perhaps on PCs, but plenty of mobile and consoles use a shared memory architecture.


The tegra products are all integrated.


Oh good, I'm glad we did mention this!


> They did not include the time it took to transfer the input file from memory to the GPU

Fair point, it's more support for this HN post[2] making the case for memory-mapping files to GPU.

> This is a neat project, but there is a reason we are not using the GPU for regex matching.

I think it's more interesting that it's a general facility for running automata on a GPU, as it's opening up the universe of things that can be accelerated by a GPU. For instance, Numba[1] is a library to run arbitrary math code, and it has a feature for generating code to run on the GPU.

[1]: http://numba.pydata.org/numba-doc/latest/cuda/index.html

[2]: https://news.ycombinator.com/item?id=11892030


CUDA already has this built in (for many years) using UVM. That won't take away the bottleneck of the transfer, but it does make it transparent to the user by handling all the paging/copies.


I thought that just made GPU memory map equal to CPUs' and you need much for VM (like paging).


Totally disagree. Moving memory to the GPU can be parallelized while doing the computation. For larger files that can fully saturate the gpu's bandwidth, it's completely negligible. "Easily the most time consuming part", no. Not even close.

Also, IO bound? Again, no. My consumer MBP can read at 3.2GB/s. Good luck doing regex on that in real time on a CPU. This is a CPU bound operation for a typical high end configuration.

Does this make sense for a 1MB file? probably not. But anything well beyond that it will be way faster.

Here's a great talk about how you should think about GPU performance: https://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pd...


> They did not include the time it took to transfer the input file from memory to the GPU

To give some context: PCIe 3.0 x16 achieves about 11-12 GB/s. So, transfers for benchmarked 53MB file would add ~4.8 ms (per direction, assuming pinned host memory)

Now, they could also pursue a streaming approach, processing in "micro-batches". I.e., pipelining input transfers (HOST->GPU), GPU-based processing, and transfer of results (GPU->HOST), as pursued by [1] (disclaimer: I'm a co-author). This lower's processing latency. Since the interconnect is full-duplex, it would just limit the processing rate to about ~11 GB/s (i.e., 5 ms for mentioned 53MB). To be fair, though, input has to be sufficiently large, such that there are (a) enough micro-batches to overlap transfer and compute and (b) the micro-batches can be chosen to be still large enough to exploit full parallelism of the GPU.

[1] https://arxiv.org/pdf/1905.13415.pdf


Or you can flip your assumptions and run 10X faster... everywhere. It was a tough investment, but we went GPU-first via Apache Arrow & Nvidia RAPIDS. (In this case, I'd look at RAPIDS / NVStrings before this.)


One of the authors here: Happy to answer any questions. Keep in mind this was a school project from 2012 (Kayvon's 15-418 at CMU, a wonderful class), so it's been a while.


Is the code available? The links are dead.


Hmm I'll take a look at that. Yes code is here: https://github.com/bkase/CUDA-grep


Can you run your benchmarks against ripgrep and report back?


Unfortunately, I don't actually have access to NVidia hardware anymore -- but I would be happy to accept a PR from someone who can get this to run!


Thanks for the reply. I'm also without nVidia hardware so I can't do it either.


Did grep treat the input file as ASCII or UTF8?

I remember another story where someone claimed to be faster than grep. Turned out the comparison was unfair because if this distinction. I believe LANG=C is enough to turn grep into ASCII mode.


The references on the project site lead to 404. Are they CMU internal links?


Which links specifically are broken?


Both links in the "REFERENCES" section.


I'm surprised to see no comparison with ripgrep[0] or The Silver Searcher[1], both of which claim to be faster alternatives to grep.

[0] https://github.com/BurntSushi/ripgrep [1] https://github.com/ggreer/the_silver_searcher


Perhaps more appropriate would be Hyperscan: https://www.hyperscan.io/

Even though it's now an Intel project targeting CPUs, it grew out of a startup trying run regex matching on parallel hardware.


FWIW, I ported Hyperscan to ARM using SIMDe.


"this was a school project from 2012" as one of the authors here already commented before your comment.

The first commit to ag - README.md - was in Nov. 2011; https://github.com/ggreer/the_silver_searcher/commit/0121e6b... . The earliest versioned commit tag was 0.3 in March 2012 - https://github.com/ggreer/the_silver_searcher/tree/0.3 . I think that lack of comparison is okay.

The first commit to ripgrip was in 2016. https://github.com/BurntSushi/ripgrep/commit/9d1e619ff359b6e...


I missed that comment. Of course then it's unfair for me to expect a comparison. That said, I would still be interested to see one now :)


I too am curious. Both of these other tools don't make use of the GPU unless I'm mistaken, so this would be an interesting comparison.


One reason that Perl regex is relatively slow is because it isn't actually regular: it can't be implemented using DFAs or NFAs.


see also, Perl Regular Expression Matching is NP-Hard: https://perl.plover.com/NPC/


Sure, but it must be relatively simple to identify, pre-compilation, if the regex is regular or not; there's only certain constructions that break regularity. Why couldn't Perl's engine fall back to a more optimized regular-language-only implementation for those cases, if it has such a big impact on performance?


It can, and it's one of my explicit goals to do so. But the biggest problem is perl's horrible code in regcomp and regexec. (based on Henry Spencer's old code). It mostly does state transitions via goto or setjmp/longjmp. The initial regex scanner did 2 passes, so it would have been trivial to detect NFA illegal backtracking with zero cost. With a cleanly written codebase it would be much easier. my re::engine::PCRE2 replacement is much faster, can do everything and even has less bugs than perl5 regex updates. But it's still not production ready, and full replacement would still be huge work. Now it's just via hooks. Which are slow, because most of the work is done twice.


Theoretically, a regular expression can be expressed in constant space, and that's profoundly limiting, so anything that requires so much as a stack is not a regex.

That includes basic things like capture groups. That doesn't mean you can't use automata in a regex-plus implementation, though.

I can't find the paper, but I think there is a hybrid engine that tries to get the best of both worlds; constant space where possible and using additional space to provide more features.


You can implement capture groups with no problem using automata (see re2).

You might have meant to say that you can't use backreferences.


Here is some further reading on the topic that I found useful.

https://swtch.com/~rsc/regexp/regexp1.html


It's also possible that they might be including perl's startup time in that measurement? Their comments about 'parallelising' 'by using ampersand' make me concerned about the methodology involved.


DFA/NFAs?


Deterministic Finite Automaton. It's a concept from automata theory which is a concept from theory of computation. Implementations of DFA's are how libraries like google's re2 or golang's regex are implemented. They're not as powerful as regex libs built with a non-regular language model but that's by design.

Fun trivia: DFAs and NFAs are computationally equivalent. Any NFA can be converted into an equivalent DFA.

https://en.wikipedia.org/wiki/Deterministic_finite_automaton


    > ...not as powerful as regex libs built with a non-regular language...
I've wondered about that, but I can't think of a realistic example where that would really matter.

Is there something that perl regex's can do, in non-crazy everyday use-cases, that can't be done by something like re2?


parsing any context-free grammar[1] is provably impossible using pure regular expressions. but add backtracing support and you end up with one of my favorite stackoverflow answers[2] of all time. (bonus: there's another legendary answer at the top of that question.)

1: https://en.wikipedia.org/wiki/Chomsky_hierarchy

2: https://stackoverflow.com/a/5233151/3961271


Backreferences and look around.

I don't think it's proven that they can't be added on to an automata based regular expression engine. Nobody has figured it out yet though.


Actually, it's been proved for ... longer than PERL exists ?

The various features in PERL allow you to emulate context-free languages. There is a proof somewhere, but it's trivial to see you can use backreferences to parse languages with well-nested parens.

Languages of well-nested parens are known to be non-regular, and thus impossible to write with regular expressions.

Also note that knowing if an arbitrary context-free language is regular is undecidable. That doesn't mean you can't try to optimize down some PERL grammar to a regex, but your optimization will never be complete.


>Also note that knowing if an arbitrary context-free language is regular is undecidable. That doesn't mean you can't try to optimize down some PERL grammar to a regex, but your optimization will never be complete.

Are you reffering to the programming language or PCRE? I would expect PCRE to be a simple superset of the regex grammar, and thus the optimization is complete and trivial: the appearance of any PCRE constructs denies the use of the DFA engine; otherwise, use the DFA.

I don’t see how you could ever optimize out eg a backreference (converting to some equivalent regex) without already having parsed the subject text and determining it to be unnecessary. In which case, I don’t see how the inability to determine regularity of a context-free grammar is relevant (if you’re referring to the context-free nature of PCRE’s grammar, its not clear to me why you should care where the regex input grammar itself is regular/context-free, and why you’d want to convert PCRE to a regular grammar; optimizing a regex search shouldn’t care about the grammar defining the search)


Backrefs are nonregular. I believe lookaround by itself is not a problem -- it's essentially intersection, which is regular (though often omitted from regex engines). Yeah, here's a reference: https://cs.stackexchange.com/questions/2557/how-to-simulate-...


Replying to my own comment to clarify: it is true that you can't express a non-regular language with regular expressions. What I meant is, it is likely we can combine automata and backtracking based techniques to get the best of both worlds. In other words, an automata based engine doesn't necessarily mean you can't also have backreferences and still have efficiency gains.


The proof that finite automata can’t implement backreferences is actually extremely simple: you need to remember all the text matches as part of any given capture group, in order to match it again as a backreference. This requires unbounded memory.


They are powerful in the sense that they are specialized to a task and perform it much better.


I always found the algorithm to convert a NFA to an equivalent DFA to be pretty cool.


Hold up. I've seen this conversation play out probably half a dozen times over the last few years and it just occurred to me...

If you can convert a nondeterministic automata to a deterministic one, doesn't that mean they are all deterministic?


"Nondeterministic" in automata is a very specialized word that doesn't refer to anything like a random process in a stochastic universe.

It's a representation for state machines in which state transitions are apparently ambiguous: a given state in the NFA can, for exactly the same input, go into multiple other states. This is a very convenient representation for pattern matching.

The way the ambiguity resolves itself is not through randomness, but by the understanding that the machine is in all of the possible states at the same time. Then, in a concrete implementation in a programming language, we represent that situation by using a set of states as the run-time state.

E.g. when the input symbol b is received, some NFA machine goes from state { 0, 3, 5 } to { 1, 3, 7, 8 }; i.e. it is in NFA states 0, 3 and 5, which transition to 1, 3, 7, 8 when b is seen.

NFA graphs can be executed directly by an NFA simulator which calculates these transitions on the fly.

We can also statically figure out all the state sets there can be and their transitions.Then we re-label these sets as the simple states of a deterministic automaton. E.g. { 0, 3, 5 } is dubbed S0, { 1, 3, 7, 8} is dubbed S1. There is a transition from S0 to S1 on b. We can work through all the possible transitions, ferret out all the subsets and their transitions to build a simple machine. That is the "subset construction".


Yes for regular languages but not for higher level ones. For example, deterministic context-free is a subset of context-free. For languages that are turing complete, the question is less about ability to compute and more about the speed at which something can be computed. This is an unsolved problem known as P vs. NP.

https://en.wikipedia.org/wiki/P_versus_NP_problem


https://en.wikipedia.org/wiki/Deterministic_finite_automaton https://en.wikipedia.org/wiki/Nondeterministic_finite_automa...

Said another way: a state machine. True regular languages can be defined by a simple state machine, which is relatively much faster than the extended regular-like language that perl exposes.


Would be awesome if they use proper open source software stack, like OpenCL, not that NVIDIA proprietary crap. Using anything from NVIDIA harms open source, because until they bankrupt there won't be FOSS drivers and software for their GPUs.


CUDA is not crap. At the time this paper was written it had been out for a number of years, had good multi-platform support, and had good support for nvidia GPUs.

If they had used opencl they may have been using an immature product, that might really have been apple only at the time (or apple and IBM?). I don't know thie history but 8-9 years ago (when the research work may have started) nvidia was tpushing the cuda idea (almost entirely alone - AMD still playing catchup).


OCL was pretty mature then. CUDA is not crap but it is lock-in and IMO its best features were captured by OCL.


There's open-source graphics drivers that use NVIDIA like 'nouveau', which in my experience are much better than the proprietary nvidia graphics drivers on linux. Otherwise, I believe CUDA exists and remains proprietary because ISAs change by GPU generation, and because NVIDIA has a habit of patching graphics bugs outside its API for gaming customers, both of which lead to tight coupling between client libraries and the underlying hardware.


I know about nouveau and respect its developers so much. Trying to make it work despite an active resistance from NVIDIA itself - it requires some patience and rigor. The stance on providing firmwares to the project is ridiculous. I believe, NVIDIA invented the whole signing feature just to undermine nouveau efforts, using security just as a pretext (as many vendors do).


If you hate nvidia so much, why do you want open source drivers for their crap products? You might be an OSS purist, but most people are trying to solve problems, and cuda is apparently more useful than opencl.

Nothing is stopping you from enhancing opencl as a language, building useful libraries, writing documentation, holding conferences to build an ecosystem around opencl, and evangelizing the technology, just like nvidia did for cuda.


I already work on open source a lot, just in other areas.


Interesting... It's like time is a limited resource. You just answered your question as to why people use CUDA.

All the popular frameworks run on CUDA, and from what I hear openCL sucks to use.


If anyone is slightly interested in a different approach. https://github.com/vqd8a/iNFAnt2. I think its designed around executing incredibly large regex sets



I remember several years ago I saw this IOCCC winning entry on regex and thought "maybe we should write a parallel version" And it's good to see someone has it implemented now! https://www.ioccc.org/2012/hou/hint.html


Github link on this page is broken, otherwise this is looking awesome. I'm excited to try it out.


The correct link is https://github.com/bkase/CUDA-grep I think he's working on fixing the broken link.


None of the links work. Any source and log on say github?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: