For another real-world example, see "Details of the Cloudflare outage on July 2, 2019" [1]
Python `re` module now supports possessive quantifiers and atomic grouping (I wrote a blog post here [2]). So, for example, `(a+|\w+)*+:` instead of `(a+|\w+)*:` will avoid such catastrophic behavior.
While atomic grouping can prevent this from happening again once you've already found the problem, the real solution is just making sure there is a timeout on your web process. That way even if you hit catastrophic backtracking, your servers aren't going to be unavailable unless someone is purposely reDOSing them.
The article starts with the famous quote by Jamie Zawinski:
Some people, when confronted with a problem, think ‘I know, I’ll use regular
expressions.’ Now they have two problems.
It then goes on to discuss the perils of regular expressions, when they encounter unanticipated data and run into catastrophic backtracking because of greedy expressions.
>... there's a bit more to the story than that, as evidenced by Jeffrey Friedl's exhaustive research on the Zawinski quote. Zawinski himself commented on it. Analyzing the full text of Jamie's posts in the original 1997 thread, we find the following:
>Perl's nature encourages the use of regular expressions almost to the exclusion of all other techniques; they are far and away the most "obvious" (at least, to people who don't know any better) way to get from point A to point B.
>The first quote is too glib to be taken seriously. But this, I completely agree with. Here's the point Jamie was trying to make: not that regular expressions are evil, per se, but that overuse of regular expressions is evil.
>Perl's nature encourages the use of regular expressions almost to the exclusion of all other techniques
I like Perl, but that is true. Despite the pretty rich set of built-ins, for example, there's not a top-level string replace function. You have to roll your own with index and substr.
- Releasing/acquiring the GIL isn't cheap. So you'd need to know in advance whether it's worth it or not.
- You need the GIL acquired to instantiate the resulting "match data", captures, etc.
Stuff like re.sub frequently calls into Python code as well, so it would seem to be rather hairy to get rid of the GIL for this. It could of course add the usual workarounds of cooperative multitasking, like yielding (releasing the GIL) from time to time, just like the main interpreter loop does (sys.setswitchinterval)
assuming it's legal to do so, GIL release seems like a reasonable flag to add to possibly-expensive functions like re.()
... or even better, have re. track its own performance and release the GIL if it forecasts that it statistically-might use a lot more computation... a trivial strawman algorithm: if len(input) > MAX then release GIL first.
Ah, the /"([^"]*)"/ trick. It was one of those small, obvious in retrospect but not that easy to invent on your own, useful things I was taught at my first job among others such as "if in a unit test you need a future date, use year 3001".
Try `a="(.*?)\b"` next time. Where `\b` means a boundary such as the space in your example. Or use a more precise match for the capture group rather than the wildcard meaning ANYTHING.
The simple example is fixable by anchoring the inner greedy expression, i.e.
regex = re.compile(r'(^a+)+b')
Perhaps the more general problem was that "The regular expression that was causing the issue here was rather long and convoluted".
A good rule seems to be never have long and convoluted regexs, instead have a pipeline of simple-to-understand regexes that can be independently debugged (I don't know if this could cause performance slowdowns however).
Why would you leave the second + intact, when the ^ prevents it from matching 2 or more times? And what if you want to match the substring in "Faaaaabulous"?
Did you see the even simpler fix in TFA that doesn't break mid-string matches?
Need I point out that one can avoid both the issues highlighted here by just using Tcl?
Regexps in Tcl use a non-backtracking implementation which handles such cases with minimal slowdown, e.g. the example shown with 16 and 32-character inputs:
The real problem is using multithreading in CPython for any set of tasks that has a nonzero chance to block one of the tasks due to a CPU-bound issue. The rule of thumb here - and I learned that the hard way - is to use async/await for anything I/O bound, and multiprocessing for anything CPU-bound.
Not really, the real problem is the exponential runtime of a badly-written regex. Multithreading in CPython is also a problem, but even if this was written in a performant language with robust multithreading support, this bug could still be amplified into a denial of service.
Why is the run time exponential in the length of the string of a's? It seems like it should be at most O(n^3).
2^n is the number of subsets of the string of a's with no contiguity requirements. Nothing like this number of possible matches should need to be checked. What am I missing?
>On my laptop it takes about 20ms to do the search above with 16 characters - but by doubling the input string size to 32 characters, it takes over 12 minutes to finish. Exponential running time really does suck.
I checked with 32 characters on regex101, and it says "Match 1 halted after 119986 step(s)" when you click the debug icon (click the message above the regex input box).
I tried with PHP 7.4.6 on my PC and the results were nearly instantaneous, even with the input string more than doubled to >32 chars. What's up with that?
It really feels like they buried the lede here, as almost all discussions about catastrophic backtracking seem to do. Here's the second-to-last paragraph:
> Alternatively you could switch to a regular expression engine that doesn’t exhibit this kind of behaviour. RE2 is an excellent library that uses a finite automata approach to avoid this type of catastrophic backtracking. Using this library on the same 32 character input reduces the running time from 12 minutes to basically instant. The downside being that RE2 doesn’t support all the functionality of the builtin re package.
Perl Compatible Regular Expressions (the flavor used by most engines, including Python's) _require_ this exponential worst-case behavior (using an implementation called backtracking). But the theory behind regex does not require this, and if you eliminate just a couple (infrequently used) features of PCRE, you can have a regex engine that runs in linear time (with respect to the size of the input string). See [2], this is by the authors of RE2. The features which are incompatible with this sort of implementation are:
1. Lookahead assertions.
2. Backreferences to matched groups.
If you don't know what these are, or you rarely ever use them, then you really have no business using this kind of regex engine. (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point). Nonetheless, every popular programming language (except Go[1]) has included an exponential backtracking regex implementation in their standard library, and exposed the entire industry to this stupidity, all for a few backreferences and lookahead assertions.
What's especially crazy is this: it's easy for a regex engine to detect the use of these constructs, because it has to parse the regex anyway! So it's feasible for the engine to optimistically try to use an efficient implementation, and only fall back to using a backtracking implementation when you're actually using these features. This is what the Python re2 module[3] does, it uses RE2 by default and supports falling back to the backtracking implementation if necessary.
Instead, we're stuck reading the same postmortem every few years describing how "catastrophic backtracking" ruined another company/person's day, when the problem has been solved for decades, and language/library creators have just failed to include that solution.
> Nonetheless, every popular programming language (except Go[1])
... and Rust. Although Rust doesn't have a regex library in std, the official regex crate (I am its author) descends from RE2[1].
> So it's feasible for the engine to optimistically try to use an efficient implementation
Feasible, sure. But not easy or simple. If Python's re2 module uses RE2 by default and Python's regex engine when necessary, then it's very likely that the match offsets it reports are not coherent. RE2 aims to be mostly consistent with things like PCRE, but there are cases where it isn't.
The problem is, building a production grade FSM regex engine or a production grade backtracking regex engine are both huge engineering projects. And there are few, if any, people that know how to do both. They both have their own little specialties and there isn't a ton of crossover among implementors of regex engines.
It's one of those ideas that sounds so stupidly obvious in practice, but when you actually sit down to engineer the thing, it's quite a bit harder than you might imagine.
> (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point).
This kind of ignores the fact that regexes are often the interface to something else. If you have a full blown programming language at your fingertips, it's usually not hard to work around the absence of look-around or backreferences by, say, using a second regex or something. But when all you have available to you is, say, a blacklist and you get to provide the regexes to that blacklist, then a more expressive regex syntax becomes a lot more useful. See for example people already running into problems with Discord's new feature: https://github.com/rust-lang/regex/issues/127#issuecomment-1...
Of course, as I noted, it would be very inappropriate for Discord to permit non-regular regexes here. It's borderline already by permitting any regexes at all, because even FSMs have their pathological cases, albeit, not exponential. But if it's a tool you run on your local system? Sure, look-arounds become much more useful.
To be clear, I am pro-FSM and agree with you that they should really be the default regex engine everywhere. But instead, reality is indeed inverted, with backtracking being the default (almost) everywhere. But I wanted to pipe up about a few points that I think you might have not quite captured precisely.
Thanks! All good clarifications. You're definitely right on all counts, and it was more than a bit naive of me to suggest that language/library authors "simply" have two production grade regex implementations :)
>> (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point).
> This kind of ignores the fact that regexes are often the interface to something else.
I will definitely admit to using the occasional assertion or backreference in grep or sed from time to time, so you make a good point about local tools too. I suppose I was thinking more in terms of programming APIs when I said this, and so keeping the standard library FSM-based would still apply here. I'd imagine that special-purpose text munging tools which care that much, will likely have their own engines anyway.
Thanks also for your work on the regex crate. I'm still dipping my toes into Rust but I'm glad to know there's an FSM-based default regex library!
Indeed. It doesn't take long before look-around inside of a grep command ends up being quite convenient, sometimes even saving yourself from a much longer shell pipeline. It's why I ultimately relented and added -P/--pcre2 to ripgrep. (Its default regex engine is the Rust regex crate.)
I'm not a Tcl expert, but now that I think about it, I'm not sure you're correct. My recollection is that Tcl is one of the few to have a true hybrid engine. So if you use look-around or backreferences, then Tcl could provide an exponential runtime. If that's true, then Tcl would be included in "Nonetheless, every popular programming language (except Go[1]) has included an exponential backtracking regex implementation in their standard library," with perhaps a caveat or special mention. But it's still a categorically different guarantee than what, say, RE2 provides.
Python `re` module now supports possessive quantifiers and atomic grouping (I wrote a blog post here [2]). So, for example, `(a+|\w+)*+:` instead of `(a+|\w+)*:` will avoid such catastrophic behavior.
[1] https://blog.cloudflare.com/details-of-the-cloudflare-outage...
[2] https://learnbyexample.github.io/python-regex-possessive-qua...