Python, Catastrophic Regular Expressions and the GIL (2013)

asicsp · on Nov 14, 2022

For another real-world example, see "Details of the Cloudflare outage on July 2, 2019" [1]

Python `re` module now supports possessive quantifiers and atomic grouping (I wrote a blog post here [2]). So, for example, `(a+|\w+)*+:` instead of `(a+|\w+)*:` will avoid such catastrophic behavior.

[1] https://blog.cloudflare.com/details-of-the-cloudflare-outage...

[2] https://learnbyexample.github.io/python-regex-possessive-qua...

tyingq · on Nov 14, 2022

This paper goes over the Cloudflare outage, and other runaway regex CVEs with MediaWiki, Stackoverflow, and the Atom editor:

https://vtechworks.lib.vt.edu/bitstream/handle/10919/98593/D...

Alex3917 · on Nov 14, 2022

While atomic grouping can prevent this from happening again once you've already found the problem, the real solution is just making sure there is a timeout on your web process. That way even if you hit catastrophic backtracking, your servers aren't going to be unavailable unless someone is purposely reDOSing them.

execveat · on Nov 14, 2022

There's a cool tool to detect regexes like these in your code: https://github.com/doyensec/regexploit

(disclosure: I work for Doyensec)

kh1 · on Nov 14, 2022

The gif showing the backtracking is cool. I wonder if there's similar tool for python.

xrayarx · on Nov 14, 2022

The article starts with the famous quote by Jamie Zawinski:

  Some people, when confronted with a problem, think ‘I know, I’ll use regular
  expressions.’ Now they have two problems.

It then goes on to discuss the perils of regular expressions, when they encounter unanticipated data and run into catastrophic backtracking because of greedy expressions.

The article is short and to the point.

asicsp · on Nov 14, 2022

More on that quote: https://blog.codinghorror.com/regular-expressions-now-you-ha...

>... there's a bit more to the story than that, as evidenced by Jeffrey Friedl's exhaustive research on the Zawinski quote. Zawinski himself commented on it. Analyzing the full text of Jamie's posts in the original 1997 thread, we find the following:

>Perl's nature encourages the use of regular expressions almost to the exclusion of all other techniques; they are far and away the most "obvious" (at least, to people who don't know any better) way to get from point A to point B.

>The first quote is too glib to be taken seriously. But this, I completely agree with. Here's the point Jamie was trying to make: not that regular expressions are evil, per se, but that overuse of regular expressions is evil.

tyingq · on Nov 14, 2022

>Perl's nature encourages the use of regular expressions almost to the exclusion of all other techniques

I like Perl, but that is true. Despite the pretty rich set of built-ins, for example, there's not a top-level string replace function. You have to roll your own with index and substr.

Narann · on Nov 14, 2022

I've been in this situation many times.

> However, the regular expression module is a C module that doesn’t release the GIL before running.

I would love to know why.

byroot · on Nov 14, 2022

Mostly a guess, but:

  - Releasing/acquiring the GIL isn't cheap. So you'd need to know in advance whether it's worth it or not.
  - You need the GIL acquired to instantiate the resulting "match data", captures, etc.

formerly_proven · on Nov 14, 2022

Stuff like re.sub frequently calls into Python code as well, so it would seem to be rather hairy to get rid of the GIL for this. It could of course add the usual workarounds of cooperative multitasking, like yielding (releasing the GIL) from time to time, just like the main interpreter loop does (sys.setswitchinterval)

asah · on Nov 14, 2022

assuming it's legal to do so, GIL release seems like a reasonable flag to add to possibly-expensive functions like re.()

... or even better, have re. track its own performance and release the GIL if it forecasts that it statistically-might use a lot more computation... a trivial strawman algorithm: if len(input) > MAX then release GIL first.

oweiler · on Nov 14, 2022

I sometimes think regular expressions being greedy is a bad default.

matsemann · on Nov 14, 2022

Yeah, almost never what I want. Like if I have some "unstructured" data like

  a="xxx" b="yyy"

And I do a regex like /a="(.*)"/ I'd like it to match xxx, not until the end " after yyy. Always have to do weird tricks like /a="([^"]*)"/

Joker_vD · on Nov 14, 2022

Ah, the /"([^"]*)"/ trick. It was one of those small, obvious in retrospect but not that easy to invent on your own, useful things I was taught at my first job among others such as "if in a unit test you need a future date, use year 3001".

matsemann · on Nov 14, 2022

Yeah, in my toy example here I could probably go with a non-greedy modifier like .*?, but sometimes one needs to bring out these weird tricks, heh.

PenguinCoder · on Nov 14, 2022

Try `a="(.*?)\b"` next time. Where `\b` means a boundary such as the space in your example. Or use a more precise match for the capture group rather than the wildcard meaning ANYTHING.

klyrs · on Nov 14, 2022

You would have the same performance issue if the default was the opposite of greedy.

adql · on Nov 14, 2022

(sometimes) Bad for performance, probably what is expected for usability.

photochemsyn · on Nov 14, 2022

The simple example is fixable by anchoring the inner greedy expression, i.e.

regex = re.compile(r'(^a+)+b')

Perhaps the more general problem was that "The regular expression that was causing the issue here was rather long and convoluted".

A good rule seems to be never have long and convoluted regexs, instead have a pipeline of simple-to-understand regexes that can be independently debugged (I don't know if this could cause performance slowdowns however).

klyrs · on Nov 14, 2022

Why would you leave the second + intact, when the ^ prevents it from matching 2 or more times? And what if you want to match the substring in "Faaaaabulous"?

Did you see the even simpler fix in TFA that doesn't break mid-string matches?

photochemsyn · on Nov 14, 2022

Ah right I do now.

cmacleod4 · on Nov 14, 2022

Need I point out that one can avoid both the issues highlighted here by just using Tcl?

Regexps in Tcl use a non-backtracking implementation which handles such cases with minimal slowdown, e.g. the example shown with 16 and 32-character inputs:

    (dep) 7 % timerate {regexp (a+)+b aaaaaaaaaaaaaaaa}
    0.353737 µs/# 2826961 # 2826961 #/sec 1000.000 net-ms
    (dep) 8 % timerate {regexp (a+)+b aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
    0.412612 µs/# 2423587 # 2423587 #/sec 1000.000 net-ms

One discussion of this can be found at https://comp.lang.tcl.narkive.com/XPVME8zS/tcl-regexp-perfor... .

Tcl's threading implementation has no GIL, which not only avoids the type of lock-up described here, but also allows performance to scale to use all cores of your machine, see: https://www.hammerdb.com/blog/uncategorized/why-tcl-is-700-f... .

BerislavLopac · on Nov 14, 2022

The real problem is using multithreading in CPython for any set of tasks that has a nonzero chance to block one of the tasks due to a CPU-bound issue. The rule of thumb here - and I learned that the hard way - is to use async/await for anything I/O bound, and multiprocessing for anything CPU-bound.

klyrs · on Nov 14, 2022

Not really, the real problem is the exponential runtime of a badly-written regex. Multithreading in CPython is also a problem, but even if this was written in a performant language with robust multithreading support, this bug could still be amplified into a denial of service.

wikfwikf · on Nov 15, 2022

Why is the run time exponential in the length of the string of a's? It seems like it should be at most O(n^3).

2^n is the number of subsets of the string of a's with no contiguity requirements. Nothing like this number of possible matches should need to be checked. What am I missing?

adql · on Nov 14, 2022

Why Perl version of regexps returns near-instantly for same regexp ?

I tested few other on regex101.com and none of them ran longer than a millisecond or two

asicsp · on Nov 14, 2022

The article explains that:

>On my laptop it takes about 20ms to do the search above with 16 characters - but by doubling the input string size to 32 characters, it takes over 12 minutes to finish. Exponential running time really does suck.

I checked with 32 characters on regex101, and it says "Match 1 halted after 119986 step(s)" when you click the debug icon (click the message above the regex input box).

kmoser · on Nov 15, 2022

I tried with PHP 7.4.6 on my PC and the results were nearly instantaneous, even with the input string more than doubled to >32 chars. What's up with that?

brenns10 · on Nov 14, 2022

It really feels like they buried the lede here, as almost all discussions about catastrophic backtracking seem to do. Here's the second-to-last paragraph:

> Alternatively you could switch to a regular expression engine that doesn’t exhibit this kind of behaviour. RE2 is an excellent library that uses a finite automata approach to avoid this type of catastrophic backtracking. Using this library on the same 32 character input reduces the running time from 12 minutes to basically instant. The downside being that RE2 doesn’t support all the functionality of the builtin re package.

Perl Compatible Regular Expressions (the flavor used by most engines, including Python's) _require_ this exponential worst-case behavior (using an implementation called backtracking). But the theory behind regex does not require this, and if you eliminate just a couple (infrequently used) features of PCRE, you can have a regex engine that runs in linear time (with respect to the size of the input string). See [2], this is by the authors of RE2. The features which are incompatible with this sort of implementation are:

1. Lookahead assertions. 2. Backreferences to matched groups.

If you don't know what these are, or you rarely ever use them, then you really have no business using this kind of regex engine. (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point). Nonetheless, every popular programming language (except Go[1]) has included an exponential backtracking regex implementation in their standard library, and exposed the entire industry to this stupidity, all for a few backreferences and lookahead assertions.

What's especially crazy is this: it's easy for a regex engine to detect the use of these constructs, because it has to parse the regex anyway! So it's feasible for the engine to optimistically try to use an efficient implementation, and only fall back to using a backtracking implementation when you're actually using these features. This is what the Python re2 module[3] does, it uses RE2 by default and supports falling back to the backtracking implementation if necessary.

Instead, we're stuck reading the same postmortem every few years describing how "catastrophic backtracking" ruined another company/person's day, when the problem has been solved for decades, and language/library creators have just failed to include that solution.

[1]: Rob Pike invented, or at least popularized, the algorithm used by RE2. He was also involved in the creation of Go, as was Russ Cox[2]. [2]: https://swtch.com/~rsc/regexp/ [3]: https://pypi.org/project/re2/

burntsushi · on Nov 14, 2022

> Nonetheless, every popular programming language (except Go[1])

... and Rust. Although Rust doesn't have a regex library in std, the official regex crate (I am its author) descends from RE2[1].

> So it's feasible for the engine to optimistically try to use an efficient implementation

Feasible, sure. But not easy or simple. If Python's re2 module uses RE2 by default and Python's regex engine when necessary, then it's very likely that the match offsets it reports are not coherent. RE2 aims to be mostly consistent with things like PCRE, but there are cases where it isn't.

The problem is, building a production grade FSM regex engine or a production grade backtracking regex engine are both huge engineering projects. And there are few, if any, people that know how to do both. They both have their own little specialties and there isn't a ton of crossover among implementors of regex engines.

It's one of those ideas that sounds so stupidly obvious in practice, but when you actually sit down to engineer the thing, it's quite a bit harder than you might imagine.

> (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point).

This kind of ignores the fact that regexes are often the interface to something else. If you have a full blown programming language at your fingertips, it's usually not hard to work around the absence of look-around or backreferences by, say, using a second regex or something. But when all you have available to you is, say, a blacklist and you get to provide the regexes to that blacklist, then a more expressive regex syntax becomes a lot more useful. See for example people already running into problems with Discord's new feature: https://github.com/rust-lang/regex/issues/127#issuecomment-1...

Of course, as I noted, it would be very inappropriate for Discord to permit non-regular regexes here. It's borderline already by permitting any regexes at all, because even FSMs have their pathological cases, albeit, not exponential. But if it's a tool you run on your local system? Sure, look-arounds become much more useful.

To be clear, I am pro-FSM and agree with you that they should really be the default regex engine everywhere. But instead, reality is indeed inverted, with backtracking being the default (almost) everywhere. But I wanted to pipe up about a few points that I think you might have not quite captured precisely.

[1]: https://github.com/rust-lang/regex/

brenns10 · on Nov 15, 2022

Thanks! All good clarifications. You're definitely right on all counts, and it was more than a bit naive of me to suggest that language/library authors "simply" have two production grade regex implementations :)

>> (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point).

> This kind of ignores the fact that regexes are often the interface to something else.

I will definitely admit to using the occasional assertion or backreference in grep or sed from time to time, so you make a good point about local tools too. I suppose I was thinking more in terms of programming APIs when I said this, and so keeping the standard library FSM-based would still apply here. I'd imagine that special-purpose text munging tools which care that much, will likely have their own engines anyway.

Thanks also for your work on the regex crate. I'm still dipping my toes into Rust but I'm glad to know there's an FSM-based default regex library!

burntsushi · on Nov 15, 2022

Indeed. It doesn't take long before look-around inside of a grep command ends up being quite convenient, sometimes even saving yourself from a much longer shell pipeline. It's why I ultimately relented and added -P/--pcre2 to ripgrep. (Its default regex engine is the Rust regex crate.)

cmacleod4 · on Nov 14, 2022

> > Nonetheless, every popular programming language (except Go[1])

> ... and Rust. Although Rust doesn't have a regex library in std, the official regex crate (I am its author) descends from RE2[1].

... and Tcl (see my earlier post). You might argue that Tcl is not popular but it still has a significant following.

burntsushi · on Nov 15, 2022

I'm not a Tcl expert, but now that I think about it, I'm not sure you're correct. My recollection is that Tcl is one of the few to have a true hybrid engine. So if you use look-around or backreferences, then Tcl could provide an exponential runtime. If that's true, then Tcl would be included in "Nonetheless, every popular programming language (except Go[1]) has included an exponential backtracking regex implementation in their standard library," with perhaps a caveat or special mention. But it's still a categorically different guarantee than what, say, RE2 provides.

burntsushi · on Nov 14, 2022

... sure. I don't personally consider Tcl to be popular at present though.

cutler · on Nov 14, 2022

Just use Perl.

martopix · on Nov 14, 2022

Now you have three problems.