Why Using .* in Regular Expressions Is Almost Never What You Actually Want

perlgeek · on June 4, 2014

Basically the same advice, from the year 2000: http://www.perlmonks.org/?node_id=24640

There is one legitimate use case of .* though: advancing to the last match of something. If you want to find the last digit in a string, /.*(\d)/ will readily find it for you.

sanderjd · on June 4, 2014

Also, if you want to capture everything after some prefix. For instance, in a log: /(WARN|ERR) (.*)/

eCa · on June 4, 2014

A somewhat clearer way (imho) is /(\d)\D*$/ since it anchors to the end of the string.

claar · on June 4, 2014

According to the debugger at http://regex101.com/:

  /.*(\d)/

- Searches right-to-left, backtracking until it finds the match.

  /(\d)\D*$/

- Searches left-to-right, going forward a step, backtrack, forward, backtrack, until it finds a match.

If you're looking for a match toward the end of a string, the .* version will be faster.

hamburglar · on June 4, 2014

Wow, regex101.com is very nice, but I never would have noticed that debugger pane if I hadn't gone looking for it after your comment. Incredibly useful tool.

Rynant · on June 4, 2014

This is not the same as finding the last match though. The parent's example will match '2' in '1 of 2 steps.'

ronaldx · on June 4, 2014

On the contrary, it does gives the same result.

$ anchors to the end of the string, \D clears the non-digits from the end to allow \d to match the digit '2'.

Rynant · on June 4, 2014

Thanks, I see where I was wrong now.

In this case when finding the last match from the end, would the lazy quantifier reduce backtracking? e.g. /(\d)\D*?$/

icambron · on June 4, 2014

No, that would work very similarly to the greedy version. The backtracking happens because the \d gets matched to the '1' and the whole thing has to be rolled back when the $ attempts match and instead finds '2' (this would happen again if there were more digits for \d to speculatively match on). So the backtracking is not caused by the laziness or greediness of the \D* ; we really do want to gobble up all of the non-digits.

On the two options generally:

    /(\d)\D*$/

is problematic if you have a lot of digits, while

    /.*(\d)/

is problematic if you have a lot of text after the last digit. Both could potentially be optimized by the engine to run right-to-left (the former because it's anchored to the end and the latter because it greedily matches to the beginning), and then both would do well. I'm not sure if that happens in practice.

Overall, I prefer the latter, both because I think it's clearer and because its perf characteristics hold up under a wider variety of inputs.

Edit: how do you make literal asterisks on HN without having a space after them?

nkozyra · on June 4, 2014

In addition to the other explanation, the lazy qualifier is redundant here anyway since there should only be one $ in any given expression.

chinpokomon · on June 4, 2014

As others before me have said, this pattern works as expected. Putting it through regexper, you can visually see this.

http://www.regexper.com/#%2F(%5Cd)%5CD*%24%2F

ygra · on June 4, 2014

Technically, while they both capture the same digit, the match itself it different, including either everything before that digit or everything after it. But I tend to liberally use lookaround to keep the actual match clean myself; maybe others go more often for a capturing group. (Well, and not being able to use arbitrary-length lookaround in most engines might be a reason too.)

abuzzooz · on June 4, 2014

Actually, both will match the '2' in '1 of 2 steps.'

perlgeek · on June 4, 2014

Yes, but that's harder to do for more complicated regexes, because you need to negate a regex (here \d => \D) for this trick.

If you have a complicated regex $r, you can only negate it with (?:(?!$r).), and in that case, .$r is much easier to read :-)

Mithaldu · on June 4, 2014

Be aware though that \d will also find arabic or roman numerals. :)

UweSchmidt · on June 4, 2014

\d finds arabic numerals? As in 0-9? Scandalous :)

http://en.wikipedia.org/wiki/Arabic_numerals

extra88 · on June 4, 2014

Perhaps they mean Eastern Arabic numerals. http://en.wikipedia.org/wiki/Eastern_Arabic_numerals

acdha · on June 4, 2014

Indeed - those are in use by millions of people. I wonder if any regexp implementation has started matching all of the other number symbols in Unicode:

http://en.wikipedia.org/wiki/Numerals_in_Unicode

wumpus · on June 4, 2014

Perl does, by default.

There's a modifier if you want to only match ASCII digits.

mhaymo · on June 4, 2014

Roman numerals? I have never encountered a regex implementation where '\d' matches 'X'. It's certainly not the case for Javascript: http://regexpal.com/?flags=g&regex=\d%2B&input=10%0Ax%0AX%0A...

mmastrac · on June 4, 2014

There's a whole lot of these numeral characters in unicode, for example:

http://www.charbase.com/2169-unicode-roman-numeral-ten

(even more in http://www.charbase.com/block/number-forms)

I'm not sure if Javascript matches these in its \d pattern, however, but I think that most regexp engines default to the ascii [0-9] unless you are using \p{Number}.

pyre · on June 4, 2014

There's also the character classes in Unicode:

http://www.fileformat.info/info/unicode/category/index.htm

fennecfoxen · on June 4, 2014

> /\d/.test("\u2169")

false

jaredmcateer · on June 4, 2014

Ruby also does not match roman numerals with \d

http://rubular.com/r/bOCmLIfKdZ

icambron · on June 4, 2014

About eight years ago I finally read Friedl's Mastering Regular Expressions [1]. I know, right? A 500-page book about regular expressions, a tool I already knew (or thought I did). But it's actually a great book-- easy to read and full of genuinely good information on the how and why of regex, and it totally changed my understanding of them. If absolutely anything in this article surprised you, I highly recommend you read the book.

[1] http://regex.info/book.html

Pxtl · on June 4, 2014

I've gotten into the habit of using the "not" operation instead of .* a lot. If I'm looking for bracketed text, I use not-bracket to match the contents.

I tend to avoid the non-greedy operator just because it often fails in terrible half-assed regex implementations (eg. visual studio 2010)

bane · on June 4, 2014

I wish the not operator allowed for sub-expressions instead of just character classes. It'll probably make it slower, but it would remove lots of unreadable convolutions people have to go through.

ori_b · on June 4, 2014

There are some regex implementations that allow it, but it's a very confusing feature. Remember that '' is not 'a'.

Arbitrary expressions can have arbitrary length, so excluding an expression simply will match it, fail the match, and backtrack to the next option.

blueblob · on June 4, 2014

Me too. I find that if you work in a bunch of different languages it seems more portable (and one less thing to remember). It also seems easier to debug.

tjgq · on June 4, 2014

Also read Russ Cox's writeup on implementing regular expressions [0]. Backtracking can be done efficiently; it's just that most regular expression engines have suboptimal implementations for it.

[0] http://swtch.com/~rsc/regexp/regexp1.html

BugBrother · on June 4, 2014

I've cursed over Python's backtracking, at least a few years back. (Why can't they just use PCRE? :-( Any advantage at all?)

ori_b · on June 4, 2014

PCRE has the same problems with backtracking.

BugBrother · on June 4, 2014

As bad? I might have had smarter coworkers at different times... :-)

maxerickson · on June 4, 2014

Python beat PCRE to Unicode support by several years.

(So there at least was an advantage)

BugBrother · on June 4, 2014

I thought the Unicode support was still spotty (< 3.X)? Or you mean the support is better than in PCRE?

acdha · on June 4, 2014

It depends on what you mean by Unicode support - this turns out to be a surprisingly painful area if you need something like case folding:

strasse = straße

or treating combining characters the same as their single character equivalents:

ñ = ñ

(That's LATIN SMALL LETTER N WITH TILDE and LATIN SMALL LETTER N followed by COMBINING TILDE)

A surprising number of languages (mostly everything but Perl) won't handle advanced uses like this.

The good news is that the next version of the stdlib regex module is being developed independently:

https://pypi.python.org/pypi/regex

Simply "pip install regex" and:

    >>> regex.match(r"(?iV1)strasse", "stra\N{LATIN SMALL LETTER SHARP S}e").span()
    (0, 6)
    >>> regex.match(r"(?iV1)stra\N{LATIN SMALL LETTER SHARP S}e", "STRASSE").span()
    (0, 7)

maxerickson · on June 4, 2014

In 2000, PCRE simply didn't support Unicode. Python 1.6 and 2.0 did (at least, based on some quick searching PCRE added support for Unicode in 2004).

"spotty" probably isn't the right word either, the change in 3.0 was to default to treating text as always being Unicode, the 'unicode' type in 2.x is reasonably complete (as these things go), just not the default treatment for text.

BugBrother · on June 4, 2014

As late as 2004? I didn't know that.

'acdha' (in the brother comment) wrote what I meant with "spotty" better and more pedagogical than I ever could. :-)

moron4hire · on June 4, 2014

I think the advent of automatic regex match highlighting in text editors is changing the regex use-case for a lot of people. It certainly did for me. I no longer see regexs as just "something you use in code to test input". I now use them as general purpose text editing tools. In a way, it's like templated text output, with input specified in the same buffer.

I know this has been done forever, but usually only by extreme greybeards in Vi or Emacs world. The auto-highlighting now makes it possible for everyone to do it.

So that said, with the ability to restrict regexes to just a selection of text, it's more about regex golf--the fewest characters, the most productive--than it is about semantic correctness. If it works for my input, that's all that matters, because the regex is getting discarded thereafter.

Pxtl · on June 4, 2014

Yeah, I do all my data-imports from flat files using regex - easy to export from spreadsheet programs as flat files, then regex them into a bunch of insert/update statements.

collyw · on June 4, 2014

Yes, I do plenty of similar things.

Format a load of data using regexes first, then use it hard coded as string to do quick one off script to update the database. It beats trying to parse Excel directly, as you never know what data type a cell will return.

chernevik · on June 4, 2014

To be picky, it's always what I want, but with a lot of other stuff I don't.

blt · on June 4, 2014

Very interesting. I've had the greedy .* "overmatch", like probably almost everyone who's used regular expressions. Had no idea they are a performance drain even when giving the right answer though.

I like posts about details of software craftsmanship like this.

prohor · on June 4, 2014

A bit of a problem with lazy quantifiers is that they are not so widely supported out of the perl world. Therefore I often need to find some tricks to get similar behavior (eg. "[^,]*," - if coma is separator)

hnriot · on June 4, 2014

Except they are! Java, python, JavaScript etc all support lazy quantifies. In fact I can't think of a single language that doesn't.

mkehrt · on June 4, 2014

Except bash, awk, sed and vim. So, everywhere I use regular expressions.

fmoralesc · on June 5, 2014

vim has \- and variants.

caiobegotti · on June 4, 2014

Sed doesn't AFAIK.

BugBrother · on June 4, 2014

Huh, PCRE is universal (almost, not in e.g. Emacs :-( ) and supports (almost) everything?

Edit: Cough, after rechecking... PCRE is not as universal as I thought. It seems I've been lucky. :-) http://en.wikipedia.org/wiki/Comparison_of_regular_expressio...

Edit 2: "Atomic groups" on that wikipedia link is when you can write a full grammar in a large regexp, right? Answer myself: No, it is the name for stopping backtracking. I've seen it as named "possessive" (perldoc perlre).

SiVal · on June 5, 2014

I thought it was becoming more universal, but now I'm not so sure.

grep on the Mac used to use PCRE regexes if you used the -P option (`grep -P ....`), but beginning with OS X 10.8 the -P option was removed, so an important place that used to offer PCRE (default grep on a default Mac) actually removed support for it. They didn't replace it with something better; they just took it away. Maybe a Unicode issue?

Not only is PCRE not universal, overall support might even be waning.

collyw · on June 4, 2014

MySQL doesn't have them, it has some POSIX version, which is way less intuitive. Plenty of people use MySQL.

BugBrother · on June 4, 2014

Isn't that part of an SQL standard (1999?).

guynamedloren · on June 4, 2014

I've run into this problem so many times. Everywhere I think I want .* , I actually want .*? (non-greedy matching). Make a mental note of this. It'll save you lots of headaches.

kstenerud · on June 4, 2014

Actually, it only needs to be \[([^,]+),([^\]]+)\] because you're only going up to a comma in the first capture group and a square bracket in the second.

poolunion · on June 4, 2014

That would also match all of "[a] more [b,c]" though.

kstenerud · on June 4, 2014

And the other regex would match all of "[a more [b,c]". Your regex must be designed around your expected input.

NanoWar · on June 4, 2014

The regex fiddle is really useful: http://regex101.com/r/qQ2dE4

larubbio · on June 4, 2014

Is the early example in the document correct?

Using an input string of abc123 he claims [a-z]+\d+ will match the entire string (which I agree with). He then says that [a-z]+?\d+? will only match abc1. Wouldn't it fail since the non-greedy match on [a-z] would just match 'a' causing the non-greedy match on \d to fail trying to match 'b'?

simcop2387 · on June 4, 2014

no it'll still match but because both are non-greedy it could match on just c1 instead of abc123.

prawks · on June 4, 2014

It could match on c1, but I believe since most (all?) regex parsers parse left-to-right, it will match the a, look for another a-z character or a digit, find b, repeat, find c, then find 1 which completes the pattern.

gatehouse · on June 4, 2014

http://regex101.com/r/aR5xM2

I used this tester posted elsewhere in the thread, it seems like since the lazy components expand "as needed" to achieve a match, it will succeed on "abc1".

EDIT: I wrapped it in a group for clarity.

bthornbury · on June 4, 2014

I once used .* in a crawler. Came back the next day to find much rogue html amongst the content of my site. I find something like [^{{delimiting character}}]* to be better

gesman · on June 4, 2014

.*? Is the solution!

bthornbury · on June 4, 2014

I once used .* in a crawler. Came back the next day to find much rogue html amongst the content of my site. I find something like [^

mschuster91 · on June 4, 2014

For the "greedy" behaviour, PHP has the "U" flag... dunno about other implementations though.

spb · on June 4, 2014

This is why I like Lua's '-' for non-greedy matching in its pattern facilities.