Hacker News new | past | comments | ask | show | jobs | submit login
Why Using .* in Regular Expressions Is Almost Never What You Actually Want (mariusschulz.com)
184 points by mariusschulz on June 4, 2014 | hide | past | favorite | 66 comments



Basically the same advice, from the year 2000: http://www.perlmonks.org/?node_id=24640

There is one legitimate use case of .* though: advancing to the last match of something. If you want to find the last digit in a string, /.*(\d)/ will readily find it for you.


Also, if you want to capture everything after some prefix. For instance, in a log: /(WARN|ERR) (.*)/


A somewhat clearer way (imho) is /(\d)\D*$/ since it anchors to the end of the string.


According to the debugger at http://regex101.com/:

  /.*(\d)/
- Searches right-to-left, backtracking until it finds the match.

  /(\d)\D*$/
- Searches left-to-right, going forward a step, backtrack, forward, backtrack, until it finds a match.

If you're looking for a match toward the end of a string, the .* version will be faster.


Wow, regex101.com is very nice, but I never would have noticed that debugger pane if I hadn't gone looking for it after your comment. Incredibly useful tool.


This is not the same as finding the last match though. The parent's example will match '2' in '1 of 2 steps.'


On the contrary, it does gives the same result.

$ anchors to the end of the string, \D clears the non-digits from the end to allow \d to match the digit '2'.


Thanks, I see where I was wrong now.

In this case when finding the last match from the end, would the lazy quantifier reduce backtracking? e.g. /(\d)\D*?$/


No, that would work very similarly to the greedy version. The backtracking happens because the \d gets matched to the '1' and the whole thing has to be rolled back when the $ attempts match and instead finds '2' (this would happen again if there were more digits for \d to speculatively match on). So the backtracking is not caused by the laziness or greediness of the \D* ; we really do want to gobble up all of the non-digits.

On the two options generally:

    /(\d)\D*$/
is problematic if you have a lot of digits, while

    /.*(\d)/ 
is problematic if you have a lot of text after the last digit. Both could potentially be optimized by the engine to run right-to-left (the former because it's anchored to the end and the latter because it greedily matches to the beginning), and then both would do well. I'm not sure if that happens in practice.

Overall, I prefer the latter, both because I think it's clearer and because its perf characteristics hold up under a wider variety of inputs.

Edit: how do you make literal asterisks on HN without having a space after them?


In addition to the other explanation, the lazy qualifier is redundant here anyway since there should only be one $ in any given expression.


As others before me have said, this pattern works as expected. Putting it through regexper, you can visually see this.

http://www.regexper.com/#%2F(%5Cd)%5CD*%24%2F


Technically, while they both capture the same digit, the match itself it different, including either everything before that digit or everything after it. But I tend to liberally use lookaround to keep the actual match clean myself; maybe others go more often for a capturing group. (Well, and not being able to use arbitrary-length lookaround in most engines might be a reason too.)


Actually, both will match the '2' in '1 of 2 steps.'


Yes, but that's harder to do for more complicated regexes, because you need to negate a regex (here \d => \D) for this trick.

If you have a complicated regex $r, you can only negate it with (?:(?!$r).), and in that case, .$r is much easier to read :-)


Be aware though that \d will also find arabic or roman numerals. :)


\d finds arabic numerals? As in 0-9? Scandalous :)

http://en.wikipedia.org/wiki/Arabic_numerals


Perhaps they mean Eastern Arabic numerals. http://en.wikipedia.org/wiki/Eastern_Arabic_numerals


Indeed - those are in use by millions of people. I wonder if any regexp implementation has started matching all of the other number symbols in Unicode:

http://en.wikipedia.org/wiki/Numerals_in_Unicode


Perl does, by default.

There's a modifier if you want to only match ASCII digits.


Roman numerals? I have never encountered a regex implementation where '\d' matches 'X'. It's certainly not the case for Javascript: http://regexpal.com/?flags=g&regex=\d%2B&input=10%0Ax%0AX%0A...


There's a whole lot of these numeral characters in unicode, for example:

http://www.charbase.com/2169-unicode-roman-numeral-ten

(even more in http://www.charbase.com/block/number-forms)

I'm not sure if Javascript matches these in its \d pattern, however, but I think that most regexp engines default to the ascii [0-9] unless you are using \p{Number}.


There's also the character classes in Unicode:

http://www.fileformat.info/info/unicode/category/index.htm


> /\d/.test("\u2169")

false


Ruby also does not match roman numerals with \d

http://rubular.com/r/bOCmLIfKdZ


About eight years ago I finally read Friedl's Mastering Regular Expressions [1]. I know, right? A 500-page book about regular expressions, a tool I already knew (or thought I did). But it's actually a great book-- easy to read and full of genuinely good information on the how and why of regex, and it totally changed my understanding of them. If absolutely anything in this article surprised you, I highly recommend you read the book.

[1] http://regex.info/book.html


I've gotten into the habit of using the "not" operation instead of .* a lot. If I'm looking for bracketed text, I use not-bracket to match the contents.

I tend to avoid the non-greedy operator just because it often fails in terrible half-assed regex implementations (eg. visual studio 2010)


I wish the not operator allowed for sub-expressions instead of just character classes. It'll probably make it slower, but it would remove lots of unreadable convolutions people have to go through.


There are some regex implementations that allow it, but it's a very confusing feature. Remember that '' is not 'a'.

Arbitrary expressions can have arbitrary length, so excluding an expression simply will match it, fail the match, and backtrack to the next option.


Me too. I find that if you work in a bunch of different languages it seems more portable (and one less thing to remember). It also seems easier to debug.


Also read Russ Cox's writeup on implementing regular expressions [0]. Backtracking can be done efficiently; it's just that most regular expression engines have suboptimal implementations for it.

[0] http://swtch.com/~rsc/regexp/regexp1.html


I've cursed over Python's backtracking, at least a few years back. (Why can't they just use PCRE? :-( Any advantage at all?)


PCRE has the same problems with backtracking.


As bad? I might have had smarter coworkers at different times... :-)


Python beat PCRE to Unicode support by several years.

(So there at least was an advantage)


I thought the Unicode support was still spotty (< 3.X)? Or you mean the support is better than in PCRE?


It depends on what you mean by Unicode support - this turns out to be a surprisingly painful area if you need something like case folding:

strasse = straße

or treating combining characters the same as their single character equivalents:

ñ = ñ

(That's LATIN SMALL LETTER N WITH TILDE and LATIN SMALL LETTER N followed by COMBINING TILDE)

A surprising number of languages (mostly everything but Perl) won't handle advanced uses like this.

The good news is that the next version of the stdlib regex module is being developed independently:

https://pypi.python.org/pypi/regex

Simply "pip install regex" and:

    >>> regex.match(r"(?iV1)strasse", "stra\N{LATIN SMALL LETTER SHARP S}e").span()
    (0, 6)
    >>> regex.match(r"(?iV1)stra\N{LATIN SMALL LETTER SHARP S}e", "STRASSE").span()
    (0, 7)


In 2000, PCRE simply didn't support Unicode. Python 1.6 and 2.0 did (at least, based on some quick searching PCRE added support for Unicode in 2004).

"spotty" probably isn't the right word either, the change in 3.0 was to default to treating text as always being Unicode, the 'unicode' type in 2.x is reasonably complete (as these things go), just not the default treatment for text.


As late as 2004? I didn't know that.

'acdha' (in the brother comment) wrote what I meant with "spotty" better and more pedagogical than I ever could. :-)


I think the advent of automatic regex match highlighting in text editors is changing the regex use-case for a lot of people. It certainly did for me. I no longer see regexs as just "something you use in code to test input". I now use them as general purpose text editing tools. In a way, it's like templated text output, with input specified in the same buffer.

I know this has been done forever, but usually only by extreme greybeards in Vi or Emacs world. The auto-highlighting now makes it possible for everyone to do it.

So that said, with the ability to restrict regexes to just a selection of text, it's more about regex golf--the fewest characters, the most productive--than it is about semantic correctness. If it works for my input, that's all that matters, because the regex is getting discarded thereafter.


Yeah, I do all my data-imports from flat files using regex - easy to export from spreadsheet programs as flat files, then regex them into a bunch of insert/update statements.


Yes, I do plenty of similar things.

Format a load of data using regexes first, then use it hard coded as string to do quick one off script to update the database. It beats trying to parse Excel directly, as you never know what data type a cell will return.


To be picky, it's always what I want, but with a lot of other stuff I don't.


Very interesting. I've had the greedy .* "overmatch", like probably almost everyone who's used regular expressions. Had no idea they are a performance drain even when giving the right answer though.

I like posts about details of software craftsmanship like this.


A bit of a problem with lazy quantifiers is that they are not so widely supported out of the perl world. Therefore I often need to find some tricks to get similar behavior (eg. "[^,]*," - if coma is separator)


Except they are! Java, python, JavaScript etc all support lazy quantifies. In fact I can't think of a single language that doesn't.


Except bash, awk, sed and vim. So, everywhere I use regular expressions.


vim has \- and variants.


Sed doesn't AFAIK.


Huh, PCRE is universal (almost, not in e.g. Emacs :-( ) and supports (almost) everything?

Edit: Cough, after rechecking... PCRE is not as universal as I thought. It seems I've been lucky. :-) http://en.wikipedia.org/wiki/Comparison_of_regular_expressio...

Edit 2: "Atomic groups" on that wikipedia link is when you can write a full grammar in a large regexp, right? Answer myself: No, it is the name for stopping backtracking. I've seen it as named "possessive" (perldoc perlre).


I thought it was becoming more universal, but now I'm not so sure.

grep on the Mac used to use PCRE regexes if you used the -P option (`grep -P ....`), but beginning with OS X 10.8 the -P option was removed, so an important place that used to offer PCRE (default grep on a default Mac) actually removed support for it. They didn't replace it with something better; they just took it away. Maybe a Unicode issue?

Not only is PCRE not universal, overall support might even be waning.


MySQL doesn't have them, it has some POSIX version, which is way less intuitive. Plenty of people use MySQL.


Isn't that part of an SQL standard (1999?).


I've run into this problem so many times. Everywhere I think I want .* , I actually want .*? (non-greedy matching). Make a mental note of this. It'll save you lots of headaches.


Actually, it only needs to be \[([^,]+),([^\]]+)\] because you're only going up to a comma in the first capture group and a square bracket in the second.


That would also match all of "[a] more [b,c]" though.


And the other regex would match all of "[a more [b,c]". Your regex must be designed around your expected input.


The regex fiddle is really useful: http://regex101.com/r/qQ2dE4


Is the early example in the document correct?

Using an input string of abc123 he claims [a-z]+\d+ will match the entire string (which I agree with). He then says that [a-z]+?\d+? will only match abc1. Wouldn't it fail since the non-greedy match on [a-z] would just match 'a' causing the non-greedy match on \d to fail trying to match 'b'?


no it'll still match but because both are non-greedy it could match on just c1 instead of abc123.


It could match on c1, but I believe since most (all?) regex parsers parse left-to-right, it will match the a, look for another a-z character or a digit, find b, repeat, find c, then find 1 which completes the pattern.


http://regex101.com/r/aR5xM2

I used this tester posted elsewhere in the thread, it seems like since the lazy components expand "as needed" to achieve a match, it will succeed on "abc1".

EDIT: I wrapped it in a group for clarity.


I once used .* in a crawler. Came back the next day to find much rogue html amongst the content of my site. I find something like [^{{delimiting character}}]* to be better


.*? Is the solution!


I once used .* in a crawler. Came back the next day to find much rogue html amongst the content of my site. I find something like [^


For the "greedy" behaviour, PHP has the "U" flag... dunno about other implementations though.


This is why I like Lua's '-' for non-greedy matching in its pattern facilities.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: