There is one legitimate use case of .* though: advancing to the last match of something. If you want to find the last digit in a string, /.*(\d)/ will readily find it for you.
Wow, regex101.com is very nice, but I never would have noticed that debugger pane if I hadn't gone looking for it after your comment. Incredibly useful tool.
No, that would work very similarly to the greedy version. The backtracking happens because the \d gets matched to the '1' and the whole thing has to be rolled back when the $ attempts match and instead finds '2' (this would happen again if there were more digits for \d to speculatively match on). So the backtracking is not caused by the laziness or greediness of the \D* ; we really do want to gobble up all of the non-digits.
On the two options generally:
/(\d)\D*$/
is problematic if you have a lot of digits, while
/.*(\d)/
is problematic if you have a lot of text after the last digit. Both could potentially be optimized by the engine to run right-to-left (the former because it's anchored to the end and the latter because it greedily matches to the beginning), and then both would do well. I'm not sure if that happens in practice.
Overall, I prefer the latter, both because I think it's clearer and because its perf characteristics hold up under a wider variety of inputs.
Edit: how do you make literal asterisks on HN without having a space after them?
Technically, while they both capture the same digit, the match itself it different, including either everything before that digit or everything after it. But I tend to liberally use lookaround to keep the actual match clean myself; maybe others go more often for a capturing group. (Well, and not being able to use arbitrary-length lookaround in most engines might be a reason too.)
Indeed - those are in use by millions of people. I wonder if any regexp implementation has started matching all of the other number symbols in Unicode:
I'm not sure if Javascript matches these in its \d pattern, however, but I think that most regexp engines default to the ascii [0-9] unless you are using \p{Number}.
About eight years ago I finally read Friedl's Mastering Regular Expressions [1]. I know, right? A 500-page book about regular expressions, a tool I already knew (or thought I did). But it's actually a great book-- easy to read and full of genuinely good information on the how and why of regex, and it totally changed my understanding of them. If absolutely anything in this article surprised you, I highly recommend you read the book.
I've gotten into the habit of using the "not" operation instead of .* a lot. If I'm looking for bracketed text, I use not-bracket to match the contents.
I tend to avoid the non-greedy operator just because it often fails in terrible half-assed regex implementations (eg. visual studio 2010)
I wish the not operator allowed for sub-expressions instead of just character classes. It'll probably make it slower, but it would remove lots of unreadable convolutions people have to go through.
Me too. I find that if you work in a bunch of different languages it seems more portable (and one less thing to remember). It also seems easier to debug.
Also read Russ Cox's writeup on implementing regular expressions [0]. Backtracking can be done efficiently; it's just that most regular expression engines have suboptimal implementations for it.
>>> regex.match(r"(?iV1)strasse", "stra\N{LATIN SMALL LETTER SHARP S}e").span()
(0, 6)
>>> regex.match(r"(?iV1)stra\N{LATIN SMALL LETTER SHARP S}e", "STRASSE").span()
(0, 7)
In 2000, PCRE simply didn't support Unicode. Python 1.6 and 2.0 did (at least, based on some quick searching PCRE added support for Unicode in 2004).
"spotty" probably isn't the right word either, the change in 3.0 was to default to treating text as always being Unicode, the 'unicode' type in 2.x is reasonably complete (as these things go), just not the default treatment for text.
I think the advent of automatic regex match highlighting in text editors is changing the regex use-case for a lot of people. It certainly did for me. I no longer see regexs as just "something you use in code to test input". I now use them as general purpose text editing tools. In a way, it's like templated text output, with input specified in the same buffer.
I know this has been done forever, but usually only by extreme greybeards in Vi or Emacs world. The auto-highlighting now makes it possible for everyone to do it.
So that said, with the ability to restrict regexes to just a selection of text, it's more about regex golf--the fewest characters, the most productive--than it is about semantic correctness. If it works for my input, that's all that matters, because the regex is getting discarded thereafter.
Yeah, I do all my data-imports from flat files using regex - easy to export from spreadsheet programs as flat files, then regex them into a bunch of insert/update statements.
Format a load of data using regexes first, then use it hard coded as string to do quick one off script to update the database. It beats trying to parse Excel directly, as you never know what data type a cell will return.
Very interesting. I've had the greedy .* "overmatch", like probably almost everyone who's used regular expressions. Had no idea they are a performance drain even when giving the right answer though.
I like posts about details of software craftsmanship like this.
A bit of a problem with lazy quantifiers is that they are not so widely supported out of the perl world. Therefore I often need to find some tricks to get similar behavior (eg. "[^,]*," - if coma is separator)
Edit 2: "Atomic groups" on that wikipedia link is when you can write a full grammar in a large regexp, right? Answer myself: No, it is the name for stopping backtracking. I've seen it as named "possessive" (perldoc perlre).
I thought it was becoming more universal, but now I'm not so sure.
grep on the Mac used to use PCRE regexes if you used the -P option (`grep -P ....`), but beginning with OS X 10.8 the -P option was removed, so an important place that used to offer PCRE (default grep on a default Mac) actually removed support for it. They didn't replace it with something better; they just took it away. Maybe a Unicode issue?
Not only is PCRE not universal, overall support might even be waning.
I've run into this problem so many times. Everywhere I think I want .* , I actually want .*? (non-greedy matching). Make a mental note of this. It'll save you lots of headaches.
Actually, it only needs to be \[([^,]+),([^\]]+)\] because you're only going up to a comma in the first capture group and a square bracket in the second.
Using an input string of abc123 he claims [a-z]+\d+ will match the entire string (which I agree with). He then says that [a-z]+?\d+? will only match abc1. Wouldn't it fail since the non-greedy match on [a-z] would just match 'a' causing the non-greedy match on \d to fail trying to match 'b'?
It could match on c1, but I believe since most (all?) regex parsers parse left-to-right, it will match the a, look for another a-z character or a digit, find b, repeat, find c, then find 1 which completes the pattern.
I used this tester posted elsewhere in the thread, it seems like since the lazy components expand "as needed" to achieve a match, it will succeed on "abc1".
I once used .* in a crawler. Came back the next day to find much rogue html amongst the content of my site. I find something like [^{{delimiting character}}]* to be better
There is one legitimate use case of .* though: advancing to the last match of something. If you want to find the last digit in a string, /.*(\d)/ will readily find it for you.