This is not really a 2000x performance "win", this is GNU grep(1) being so bad at dealing with UTF-8 to the point of being basically unusable.
The PoSix locale system is a nightmare, but the implementation in GNU coreutils is even worse, when it is not insanely slow it is plain broken.
I wasted days tracking down a bug when we moved a script from FreeBSD to a linux sytem that had an UTF-8 locale, even for plain ASCII input GNU awk could match lines erratically if the locale was not set back to C! I'm sure this bug has been fixed by now, but is by far not the only time I have found serious flaws with UTF-8 support in GNU tools, it seems this days the only way to be safe is to set the locale to C at the top of all your scripts, which is quite sad.
This is also required because due to POSIX locale insanity the actual behavior of tools as basic as ls(1), and more fundamentally regexps matching changes based on the locale, making scripts that don't set their locale at the top unreliable and nonportable.
Or using a toolkit with decent UTF-8 support, like the from Plan 9 that one can use on *nix systems via Plan 9 from User Space ( http://plan9.us ) or 9base ( http://tools.suckless.org/9base ), if you use those also your scripts will be much more portable, as there are all kinds of incompatibilities among for example awk implementations.
Minor correction: the "bug" in grep was fixed in GNU grep 2.7, not 2.6. I recall that because I wrote an article about it on my blog. See the release notes for GNU grep 2.7 here: http://savannah.gnu.org/forum/forum.php?forum_id=6521
When the author shows the output of "which grep" it shows the path "/opt/local/bin/grep"; "/opt" is typically for additional installations of software. I haven't used SmartOS myself but my best guess is someone installed GNU grep manually and the PATH just included "/opt/local/bin" before "/usr/bin".
> This is not really a 2000x performance "win", this is GNU grep(1) being so bad at dealing with UTF-8 to the point of being basically unusable.
>
> The PoSix locale system is a nightmare, but the implementation in GNU coreutils is even worse, when it is not insanely slow it is plain broken.
Actually, GNU grep as of last year was much faster than BSD grep, as long as the C locale is used:
The reason why using UTF8 makes it much slower is that dealing with Unicode correctly is a very expensive business. But pretending that languages other than english do not exist is not an answer.
The Plan9 grep deals with Unicode correctly and is very, very fast. Claiming that
> dealing with Unicode correctly is a very expensive business
is plain FUD.
UTF-8 has zero overhead in this case (not near zero, but actual zero overhead). Grep doesn't even need to be aware of UTF-8. It doesn't need to be aware where a rune starts and ends in the byte stream. UTF-8 is designed in such a way that matching a regexp only requires byte comparisons, exactly as for ASCII.
Unfortunately, it's not as simple as that. There is more than one way of representing many text sequences using Unicode; for example, combining diacritics vs accented characters. To do the Right Thing (tm), you need to normalize, and that adds at least another pass over the whole string.
If you just deal with it on a byte by byte basis, you're not actually Unicode compatible. It might be fast, but it's not Correct.
In addition, conforming Unicode implementations have to ensure that their input is well-formed Unicode, and reject non-conforming characters (for some security-related reasons). So the original goal of UTF8, that it could be usable for straight byte-level search (the UTF8 RFC mentions a goal of allowing Boyer-Moore to be applied directly) is no longer really possible.
It's not necessarily grep's responsibility to normalize strings. They could very well stipulate that it is the responsibility of the user to make sure the both the input and the search expression are already normalized. That way, the normalization could be done by a separate program, resulting in the reusability and separation of concerns that is The Unix Way.
I fully agree, if you redefine grep as a byte search program, rather than for text search. Rule it out for most languages apart from English.
To have any usable grep with applicability beyond programming languages, English-language log-files, etc. (i.e. massive Anglocentrism), the normalizing would need to be built-in. Sketched:
> The Plan9 grep deals with Unicode correctly and is very, very fast. Claiming that
>> dealing with Unicode correctly is a very expensive business
> is plain FUD.
If you think this is true you don't know unicode enough. I don't know how plan9 grep is implemented but if it does it with near zero overhead then it's not actually supporting Unicode, but only with a small subset of Unicode that they like.
I also probably agree that that's the best part of Unicode and the rest of Unicode could burn in a fire but it's still not Unicode.
OP needs to upgrade grep to the latest version. He is using 2.5.3, and latest version is 2.10.
Release notes for 2.6:
"This release fixes an unexpectedly large number of flaws, from outright bugs (surprisingly many, considering this is "grep") to some occasionally debilitating performance problems."
Adding to what you said, I also find Plan9 regular expressions to be significantly easier to grok in your head compared to POSIX or PCRE regular expressions that push a lot of complexity on the user's side (PCREs are not even true regexps as PCREs are not regular).
Plan9 regexps are very easy to reason about, more explicit and don't look as line noise as much.
RE with backreferences is strictly more powerful then RE without them. Implying that PCRE is somehow inferior due to backreferences support is quite fuddy.
UTF-8 regular expression matching shouldn't be different from ASCII at all, as far as I can tell. In UTF-8, every byte by itself can be identified as a start byte or trail byte, so if you want to match a regular expression, you don't even have to care about UTF-8 in any way. Any legal character in the regexp you want to look for can only match at a legal character start in the haystack.
> UTF-8 regular expression matching shouldn't be different from ASCII at all, as far as I can tell.
Not really. First the '.' operator needs to work differently. This can be still done fast, but if you actually want to claim unicode support you should also consider:
- Unicode collation orders
- Unicode string equivalence algorithms
- Unicode normalization algorithms
This will also take care of all the unicode weirdnesses that no one actually uses or cares about but you still must implement to claim compatibility like: presentational forms, combining diacritics, ligatures, double sized characters and other odd stuff.
You both are wrong. If you take arbitrary byte in ascii stream you can be sure that this is a whole char, but if you take arbitrary byte in utf-8 stream you can get a start of a char / end of char / or whole char.
Obviously choosing an arbitrary offset in the byte stream might not align on rune boundary, but UTF-8 is great because you can determine that. You can determine the start/end boundaries in any subset of the byte stream, as opposed to most other encodings.
The issue is completely unrelated to the fact that matching regexps in UTF-8 text is the same as matching regexp in ASCII text. The regular expression tool doesn't even need to care that the text is UTF-8. It's just byte comparisons, the tool doesn't even need to be aware of rune boundaries.
Don't some of the common cases of searching require the ability to quickly skip forward N characters for high efficiency? That's simple pointer arithmetic in the ASCII case, but requires reading each byte skipped in UTF-8, right?
If it's just doing byte comparisons, then the tool isn't properly implementing http://en.wikipedia.org/wiki/Unicode_equivalence, unless it can assume that all input text was already normalized.
just look into grep’s sources. it has loads of low-level optimizations to make string search faster. But all these optimizations work only with exact-witdth encodings, thats why it is related.
In case anybody else was wondering what LANG=C means,
> The 'C' locale is defined as the "default" locale
> for applications, meaning that their strings are
> displayed as written in the initial code (without
> passing through a translation lookup).
>
> There is nothing special about the C locale, except
> that it will always exist and applies no string
> replacements.
-Malcolm Tredinnick
As far as I know, grep uses Boyer-Moore for string matching when possible. Giving a variable char size encoding such as UTF8, plain Boyer-Moore isn't possible, but it is the asymptotically faster algorithm known.
Hence even perfect versions of grep will slower by arbitrarily large factors, depending on the input.
So while there may be problems, expecting no difference or no significant difference between encodings is not correct either
> Giving a variable char size encoding such as UTF8, plain Boyer-Moore isn't possible (...)
That you get UTF-8 input and produce UTF-8 output doesn't imply you are better off using UTF-8 for processing. Translating UTF-8 to fixed-width UTF-32 and back is of linear complexity and takes small, fixed amount of memory. The only trade-off is when processing /very/ long lines -- up to four time more memory would be used for buffer.
As mentioned in other posts, Unicode requires normalization of certain character combinations into other characters, so you'll be processing all input characters anyway. Just prefix an extra step to it, not even a separate loop.
And so you can do Boyer-Moore with Unicode at very little extra cost :-)
Some text-intensive programs of Plan 9, including grep, use internally fixed-widht format called `Rune' for unicode, exactly for reasons of performance. UTF-8 input is translated into strings of Runes for processing and translated back for output.
For getting dtrace-like traces under Linux I strongly suggest getting oprofile (http://oprofile.sourceforge.net/news/) working on your target machine. I've used it on a PPC embedded board and it worked wonderfully. Do not optimize that which you have not measured.
Is this really a bug in grep, rather than a bug in Solaris's libc? I've never seen grep so slow, and I've been using UTF-8 locales for years.
I'm not denying that grep was buggy (there's a link to grep's bug tracker to a bug that was closed more than a year ago), but I'm surprised at the magnitude of the slowdown.
The official GNU grep used to be absurdly slow at UTF-8. Linux distributors very quickly noticed this and fixed it when they switched to UTF-8 by default. But GNU grep maintenance was essentially dormant for years and these patches were only integrated in 2010.
For an old, unpatched GNU grep a 2000x slowdown is quite believable.
The poster is using grep 2.5.3. Release notes for 2.6 talk about UTF-8 performance improvements:
"This release fixes an unexpectedly large number of flaws, from outright bugs (surprisingly many, considering this is "grep") to some occasionally debilitating performance problems."
The PoSix locale system is a nightmare, but the implementation in GNU coreutils is even worse, when it is not insanely slow it is plain broken.
I wasted days tracking down a bug when we moved a script from FreeBSD to a linux sytem that had an UTF-8 locale, even for plain ASCII input GNU awk could match lines erratically if the locale was not set back to C! I'm sure this bug has been fixed by now, but is by far not the only time I have found serious flaws with UTF-8 support in GNU tools, it seems this days the only way to be safe is to set the locale to C at the top of all your scripts, which is quite sad.
This is also required because due to POSIX locale insanity the actual behavior of tools as basic as ls(1), and more fundamentally regexps matching changes based on the locale, making scripts that don't set their locale at the top unreliable and nonportable.
Or using a toolkit with decent UTF-8 support, like the from Plan 9 that one can use on *nix systems via Plan 9 from User Space ( http://plan9.us ) or 9base ( http://tools.suckless.org/9base ), if you use those also your scripts will be much more portable, as there are all kinds of incompatibilities among for example awk implementations.