2000x performance win

luriel · on Dec 10, 2011

This is not really a 2000x performance "win", this is GNU grep(1) being so bad at dealing with UTF-8 to the point of being basically unusable.

The PoSix locale system is a nightmare, but the implementation in GNU coreutils is even worse, when it is not insanely slow it is plain broken.

I wasted days tracking down a bug when we moved a script from FreeBSD to a linux sytem that had an UTF-8 locale, even for plain ASCII input GNU awk could match lines erratically if the locale was not set back to C! I'm sure this bug has been fixed by now, but is by far not the only time I have found serious flaws with UTF-8 support in GNU tools, it seems this days the only way to be safe is to set the locale to C at the top of all your scripts, which is quite sad.

This is also required because due to POSIX locale insanity the actual behavior of tools as basic as ls(1), and more fundamentally regexps matching changes based on the locale, making scripts that don't set their locale at the top unreliable and nonportable.

Or using a toolkit with decent UTF-8 support, like the from Plan 9 that one can use on *nix systems via Plan 9 from User Space ( http://plan9.us ) or 9base ( http://tools.suckless.org/9base ), if you use those also your scripts will be much more portable, as there are all kinds of incompatibilities among for example awk implementations.

dododo · on Dec 10, 2011

here's another "fun" grep locale oddity:

   $ echo HI | LANG=en_US.utf8 grep '^[a-z]'
   HI
   $ echo HI | LANG=C grep '^[a-z]'
   $

apparently en_{GB,US}.utf8 orders a-z like aAbBcC..zZ.

   $ echo ZI | LANG=en_US.utf8 grep '^[a-z]'
   $

ot · on Dec 10, 2011

I had the same problem with sort:

  $ sort <<EOF
  > Aa
  > aa
  > Ab
  > ab
  > EOF
  aa
  Aa
  ab
  Ab

I was going crazy because I was getting different results in OSX and Ubuntu. Setting the LANG to POSIX fixed it.

Macha · on Dec 10, 2011

    $ sort --version
    sort (GNU coreutils) 8.14

For what it's worth, this gets me the same results under LANG=C and LANG=en_US.utf8

iso8859-1 · on Dec 10, 2011

This is what I get:

    $ echo HI | LANG=C grep '^[a-z]'
    $ echo HI | LANG=en_US.utf8 grep '^[a-z]'
    $

How come?

p9idf · on Dec 10, 2011

I was able to reproduce the bug. It could be a version thing.

  ; grep --version
  GNU grep 2.6.3
  ; echo A | LANG=en_US.utf8 grep '[a-z]'
  A

Ives · on Dec 10, 2011

No, I have the same version but not a similar result. I also have the en_US.utf8 locale installed.

pdw · on Dec 10, 2011

FWIW, more recent GNU grep releases (2.6 and later) run fine in UTF-8 mode.

Actually, why is SmartOS, "the Completely Modern Operating System", shipping with a five year old grep?

rg3 · on Dec 10, 2011

Minor correction: the "bug" in grep was fixed in GNU grep 2.7, not 2.6. I recall that because I wrote an article about it on my blog. See the release notes for GNU grep 2.7 here: http://savannah.gnu.org/forum/forum.php?forum_id=6521

wollw · on Dec 10, 2011

When the author shows the output of "which grep" it shows the path "/opt/local/bin/grep"; "/opt" is typically for additional installations of software. I haven't used SmartOS myself but my best guess is someone installed GNU grep manually and the PATH just included "/opt/local/bin" before "/usr/bin".

pdw · on Dec 10, 2011

I guess SmartOS reserves /usr for the Solaris base install and puts additions in /opt/local. If you look at http://wiki.joyent.com/display/jpc2/SmartOS+Smartmachine+Dat... and click "to see list of installed software", it includes grep 2.5.3.

EdiX · on Dec 10, 2011

> This is not really a 2000x performance "win", this is GNU grep(1) being so bad at dealing with UTF-8 to the point of being basically unusable. > > The PoSix locale system is a nightmare, but the implementation in GNU coreutils is even worse, when it is not insanely slow it is plain broken.

Actually, GNU grep as of last year was much faster than BSD grep, as long as the C locale is used:

http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...

The reason why using UTF8 makes it much slower is that dealing with Unicode correctly is a very expensive business. But pretending that languages other than english do not exist is not an answer.

4ad · on Dec 10, 2011

The Plan9 grep deals with Unicode correctly and is very, very fast. Claiming that

> dealing with Unicode correctly is a very expensive business

is plain FUD.

UTF-8 has zero overhead in this case (not near zero, but actual zero overhead). Grep doesn't even need to be aware of UTF-8. It doesn't need to be aware where a rune starts and ends in the byte stream. UTF-8 is designed in such a way that matching a regexp only requires byte comparisons, exactly as for ASCII.

barrkel · on Dec 10, 2011

Unfortunately, it's not as simple as that. There is more than one way of representing many text sequences using Unicode; for example, combining diacritics vs accented characters. To do the Right Thing (tm), you need to normalize, and that adds at least another pass over the whole string.

If you just deal with it on a byte by byte basis, you're not actually Unicode compatible. It might be fast, but it's not Correct.

_delirium · on Dec 10, 2011

In addition, conforming Unicode implementations have to ensure that their input is well-formed Unicode, and reject non-conforming characters (for some security-related reasons). So the original goal of UTF8, that it could be usable for straight byte-level search (the UTF8 RFC mentions a goal of allowing Boyer-Moore to be applied directly) is no longer really possible.

edit: a discussion on this subject from a few months ago, where I was corrected on that point: http://news.ycombinator.com/item?id=2860932

Confusion · on Dec 11, 2011

It's not necessarily grep's responsibility to normalize strings. They could very well stipulate that it is the responsibility of the user to make sure the both the input and the search expression are already normalized. That way, the normalization could be done by a separate program, resulting in the reusability and separation of concerns that is The Unix Way.

barrkel · on Dec 11, 2011

I fully agree, if you redefine grep as a byte search program, rather than for text search. Rule it out for most languages apart from English.

To have any usable grep with applicability beyond programming languages, English-language log-files, etc. (i.e. massive Anglocentrism), the normalizing would need to be built-in. Sketched:

    #!/usr/bin/bash
    regex="$1"; shift
    unicode-normalize-to-utf8 "$@" | byte-grep "$(utf8-byteify-regex "$regex")"

EdiX · on Dec 10, 2011

> The Plan9 grep deals with Unicode correctly and is very, very fast. Claiming that

>> dealing with Unicode correctly is a very expensive business

> is plain FUD.

If you think this is true you don't know unicode enough. I don't know how plan9 grep is implemented but if it does it with near zero overhead then it's not actually supporting Unicode, but only with a small subset of Unicode that they like. I also probably agree that that's the best part of Unicode and the rest of Unicode could burn in a fire but it's still not Unicode.

morsch · on Dec 10, 2011

> If you think this is true you don't know unicode enough.

Yes, well, that is sort of a given for any statement about unicode (including this one).

p9idf · on Dec 10, 2011

You are conflating UTF with Unicode.

iradik · on Dec 11, 2011

OP needs to upgrade grep to the latest version. He is using 2.5.3, and latest version is 2.10.

Release notes for 2.6:

"This release fixes an unexpectedly large number of flaws, from outright bugs (surprisingly many, considering this is "grep") to some occasionally debilitating performance problems."

http://savannah.gnu.org/forum/forum.php?forum_id=6249

4ad · on Dec 10, 2011

Adding to what you said, I also find Plan9 regular expressions to be significantly easier to grok in your head compared to POSIX or PCRE regular expressions that push a lot of complexity on the user's side (PCREs are not even true regexps as PCREs are not regular).

Plan9 regexps are very easy to reason about, more explicit and don't look as line noise as much.

adobriyan · on Dec 10, 2011

RE with backreferences is strictly more powerful then RE without them. Implying that PCRE is somehow inferior due to backreferences support is quite fuddy.

omaranto · on Dec 10, 2011

I thought POSIX regexes allowed back references (as in "(*.)\1"), so they aren't regular either.

iradik · on Dec 11, 2011

Yup, the author an older version of grep 2.5.3. I have grep 2.6.3 on RHEL6 and only got a 2x boost doing a similar test.

Nitramp · on Dec 10, 2011

Does anyone understand what the bug in grep was?

UTF-8 regular expression matching shouldn't be different from ASCII at all, as far as I can tell. In UTF-8, every byte by itself can be identified as a start byte or trail byte, so if you want to match a regular expression, you don't even have to care about UTF-8 in any way. Any legal character in the regexp you want to look for can only match at a legal character start in the haystack.

EdiX · on Dec 10, 2011

> UTF-8 regular expression matching shouldn't be different from ASCII at all, as far as I can tell.

Not really. First the '.' operator needs to work differently. This can be still done fast, but if you actually want to claim unicode support you should also consider:

- Unicode collation orders

- Unicode string equivalence algorithms

- Unicode normalization algorithms

This will also take care of all the unicode weirdnesses that no one actually uses or cares about but you still must implement to claim compatibility like: presentational forms, combining diacritics, ligatures, double sized characters and other odd stuff.

4ad · on Dec 10, 2011

GNU grep uselessly converts UTF-8 to some multibyte char internal representation.

You are completely right, matching regular expressions with UTF-8 text is the same as with ASCII text, one of the reasons why UTF-8 is so good.

omgtehlion · on Dec 10, 2011

You both are wrong. If you take arbitrary byte in ascii stream you can be sure that this is a whole char, but if you take arbitrary byte in utf-8 stream you can get a start of a char / end of char / or whole char.

4ad · on Dec 10, 2011

Obviously choosing an arbitrary offset in the byte stream might not align on rune boundary, but UTF-8 is great because you can determine that. You can determine the start/end boundaries in any subset of the byte stream, as opposed to most other encodings.

The issue is completely unrelated to the fact that matching regexps in UTF-8 text is the same as matching regexp in ASCII text. The regular expression tool doesn't even need to care that the text is UTF-8. It's just byte comparisons, the tool doesn't even need to be aware of rune boundaries.

colomon · on Dec 10, 2011

Don't some of the common cases of searching require the ability to quickly skip forward N characters for high efficiency? That's simple pointer arithmetic in the ASCII case, but requires reading each byte skipped in UTF-8, right?

(I'm particularly thinking of Boyer-Moore -- http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_sear... -- but I'm sure there are other examples as well.)

_delirium · on Dec 10, 2011

> It's just byte comparisons

If it's just doing byte comparisons, then the tool isn't properly implementing http://en.wikipedia.org/wiki/Unicode_equivalence, unless it can assume that all input text was already normalized.

omgtehlion · on Dec 10, 2011

just look into grep’s sources. it has loads of low-level optimizations to make string search faster. But all these optimizations work only with exact-witdth encodings, thats why it is related.

P.S.: and sometimes there are a lot of problems with lower and upper-case letters http://www.gnu.org/s/grep/devel.html

vilya · on Dec 10, 2011

Character classes would trip you up, I think. e.g. \w for word characters.

tantalor · on Dec 10, 2011

In case anybody else was wondering what LANG=C means,

  > The 'C' locale is defined as the "default" locale
  > for applications, meaning that their strings are
  > displayed as written in the initial code (without
  > passing through a translation lookup).
  > 
  > There is nothing special about the C locale, except
  > that it will always exist and applies no string
  > replacements.
  
  -Malcolm Tredinnick

http://mailman.linuxchix.org/pipermail/techtalk/2002-Novembe...

tantalor · on Dec 10, 2011

  > If the locale value is "C" or "POSIX", the POSIX locale is used 
  
  - The Single UNIX ® Specification, Version 2

http://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html

bjoernbu · on Dec 10, 2011

As far as I know, grep uses Boyer-Moore for string matching when possible. Giving a variable char size encoding such as UTF8, plain Boyer-Moore isn't possible, but it is the asymptotically faster algorithm known.

Hence even perfect versions of grep will slower by arbitrarily large factors, depending on the input.

So while there may be problems, expecting no difference or no significant difference between encodings is not correct either

dexen · on Dec 10, 2011

> Giving a variable char size encoding such as UTF8, plain Boyer-Moore isn't possible (...)

That you get UTF-8 input and produce UTF-8 output doesn't imply you are better off using UTF-8 for processing. Translating UTF-8 to fixed-width UTF-32 and back is of linear complexity and takes small, fixed amount of memory. The only trade-off is when processing /very/ long lines -- up to four time more memory would be used for buffer.

As mentioned in other posts, Unicode requires normalization of certain character combinations into other characters, so you'll be processing all input characters anyway. Just prefix an extra step to it, not even a separate loop.

And so you can do Boyer-Moore with Unicode at very little extra cost :-)

Some text-intensive programs of Plan 9, including grep, use internally fixed-widht format called `Rune' for unicode, exactly for reasons of performance. UTF-8 input is translated into strings of Runes for processing and translated back for output.

gravitronic · on Dec 10, 2011

For getting dtrace-like traces under Linux I strongly suggest getting oprofile (http://oprofile.sourceforge.net/news/) working on your target machine. I've used it on a PPC embedded board and it worked wonderfully. Do not optimize that which you have not measured.

mgedmin · on Dec 10, 2011

Is this really a bug in grep, rather than a bug in Solaris's libc? I've never seen grep so slow, and I've been using UTF-8 locales for years.

I'm not denying that grep was buggy (there's a link to grep's bug tracker to a bug that was closed more than a year ago), but I'm surprised at the magnitude of the slowdown.

pdw · on Dec 10, 2011

The official GNU grep used to be absurdly slow at UTF-8. Linux distributors very quickly noticed this and fixed it when they switched to UTF-8 by default. But GNU grep maintenance was essentially dormant for years and these patches were only integrated in 2010.

For an old, unpatched GNU grep a 2000x slowdown is quite believable.

iradik · on Dec 11, 2011

The poster is using grep 2.5.3. Release notes for 2.6 talk about UTF-8 performance improvements:

"This release fixes an unexpectedly large number of flaws, from outright bugs (surprisingly many, considering this is "grep") to some occasionally debilitating performance problems."

Current version of grep is 2.10.

lordlarm · on Dec 10, 2011

It actually was 4000x times faster by using C's inbuilt counter ("-c") instead of "wc -l".

Any risks with just updating to latest version of grep instead of using the LANG=C hack?

iradik · on Dec 11, 2011

Anyone able to verify the author's claim?

I only get a 2x improvement when switching LANG on Redhat linux on EC2:

% export LANG=C % time grep done nohup.out | wc -l 152929

real 0m0.343s user 0m0.233s sys 0m0.112s

% export LANG=en_US.UTF-8 % time grep done nohup.out | wc -l 152931

real 0m0.771s user 0m0.673s sys 0m0.100s

% grep --version GNU grep 2.6.3

Author is using grep 2.5.3, I am using 2.6.3, so not testing the same thing.

kennystone · on Dec 10, 2011

I enjoyed seeing your debugging process and picked up a few tricks.