Just fixed a 20-year-old bug…

psykotic · on Aug 28, 2011

    // Slimey cheap key comparator.
    int32 cmpkey(const void *key1, const void *key2) {
        return (int32)((intptr_t)key1 - (intptr_t)key2);
    }

Any C programmer's spidey sense should tingle when they see a ternary comparison function implemented like this. Let's look at the canonical example:

    int intcmp(int x, int y) { return x - y; }

It seems perfectly fine. After all, cmp functions are usually allowed to return any signed number, not only -1, 0 or +1. But what happens if x is INT_MAX and y is INT_MIN? You get signed integer overflow. That's bad. In theory, it's really, really bad because signed integer overflow in C is undefined, so it could launch nuclear missiles, segfault, or give a compiler like GCC total license to fuck with your code in all kinds of ways that are hard to imagine. In practice, on 2's complement machines, it's just really bad since INT_MAX - INT_MIN = -1 doesn't have the right sign.

Returning to Cliff's code, since he's only using the cmp function for equality and not ordering, the overflow problem doesn't manifest itself. If he were using it for ordering in a sorted array or search tree, you might think that this wouldn't cause a bug as long as all he needed was a consistent ordering rather than any particular one. Well, you'd be wrong, because intcmp violates transitivity. Mathematically speaking, INT_MIN+1 < 0 < INT_MAX. According to intcmp, INT_MIN+1 < 0 and 0 < INT_MAX but INT_MIN+1 > INT_MAX. This breaks the invariants of your sorting and search tree code, so all bets are off.

The solution? Do the simple thing:

    int intcmp(int x, int y) { return x < y ? -1 : x > y ? +1 : 0; }

Your compiler can turn this into a few branchless instructions with a single CMP.

stephenhalter · on Aug 28, 2011

x = INT_MAX & y = -1 is enough to break your intcmp example. There are a wide range of inputs that would cause a problem.

psykotic · on Aug 29, 2011

Yes, it's as you say.

Signed arithmetic is a minefield in C. Here is a case that also explains why my intransitivity example for simplicity used INT_MIN+1 rather than INT_MIN:

    // buffer size must be >= ceil(log10(INT_MAX)) + 2
    void itoa(int x, char *buf)
    {
        if (x == 0) { *buf++ = '0'; *buf = 0; return; }
        if (x < 0) { *buf++ = '-'; x = -x; }
        // handle the x > 0 case.
    }

This seems like a solid algorithmic strategy for this problem: reduce to the positive case. Unfortunately things go very badly if x is INT_MIN. You get another signed integer overflow which in theory is wholly undefined by C. But in practice, with 2's complement arithmetic, you get x == -x. (As a corollary, the antisymmetry property intcmp(x, y) = -intcmp(y, x) fails when e.g. x = INT_MIN and y = 0, so we lose both transitivity _and_ antisymmetry.)

Anyway, the itoa strategy is great on an algorithmic level but the implementation fails due to messy reality. The proper implementation uses a cast to an unsigned variable without any explicit negation. This obviously works on 2's complement machines, and Steele and Harbison's C - A Reference Manual, pp. 190 suggests it is well-defined and portable:

"Except for the type _Bool, the general rule for converting from one integer type to another is that the mathematical value of the result should equal the original mathematical value if that is possible. For example, if an unsigned integer has the value 15 and this value is to be converted to a signed type, the resulting signed value should be 15 also. If it is not possible to represent the original value of an object of the new type, then there are two cases.

If the result type is a signed type, then the conversion is considered to have overflowed and the result value is technically not defined. If the result type is an unsigned type, then the result must be that unique value of the result type that is equal (congruent) mod 2^n to the original value, where n is equal to the number of bits used in the representation of the result type. If signed integers are represented using twos-complement notation, then no change of representation is necessary when converting between signed and unsigned integers of the same size. However, if signed integers are represented in some other way, such as with ones-complement or sign-magnitude representation, then a change of representation will be necessary."

Unsigned arithmetic has well-defined semantics in C (wrap-around on overflow, etc). Unfortunately it has other legacy issues with things like mixed signed/unsigned comparison: if x is an unsigned int then x > -1 is always false rather than being always true as you'd expect. In practice I use signed integers for everything, even variables subject to non-negativity invariants, except when I need to rely on wrap-around or bitwise operations.

In summary, all this stuff is hard to get right. It's silly that I have to carry around all this esoterica in my head to write correct code. It doesn't help that compilers like GCC are increasingly taking advantage of theoretically undefined behavior in people's code to create (often very marginal) opportunities for optimization; if you want to be scared shitless, read this presentation: https://www.securecoding.cert.org/confluence/download/attach...

swah · on Aug 28, 2011

Checked out openjdk just to see that file, I thought it would be interesting to run cloc on hotspot:

    hotspot $ cloc .
        3742 text files.
        3663 unique files.                                          
         343 files ignored.

    http://cloc.sourceforge.net v 1.53  T=28.0 s (120.9 files/s, 37436.2 lines/s)
    --------------------------------------------------------------------------------
    Language                      files          blank        comment           code
    --------------------------------------------------------------------------------
    C++                             712          67857         102162         365876
    Java                           1556          25399          54259         188050
    C/C++ Header                    852          29418          52451         112257
    XML                              77            504            231          15759
    C                                20           1494           2527           8430
    XSLT                              7            564            166           4392
    Assembly                          6             91             36           2256
    HTML                              7            230              3           2010
    Bourne Shell                     75            497           1800           1767
    make                             20            387            841           1579
    Javascript                        2            130            188            917
    D                                 4             83            127            869
    SKILL                             3             74              0            526
    DOS Batch                        42            176           1025            423
    ASP.Net                           1             22              0            141
    Bourne Again Shell                1             27             62            123
    Teamcenter def                    1              0              0              7
    --------------------------------------------------------------------------------
    SUM:                           3386         126953         215878         705382
    --------------------------------------------------------------------------------

notJim · on Aug 29, 2011

A couple languages I was unfamiliar with:

- SKILL: I'm guessing http://en.wikipedia.org/wiki/Cadence_SKILL

- Teamcenter def: Guessing http://en.wikipedia.org/wiki/Teamcenter

raster_blaster · on Aug 29, 2011

I took a look but D -> DTrace, not the D I thought it would be. :)

wccrawford · on Aug 28, 2011

The code was written 20 years ago, but the -bug- didn't appear until people started writing 64-bit software.

I wouldn't call that a 20-year-old bug... It didn't cause any issues for most of that time because it couldn't have.

cperciva · on Aug 28, 2011

It was a bug. It should have had

    assert(sizeof(void *) <= sizeof(int32));

(or better yet, a compile-time assertion).

alttab · on Aug 28, 2011

Can you imagine that "20 years ago" (for most of us is too early to be programming at this level) you missed this thinking it couldn't possibly be anything but 32 bits for the purpose it was written for?

That said, never assume, because "dict.cpp" stuff happens all the time, except its modern web-incantation is usually called "copy-pasta."

Roboprog · on Aug 29, 2011

I love cperciva's solution. OK, not solution, but at least paranoia. Back when I worked at a place called Morada in the early 90s, we had to port a C runtime library for the product to everything from 16 bit x86 "real mode" (DOS) to 64 bit DEC-Alpha, so word sizes were all over the map.

Trying to define portable size definitions in C can make you crazy (e.g. - such as trying to save a tokenized binary file from source code, such that the binary ports between platforms).

Kudos to the author for finding this bug.

zwischenzug · on Aug 29, 2011

Only if you'd been raised on Java.

There had been the shift from 16->32 bit not long before that, so it's not that much of a stretch, and 8->16 before that (well after I got my first computer).

We found an almost identical bit-mismatch bug in a programming language that led a bitmap of tcp connections to appear to go crazy on a customer install in a core piece of software. These bugs cost.

tsotha · on Aug 29, 2011

Yeah, in that time period you couldn't really get away from writing really defensively. It was even worse, earlier. When I wrote C code for x86 there were all sorts of different memory models you could choose from at compilation, and we were producing both a 16 bit and a 32 bit (for the daring) versions of our product. You couldn't assume anything.

I write java now and this is not something I miss.

mikeash · on Aug 29, 2011

64-bit architectures were around 20 years ago. It was pure luck (whether good or bad I'll leave up to the reader) that it didn't get exercised enough to show up until now. The code invoked undefined behavior which just happened to work on the architecture it was targeting at the time. That's still a bug, even if the result works as expected, precisely because it could stop working as expected later.

Retric · on Aug 29, 2011

20 years ago computers where a lot slower so some tradeoffs into "undefined behavior that happens to work quickly" was fairly common practice. Personally I prefer to call such things landmines vs bugs because they where often left there deliberately.

forgotusername · on Aug 28, 2011

I guess, practically that's true, but the original code still violated the constraints of the abstract machine for which it was written (it assumed pointer sizes valid only for particular implementations of C++, etc.)

kragen · on Aug 29, 2011

There have been significant 36-bit and 60-bit machines since at least the 1960s and 64-bit machines since the 1970s. Notably, Digital's PDP-10 line that ran the early internet were 36-bit, and Crays (from 1977) were 64-bit.

However, none of those machines had 64-bit addresses. I think the Alpha may have been the first mainstream machine with 64-bit addresses, and it didn't come out until 1992, only 19 years ago. (I don't remember if the i860 had 64-bit addresses or not, but it came out a few years earlier.)

lukesandberg · on Aug 28, 2011

using subtraction for comparison can easily be broken due to overflow. This particular case may have been fine... but i wouldn't count on it.

viraptor · on Aug 28, 2011

I'm a bit confused... why would that function be faster than `return key1 != key2` ?

Not sure how to compile this function to a 16b target, otherwise I'd check myself :(

andrewf · on Aug 28, 2011

Typically, at the assembly level, doing a "compare" will change the contents of some CPU flags, but not registers. Normally, those CPU flags would then influence future instructions - most often a "GOTO if this flag is set" instruction.

In this case, you want to return the result of the comparison from a function. Since functions return values in registers, you'd need to do some extra work to copy the result into a register from the flag.

Between compiler optimisations, and the cost of your function call in the grand scheme of things, the difference is likely negligible. This may, or may not, have been the case in 1991. Also, casting to an integer pointer rather than a char pointer means the code has to divide by 4 at least once. (He did write this "optimisation" as an undergraduate!)

EDIT: Actual assembly comparison, from gcc -O3 (Apple's 4.2.x). subl (subtract) and cmpl (compare) take the same amount of time. But then _compare has to do setne to to store the "not equal" flag into a register, then movzbl to expand that result from 8 bits (which is what setne writes) to the full size of an int.

  _subtract:  // return (int)((char*)key1 - (char*)key2))
    pushl	%ebp
    movl	%esp, %ebp
    movl	8(%ebp), %eax
    subl	12(%ebp), %eax
    leave
    ret


  _compare:  // return key1 != key2
    pushl	%ebp
    movl	%esp, %ebp
    movl	12(%ebp), %eax
    cmpl	%eax, 8(%ebp)
    setne	%al
    movzbl	%al, %eax
    leave
    ret

viraptor · on Aug 28, 2011

Strange, new gcc (4.4.4) produces a much nicer assembly for key1!=key2:

    .LFB0:
        .cfi_startproc
        xorl    %eax, %eax
        cmpq    %rsi, %rdi
        setne   %al
        ret
        .cfi_endproc

Clang 2.8 does something similar to your _compare (I expect new versions to be a bit better)

andrewf · on Aug 28, 2011

Different calling conventions. The calling convention for my target (32-bit x86 Mac OS X) says that all arguments have to be passed in on the stack. For your 64-bit target, the arguments go into registers.

ori_b · on Aug 28, 2011

You're comparing two different assembly languages and calling conventions. x64 != x86.

viraptor · on Aug 29, 2011

You're right. It's been some time since I've looked at any intel-like assembly :)

lukesandberg · on Aug 28, 2011

I think the point is that he needed a comparison function not and equals function. thus the subtraction. Erric Lippert had a nice series on subtle bugs in compareTo implementations, in part3 he talks about a similar version of this bug. http://blogs.msdn.com/b/ericlippert/archive/2011/01/27/spot-...

tldr; if you aren't doing a comparison in your compareTo implementation, you probably have a bug.

viraptor · on Aug 28, 2011

I thought it might too, but then he writes:

"A zero result means equals, and non-zero means not-equals… and then the results are cast to 32 bits."

So he didn't care about ordering.

cpr · on Aug 28, 2011

Boy, you'd think anything with an (int32) cast would get seriously eagle-eyed during a 64-bit port...

pagekalisedown · on Aug 29, 2011

You'd think the first thing someone would do is grep for all typecasts. hmm

One thing you can do to make your typecasts easier to find, assuming you're using C++, is using static_cast. I know it's ugly, but searching for static_cast is much easier than all possible ways you can write a typecast.

zvrba · on Aug 29, 2011

Is it just me who thinks that the OpenJDK's "fix" still has the same bug as the original code?

mypov · on Aug 29, 2011

So Dr. Click is a mortal after all. He still rocks and his work and thoughts are hugely informative and influential. Kudos for the upfront self crit.