Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here's my worst:

We had a function that looked like this:

  void f() {
    bool flag = true;
    while (flag)
      g();
  }
This function exited sometimes, which should not be possible.

You say, "Duh, Mr. AnimalMuppet, clearly g is smashing the stack!" But the return address wasn't getting trashed - just the value of flag.

This bug also would disappear if you did something like, for example, print out the address of flag so you could watch it in a debugger.

I chased this bug for a month, off and on. Finally I got desperate enough to print out the assembly produced by the compiler, and things got clearer.

flag was a register variable. (This was gcc compiling for an ARM CPU, by the way.) It lived in R11 (or maybe R12, it's been a long time). When f called g, it just pushed the return address on the stack. But g was going to have its own value to put in the R11, so it pushed R11 onto the stack just before allocating space for its own variables. So f's local variable wound up in g's stack frame...

... and g was smashing the stack. Duh.

In particular, g was using queues (msgsnd and msgrcv) to exchange messages with another CPU. The API here is misleading. These functions expect to send a packet of the form

  struct msgbuf {
    long mtype;
    char mtext[1];
  };
and they take a size parameter, among others, because you actually pass in a different structure, with an mtext array big enough to carry your actual information. But the size you pass to the function is expected to be the size of the actual mtext array, not the size of the actual structure (that is, the size doesn't include the size of the mtype field).

The contractors who wrote the code didn't know that. They assumed that they could just pass sizeof(type_similar_to_msgbuf) into msgsnd/msgrcv, and it would work. In fact, they were sending and receiving 4 bytes too many. Which kind of worked, in that the sender and receiver never got out of sync. But when the receiver got four bytes too many, it trashed what was just past it on the stack, which turned out to be the stored value of R11.

The net result was that the flag would get cleared if, on a different CPU, four bytes of unrelated memory were zero.



Well you didn't mention valgrind, but this is the kind of thing it should have caught. In fact if your programming in C/C++ running the application with various valgrind, glibc MALLOC_CHECK_, boundschecker, purify, etc tools should be done on a regular basis as part of your integration testing.

For sure boundschecker saved myself and various coworkers days of debugging a couple companies back. Its one of those tools that pays for itself the first time it finds a truly evil bug.

Also, POSIX is full of gocha's (personally I think its worse than W32) just waiting to catch the unwary. FD_SETSIZE with select() bit me hard a few years back by causing a bug that only happened when the file descriptors eventually got > FD_SETSIZE, but that was hardly the worst bug I've found.


valgrind wasn't our first choice, because we were in embedded systems, and valgrind slows things down so much that our timing gets destroyed and the behavior of the system changes. Major parts of the system behavior would not work right under valgrind. So we didn't tend to run it very often.


Psst…I'd mark that flag volatile or atomic if I were you; a smart C11 compiler might mark your loop as terminating ;)


It lives on the stack, it definitely isn't volatile.


Wait, why would it do that?


From a recent (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf) draft, section 6.8.5, Iteration statements:

> An iteration statement whose controlling expression is not a constant expression, that performs no input/output operations, does not access volatile objects, and performs no synchronization or atomic operations in its body, controlling expression, or (in the case of a for statement) its expression-3, may be assumed by the implementation to terminate.


For that condition to be met, the compiler would have to prove that g() doesn't do IO, etc - but the OP said it calls msgrcv(2), so it's not an issue.


Ah yes, our old friends of impossible bugs: compiler optimizations, disappearing when observed, and triggered far, far away. Yikes, man.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: