Programmer Gore

jacquesm · on July 15, 2010

Sometimes, usually when in a hurry I skip the 'getting to the root cause' step and this has bitten me badly on more than one occasion.

So now, if I can afford it I really want to know where things went wrong. Usually that means a longer time-to-fix but what's fixed that way usually stays fixed. The 'band-aid' type fixes tend to lead to subtler problems that are harder to fix later on.

If it breaks, let it please break now and in as spectacular a fashion, without any band-aids, that way we can stay away from the kind of bugs that only happen during full moon and Eastern wind.

Which reminds me, I really should have a look at the guts of some software that currently runs in a wrapper script because it crashes once every 3 months or so without any apparent cause.

Robin_Message · on July 15, 2010

There's some good advice in chapter 8 of Code Complete on Defensive programming. One example given there is the C function realloc, which resizes a block of memory, which can sometimes mean moving the whole block to a new, larger block. Since intermittent bugs are indeed the worst kind, Steve suggests making the debug compile memory allocator always move the block, so as to exercise that code path everywhere in testing.

Edit: Wrong Steve and wrong book -- it's a running example in "Writing Solid Code" by Steve Maguire.

jacquesm · on July 15, 2010

That's a good trick, regardless of which book it came from.

In the software I wrote about above I suspect a very subtle resource leak.

A nice example of such a leak is for instance forgetting to close an opened file descriptor if some other rare error condition occurs elsewhere in the code (not that that's it, but that's how you can get to the point where something will run for months on end without crashing and then suddenly it does).

File descriptor leaks can be relatively easily traced using lsof by the way (one of the step-child utilities that really should be in every coders toolbox, right next to gprof and make).

xsmasher · on July 15, 2010

Good tip; I believe Guard Malloc does this for you.

khafra · on July 15, 2010

The ITIL framework separates the duties of finding an immediate workaround and diagnosing the root cause of one or more incidents. I can see some value in not letting those two tasks flow into each other without thinking about it.

fierarul · on July 15, 2010

Finding the root cause of failure is essential, especially when you are working with large codebases.

For my biggest running project, I customize about 1GB of source code not written by me. Every bug needs to be chased until one actually understands why it happened otherwise it's too risky to just make a patch that "seems" to fix it.

Plus, in the process you usually learn a new and interesting thing about a previously unknown part of the codebase.

Of course, very few customers actually understand the importance of this and have the budget to allow you this "luxury".

Fixing the bug for most of the servers and leaving a small fraction with the old codebase to investigate the bug some more sound interesting but I couldn't be doing that.

gaius · on July 15, 2010

A gigabyte of source code!? That's the entire depot not just one tag, right?

fierarul · on July 15, 2010

Actually, it's just about 800MB of source code with 150MB of external dependencies. I never counted the lines of code until today but apparently just for the main language there are over 9 MLOC.

Of course, this doesn't actually matter as I don't touch most of the code and the "recently active" part seems to measure only about 275 KLOC. But when a bug arrives it might take me anywhere; even so, I don't think I've visited more than 35% of the codebase.

pmjordan · on July 15, 2010

The Linux kernel currently is around 450MB of source code. I've worked on individual projects with 1-2MLOC, I suspect that's about 100MB of code. A couple of those don't seem all that implausible.

gaius · on July 15, 2010

Hmm

    yulia:/usr/src# bzcat linux-source-2.6.32.tar.bz2 |wc -c
    382382080

382M. Tho' is it really one "thing"? A lot of that is device drivers, loadable modules etc that no single install will ever have.

pmjordan · on July 15, 2010

I guess my 450MB are from 2.6.34 and are the disk usage of the actual untarred sources that I'm working from, so my figure includes cluster off-cuts (the bytes needed to round up to the nearest 4096 for each file).

It's certainly one thing in that it all uses the same build script; and any of the pieces are pretty worthless on their own.

Mongoose · on July 15, 2010

Is there a bash.org-style site for programmer horror stories? They pop up once in a while on HN, but it would be nice to have a single-service site for these kinds of stories.

gaius · on July 15, 2010

http://thedailywtf.com/Default.aspx

sh1mmer · on July 15, 2010

Great article, but for someone reason when I saw the headline I thought it was going to be able Al Gore.

gcheong · on July 15, 2010

I thought this was going to be about how Al Gore programs his own climate change models.