Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What was the worst bug you've ever solved ?
76 points by jacquesm on Oct 9, 2009 | hide | past | favorite | 125 comments
What is the worst bug (the software kind) that you've encountered in your hacking career ?



Back when text messaging capability was a rarity on mobile phones, which were themselves rare, I was testing an SMS-based weather forecast service that I had written on behalf of one of the mobile network operators.

The testing worked well on the emulator so I decided to test it over the public network to an actual handset. Only I forgot to advance a recordset through which I was looping, so the code never hit the end of recordset condition. It took me some time to notice there was a problem...

The fact that I crippled a national SMS network for a few hours was bad.

The fact that my company had to pay for each SMS, wiping out out profit for that month was worse.

The fact the handset was mine and on my first date with a girl later that evening (whom I later married) my handset kept beeping with incoming text messages (about 96,000 if I remember) was the ultimate.

The handset didn't have a silent-no vibrate function (either it beeped or it vibrated or it did both) and the SMS inbox filled up after 200 or so messages meant it took days for the inbox to fill up, me to clear it message by message, then fill up again ad nauseam.

Still, I laugh about it now...


You couldn't just turn it off for the duration of the date?


One of my colleagues did this with our automated notification system. After his phone received about $60 worth of text messages, he panicked and shut down the server!

Then again $60 is only about 1200 messages.


> Then again $60 is only about 1200 messages.

You must not be in the US..


Yeah, that's like $10,000 to $20,000 in text messages.


What? Are you saying it costs nearly $10 per message or more? How can 1200 messages cost $10K - $20k?


That's what we call "speaking in hyperbole"


I was using the OP's count of 96000 messages.


I am, he had a package where each message cost him $5c.


I inherited a giant hideous stock-management system. It did a certain amount of automated ordering without manual intervention.

Long story short: a nasty race condition meant that it was over-ordering duplicate products from the suppliers to the tune of tens of thousands of dollars per day.

On the general theme, my most frightening software experience was when I met a guy who was the star programmer for a company doing controllers for elevators. I got talking to him, and he showed me some code. It took me about 15 seconds to identify an edge case that would engage the motor while the elevator was at the top floor (thereby attempting to pull the elevator into the roof?). It took me a further hour to explain what the edge case was. The bug wasn't that scary, as I'm sure there are hardware failsafes, but the general dimness of the guy writing software to control lifts was scary.

I started taking the stairs after that.


In high school I got a job for my local Department of Public works in the Power division. I lived in a small New England town that did their own power, much like most towns do their own water and sewer.

My job was as an assistant to the inventory guy, a 70 year old feisty man with one hand named Al.

I was often bored when Al would tell me to 'go hide somewhere' so I wrote some software to help him manage the inventory system. The power engineers in charge saw this and after a few small programming assignments had me work on updating the newly installed SCADA control system. This was a specialized programming environment that controlled all the power in the town.

We were setting it up to buy power from the local college during the yearly 'peaks' in August, thus reducing our yearly electrical bill by potentially millions of dollars.

After a month of working on it and incrementally adding my changes I screwed up. I knew this when I submitted a change and all the alarms went off at the substation.

Half the towns power was out. I got it back on after an hour, and nobody called with ventilator issues, so I think there was no real harm done.

The engineer in charge of the department laughed it off when he saw my apprehension about the situation. He said in the grand scheme of things they have made far bigger mistakes than that, probably referring to the blown transformer a couple months earlier.


That must have been one scary hour.


A batch run of only a few thousand items was running all night long, rarely finishing and causing all kinds of problems when people logged in in the morning. The users had been complaining about this for years.

I was given the ticket and found a "SLEEP 10" (causing a 10 second pause for each item) in the 10 year old BASIC code, put in by the original programmer for debugging purposes, and never removed before it was promoted.

I removed the "SLEEP 10" and run time went from 12 hours to 23 seconds.

The users loved me, but my boss was not pleased. He said, "You should have changed it to a SLEEP 5 so we had something else to give them the next time they complained."


Perhaps it was intentional on the original programmer's part... but he never got round to reducing the sleep time - http://thedailywtf.com/Articles/The-Speedup-Loop.aspx


Reminds me of back in the old days writing integer assembly language (and later) C apps for microcomputers. As FPUs started to become available, users complained they spent all this money on an FPU and got no speed improvement.

I half-seriously suggested adding FPU detection and intentional delay loops if no FPU was present.


Not so much a bug, but this incident cut my lifespan by about 5 years or so:

I was fixing an account balance on a customer's master database. Any change that happens on the master gets replicated to 30 branches, usually at 10 minute intervals.

I wrote the UPDATE statement, highlighted it, and pressed "Execute". Unfortunately, I didn't select the WHERE clause, so I basically gave all their customers (85000) a 50c credit balance. The IDE I was using also had a bug that caused it to ignore the auto-commit setting (which was turned off), so it basically committed the transaction. First I tried a ROLLBACK, which failed obviously. Realizing I had to act quickly, I disconnected the network cable (I was working on the DBMS server) to stop replication. I extracted the transaction log from the current database into a textfile (a few hundred MB) and restored the database from the most recent backup. Then I basically ran the extracted transaction log as SQL scripts against the database, hoping that it wouldn't fall over. I didn't want to do a normal restore because I was afraid of it showing up in the logs.

Within a very stressful 60 minutes I had everything back to normal. I never told a soul IRL about it.


Which is one of my pet gripes with SQL, that the default is to affect all records on update, that should have been:

UPDATE tablename SET xxx='newvalue' WHERE ALLRECORDS;

Or something to that effect.


MySQL has --safe-updates (aka --i-am-a-dummy), which will refuse to execute UPDATE and DELETE statements without a key in the WHERE clause.


Would you feel the same if in unix the default for 'rm' was 'rm -rf *' ?


I had one of these too. I had to UPDATE a line of a few customers order's logs on a production server. The database was huge (several million lines).

After a few seconds I thought it was taking too much time only to realise I forgot one AND in the WHERE... I struggled with the backup but managed to get the data back just after the first call.


I got into the habit a long time ago of writing my where clause first and then generally writing the rest as a select clause. Then converting the select into a update or delete. I don't always do the select thing it depends on what part of the DB I am working on. If it is a pretty complex statement I tend to write the select just to make sure. I have a guy that is better than me at SQL but he still has not taken my advice and every now and then I hear cussing coming from down the hall. He is good at backing things up but his process takes a lot of time when he messes up. I prefer the ounce of prevention path.


I do almost the same thing:

I first write out a select that will show me the records I'm about to modify, then replace the select + fields with the update. And I try to keep modifying live databases using interactive SQL to an absolute minimum.

Check three times, then hit enter.


I've done almost this exact thing. The "run only the SQL that's highlighted" feautre is handy, but it can cause serious bad juju on a production box.

Unfortunately, in my case, I ended up having to rebuild all the data from week-old backups and transaction logs.


For a different definition of worst:

I started a job recently at an ecommerce company. There was a long-standing bug with the cart display in the upper right of the page always saying that the cart was empty. People would report it all the time, and the quite smart lead programmer said it was something really complicated that he hadn't had time to investigate.

But eventually, after I sort of knew my way around the code, and when I finished up all the tasks on my to-do list, he handed me that as a why-not-investigate sort of thing. He didn't really have any idea, just that the previous long-gone coder had said it was complicated and in the depths of the way the front end code interacted with the order system.

So, I reproduce it, and look at the template code. These two lines, right next to each other:

[% cart_summ = ourdb.cart_summary %] [% IF cart_cumm.qty > 0 %]

Note that the two variables don't match. And this was broken on the site for _FOUR YEARS_. And nobody looked because someone said he had and it was hard, and nobody had time for a hard problem.

facepalm


hehe. reasons to develop YOUR_LANGUAGE-lint?


Some users of a (shipped, fairly heavily used) web app we had deployed were getting kicked back to the login screen at random. Sometimes, very frequently.

Looking in the logs we could see that these users were somehow losing their authentication cookie and the application was correctly bouncing them to login. So how were they losing their cookie? Assuming it was a bug in the code we searched and searched to no avail.

Finally I discovered that the hardware load-balancer our CTO/'IT' guy' had insisted on was the culprit. The load balancer would buffer fragmented requests and re-assemble them before sending them on to the server. Unfortunatly the load balancer had a huge bug in its firmare.

If a user was using firefox, on windows, and their request was fragmented such that the first packet contained data up to the end of a header line including the \r but not the \n, so the next packet would start with a \n and then a header name, the load balancer would insert a \n\r between the two packets, thus effectively truncating the HTTP headers, usually before the cookie lines.

When I found this bug I couldn't believe that this was actually happening, I thought I was taking crazy pills, but you could run a sniff on the front and back side of the load balancer and see the request go in well formed and come out busted. We ditched the hardware load balancer and all was well.


Was converting an avionics subsystem from Ada to C. It was a client application that had to talk to an Ada server, sending and receiving rather huge chunks of data, large, deeply nested, intricate structure types. The C structure type had to match the Ada type exactly, or else it wouldn't work.

I got it working fine on our desktop simulation, but running on the actual hardware it was consistently off. After extensive testing, I realized that it was a bug in the compiler for the target hardware, such that a very particular type of structure (something like, {int, char, float}) was being packed incorrectly, resulting in a 2-byte pad that shouldn't be there. If I reordered the structure elements, it was fine, but that particular grouping and order refused to work correctly.

It was GCC, so we could fix the compiler ourselves, right? Not really, as, for avionics systems the compiler has to be thoroughly qualified for avionics use, and changes equal requalification. I "fixed" it by storing the float as an array of characters, converting it to and from a real float type as we needed to use the data value.

Trivial, perhaps, but I was very excited to resolve the problem, after spending days barking up wrong trees. One usually expects that the problem is not in the compiler... :-)


Were you using GCC's __attribute__((__packed__))?

Anyway, the standard way (to handle protocols) is to parse the thing not making any assumption of the struct layout.


Yes. The other structures all packed correctly.

If I am understanding what you are saying, we really couldn't do that, as the server sent and expected to receive binary blobs of data; the only way to know what was what was to have a map of where the data elements were.


I like to use macros (in C) to spit out the structure size and offsets of each structure member over a serial port, in a format I can then cut-n-paste back into the source. This output consists of a bunch of (compile-time, when possible) assertions so that any changes to a structure break the build. These assertions go on both the embedded side and the PC server side, so any weird packing issues show up at compile time.


Bug with the most spectacular results:

As a (former) hardware engineer, I've worked on many projects where bugs have physical effects. This can range anywhere from amusing to seriously dangerous.

One such bug involved a mistake in the assembly diagram and silkscreen for a circuit board. The result was that a tantalum capacitor was installed backwards on a 12V supply rail.

Tantalum capacitors are polarized, and they fail in a spectacular way when reverse-biased. In this case, the supply rail could source upwards of 20A, so the fireworks were loud and impressive. Luckily the cap was easily replaced and the only permanent damage was cosmetic.

Hardest-to-troubleshoot bug:

In my subsequent return to the world of software, I worked on device drivers for network interfaces (among other things).

NICs frequently operate through a circularly-linked list of packet descriptors, which contain pointers to buffers in RAM where the NIC can DMA packet data. The hardware fills the DMA buffers and marks the descriptor as "used," and the driver chases the NIC around the ring, processing the packet data and marking the descriptors as free.

In testing, we discovered that under long periods (hours, usually) of heavy load, the system would occasionally freak out and stop processing packets. Sometime later, various software modules would crash.

Working backwards through the post-mortem data, I saw that the NIC would get "lost" and dump packet data all over system memory. I dumped the descriptor ring (tens-of-thousands of entries) and wrote some scripts to check it for consistency.

To make a very long story short, when the NIC was stormed with lots of 64b packets with no gaps, it would eventually screw up a DMA transfer and corrupt the "next" pointer in the descriptor ring. On the subsequent trip through the ring, the NIC would chase an errant pointer off into system memory and corrupt other system data structures.

Since hardware can DMA anywhere in RAM, the OS is powerless to stop it. The resulting errors can be ridiculously hard to track down and fix.


Had an obscure picking id wrap around because a table wasn't getting cleared for debugging purposes which resulted in excessive amounts of beer being delivered to unsuspecting customers at an automated gas station.

Here's a video of part of the result: http://www.youtube.com/watch?v=RUhLDtPnSuQ


A bug that gives out free beer. Talk about a bug with a silver lining :)


If you're a customer, that's a feature not a bug!


Runaway robots at Anybots have caused:

  - 2 holes in drywall
  - 1 bent bookshelf
  - 1 dent in concrete floor
  - 1 frightened Jessica
  - http://www.youtube.com/watch?v=qkenIInV9rI
The last one was fun because I have logs showing packets from the PC/104 computer stack (running FreeBSD) connected to the robot while it was in midair.


How do you dent a concrete floor with a robot ?


Monty weighs 160 lbs, and at the time had an over-designed metal piece for its neck. A sensor failed and it faceplanted, taking a 1 cm deep chunk out of the concrete. It's still there, under some carpet.


Sure doesn't look that heavy, maybe I'm misjudging the scale here, I'll have another look at that video.

Nice balancing by the way, smooth dampening.

Edit: Ah yes, now I see it, that's a small scope and a lab powersupply in the background, I estimated the size off by a lot.

I figured the whole thing was about a foot tall or so, sorry!


The most memorable bugs are the ones that cause physical damage. This was mine:

http://www.youtube.com/watch?v=b7i2KkYYulI

Damage: Blown tire, dented rim, looking like fools in front of our peers.


Anytime the effects of code escape into the real world, the result is far more interesting.


Um.. what was that? Some autonomous vehicle navigation system?


Darpa Urban Challenge, top right of the video.


What was the bug?


Depends on your point of view. :-) Either:

(a) The CAN bus (which reports wheel velocity and position of the steering wheel among other things) micro-controller hardware stopped sending interrupts which caused the main computer to think we were stopped. As far as I can tell this is just a straight up hardware bug with the Philips ARM chip we were using. This caused the accelerator controller to floor it because it was only seeing the last CAN message we ever got which happened to be zero velocity. Same thing with the steering (hence the big swerve).

or

(b) I failed to consider the contingency of not getting any CAN interrupts (either because the of the [very intermittent] hardware bug or because the connector got disconnected--turns out the symptoms are the same) and didn't have any code written to deal with it. Guess what I wrote that night. :-) Luckily I had all day to think about it while a new tire was fetched and by the time I figured out what was going on it took about 10 minutes to code the fix: a watchdog timer that shuts the world down if there are no inputs from the sensors for any reason.

It seems obvious in retrospect but when things are working you sometimes forget about weird failure cases.


Any chance of getting 'second opinion' style sensors in there that can provide you with sanity checks ? Such as 'GPS reports movement, but wheel sensors do not, we have a problem ?'

That way you can avoid a paralysis of the control software until the vehicle has really come to a halt.


I read somewhere about realtime applications that do something like this but redundant sensors holding a consensus polling algorithm. Have three sensors reporting the same thing and if they are not all in agreement within some kind of delta then go into some kind of limp mode or have the two sensors in agreement be the ones that the system uses for it's algorithms. I cannot recall where I read it though.


Me too. The back of my brain is telling me it was for some sort of plane control software? Maybe? An interesting tidbit that I recall was that they used different manufacturers to hedge their bets against bugs.


That sounds about right. I seem to recall that it was for aerospace too. Maybe NASA? Something about zero defect software. I cannot for the life of me find the article right now though. I also vaguely remember that it was a HN submission too.


I believe you're thinking of Kalman Filters. They're commonly used in sensor fusion and noise reduction, anyhow.


Thanks for the tip. As a side note the wikipedia entry is probably one of the most in depth entry I have seen there:

http://en.wikipedia.org/wiki/Kalman_filter


Two immediately jump to mind. One that had a massively bad impact to the company, another that might have..

First, using perl a (later-fired) co-worker added a hardcoded check like the following:

if ($client_id = "specific_id") { #email reports }

Needless to say, we emailed reports for all of our clients to a specific client, didn't go over too well considering that many of them were competitors.. It was particularly bad because he had previously been talked to about flipping the constants to avoid the = vs == bug.

Second, possibly abused but not known for sure, was found a few years after initially being put out. Our webapp created a session ID for each user, MD5 hash.

Except it started like: StringBuffer md5HashedBuffer = new StringBuffer(userId);

Which, because the userId was an int, simply creates a string buffer of size userId, not a string buffer initially populated with userId.

The rest of the hash was added afterwards, then the one-time created, with the result that everybody's session id was the same. Changing your user id in the GET or POST would allow you to be logged in as a different user.


Why weren't both of these bugs caught during code reviews?


A while back, I developed a program to generate invoices for about a dozen busy warehouses. During testing, for convenience sake, I hard-coded in my local printer.

Unfortunately, I forgot to return the printer name to a variable when promoting into production. Hilarity ensued.


Hehe, that one had me laughing here. Ouch. Hope you put enough paper in it ;)


Better one.

Guy I knew - awesomely good - hex edited a DOS boot sector on a in house machine to use FUCK.SYS instead of whatever.sys it normally is (I forget). He renamed that file to fuck.sys, rebooted and the machine ran. Cool!

We laughed and reinstalled DOS and two days later the boss come charging in yelling 'I have a client on the phone who says his new machine can't find FUCK.SYS!'

The awesome guy goes 'uh oh'. I laughed.


config.sys


After having launched our product I was spending some time reviewing commits together with the senior tech lead. Still to this day I can recall the commit number, the filename and write down the code from memory responsible for what turned out to be the source of a bug completely wiping out our users computers. Someone had mixed uncommenting a piece of code together with fixing a bug which hid the fact that some horrible code was active in the product. It took us 5 minutes to produce a fix and push it out to the update servers. Did we end up wiping someones computer? Yup, about a dozen known cases including a couple one in-house. I don't even want to think about how many actual cases there were, considering that we had about 2 million downloads of our product before the bug was fixed.


How do you define worst?

How about most widespread? Once while trying to debug a CPAN module I figured out that if $condition was false then Perl had a bug causing

my $foo = $bar if $condition;

to leave $foo with whatever value it had on the previous function call. (The exact behavior is more complex than that, but that's a good first approximation.) I then made the mistake of reporting this in a way that made it clear that

my $foo if 0;

was a static variable. Cue years of people like me trying to get the bug fixed and other people trying to keep it around. In the meantime EVERY large Perl codebase that I've looked in has had this idiom somewhere and it has caused hard to notice and reproduce bugs.

How about worst damage to a system? Due to a typo I once caused my employer to send the Bloomberg's ftp system every large file that it had ever sent. Since it sent a large file every day, this crashed their ftp server, meaning that a number of feed providers for Bloomberg didn't have their feeds update that day. I implemented no less than 3 fixes that day, any one of which would keep the same mistake from causing a problem in the future.

How about most (initially) bizarre? Debugging a webpage that a co-worker produced where, depending on the browser window size, you couldn't type into the form. The bug turned out to be a div that had been made invisible but left on top of the page. At the wrong window size it was in front of the form, so you couldn't click on the form elements behind it. (I figured this out by first recreating it with static HTML, then using binary search to figure out what parts of the page I could whack and still reproduce it until I had a minimal example. Then it was trivial to solve.)


That second one reminds me of this: An ISP called planet internet changed their homepage, only to find that they reliably crashed Explorer (3 at the time).

Took a while before the phone rang if I wanted to have a look.

It turned out they had a little animated gif in there with the inter-frame interval set to 0, causing a divide by 0 in Explorer.

That gif was pretty much the last suspect on the list.

Divide & conquer until you are simply staring at the solution and still you don't see it...


The store locator function on a national pizza chain's web site would completely hang their web server whenever an international search was done. Many, many hours and days of testing and debugging led us to conclude, and build a proof that it was a reproducible bug within IBM's Domino platform, only on AIX boxes, only when: 1) A script using LotusScript, their proprietary language was kicked off, and... 2) A Java agent was then kicked off before the original script completed.

At the time, Java was a new feature within that platform, so there weren't many apps that mized both languages.

After getting to this point, IBM joined in the fix effort, and we had daily conference calls, on which we always had IBM execs lurking because their 6.0 release of the platform was imminent, and this bug had the potential to wreak major havoc if not fixed before launch.

So I cannot personally claim to have done the actual bugfix - the IBM programmer did that. But it was a great learning experience to work together with IBM to find and fix it.


Dominos Pizza :)


The worst bug I encountered was due to IBM MVS (or COBOL--I was never sure which was at fault) losing addressability of part of a variable length record. Now you see it, now you don't. The solution at our shop was to move the whole record to itself before attempting to look into the record. I was a newbie. If the old guard hadn't told me that workaround, I NEVER would have thought of it. This problem eventually went away, but 25 years later we still occasionally ask each other "did you try moving it to itself" when dealing with new problems. We chuckle, while today's newbies shake their heads at our Old Fart humor.

The worst one I ever caused was when Visa started carrying two amount fields in their credit card records. One was the amount in original currency, the other was converted to the receiving system's local currency. I used the original amount. Our hand-made test data used the same currency for both amounts, so no problem in test. Imagine my surprise when we went live and our system started posting original currency amounts to cardholder accounts, which at the time only supported US dollars. Luckily, we caught it early and senior management and cardholders were all good sports about it. I think those credit card statements with massive amounts became collector's items.


Since you mention COBOL records were variable-length I'd guess that they probably contained ODOs (Occurs-Depending-On, variable length arrays for those of you who aren't COBOL literate). In order for a group or record MOVE to work properly you had to move the subordinate ODO values first, otherwise the runtime system would miscalculate the target record length, possibly truncating the MOVE.

This also meant you couldn't use READ INTO for variable length records (which is equivalent to a READ followed by a MOVE) without taking some care.

As you say: "newbies shake their heads..."


I worked on a taxi booking and dispatch system, written in c, and running on dos with custom networking via RS232. This system was installed at around 300 locations around the UK, and on one fateful day, every installation crashed.

It came down to me to find and fix the problem, and it was subtle. The clue lay in the fact that all of the sites that crashed did so within about a minute of one another.

Turns out that some of the old, old sections of the software had been written by the MD, who, despite referring to himself as 'the emperor of c', was in fact an atrocious programmer.

The actual trigger was the comms system looking at a byte that determined as to whether a message had been received. This byte was set to the character 'A' if a message was received. It just so happened that the first byte of the current value of the number of seconds since 1970 evaluated to 'A', and had been written into that memory location a negative index into an array that hadn't been initialised.

This negative index into an array that shouldn't have been empty caused a section of memory to be overwritten that made the comms system think that it had received a packet. This snowballed quickly, and took down the system within about five seconds of boot.

Took the best part of two days to track down, and, of course, it was everyone else's fault but the emperors.


Let me guess: The crash occurred slightly before 6 PM on July 22, 2004?


I was working on a C++ daemon process that communicated over a TCP socket. At the time, we were using the Poco library's facilities to do the standard daemon startup stuff (get rid of the controlling pty, point standard fds to /dev/null, etc). Anyway, one of our field installations wasn't working, so I took a look. It turned out that the communications over the TCP socket weren't working -- where the client process expected a few header bytes containing the message length, it was getting wacky values. I tried a bunch of stuff, and in the end, I displayed the header as ASCII, and it showed up as "SQL: INS". This blew me away; this looked like some debugging output that normally goes to standard output when the process wasn't running in daemon mode.

As it turns out, the Poco library didn't read Steven's UNIX book all that closely, and they closed all of the file descriptors when turning a process into a daemon, instead of reopening them to point to /dev/null. So, standard output was closed, and its file descriptor was reused for the TCP socket. Of course, things like "cout" always assume that standard output is at a particular descriptor, so all the standard output from the program was getting written to the TCP socket.

Boy, that was confusing.


I worked on a piece of arcade equipment (manufacturer and model shall remain nameless) that used a bunch of solenoids to control the works under glass.

A little race condition in the code allowed one of the smaller solenoids to stay in a duty-cycled state, effectively turning the coil into a small space heater. Given the right play conditions and length of play, the coil could catch fire, and a couple of times it did. Lots of wood and plastic under glass made for a fun little display.

I heard one story about a unit in Paris being dragged out of a cafe and into the street, then put out with axes and buckets of water. Wish I had been there to see that.


I worked on a program that ran large batch jobs, sometimes taking more than twenty-four hours. This was actually spectacular performance, since we used custom hardware to do most of the computation. I wrote the code that interfaced with the hardware. When the code timed out trying to talk to the hardware, the only sensible thing to do was report the error and abort the program.

Unfortunately, this seemed to happen quite often. Jobs would abort randomly, after about eight hours, sometimes much less and sometimes not at all. Overheating was the obvious first suspect, so that was investigated and ruled out. The hardware was running cool and was in perfect working order. The customer started splitting jobs into shorter batches and combining the results by hand. We wrote code to help them automate this workaround. But batches were still randomly aborting. And I couldn't replicate the bug, despite having identical hardware to the customer. Something in the customer's environment was essential.

Eventually, somebody at the customer figured out the problem. My boss called me up and said, "The customer suspects your code is not time-travel compliant." It was true. My code assumed that time always goes forwards. If time ever went backwards while my code was waiting for hardware, it would immediately time out and abort the batch. And our customer encountered a bug where time did appear to go backwards occasionally. I was too stressed out over other tasks to ask for details of the bug. I just sent them a fix and breathed a sigh of relief when they accepted it.

A Google search now reveals that there was an issue with time going backwards under Xen on dual-core Opterons, which is what the target platform was. They never told us they were using Xen. Maybe that's why they were much nicer to us after the problem was diagnosed!


Ok, here's one of mine, it's only fair.

Jasper L. systems administrator of an early web hosting company calls up one evening, there is a problem with the paging system.

A certain host is being paged as 'offline' but when checking the machine works fine.

So I go there bit by bit we check out the software, everything works fine, but sure enough every 10 minutes or so the machine (called 'chopper', I'll remember that for the rest of my life) gets reported as 'down' again.

But there is absolutely nothing wrong with it.

After ruling out all the software bits we figure it must be hardware somehow. The way the supervisor works is it sends a ping packet to the machine, and if the machine responds it is deemed to be up. But chopper misses one ping out of every 5 or so, and sometimes several in a row.

We swap out the computer, move the scsi drive to another box and boot it. Chopper registers as off-line, but works just fine.

More confusion, finally, out of desperation we start messing with the network. This is all 10Base T, coaxial cables with little T connectors on the machines and a terminator at the end of the line on the last 'T' to make sure the impedance is right.

The terminator was on, so that wasn't it.

We went for a break at that point, we'd been at it for hours. Finally, by elimination we figure the only things we haven't changed yet are the drive and the 'T' connector, but surely that can't be it.

And it was... somehow that T connector was a pretty good filter for some bits in the incoming or outgoing ping packet changing one of the bits, causing the IP checksum to fail. No returned ping... we replaced it back and forth 3 more times just to make sure we weren't seeing things.


That's not 10BaseT, it's 10Base2 (or thinnet): http://en.wikipedia.org/wiki/10BASE2

And--even though I've strung cables for both kinds, and of course 10Base2 takes far less cable, the maintenance issues and troubleshooting headaches of the bus topology make me so glad we've progressed to more robust network topologies.


Ah, yes, of course, you're right. If I could vote I would vote you up but my voting seems to be on the blink.


Define "irony" :)


Worst software bugs, in increasing order of severity:

An ASP.NET app that didn't consume it's record sets before closing it's database connections, thereby breaking IIS's connection pooling and causing hundreds SQL server login/logouts per second. Lots of interesting side effects on that bug.

An obscure page latching problem on a high volume SQL server (hundreds of transactions & thousands of queries per second). The SQL server would spend all of it's time waiting on page latches, response time would grow exponentially. An eleven hour phone call with MS support finally identified the problem.

Oracle 10.2.0.[1234]


You are responsible for Oracle 10?

We need to have a word.


I don't know anything about Oracle 10, can you elaborate?


That first one is actually two bugs, one in the application, another in IIS for expecting applications to behave nice.


Non-strict comparator with the STL. Nearly impossible to identify when it's happening. It's happened once to me and once or twice to a coworker over the last couple years, and takes 3-5 days to debug every time.

Example:

  struct FooSort {
    bool operator()(Foo const& a, Foo const& b)  const {
  -    return not a < b ;
  +    return b < a;
     }
  };


Try a worse version of this - comparing the results of a floating-point function call where the left operand gets moved into the 80bit fp unit during computation, and the right operand stays in register.

Obviously there is a precision difference between the two numbers, enough to make a < b and b < a return true in a surprising number of cases. The way I ended up fixing it was by putting the result of the function call in a member variable in every struct, pre-computing all of the results, and comparing based on that value.


I hit this, too. Regression tests were failing when I changed code that obviously shouldn't change the output of the program at all. This happened on a regular basis when we changed numeric code, because of the normal limitations of floating-point arithmetic; we just made sure the numerical results were accurate and updated the regression tests. (The regression tests were quite handy for finding logic errors; they weren't really used to test numerical accuracy.)

But in this case I was just adding some error checks, which weren't even being triggered. Clearly this shouldn't affect the results of our numerical calculations. Since my code shouldn't affect the calculations, I was convinced that our existing numerical code had a subtle memory or timing bug. (I knew that floating-point code was tricky, but clearly I was doing exactly the same operations on exactly the same values.) I spent days staring at code, and then my boss told me to stop working on it since the results were clearly correct in both versions, even if they weren't identical.

A few weeks later I read about how values change when they're copied out of the x87 stack into registers. And I thought, naw, we couldn't possibly be using x87 arithmetic. But we were. Which was horrifying, since floating-point calculations could be a bottleneck under some workloads. But we had been running that way since before I started working on it, so at least it wasn't my fault. I added a compiler option to request sse2 floating point instead of x87 floating point. Voila, predictable floating-point results, plus measurably faster performance on a few tests.


This has happened to me several times. Especially in combination with the fact that C++ does not guarantee floating point operations happen the same every time... Argh. So it is unwise to do things like

  bool operator()(Foo const& a, Foo const& b) const {
    return a.some_float_func() < b.some_float_func();
  }
if some_float_func does floating point operations rather than just returning a stored value.


> C++ does not guarantee floating point operations happen the same every time...

Can you elaborate more on this? Are you implying that two identical calculations will return different results?


For example, a function might be inlined at some call sites and not inlined at others. Inlined versions might carry 80 bits through the result, but non-inlined version are truncated to 64. So in:

  double foo(double x) { return x*3.3; }
  double bar1(double x) { return foo(x); }
  double bar2(double x) { return foo(x); }
  void t_bar(double x) { assert(bar1(x) == bar2(x)); }
t_bar can fail depending on how bar1 and bar2 are compiled.


I'm not sure how the example could make the assertion fail; I think that the point is that if 3.3 is treated as a long double by the compiler, then the result of x*3.3 will be a long double, which is more precise than the double returned by foo; but, regardless of how bar1 and bar2 are compiled, they both make a call to foo() which returns a double. Is it the case then that the behavior depends on how foo is compiled? I don't see how else the 80-bit result could propagate to the test for equality.


returning a double can be optimized into returning an 80-bit result in a register. If this happens with bar1 but not bar2, then you get different results.


There are situations where the order of evaluation is not guaranteed. The big one that I've seen is evaluation of expressions as arguments in a function call, but there are others. It is easy to create floating point expressions where because of rounding, the results will differ depending on the order of evaluation. I'm not aware of situations where the order of operation differs from one execution to the next. However, I could see a compiler generating different code for the same expression when it appears in more than one place in the source code because of optimization.


I'm glad I'm not the only one who has trouble when I hit this.

I really, really wish the STL had a mode you could compile it in which boiled down to, "Double check everything no matter how slow it makes things." Maybe it has it and I don't know it. But that was my top irritation with the little bit of C++ I've done.

I have no idea how a non-strict comparator can lead to memory corruption rather than just indeterminate sort order, but it is no fun tracking that down.


This bug was solved after the title went gold and shipped in US :/ We did not find out about it until a few users emailed us. The bug got through SONY and our internal testing...

-It was a first gen PS2 game that had to get pushed out since it was considered a launch title. The dungeons in the game were seeded and randomly generated - In some instances, a lever to proceed to the next dungeon was made inaccessible due to the random seed that enclosed the lever with 'wall' tiles - the player either had to start the game over and hope the dungeon was seeded correctly, OR play through the game without finishing that particular dungeon. It was not total killer in the sense that the game could still be completed, but it was tough to know that such a glaring bug got through...

-Fix was made for the Korean version :) Basically a check to make sure that wall polys did not enclose any levers prior to generating the dungeon, if the lever was bounded, re-seed, generate and check again...


The nastiest was an interaction between two libraries. If you simultaneously imported xml.dom and matplotlib (both python libraries), and then called functions in one of the libraries (I think matplotlib), the program would segfault.

My incomplete set of unit tests didn't catch it because they were imported by seperate submodules of mine.

I only managed to find it by writing a unit test, going back in version control, searching for the first revision where the problem occurred, and going line by line through the changeset (luckily I commit early and often).


While I was working at Lucent, they had an builtin ftp client which was used to transfer images to the router from a server.

Right after I checked in my code (a completely unrelated feature), this ftp facility broke such that you couldnt ftp an image to any router on any platform. So I was assigned this bug which basically stalled the release on every platform that we supported. (The CTO himself called me.) Now you have to realise this BigCo supplies every cell phone carrier in the nation - Sprint, Verizon etc.

What I found was that the ftp system had a bug such that if the image being ftp-ed was an exact multiple of 8k (or something like that) then it would fail. My checkin made the image file an exact multiple of 8K. (Story of my life!)

I found the bug, emailed the CTO and he assigned someone from the core team to fix it. That guy calls me up and in the end I fixed it myself using vi and kibitz.


Nameless BigCo came in for PCI compliance on Struts 1.something they coded the credit card number as a member variable of a struts action (struts 1 used a singleton pattern) the last submitted person got to pay for all of the simultaneous transactions going on in the system. Fortunately my team caught it in acceptance review.


I'm still a little new to the game, so I've only had one hacking job thus far. It spanned 4 years and taught me a lot about what to do and what not to do. I could type a small book about all the crazy things they did in that company and the nightmare of spaghetti code, the needlessly complex 3000+ table database, but I'll try to focus on just a few things that went wrong over the years or that needed fixed.

Two of the previous employees got into a debate on whether or not you could directly connect to the database through JavaScript and fetch recordsets, etc. To prove that you could, one to the developers did just that...and left it in place...in production...with the full connection string and user credentials in plain site in the javascript code on the page.

Countless sql injection holes were plugged over the years, but not before we got hit with an attack that plugged javascript ad code into 50 or more tables and a few hundred thousand rows.

The initial developers used ".inc" as the file extensions for include files like headers, footers and database access. These files sometimes had html, javascript, asp or all 3. You never knew. The problem was IIS treated these like text files and would gleefully serve them up if accessed them directly - revealing server side sourcecode, more connection strings, etc, etc.

Our accounting "system" was woefully inadequate. Our sales people always said "yes" and we eventually ran into some clients that wanted their invoices formatted and calculated in a way our system couldn't handle. We were always far too busy to take the time needed to properly engineer the new solution, so this one clients bill was done manually - through SQL Management Studio - every month, for close to two years before I left. Reconciliation of the bill, was also done manually. The kicker is, billing was done based on dates that assignments were completed. These dates were not only not locked down once billing had begun, but new assignments could be injected into the billing period long after the invoice was generated, because multiple factors. Needless to say, it took me half a day, once a month to reconcile their invoices and on several occasions, due to the very manual nature, payments were applied to the wrong assignments, transactions, etc. Once to the tune of $200,000. That was fun to fix. :)

I could go on and on and on and on....


At Viaweb, my careless use of $_ in Perl led to the name of every credit card shown in the merchant UI being replaced by a secret auth key. I didn't see it because that particular auth check was bypassed for admin users like me. It wasn't a serious security hole, and we changed the auth key afterwards. But as it happened, I was in a cranky mood when I created the auth key and it had some bad words in it, so it was a little embarrassing.


Probably the worst bug I ever encountered was a documentation bug.

Back in 1970, I was at my first programming job, working for a very small consulting company. The contract was to convert a large system from 1401 Autocoder to System/360 Cobol. However, instead of the source code for the current system, we were given system specs, which were almost flawless.

However, in final acceptance testing, one program, for one particular accounting line, kept producing results that differed from the current system. The customer would not accept this, even though my code implemented the specified calculation precisely.

Eventually, after a few rounds of back and forth, I demanded to see the source code of the current system, and after much more back and forth, the source listing was produced. Looking at the source, it seemed to be doing exactly what my code was doing.

There was one fact not taken into account - the present system, rather than running on the 1410 for which it was written, was running in emulation mode on their new 360. And they hadn't bought a license for the compiler. So the small accounting change they had made a few months prior had been made to the object code - the card deck itself. Luckily the programmer who had made the change had written a note to that effect on the listing - in the object code part, where he eventually sheepishly showed it to me. We then got paid.

(I also once had a bug that almost got national distribution via Time magazine. Luckily they had only printed a few thousand copies before someone noticed it. That one would have gotten me fired, but for the fact that the same release also included a halftone compression routine that enabled them to push back their photo deadline by a day. The reason the bug escaped my testing was that it only occurred in pictures where the line count was of the form 4n+3 - I had tested with both even and odd line counts, but never managed to use one that produced the off by one error.)


most embrarrasing:

" #define INTERVAL 10 * 86400 "

read the above 'C' statement. Yeah, there are no brackets surrounding the '10*86400', INTERVAL was being used in some calculations.i wasted a lot of time debugging this crap.


I worked on the communication middleware for an early Windows tablet. My part sat between a VB6 GUI app (that we customized for each customer) and a RS-232 device and handled all the communications.

One customer had intermittent communication failures that we couldn't identify for weeks. I finally went on-site and hooked up a serial port logger between the tablet and device. After an hour of testing, I finally captured what caused the communication failure. It was some debug message from the GUI app, somehow coming across the serial port. I called back to the office and had a guy from our group go over the the VB app team and ask which genius was logging stuff to the serial port. Sure enough, some guy had been using VB's remote debugging feature with a null modem cable and left it turned on in the shipped app.


Every since time there is a memory corruption on a games console it takes hours and hours of tedious detective work to find.... shudder. Glad I don't do that any more.


Yeah I had 'fun' with an embedded bootloader that would corrupt a single byte in the image (at offset 2^20 iirc)


We set up a new network segment with multiple VLANs trunked over a fiber line. The entire thing worked flawlessly except for one VLAN. It turned out to be a conflict between a very strict media converter and bad ARP packet-generating code in the device acting as the DHCP server on that VLAN. The ARP packets it generated were too short for the media converter, so it dropped them thinking they were damaged and the computers on the VLAN were unable to find the gateway. The end solution was to go with a dumber media converter that didn't work at the packet level. The DHCP device to this day still has the bad code.


The bug itself wasn't very interesting, just a brainfart. But it wasn't doing what it should, so I added debugging output - one line per pixel. (I was generating a png from a custom image format.)

Because there was so much output, I did a `| head` to keep it manageable. Saw what I was doing wrong, fixed it, reran the command, it seemed fine - but the output image looked exactly the same as it had before.

It took me about an hour to realise that once head had exited, the pipeline sent a signal to my program, killing it. The image wasn't getting written until after all the pixels had been processed, so it never got to that stage.


I once worked an issue with a crashing Exchange server.

It had to do with a connection being reset in the space of time between the server checking if the connection was good, and actually using it.

The client could reproduce it at will, and after a little bit of code at home, I could, too. Back at the office, no repro.

It took a lot of back and forth (to put it mildly), but the problem at work was that the network was too slow - the server got the reset with plenty of time to "notice" it, and the crash didn't occur.

Once we set up the repro on its own switch at work, everything failed in the expected manner.


Peter Seibel ask this question in all his interviews in Coders at Work: http://www.codersatwork.com It seemed like most were concurrency related.


Embedded OS for a consumer product; the units were freezing (very occasionally) in the field. Usually the units were in store kiosks (= disappointed and unimpressed could-have-been-customers).

Turned out to be a race condition in an interrupt handler, where the OS would say "wake me up when something interesting happens" but something else would sneak in, clobber the wake-up trigger, which meant "... never."

Two weeks to find it, fixed by swapping two instructions in some assembly glue.

The harder a bug is to find, the simpler the fix is.


"Simpler" depends if you count the troubleshooting time. To rip off an old anecdote:

  Change one line of code: $3.00
  Find which line of code to change: $3,436.88


I once did a crawler that searched the content of a couple of large websites for phone numbers.

I did this daemon in Perl, using fork() to search multiple websites in parallel. When a new search was initiated, the daemon broke the search in multiple packages (of 5000 items to be searched). After initiating 5000 searches (keeping the maximum number of active processes at 30), I did a wait for all the children to finish, so that I can mark the package as "done".

The trouble with fork is that opened sockets are sometimes not forked very well. And I forked my database connection (a DBD::Pg). Forking a DBD::Pg connection went well in my tests, but once in a few thousand forked processes, a process would stall. Sometimes it recovered after a few hours, sometimes it didn't.

I tried setting an ALARM that would make the process auto-kill itself after some time. Didn't work. The final workaround was to monitor the child processes from the parent, and kill -9 the ones that hanged. But the whole processing became too slow.

Finally I gave up on the idea of packages and on marking the search as "done". So I just marked the search as "done" after the last child was forked.

It was a difficult bug because the client had no idea it was a bug, and waited patiently for the results (sometimes as much as 2 days). It was also difficult because at first I had no idea why my processes hanged.


Worst bug monetarily:

I was consulting for a multinational on a commodities trading app. I broke the stuff sending trade data to the Risk Management system that does automatic hedge trades. My manager comes down the hall and tells me I've caused $4 million in exposure in the past two hours. At the end, her comment is: "A million here, a million there, business as usual!"

I told that story to a former coworker, and he pointed out, "Well, your boss was calm about it. I bet those traders were livid!"


so it's your fault...


Yes, it was my fault. And we fixed it and rolled that out to production in minutes. And those trades probably got their hedge trades done manually. But at the moment, it seems pretty bad.


A neat piece of software calculating the prices for customers Had a hierarchical pricing model for different contracts with large corporations. But the hierarchies were not separated. You started in one hierarchy and ended up in another one. While trying to read the tree from the bottom up you could end up reading the whole database for calculating a price with conditions from every customer.

I must not state the accuracy of that price...


We were doing some early testing on this distributed system, and process A kept backing up well under it's nominal load. A had no allowance for shedding excess load (it was broadcasting high frequency safety-critical data), and the network buffer backed up under the shitty messaging layer. It turned out that process B[0..n] couldn't pull the messages off the wire quickly enough because process C was blasting some other data to B at about 1000 times the nominal loading, filling up B's VM and kicking off the (improperly-tuned) garbage collector for 2-second intervals -- it ate the processor time needed to handle the load. Total death spiral.

Needless to say we ended up with more robust load management code and tuned the output of some processes.


One bug in Daylife's API tester's javascript. It just never worked in IE for some stupid reason. The solution was surprisingly simple.

After some analysis with firebug, I just figured out that the variable's value wasn't being preserved after a particular point. So I just had to take it's value and assign it back to itself.

It's in this file: http://demos.daylife.com/api_tester/js/daylife.tester.js (lines prefixed with comments "IE Fix")

[Full disclosure: I was not and I'm not a part of Daylife. I only participated in a contest they once conducted and was asked to try solving their problem by one of their staff.]


Is this relevant to following by any chance?

http://karma.nucleuscms.org/item/101


exactly. thanks for the blog post link.


This isn't a particularly bad one, but it's the first one that popped into my head, and is somewhat unexpected unless you're very familiar with how JavaScript RegExp's "exec" method works:

    > r = /^x$/g
    /^x$/g
    > r.exec("x")
    x
    > r.exec("x")
    null
    > r.exec("x")
    x
    > r.exec("x")
    null
(When exec is used on a regex with the global flag it will remember the position of the end of the last match and perform the next match beginning at that offset. Obviously will cause very bizarre behavior if you expect it to be idempotent...)


Worst: c/c++ pointer memory errors, duh. Especially when there are thousands of pointers and don't know which one is overwriting memory that it shouldn't be overwriting.

Second worst: linking c programs when some of the symbols are duplicated in more than one library yet are defined differently, Nobody mentioning this to the developer made it all the more exciting.


Thanks to the Boost.org developers the C++ pointer issues are mostly a thing of the past. Using shared_ptr et. al. for all but the most time-critical code has made memory-smashing, memory leaks, etc. so easy to avoid!


Calling the least mean square fit function in the linpack library used to go into an infinite loop occasionally. I tracked it down to some very dusty FORTRAN code in the linpack kernel, which gfortran42 compiles incorrectly. Adding -ffloat-store to the compile flags for that library fixed it.


ExternalInterface hell (Flash-JS bridge)


Like how the moment you move a Flash object ExternalInterface breaks completely in IE?


Define "worst". The most costly? The hardest to diagnose? The one with the most stupid cause?


Take your pick. Whatever works for you. For me the definition would probably be the hardest to diagnose, but I'm sure there are better ones that lead to more interesting stories and lessons to be learned.

Bugs are a great learning experience.


The most expensive bug that I can talk about publicly that I've encountered:

I used to work on software for a very expensive (started at 70k) DNA/protein analysis hardware solution. This was in the late 90s, and our GUI was a couple million lines of MFC code. I was responsible for an analysis package written in cross platform (linux and windows) C++, but the main UI and all the complex interface was MFC.

My then boss was a unix idiot who hated Windows. Which is fine and all, but a Windows product was paying our salaries. So one day he decided to rewrite my (well tested, though not unit tested) file handling code written using the win32 API and port it to posix. Not to accomplish anything different, mind you, and not to add it to the cross platform bits since it was useless in isolation and, oh yeah, we had multiple millions of lines of MFC so nothing was ever getting ported.

In any case, during this "upgrade", he found a function called DirTouch, which was intended to make sure certain directories existed during the normal course of saving data. Well, this gentleman subtly changed the semantics from "create the directory if it doesn't exist" to "create the directory if it doesn't exist, but if it does already, then silently delete whatever is there". This wasn't the whole bug, but this change of semantics to a destructive function was the root cause.

This change got shipped. One of our customers killed literally more than $1.3MM worth of data, since each machine run might cost $100K given lab time, reagents, prep, etc.


What happened to him? And did he try to blame you? In his mind I'm sure all he did was port your code, which must after all have been buggy.


Well... it was a shitshow. We were just told by a livid CEO that a major university (the bio world is surprisingly small and it's pretty easy to get a terrible rep) had a major data loss, etc. We were on the phone with them while simultaneously overnighting the computer to data recovery trying to figure out what happened. It took two frantic days before I figured out what the problem is, patched it, and started the emergency release process for all the other customers.

But yeah, I initially took responsibility for the mess, since it was most likely my code that did it. After figuring out precisely what happened, and having to demonstrate to my boss that the old code did not delete stuff out of directories, that responsibility got walked back a bit. Still... I got "laid off" within 5 months of that.

Basically, there was blame to go around: the university should have had a comprehensive backup solution in place, particularly given the expense involved in creating the data; I should have been more careful to not needlessly call this directory touch operation while creating cached bits in the data files (the "data file" that the user thought of was actually a directory because we were running into fat16 and fat32 file size limits), but at the end of the day, my boss took code that used to work and turned it into code that not only didn't work but was broken in the worst way possible for no other reason than to fuck about with posix idiocy. In a giant win32/mfc app.


I've had a lot of frustrating experiences, but the one that made me slap my face the hardest due to the shear simplicity of it was the time I was working on a windows service. I was opening a socket and listening, but for whatever reason I could not get the client to connect! After a half hour or so of playing around i finally minimized the million windows i had open to see Norton sitting there asking me to unblock the application face palm


A multi threading system and a queue: it will be hard to explain all the algorithm here but sometimes you think that one thread can't affect others and it is not true I had to find an exception that kills no just the thread but doesn't allows the queue consuming to continue, I remember I went home something like 5am.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: