It was a long time ago, but I remember we were working on an embedded system controlling some industrial equipment, and it randomly crashed; the time between crashes was long enough that it'd take several days before it happened, so even getting a trace of the crash was an exercise in patience. Eventually we did get a trace, and it turned out the CPU would suddenly start fetching instructions and executing from a completely unexpected address, despite no interrupts or other things that might cause it. We collected several more traces (took around a month, because of its rarity) and the addresses at which it occurred, and the address it jumped to, were different every time. Replacing the CPU with a new one didn't fix it, and looking at the bus signals with an oscilloscope showed nothing unusual - everything was within spec. We asked the manufacturer and they were just as mystified and said it couldn't happen, so we resorted to implementing a workaround that involved resetting the system daily. Around a year after that, the CPU manufacturer released a new revision, and one of the errata it fixed was something like "certain sequences of instructions may rarely result in sudden arbitrary control transfer" - so we replaced the CPU with the new revision, and the problem disappeared. We never did find out what exactly was wrong with the first revision, other than the fact that it was silicon bug.
I know that pain. I was working on the NDIS driver (WinNT 3.5.1, later a 4.0 beta) for our HIPPI[1] PCI cards. The hardware was based on our very-reliable SBus cards, so when the PCI device started crashing, we assumed it must be a software error.
I probably spent 2+ months trying to find the cause of the crash. Trying to decide if your change had any affect at all when you have to wait anywhere from 5 minutes to >10 hours for the crash to happen will drive you insane. You have to fight the urge to "read the tea leaves"; you will see what you want to see if you aren't careful.
While, I never did find the problem, I did discover that MSVC was dropping "while (1) {...}" loops when they were in a macro, but compiled correctly when they macros were changed to "for (;;) {...}".
Later, a hardware engineer took the time to try to randomly capture the entire PCI interface in the logic analyzer, hoping to randomly capture what happened before the crash. After another month+ of testing, it worked. He discovered that the PCI interface chip we were using (AMCC) had a bug. If the PCI bus master deasserted the GNT# pin in exactly the same clock cycle that the card asserted the REQ# pin, the chip wouldn't notice that it lost the bus. The card would continue to drive the bus instead of switching to tri-state, and everything crashed.
Every read or write to the card was rolling 33MHz dice. Collisions were unlikely, but with enough tries the crash was inevitable.
Ok, that one should get the prize. The chances of spotting that are insanely small, kudos on making progress at all, more kudos for eventually tracing it down to the root. I really hope I'll never have anything that nasty on my path.
Most of the credit goes to the hardware guys that were able to finally isolate the problem.
We found the bug, but the months of delay (and a few other problems like losing a big contract[1]) killed the startup a few months later. While I'm annoyed the SCSI3-over-{HIPPI,ATM,FDDI} switch I got to work on was never finished, the next job doing embedded Z80 stuff was a lot more fun... and a LOT easier to debug.
Incidentally, I found a picture[2] of the NIC. Note that HIPPI is simplex - you needed two cards and two 50-pin cables. This made the drivers extra "fun".
[1] "no, I'm not going to smuggle a bag of EEPROMs with the latest firmware through Malaysian customs in my carry-on" (still hadn't found the bug at the time)
That's a beautiful board. I remember the FAMP made by Herzberger & Co at NIKHEF-H, I used to hang around their hardware labs when those were built. Similar hairy debug sessions in progress. Those worked well in the end iirc and ended up at CERN.