Ah, bummer. I was hoping it would end up being something cooler.
Like in the 80's, writing games for the Commodore 64... Where they kept the memory for the program at a location shortly after where they kept the screen memory... So if you forgot to bounds-check the coordinates of the little guy you had running around on the screen, you could run him right off the bottom.
Where he'd start overwriting random bits of memory. Until eventually he walked right over your source code. Which was being interpreted. Until, of course, the guy caught up with the current execution point.
At which point, you hoped you had spent 10 minutes saving a copy of your code recently. Because listing your source you'd find it speckled with random special characters and changed values. Tire tracks from that little guy.
And you learned that the first thing you do when implementing that joystick routine is to also implement bounds checking.
Ah, the wonders of low-level assembly programming. I was a teaching assistant for an assembly course this semester at school, and one of the students was running a program in the debugger. (Specifically, Microsoft CodeView running in DOSBox.) F10, F10, F10, until she got to the load instruction that would fill the cx register. Press F10 again, and suddenly, the debugger shows "cx = DEAD."
She freaks out a bit, I freak out a lot. We're both worried that something really bad happened, then I realized that DEAD is actually a number in hex. I look through the data segment, and I see a AD DE in there (yay Intel byte reversal!).
I check through the code a bit, and even though it looks like the data segment is being initialized properly, it's still reading from the Program Segment Prefix. So, I find the Wikipedia article on the PSP and see that address 05h in the PSP has a cross-segment jump to CP/M compatibility code.
But, there's nothing at address DEAD (at least, nothing sensible). So I go searching through the DOSBox code for that string, and I find these lines:
// far call to interrupt 0x21 - faked for bill & ted
// lets hope nobody really uses this address
sSave(sPSP,cpm_entry,RealMake(0xDEAD,0xFFFF));
As it turns out, because DOSBox doesn't implement CP/M compatibility, they simply made it jump to DEAD instead, just to make sure that if anyone tried running a CP/M program they would get the picture. And the assembler just happened to put the variable in the right place that the load pulled the DEAD instead of random nonsense.
I can only imagine the problems that can manifest when you're dealing directly with hardware, and don't have access to commented source code.
Allison asks: “What linker debugging strategies do you have?”
Change the linker script randomly (actual thing that has worked)
Change variable attributes from ‘private’ to ‘public’ at random (actual other thing that has worked)...
Changing stuff at random (without having any idea why you're doing it and without keeping track of what you've already done) isn't a very productive debugging strategy - it's more like an act of desperation. At least use the scientific method:
- Make a hypothesis about what might happen if you change X.
- Observe and record what actually happens.
- Think about why this could be, and then think about what experiment you could do next to get another step closer to the answer.
If you have a hard time coming up with a reasonable hypothesis, it might help to learn more about the system. In this case, reading the linker documentation to find out what the various options do could be more productive than just making random changes to the linker script.
If you don't have a causal model yet, random peterbation can get you moving in the right direction (genetic algorithms). Hopefully through observations of enough effects you will start to see enough patterns to construct a causal model.
In this case, reading the linker documentation to find out what the various options do could be more productive than just making random changes to the linker script.
If you haven't been faced with a mystifying bug where randomly futzing with things in the hope that you would find a change that had a reliable effect was the best hope you had of solving it then you probably haven't been programming long enough.
Haven't been programming long enough? If you're the average age on HN, I may have been programming before you were born. :)
And I've dealt with lots of mystifying bugs. But my approach to debugging isn't trying random things, it's gathering more information in deliberate ways: reading the code, tracing code with the debugger, adding logging information if the problem doesn't reproduce under the debugger, running test cases that attempt to narrow down the scope of the bug, etc.
I've even debugged customer problems that I couldn't reproduce locally by looking at the code and deducing what execution path had to have occurred to account for the reported behavior. That's one situation in which you absolutely can't try random things to see if they work.
If I were truly making random changes, it would suggest to me that I had no understanding of how the system worked. If you do understand the system and its potential failure modes, then you can do better than random.
In this particular case the kernel binary loader was dropping half the binary on the floor, so code would sometimes work and sometimes not in a way which depended on both the environment and the layout of the compiled binary. I'm not surprised the author ended up randomly futzing with things to see if they could find a reliable cause! (This error could have come straight out of the logout; essay that was linked on HN a week or two ago: "I have no tools because I broke my tools with my tools!") Personally I think the reason she was emphasising that changing these things made a difference was because it was so surprising.
Now in your infinite and sagacious wisdom you've told us all how she should obviously have been able to debug this problem in a trice using your superior methods. Me, I'm happy to cut her some slack, and recognise that systems development can be a complete nightmare!
Yep, two other more 'deterministic' strategies should have caught this: interactive debugging should have showed that the memory is set to zero immediately after loading, not while running; and inspecting the symbol table could have revealed that only data after a fixed offset is corrupt.
No, you don't need hindsight to figure this out. It should be easy to figure out in advance which of these strategies is likely to be more effective in solving a problem:
(1) Go into the debugger and trace how the system got to the bad state it was in.
(2) Randomly change stuff and see if it fixes the problem.
After doing (1), you understand exactly what your code is doing and why. After doing (2), you may not even understand what the problem really was, or why it got fixed, or whether it's really fixed or just doesn't happen any more in this particular situation.
I've seen people debug using strategy (2), and when they give up and ask for help, stopping, thinking a while and employing strategy (1) will usually find the problem.
Time travelling debuggers are awesome, but they still can't compete with already knowing the cause of the bug when trying to find the shortest path from bug to cause!
There was a bug in the (Windows, 1.3.something) JVM around 2003 where if you had more than 1024 files open (eg initialized the velocity templating engine on each request as a random example) then it would delete any subsequent files it opened.
Needless to say that was a bugger to debug. We didn't believe what was happening until we literally saw a jar disappear before our eyes.
I once made a somewhat similar error. Many years ago, when I was just learning PHP, I attempted to write a function to delete all the contents of a directory, including subdirectories.
I don't remember the exact code, but it was something like:
The function names may not all be correct, been a while since I PHP'd
Anyway, so the code looks like it should work. The problem is the scandir function result includes . and ..
So my little function would see .., recognise it as a directory, and recursively call itself on the parent of the directory I wanted to empty. I lost a fair few files before I realised what was happening. Luckily I had back ups, as this was before I started using SCMs.
Like in the 80's, writing games for the Commodore 64... Where they kept the memory for the program at a location shortly after where they kept the screen memory... So if you forgot to bounds-check the coordinates of the little guy you had running around on the screen, you could run him right off the bottom.
Where he'd start overwriting random bits of memory. Until eventually he walked right over your source code. Which was being interpreted. Until, of course, the guy caught up with the current execution point.
At which point, you hoped you had spent 10 minutes saving a copy of your code recently. Because listing your source you'd find it speckled with random special characters and changed values. Tire tracks from that little guy.
And you learned that the first thing you do when implementing that joystick routine is to also implement bounds checking.