> can't possible be interacting with your current problem The only thing not log...

sillysaurus3 · on Aug 18, 2014

Maybe my previous comment was unclear. If so, sorry about that.

The point was that, in programing, there is never any "figure out correlation." You can rule out whether a bug is being caused by a given line of code by examining the flow of data between what you're seeing on screen and the lines of code responsible for what is shown on that screen. A bug is never "correlated" with any given line of code. The line of code is either logically related to the bug, or not related at all.

I'd be interested to hear more about how programming could be made into a correlation game, though. It sounds like a new mental tool that I've never learned, which means I should learn it.

yummyfajitas · on Aug 18, 2014

To make programming into a correlation game, build a distributed system and work on performance. You suddenly have flow of data, together with lots of external factors that are difficult to measure (e.g., a spike in latency in US-East but not US-West for 0.5% of packets, or 2% of your shared instances having a noisy neighbor).

In such contexts, you usually also have a LOT of code.

Correlation analysis becomes very important in figuring out which piece of code to even look at. Bugs do become "correlated" with a line of code, because bugs take the form "noisy neighbor + blocking disk read (code) + high latency to master DB => slow response".

sillysaurus3 · on Aug 18, 2014

Thank you. That's a very interesting way to think about it, and I hadn't considered that before.

I apologize for making so many comments in this submission. I feel pretty terrible about it, because the number of comments are higher than the number of upvotes, which has pushed your submission off of the front page. I didn't realize it was happening until too late. But more than that, in retrospect, I should have behaved differently altogether, which would've resulted in fewer comments. Sorry.

kordless · on Aug 20, 2014

Interesting you mentioned a distributed system. Worked on a few in the past and a fairly complicated one right now. Exactly what I was thinking when I made my comment! :)

scott_s · on Aug 18, 2014

You can rarely "rule out" anything based on reasoning about data flow - simply because your reasoning always has the possibility of being flawed. If you had an oracle you could consult while debugging, you could do this. But we rarely are in that situation. Instead, you can only increase your confidence.

In my experience, it is best to fix problems as you find them, even if you are confident that they are not the cause of your current bug. Each thing "wrong" I find in the code leads me to question my current mental model of what is going on. A bug means that, somewhere, there is a difference between my mental model, and what the code actually does. A seemingly unrelated error gives me evidence that my mental model may be even further from reality than I realized. That's bad. At that point, I know of at least two deviations from reality. And just like trying to establish causation in a scientific experiment, you want as few variables as possible.

sillysaurus3 · on Aug 18, 2014

I am not arguing against fixing problems. I am arguing against voodoo thinking.

I'm getting the feeling that the way I learned to debug is somehow very different from what everyone else here is doing. I've never found a bug I haven't been able to fix or understand. I'm not bragging. It's simply the truth, and it's why I'm mystified about the negative reactions I'm getting here. I've spent my whole life debugging things, and I've never once ran across "maybe the bug is caused by this new problem, even though it's obviously completely unrelated, so I'm going to fix it and cross my fingers." It just seems ludicrous, the same way that it was ludicrous to think the alignment of the stars could determine whether you'll have a happy life.

The oracle you consult while debugging is the program itself. If you don't understand the behavior, then you add logging statements until you do. You log everything, absolutely everything, and then reason about what the program is doing when the bug is occurring.

There was exactly one bug I was never able to fix, and it was because the bug resided in closed source code (D3D9). Multithreading was triggering the bug even though their code claimed to be MT-safe. At that point, there was nothing I could do except revert to unmultithreaded shader building. It was a terrible race condition that took a full day to track down, but it didn't require any kind of correlation tricks. It simply required that I rule out large swaths of the codebase until it was logically impossible for the bug to be anywhere except in the D3D9 library.

Yes, of course, fix problems as you run across them. But that's a separate discussion altogether.

scott_s · on Aug 18, 2014

I am the same way, and I similarly am proud in that I can always fix a bug. It's all just code, and I can always pin-point the error.

My impression is that you and I would go through very much the same process when debugging code. This sub-thread, however, is filled with my points on mental models and confidence, so I won't repeat it here. It's not voodoo thinking, but keeping in mind that mental models may be wrong. I am careful to avoid thinking certain things are "impossible" because I know my reasoning is fallible. I instead say "I find it extremely unlikely, and here's my evidence why."

My bug-that-go-away was a race-condition in a lock-free, multithreaded allocator I was porting to the Itanium instruction set. My understanding of the memory model on Itanium wasn't very good, and even the original author of the algorithms was unsure if it could even be ported. I eventually decided it was not a good use of my time, and moved on. (The algorithm was designed on the Power instruction set, and was easy to port to x86.)

csbrooks · on Aug 18, 2014

"maybe the bug is caused by this new problem, even though it's obviously completely unrelated, so I'm going to fix it and cross my fingers."

You're mis-stating what I said. I said "you can't reason how it could be causing the issue you're seeing". That's very different from "it's obviously completely unrelated".

Look at the example in the article. We're getting network delays. Here's a bug that would cause network delays. It's not clear why we'd only be seeing delays on the west coast server and not the east coast.

Do we fix the bug and see what happens? Or do we keep researching?

What happens when we've been researching for a couple hours, and the network delays are keeping people from doing their work? What happens after two days of research, and we're losing a few thousand dollars an hour?

Huh, maybe we should fix the bug we can find right now (too many redis connections), and see what happens to our network delay bug.

Maybe it turns the bug we wanted to fix originally from something intermittent into something consistent.

Now if you're having an issue with network latency, and you find a bug in the code that converts floating point numbers into dollars and cents, then obviously that's not what I'm talking about.

sillysaurus3 · on Aug 18, 2014

Ok, we're clearly talking past each other. Let's reset.

1) You have a bug. Your ovals are coming out as boxes, and you hate anything without rounded corners. So you look into your drawing code to find out what's going on.

2) Along the way, you stumble across a routine that has a bug: Your network code isn't properly handling a failure condition. You're absolutely certain that your oval drawing code isn't tied to network data in any way: it's always supposed to draw ovals, but it's drawing boxes.

Do you fix #2, and cross your fingers that #1 is also fixed?

I would write down #2 (for example, by adding a TODO comment) and then keep trying to fix #1. I wouldn't stop what I'm doing, fix #2, then see if #1 is fixed.

Now, here is your original comment:

In programming, in my experience, the nastiest bugs to fix are actually two or three separate bugs interacting in weird ways. If you find a bug like he did, and it's easy to fix and unlikely to break something else, but you can't reason how it could be causing the issue you're seeing, FIX IT ANYWAY. It's quite possible it's interacting in some subtle way with another bug, and fixing it may make the other issue start behaving more consistantly, and easier to fix.

It may feel wrong, because you feel like you should set that theoretically unrelated bugfix aside until you can work out the bug you're trying to focus on. In my experience, that's often not the right approach.

The situation I described above isn't uncommon. You have a bug you're trying to fix, and then you run across a different problem, but it's obviously completely unrelated. You're saying, "Drop what you're doing and fix it." I'm saying, "Focus, and think logically."

I apologize if I'm somehow misrepresenting you. Please correct me if that's the case. Also, I think I'm having an off day, and my comments are coming across as self-centered and snobbish. My apologies.

agarden · on Aug 18, 2014

Race conditions. This bug shows up sometimes while frobbing the dingbat, but only sometimes. The crash is correlated with frobbing the dingbat.

scott_s · on Aug 18, 2014

Very much yes. If someone tells me about a bug, and they tell me "It always happen when I do this", my immediate reaction is "Oh, good, that will be easy to fix."

If instead they say, "It sometimes happens when we do this", then that means it's going to require serious investigation because we will need to start correlating different events to see under what circumstances the bug does and does not pop up. That is very much like the scientific method.

dllthomas · on Aug 18, 2014

The Udacity course on debugging speaks directly to looking at correlations between bugs and executions of various portions of code (and the same across bugs).

There's some interesting stuff: https://www.udacity.com/course/cs259

csbrooks · on Aug 18, 2014

"So yes, people think like this because it's easier to fix the bug than it is trying to figure out correlation."

And many times, there's no clear advantage to figuring out the correlation first instead of just fixing the bug and seeing what happens.