Something Rotten in the Core

jitl · on Oct 26, 2017

It seems like commenters here on HN are caught on the point that GDB does provide a reasonable machine interface. That's not the author's point; responding specifically to that fact is missing the forest for one mis-labeled non-tree.

Here's two paragraphs from later on that sum much better:

> We've seen it hundreds of times in all kinds of software. Functions that return bool instead of an error code. Where did the precise error vanish to? Poof, it's gone! What used to be a useful error message became false, and if you're lucky you'll get a generic "Unexpected error" appearing on screen. And that's if your program is using a library. If it's calling out to a command-line worker, the most likely case is it won't get checked at all and will just get printed out into a log file you'll never find, and then never seen again.

> (image) Homer's typing bird: http://www.codersnotes.com/notes/something-rotten-in-the-cor...

> We're making systems that are fragile, because they're just glued on rather than bolted together. We're wrapping complex things up a wrappers that don't take the same responsibilities as the things they rely on. Like Homer's pecking bird in the Simpsons, they work just fine when everything is as expected, but when the slightest change in situation happens then everything breaks.

I think the article is worth reading in full.

zug_zug · on Oct 26, 2017

I agree that GDB isn't substantive to the author's point, it's merely an example and validity of the argument as a whole isn't contingent on that one example.

However, I guess I'm not clear about the argument as a whole. The author is essentially claiming, "We write wrappers that are shoddy in that they don't consider failure cases."

That is true. What the author doesn't address is: what's the solution?

If the answer is "Code better," or "Think more," I feel like that's a straw-man.

I think the real question is how can we make a wrapper (e.g. a complex setup script) provide visibility into its problems (permissions issue, harddrive space, network issue, etc) without doubling the work involved?

AstralStorm · on Oct 26, 2017

Calling the systems glued on is generous. It presumes they are actually holding together. It is often not even duct tape level, but the connectors are machined with micrometric precision.

Likewise most errors are not checked or deemed impossible instead of proven to be impossible.

jepler · on Oct 26, 2017

I couldn't tell whether the author was making a good point, because their facts were badly askew.

Historically, there were multiple UNIX debuggers. It's true, gdb largely exterminated them (and anyway, when people say UNIX they often mean linux+freebsd+macos, where gdb was the defacto standard). However, there is now a second common UNIX debugger from the llvm/clang alternate universe (lldb).

In the very bad old days, debugger wrappers did drive gdb via the exact same interface that a human would, with all the terribleness that entails. but gdb has gone through several iterations of "GDB/MI", alternate APIs for "machine interfaces"; the current iteration seems to be "mi2".

Besides this, GDB has long since standardized the "debugger stub", which does low level operations like reading memory from the debugged process, starting/interrupting, etc.

As you can see, there are a plurality of interfaces to the debugger, several of them oriented specifically to use by other programs, rather than being repurposed human interfaces.

Besides this, gdb has also become internally extensible in Python, which is pretty great.

In short, invoking gdb as a subprocess is A LOT more like using an API than it is like trying to parse the output of "ls -l" as the only way to get a listing of files and their properties. It just happens to involve a second address space (a third, when you count the debugged program). It's just stream oriented, and doesn't look like JSON (or name your preferred hotness) because it was developed independently of, and quite possibly before, that preferred hotness.

(I'm pretty sure lldb's interface is also designed to be driven by other programs from the start, and it may also support a same-address-space embedding mode, but I don't have any actual experience using lldb, I just know it's out there)

jepler · on Oct 26, 2017

"The LLDB debugger APIs are exposed as a C++ object oriented interface in a shared library. The lldb command line tool links to, and uses this public API… The entire API is also then exposed through Python script bindings which allow the API to be used within the LLDB embedded script interpreter, and also in any python script that loads the lldb.py module in standard python script files. See the Python Reference page for more details on how and where Python can be used with the LLDB API."

AstralStorm · on Oct 26, 2017

And therefore three is no stable ABI because C++ lacks one. Maybe in the future...

Animats · on Oct 26, 2017

This is a classic problem with open-source GUIs - they're a wrapper around a command-line program. Such programs typically have no clue what happened at the command line level - they just present whatever the command line program prints to the user.

A few days ago, there was a UI designer on here who was looking for an open source program to work on. I suggested "git gui". Git's default GUI is a Tk wrapper around the command line program. Lots of buttons, corresponding to command line options. No understanding at the GUI level of what's safe, what's useful right now, and what the state of the project is.

The original Macintosh deliberately lacked a command line, so programmers had to figure out a usable GUI for everything. They had the right idea.

One of the original design misfeatures of UNIX is that programs take in command line parameters and environment variables, but all they give back is a numeric error code. If they gave back a list of strings and a set of name/value pairs, and there was some convention about what should come back, scripts and GUI front ends would be less dumb.

hossbeast · on Oct 26, 2017

They also write a descriptive message (and nothing else) on stderr as a convention.

Animats · on Oct 26, 2017

Which is usually useless to a GUI or a script.

hossbeast · on Oct 29, 2017

Other than for informing the user what went wrong (which is specifically one of the things you called out as a problem)

klodolph · on Oct 26, 2017

> We've seen it hundreds of times in all kinds of software. Functions that return bool instead of an error code. Where did the precise error vanish to? Poof, it's gone!

A thousand times yes.

Earlier this year, two of us spent a full day debugging a problem with some of our automation. Our team has pretty good automation, for the most part, but this particular problem was in kind of a dark corner. A shell script in the automation would start up a process in the background, and then send commands to that process. The background process could be slow to start up at times, so to deal with this, the commands running in the foreground had long timeouts if any. Guess what happens if the background process dies?

Well, the shell script doesn't care, that's for sure, it wasn't watching for error codes in background processes. It was a bit of an adventure following the path from the foreground commands to the missing background process, and finding the log files for the background process.

It's one of the things I like about writing this quick-and-dirty automation in Go--the error handling is so explicit that you'll usually end up with good logs explaining what went wrong and what the program was trying to do at the time. Much better than dealing with shell scripts. Shell scripts are quick to write but you're often left in a bad position when they fail in unexpected ways, or even in expected ways.

(The actual bug we hunted down was traced to one missing line in a configuration file, but the problems with that piece of automation are far larger.)

candiodari · on Oct 26, 2017

So what you mean to say is that the problem is not so much that there isn't any error reporting, but that, in C, it's being ignored ?

Never do

  printf("here's a number: %d", 11);

Always do:

  int attempts = 0
  int ret = printf("here's a number: %d", 11);
  while (attempts++ < max_attempts && ret < 0) {
    switch (ret) {
      case EINTR:
      case EAGAIN:
        ret = printf("here's a number: %d", 11);
      default:
        // At this point you, as a programmer, should STOP AND THINK.
        // What would be a reasonable reaction here ? How will it affect
        // everything else the program does ? What is the correct way to
        // proceed ?
        //
        // P.S. Anyone doing "return -1;" at this point should be taken out and shot.
        // and yes, that's the C equivalent of what every Go programmer always does.

        panic("printf error", ret); // for example, crash the program.
    }
  }

Needless to say, you should do this on EVERY printf statement.

There. Isn't explicit erroring great ? NO IT ISN'T.

Needless to say, this has an almost direct translation to Go. Does anyone do this ? Of course not. In Go, like in C, like in shell scripting, in the vast majority of programs nearly all errors are ignored.

That's why exceptions are so very superior to explicit error handling : it accomplishes many things :

1) it alerts the user that an error occured. "Explicit error handling" like C, Go, most C++, ... do will simply silently attempt to proceed, likely turning a small error or a typo into a disaster or catastrophe. Silent database corruption, here we come !

2) It provides information about where the error occured. Stop me if this sounds familiar: "when an error is printed, and the program crashes, I download the source and grep it for what I think is a unique word in the error message. When it turns out it isn't I get cranky. When it turns out there isn't a unique word in the error I just sit down in a quiet corner and softly cry".

3) It allows for "layered" error management strategies. I'm not saying it gets it up to OCaml levels, but it is far superior to C or Go error management. In the main function, you catch any Exception for the various parts of the program you start, log it in a reasonable manner, alert if necessary, and restart the relevant portion of the program. Inside the parts of the program you catch finer grained exceptions with more explicit management.

4) it's far more concise.

So "explicit" error management ? Let's just be truthful here (just look at Github examples of C and Go code): it's really just ignoring errors.

You can find coding errors involving ignored errors in the Go standard library in minutes. Examples:

1) https://github.com/golang/go/blob/master/src/bufio/bufio.go#...

2) https://github.com/golang/go/blob/master/src/bufio/bufio.go#...

3) https://github.com/golang/go/blob/master/src/flag/flag.go#L5...

So even the Go core developers themselves can't be trusted to not ignore errors.

klodolph · on Oct 26, 2017

> So what you mean to say is that the problem is not so much that there isn't any error reporting, but that, in C, it's being ignored ?

...huh? No, I'm not saying that. I'm saying that if you write a shell script there's a risk of not detecting errors that you care about. I also said that I like to rewrite these overgrown shell scripts in Go, which is apparently a Wrong Opinion and some bad C code will somehow convince me of this.

First, the nitpicks: EAGAIN should not be handled here. EAGAIN shouldn't be retried in a loop, that will just spin the CPU for no good reason. If printf() returns EAGAIN it means that you made stdout non-blocking and hopefully you would know if you did that, but that's unusual except in language runtimes. There's also a missing break; in the switch.

Beyond that, I don't really care about error handling for printf() when I'm logging output or running interactive programs.

Compare this with the behavior for C++:

    #include <fcntl.h>
    #include <iostream>
    #include <unistd.h>
    int main() {
        close(STDOUT_FILENO);
        std::cout << "Hello, world!\n";
        return 0;
    }

Try it yourself.

As an example of the errors we see in our logs, they often look like this:

    some_file.go:399: could not realign warp core coupling b502:
      plasmaManifold.PhaseInterplex(): host not found: m19d.eng.ncc1701d

The "ignored errors in the Go standard library" aren't really ignored errors. Look at the bufio code a little bit more closely, you'll see that those errors are properly returned.

candiodari · on Oct 27, 2017

> First, the nitpicks: EAGAIN should not be handled here. EAGAIN shouldn't be retried in a loop, that will just spin the CPU for no good reason. If printf() returns EAGAIN

Manpage seems to imply that's not the only reason:

http://man7.org/linux/man-pages/man3/errno.3.html

And I'm pretty sure that the manpage is right : with creative redirects you can make that happen for other reasons too. You can redirect stdout to a file on NFS, or to a tcp socket that may have a full buffer, lots of evil ideas come to mind.

I'll take another good look at the bufio error. Thing is, I'm also pretty sure that I'd want bufio to correctly handle EINTR and EAGAIN and it seems to me very unlikely that this golang runtime code is correct for those cases. But I'll spend some time trying to make it fail. Maybe I'll learn something.

AstralStorm · on Oct 26, 2017

By "most of C++" you mean stuff written as C with classes right? Or perhaps the fact that exceptions are not more specific which is a general problem everywhere?

ajross · on Oct 26, 2017

Lost interest right here:

> We're not talking about calling out to a library here. We're talking about actually launching an instance of GDB, passing it commands, and parsing the results it prints out. And this is where we get led down a dangerous path.

That's just wrong. GDB has a reasonably well specified control protocol, which is what everything uses. Yes, it's ASCII and readable. No, it's not just "parsing gdb output".

Come on.

citrin_ru · on Oct 27, 2017

> So much user-facing network software is built on top of other programs, like ssh or rsync, and when those things fail they just don't know what to do. And so much of the problem is precisely because they're not using them as libraries, they're using them as command-line utilities.

Unix CLI utilities have well defined way to return an error - non zero exit code. It is even possible to return different errors as different exit codes, though it is rarely done.

clhodapp · on Oct 26, 2017

I know it was just the specific example a commonly-wrapped program that the author happened to use but it does bear noting that gdb actually ships with a perfectly usable raw mode interface built in. It even supports split panes.

gfody · on Oct 26, 2017

reminiscent of Spolsky's law of leaky abstractions, definitely another side of the same problem