Scientists Explain Why Computers Crash But We Don't

jerf · on May 4, 2010

I'm underwhelmed. The reason we build our programs that way is that actually makes them less likely to crash. We actually have access to programs structured more like the E. Coli graph. They crash a lot. Because the fact that one error takes down the whole system has less to do with the organizational structure of the code, and more to do with the fact that we've deliberately built the system such that one error will crash the entire system... because the alternative is worse. There's no recovering from a segfault not because there aren't enough different implementations of core functionality, but because there is no sensible way to recover from a segfault.

Cells have a type of redundancy programs do not have, and may never have, and it actually hasn't got anything to do with the source code at all. They have numerous independent copies of things that mostly work, most of the time. They perform tasks that transform lots of things into lots of other things, most of the time. The whole system is built so that everything mostly works, most of the time, usually recovers, and it doesn't matter much whether this cell dies or that mitochondria malfunctions. It's an entirely different set of primitives brought on by massive parallelism. (And not the usual biological parallelism you hear about in the brain, but physically, everywhere.)

It's the difference in primitives that brings about the difference in result. Difference in optimal layout is an effect, not a cause. Cells don't have the option to work like programs, and programs really don't have the option to work like cells. (At least not without a lot more computational resources.)

vsingh · on May 4, 2010

>The whole system is built so that everything mostly works, most of the time, usually recovers, and it doesn't matter much whether this cell dies or that mitochondria malfunctions.

What makes you think we can't design reliable software systems that way? In fact, I think it has already proven to be a remarkably good idea: http://erlang.org/

jerf · on May 4, 2010

Heh, I actually cut out a segment where I started describing a system I could build that would work like that, starting with Erlang. (I actually program professionally in Erlang, though not exclusively.) It got too long and parenthetical.

It's still not the same. There's just no biological equivalent to a bit of code that is dividing by zero or referencing a file that doesn't exist, or any of several other errors I've made that have brought enormous swathes of the supervision tree down because they're restarting crappy code. Erlang adds this sort of biology-style robustness at the top of the stack, biology itself works with it at the bottom of the stack. That changes everything. Programming with massive unreliable parallelism may indeed someday happen, but it's a long road between here and there.

j-g-faustus · on May 4, 2010

Well, we can find or create equivalences.

Someone I knew did a master thesis on the male reproductive system, he told me that males have 10-15 largely independent biological pathways to produce sperm. From nature's perspective the point is presumably that if you can still breathe, you should be able to reproduce :)

If we were to put the same level of effort into a file reading module, we would have the files replicated over 10-15 different systems with file reading code written by a dozen different people in a dozen different languages, all using different heuristics to locate a similar file (or a backup) if the original wasn't found. Add some sort of selection mechanism to pick the best result from the 10-15 return values, and you would have a very resilient file reader :)

I can't imagine we will ever want to write code like that by hand - drawbacks include development cost and maintenance hassle, and the system becomes very hard to understand and debug.

But it's still an interesting approach. In computer systems I suppose we could bring it about by some variant of evolutionary programming. ( http://en.wikipedia.org/wiki/Evolutionary_programming )

jerf · on May 4, 2010

Yeah, that's pretty much how I was thinking of it. Enormous effort, even if you do get to use evolution. And consider the whole cycle of building a web page; organically retrieve a file, organically open a connection to a database (with an organic protocol, of course) to organically retrieve some organic data, organically convert it to some sort of organic representation (HTML is too rigid, we'd need some sort of probabilistic representation or something) and organically render it in a browser; the complication is simply enormous and the payoff? For enormously more computational power, you have a net decrease in the reliability of the whole process.

I could see how AI could use such a thing, especially since the best intelligence we know works that way. But in general? It seems less than awesome.

SapphireSun · on May 4, 2010

Not on evolutionary time scales :-)

steadicat · on May 4, 2010

Precisely. Another example are Dynamo-like clustered database systems. They are designed to have redundant servers so that if one crashes the system as a whole isn't affected.

There are also examples of both (Erlang and Dynamo) combined: Riak and Cloudant. These systems run multiple Erlang processes on multiple redundant nodes. Processes or nodes can die without ill effects, and often the system knows how to heal itself.

jerf · on May 4, 2010

If you graphed those systems as they were graphed in the article, they would look like the linux kernel, not a shell. Those things don't work like cells, they add the duplication at the highest layer possible. It's not something pervasively shot through the entire architecture, all the way down to the simplest primitives, like it is in biology. Those systems run the same code on what may very well be the same basic hardware, code that still has the same basic structure of core primitives and higher layers and can still crash like a program can. They are slightly more robust against some types of errors, but are still not even remotely like a cell.

InclinedPlane · on May 4, 2010

Google is built that way.

yaroslavvb · on May 4, 2010

Note that the original paper doesn't say anything about crashing. Their point is that E.Coli regulatory network has a different topology than Linux call graph, in particular, a much lower overlap between modules and little reuse of low-level workhorses. They postulate that higher level eukaryotic networks would have more reuse and be more similar to Linux call-graph. http://www.pnas.org/content/early/2010/04/28/0914771107.full...

vsingh · on May 4, 2010

I'm trying to understand the lesson behind this result.

I don't think it's the obvious, well-understood fact that biological systems have massive, redundant parallelism, whereas our software systems do not.

I believe it in fact says something very specific and fairly non-intuitive: that biological systems have many slightly different copies of key routines, whereas our software systems as they are designed today do not.

"That’s why E. coli cannot afford generic components and has preserved an organization with highly specialized modules, said Gerstein, adding that over billions of years of evolution, such an organization has proven robust, protecting the organism from random damaging mutations."

For example, imagine instead of having one 'sort' function, you had different sort functions dispersed throughout every area of your code that performs sorting, and each one was very slightly optimized (through design or some unspecified evolutionary process) for the particular characteristics of the data being sorted at that program location.

Thus, 'sort' is no longer a single point of failure. If one of your sort routines has an exploitable buffer overflow, then it's probably the only one that does, which limits the potential damage to the system as a whole -- especially if you've designed your entire system this way.

Could it be better in some cases to copy and slightly modify a software component, than to simply reuse it?

sesqu · on May 4, 2010

It absolutely makes sense to make your fallbacks independent from one another, but implementing fallbacks at all is expensive.

In evolution, you're pretty much guaranteed a breakage at some point that can't be fixed. That's not quite as true with software - you're still guaranteed breakages, but you get to fix them, and fixing dependent systems is a whole lot more economical.

I think the best analogy is interfaces. You code to an interface with multiple implementations, and if a problem occurs, you switch implementations. Next, if you have vulnerable and complex components, you make sure they each have a custom implementation, so the inevitable bugs can't be widely exploited.

thisisnotmyname · on May 4, 2010

Couldn't getting sick be considered crashing? We obviously have a lot of redundancy (10^12 cells) so we don't "crash" unless a lot of those cells get sick. In E-coli, on the other hand, there are plenty of genes in which a single mutation kills the whole organism. (302 of them according to Japan's National Institute of Genetics: http://www.shigen.nig.ac.jp/ecoli/pec/index.jsp)

goodside · on May 4, 2010

Counterexamples include SIDS, stroke, epilepsy, psychosis, and suicidal depression.

thristian · on May 4, 2010

Ever since I learned about it, I've always thought epilepsy was a astounding example of the brain's ability to recover from cascading system failures. After an episode of epilepsy, not only can the brain resume its ordinary functioning, but it hasn't even rebooted — non-volatile storage (memories and learned skills) is not corrupted, and even higher-level state like personality is unscathed.

It's unfair to criticise a system for having a failure-state, because all systems have failure states. Different systems have different ways of handling failure-states, though, and the brain's ability to cope with failures is nigh amazing.

goodside · on May 4, 2010

Just to be pedantic, non-volatile storage is what survives a reboot on a computer. For the brain to have "not even rebooted" you'd have to come out of the seizure with your short-term working memory intact. I'm not an expert, but I strongly suspect it's not possible to start dialing a phone number, enter a petit mal seizure for five seconds, and then finish dialing without having lost your place.

ahoyhere · on May 4, 2010

I can't weigh in on your statement with any factual additions but...

If you start dialing a number, and somebody calls your name and you turn around and talk to them for 10 seconds, you probably can't resume where you left off, either. :)

pygy_ · on May 4, 2010

And death. Biological systems do crash, all the time.

blhack · on May 4, 2010

Uhh? Computers are houses of cards...everything in them is interdependent. Bodies, on the other hand, are like cities...a few major parts that, if they fail, kill the entire system, but lots of redundancy, and lots of things that aren't completely necessary.

caf · on May 4, 2010

It reminds me of the way that aircraft avionics are sometimes made more reliable - they build N complete controllers from scratch, to the same requirements, then use a voting system to discard the output from a controller that is malfunctioning (and therefore producing different output from the N-1 others).

It seems to me though that there is a big difference in the problem domains. Biological systems can produce a range of outputs that are on a sliding scale of more or less desirable - but software systems are generally specified to produce a single correct output, and any deviation from that is regarded as a complete failure.

klodolph · on May 4, 2010

"Gerstein said that this organization arises because software engineers tend to save money and time by building upon existing routines rather than starting systems from scratch."

Bull. There are two reasons why the shapes are different.

1. The bottom of the Linux graph is smaller because there are not very many primitive operations on a computer. Data is homogenous. There are 256 interchangeable values for a byte. By comparison, a bacterium has to use separate pieces of machinery to handle chemicals made up of the dozens of non-interchangeable elements that it works with. On a computer, you might use one single function for searching binary search trees all over the place in different systems, regardless of the data. In a bacterium, when code gets copied, the copy is modified. Every new application is a fork of some other application, welcome to a developer's nightmare (a good argument against ID... a designer would never duplicate so much code). Developers reuse code because they can, bacteria doesn't reuse code because it can't.

2. The top of the Linux graph is larger because computers have to do more. Computers get selected for features. Bacteria get selected for survival. The bacteria have to "just work", whereas people expect to be able to configure computers. I can plug 100 different network cards into my computer, but don't expect to plug a different flagellum into a bacterium. Maybe it's just a matter of level, if you picked a higher point in the call graph eventually you'd get to "main", no?

zemaj · on May 4, 2010

That headline is terrible.

My take away: If there is an intelligent designer, they'd be an awful programmer.

askar_yu · on May 4, 2010

agree with your first statement.

zemaj · on May 4, 2010

Downvoted? Wasn't expecting that from hacker news! :)

fun2have · on May 4, 2010

Normally in Software there is two approaches a single software solution , or a best of breed. I.e. Office vs Lotus 123, word perfect, etc.. Or in the world of ERP having HR, Accounting, Planning software from different vendors. The interesting bit from this article is that is implying that cobbling bits together is a more reliable approach.

I have always been amazed in how the unix approach of cobbling bits together is often more reliable., than trying to write one large program.

troels · on May 4, 2010

TL;DR Redundancy increases stability.

True, but it also makes it neigh impossible to make any changes to the system, which is generally a desirable trait in software.

meric · on May 4, 2010

That reminds me of what my lecturer said... "Do not start writing functions before knowing what they're going to do, hoping you can massage them into doing something useful." It was a `duh!` advice but now the converse seems appropriate if you're going to design a living organism.