Principles of Software Engineering, Part 1

bcantrill · on April 2, 2013

Great stuff, and I love the concrete example of the ZK failure due to error logging -- a classic cascading failure mode. While it's true that I'm an inveterate disaster porn addict[1] and would therefore love this regardless, I think that Nathan's piece serves as a model in that it speaks to learning from failure rather than gloating about nascent success -- we collectively need much more of this! I also like that Nathan doesn't romanticize other engineering domains, as naive software engineers are wont to do; other engineering domains also struggle with failure -- it's just that their failures are so much more public (and so much more likely to involve loss of property and/or life) that they cannot evade collective introspection the way software engineering so frequently seems to. Very much looking forward to Part 2!

[1] http://www.infoq.com/presentations/Debugging-Production-Syst...

m_mueller · on April 3, 2013

I've enjoyed your talk, thanks for posting. One thing I'd like to know though: As someone who's optimizing his debugging skills and environment so thoroughly as you, it surprised me that you love javascript. Don't get me wrong, obviously it has some of the best tooling thanks to its abundance, but doesn't it bug you that it tends to fail silently? I feel that there are quite a few error classes that need to be caught by unit tests in case of JS, where in languages with more rigid type systems (such as python) it gets caught as an exception right on the first run. Or is it that this uneasy feeling about everything you do in JS is what has spawned a culture of more thorough unit testing, such that at the end you're better off?

seanmcdirmid · on April 3, 2013

The things that you need to unit test even when you have static typing typically overlap tests that will detect type errors as well. The fact that there is no static typing also puts a bit more fire under your butt to test things.

In the end, its a wash.

mathattack · on April 2, 2013

I think this quote is magic, "Software engineering is a constant battle against uncertainty – uncertainty about your specs, uncertainty about your implementation, uncertainty about your dependencies, and uncertainty about your inputs."

Engineering is about handling what goes wrong, not what goes right. It's about handling the errors, changes, misuse, etc. It isn't about the techniques per say, as much as the mindset of living in an imperfect world.

[Edit: Fixed a typo.]

confluence · on April 3, 2013

Indeed, which is why I like to think of engineering as a game of hyper-dimensional whack-a-mole [1].

There are a certain series of things you have to hit in a fairly hyper-dimensional world, dodging constraints, hurdling uncertainty and taking risk in your stride as you struggle to make products that work, delight consumers and make bank.

It's like a complex and exquisite ballet really, with suppliers, manufacturers, producers and designers all coming together to make extraordinary products that astonish the world.

Ah, I love engineering.

[1] https://news.ycombinator.com/item?id=4238984

> designing a rocket engine is a massive game of high dimensional parameter whack-a-mole, it's very difficult to get a passable configuration without a lot of iteration and forwards-backwards passes

baumgartn3r · on April 2, 2013

Super interesting post! I'd have mentioned Unit tests as another measure to tackle uncertainty. Simple, boring unit tests (reminds me of this post[1]). Maybe he just assumes those will exist when professional engineers write code. [2]

[1] http://robertheaton.com/2013/04/01/check-youre-wearing-trous... [2] http://www.amazon.com/Clean-Coder-Conduct-Professional-Progr...

lorewarden · on April 3, 2013

Unit tests are of course important, but they don't test for higher level failures like network issues, high latency, increased load, etc. Your components must be designed to be isolated from incidents as much as possible, possibly using the techniques implemented in Hystrix [1], an open source library from Netflix.

[1] https://github.com/Netflix/Hystrix

nathanmarz · on April 2, 2013

Yes, that's absolutely true. Unit tests are super important in order to guard against future mistakes. Or put another way, they guard against the uncertainty of the code maintaining its current functionality in the future.

abecedarius · on April 2, 2013

One question this raised (and I don't mean this as a gotcha): why could a flood to the error-reporting servers take down all of the applications? I expected the primary fix to be to decouple the work so it could continue with no error reporting server. (But I'm not familiar with Zookeeper or any of the other work the author's doing, beyond reading some post on Storm.)

nathanmarz · on April 2, 2013

It's a convenience thing so that users can quickly see if there are any errors happening in their applications. While you could provide hooks to integrate the error stream with some external error reporting system, you also want something that just works out of the box. Zookeeper is the only place that Storm can store state, Zookeeper is good at storing small amounts of data, and the recent errors are a small amount of data (as long as things are properly throttled). Hence, the design.

abecedarius · on April 2, 2013

Thanks -- it's interesting to hear how people do this kind of work that I'm not involved in these days.

MichaelSalib · on April 2, 2013

Zookeeper is a distributed coordination service. Think of it as an extremely robust reliable datastore for handling small amounts of data. It provides that robustness by using an expensive synchronization protocol. When you try and slam it with large volumes of data, zookeeper falls over. And Storm relies on Zookeeper for basic functioning, so without a running zookeeper ensemble, the associated Storm cluster will die too.

mrgordon · on April 3, 2013

I wish it were a bit more robust than it is. The ZooKeeper version we run (3.3.4, admittedly not the newest) reports the wrong version number (3.3.3) and has a major bug in the way it does snapshots. We found that it doesn't serialize the tree of nodes to disk correctly so there is a race condition where it writes a node even though the parent of that node has been deleted. Then ZK tries to reload from the flawed snapshot but it cannot so it crashes which results in endless leader elections that never resolve..

All software has bugs and these specific problems have been fixed in newer versions, but they are super scary issues to run into with your distributed coordination service.

abecedarius · on April 2, 2013

That makes sense. It's not clear to me though why error logging should belong to it.

MichaelSalib · on April 3, 2013

Well, a Storm "program" operates concurrently on many nodes at once. If an exception is thrown, you may want to log it and the stack trace, but where? If you write to a local log file, that data will be useless unless you run some sort of log shipping or log centralization (like with scribe or kafka or syslogng). But that's usually a pain in neck to setup and you can't run storm without already running a zookeeper cluster, so if you're lazy, you just log to zookeeper.

Everything is fine as long as exceptions are infrequent.

hp50g · on April 2, 2013

Indeed.

All our error logging is entirely away from the live kit to prevent shit like this happening.

ambiate · on April 2, 2013

There is a fine line between industry (cost center vs generating revenue) and startups when it comes to discussing the term software engineering.

I see a large amount of legacy maintenance in cost center based programming. Revenue generating industry channels seem to favor the enterprise aspect of software engineering. Startups attempt to just build, and fix as necessary (cowboy). Yet, each has their own facet of software engineering.

I am still trying to draw the line between too-enterprisy, too-maintenancy, and too-cowboy. At my current job, we assume everything is certain. The uncertainties are not coded for, because everything is internal. This bothers me to a large extent. I love coding for the uncertain. Giving more control to the user and automating a whole department is right up my court. Sadly, it is hard to convert people. Only the 'RU' in CRUD is in the user's hands most of the time. It is pure legacy fear.

The removing cascading failures part needs more emphasis. Remove portions from your cycle/automation/jobs. What happens? I also agree with the measure and monitor portion. Waiting to create analyzers and looking at metrics once the program starts breaking in production is too late.

Looking forward to the next posts.

205guy · on April 2, 2013

This may be semantics, but I think of software engineering as the slightly larger scope of building real-world solutions with software and hardware. Civil engineering is not (just) about mixing the right cement and letting it cure at the right temperature for the right length of time, nor is it strictly about building a bridge, it's about building a bridge for the right price in the right amount of time that will last a given number of years, all parameters which were determined through a careful process and making decisions with stakeholders, while applying scientific principles (geology, materials science, etc.) and good people management skills. Oh, and the successful bridge project leaves behind the documentation of the bridge as built and a structure to assure its proper maintenance.

However, I do agree that handling the huge and complex range of inputs, not only the expected ones, is a great beginning to the process, one that is often overlooked. And same goes for internal monitoring, to make sure your system is still functioning as designed.