Failsafe – failure handling with retries, circuit breakers and fallbacks

dredmorbius · on July 23, 2016

A note on the name: "fail-safe" in engineering doesn't mean that a system cannot fail, but rather, that when it does, it does so in the safest manner possible.

The term originated with (or is strongly associated with) the Westinghouse railroad brake system. These are the pressurised air brakes on trains, in which air pressure holds the brake shoes open against spring pressure. Should integrity of the brakeline be lost, the brakes will fail in the activated position, slowing and stopping the train (or keeping a stopped train stopped).

https://en.m.wikipedia.org/wiki/Railway_air_brake

Fail-safe designs and practices can lead to some counterintuitive concepts. Aircraft landing on carrier decks, in which they are arrested by cables, apply full engine power and afterburner on landing. The idea is that should the arresting cable or hook fail, the aircraft can safely take off again.

https://en.m.wikipedia.org/wiki/Fail-safe

Upshot: "fail safe" doesn't mean "test all your failure conditions exhaustively". It may well mean to abort on any failure mode (see djb's software for examples). The most important criterion is that whatever the failure mode be, it be as safe as possible, and almost always, based on a very simple and robust design, mechanism, logic, or system.

From the description of this project, it strikes me that it may well be failing (unsafely?) to implement these concepts. Charles Perrow, scholar of accidents and risks, notes that it's often safety and monitoring systems themselves which play a key role in accidents and failures.

Animats · on July 24, 2016

"These are the pressurized air brakes on trains, in which air pressure holds the brake shoes open against spring pressure." Air brakes don't really work that way.[1] There's an air tank on each car to provide the pressure to apply the brakes if the brake line loses pressure.

Fail-safe design comes from railroad signaling. It is a principle of classic railroad signaling that any broken wire or relay that fails to pull in must result in an indication not less safe than the correct one. "Vital" Relays in classic signaling systems fall open by gravity, and use silver-to-silver contacts so as to avoid welding together on overloads. (Lightning strikes on rails and on signal lines are considered a normal part of railroad operation.)

[1] https://en.wikipedia.org/wiki/Railway_air_brake#Straight_air...

dredmorbius · on July 24, 2016

From your linked source:

"Under the Westinghouse system, therefore, brakes are applied by reducing train line pressure and released by increasing train line pressure. The Westinghouse system is thus fail safe—any failure in the train line, including a separation ("break-in-two") of the train, will cause a loss of train line pressure, causing the brakes to be applied and bringing the train to a stop, thus preventing a runaway train."

Without air pressure -- from line or cannister, the brakes fail in the activated mode.

I'm trying to find a source, but my understanding is that red/green for lit signals as "stop/go" came about after an earlier mode, in which a steady white light meant "go" proved problematic: the red disks fronting stop lamps could fall out (or perhaps be broken), leaving ambiguity as to what "white" meant.

Switching to red and green lamps meant that the failed-disk mode now clearly indicated a signalling problem, where the signal could not be trusted.

Animats · on July 24, 2016

No, train brakes need pressure from the car tank to be applied. This is what the famous "triple valve" is for. High train line pressure releases the brakes and charges up the car tank. Low pressure applies the brakes. This has the annoying property that you can't leave a train parked on a grade for too long without applying the manual brakes on the cars. US freight air brakes were standardized in 1893, and haven't changed much since.[1]

Semitrailer parking brakes really are spring-loaded and released by air pressure.

[1] http://www.railway-technical.com/air-brakes.shtml

dredmorbius · on July 24, 2016

OK, Today I Learned.

This diagram in particular (from your URL) shows that though there is a spring in the brake-shoe application mechanism, its action is to release the brake.

I hadn't know this (and had never found a good diagram of railroad brake design). This isn't what my understanding had been.

NB: this isn't my area of expertise, and my understanding had been the incorrect idea that spring-pressure held brake shoes in place.

Which makes me wonder why this design was chosen over a spring-driven shoe.

Thanks for sharing that. And brickbats to the hive-minders who've (at this point) downvoted your earlier comment in this thread.

Animats · on July 24, 2016

"Which makes me wonder why this design was chosen over a spring-driven shoe."

The real answer is that the Westinghouse air brake system won the 1887 Burlington brake trials. Other entries included vacuum brakes, buffer brakes (bumping into the car ahead applied the brakes), a competing air brake system, and electropneumatic brakes (by Herman Hollerith, the punch-card guy). Nobody entered a spring-loaded system.

dredmorbius · on July 24, 2016

And you've got standardisation, across an entire rail ecosystm (that's rail, not Rails), in which locomotives, rolling stock, couplings, etc., etc., etc., all need to work together.

An advantage of standardisation is you get, well, standardisation. Such as US President Herbert Hoover implemented by setting up the National Institute for Standards and Technology (NIST), which specified standards for screws and nuts and bolts. I'm not sure if Bendix transmissions were included, but come WWII, it was possible for the US War Department to order something like five million Jeep transmissions from several dozen suppliers, any of which could (at least in theory) be interchanged or have parts swapped between them.

The disadvantage is that you may find yourself very effectively stuck at a local optimum that's far from a global optimum, with murderous path dependencies.

I've been grousing over a set of TV propaganda videos created by the Mont Pelerin Society / Cato Institution through Johan Norberg and his "Free to Choose Media" production company (at least the propaganda slant is fairly obvious). The 2nd installement of his series on Adam Smith spends much of its time aboard a supersized cargo carrier, waxing rhapsodic about the wonders of the market in coming up with such a marvelously efficient system.

Except that it took the US Navy to standardise container sizes. After some 20 years of dickering over container sizes, materiel transport needs of the Vietnam War finally forced standarisation.

(Another US regulatory body, the Interstate Commerce Commission, meanwhile, had been happily impeding progress thanks to its regulatory capture by the railroad industry, and I won't even begin to mention the Texas Railroad Commission, which has little to do with railroads and was exceptionally significant well beyond Texas, at least for a time).

mcpherrinm · on July 24, 2016

The Lac-Megantic crude oil train derailment/fire disaster is rather horrific example of that "annoying" property, where an insufficient number of manual brakes were applied, and an engine fire caused the engine proving air pressure to be shut down.

https://en.wikipedia.org/wiki/Lac-M%C3%A9gantic_rail_disaste...

dredmorbius · on July 24, 2016

I was aware of brake failure as a factor in the Lac-Megantic disaster, but not that this was the specific cause.

dredmorbius · on July 24, 2016

NB: it's really annoying to see HNers downvoting factually accurate and well-intentioned comments.

Particularly when they're correcting errors or omissions in other comments. Such as those in mine above to which Animats is replying.

superzamp · on July 24, 2016

Interesting comment. I've been looking for a term for "system that cannot fail", but have not been able to find any.

An example of such system could be a ball check valve, which can inherently only work.

https://en.wikipedia.org/wiki/Check_valve

Can you think of a word to describe such systems?

dredmorbius · on July 24, 2016

There are two terms that come to mind.

The first is "impossible".

The second is "pre-failed".

As the drunk has observed, you can't fall off the floor.

If you're looking for a term for a system which is highly immune to failure, "resiliant" comes to mind.

Take Tesla's solid-state, no-moving-parts one-way fluid valve. It has no moving parts to break (though it could conceivably be fouled by dust, dirt, sediment, or debris).

http://makezine.com/2012/01/05/the-tesla-valve-one-way-flow-...

"Overengineered" is another possibility.

im4w1l · on July 24, 2016

A titanic system.

pm90 · on July 23, 2016

This is a great comment. Why do you think the project fails to implement the concepts that you mention?

daenney · on July 23, 2016

I think that what they're trying to get at is that having libraries that (for example) wrap failures in retry modes isn't necessarily failing safely. It can very well obscure problems in your implementation or other parts of the systems you're talking to. Having it fail safely can just as well be "abort execution" and visibly log it so as to raise the problems with those that might be able to solve the root cause.

There's certainly something to be said for retry strategies in places that involve a lot of network chatter but please don't also forget to add some kind of back off to it so you don't end up retry-overloading a system that's trying to recover.

dredmorbius · on July 24, 2016

Pretty much this, yes.

If you hit an error condition in your code that you aren't explicitly handling, break that mofo.

The faster and more explicitly you break, the better, as this gives you the signal to fix the problem.

Wrapping and retries attempts to heal the damage, meaning, effectively, your code is walking wounded -- it's encountered an untrapped error, has ignored it, and is attempting to continue.

The faster and more definitively an error breaks, the better the likelihood of fixing it, and the more obvious the error and fix are.

jodah · on July 24, 2016

Author here: fully agree. Blindly retrying an operation any failed operation could lead to cascading failures or system overload, which is what circuit breakers are intended to avoid. Generally, it's just good to think about which failures can or should be retried or recovered from and what recovery should look like. A tool like Failsafe just makes it easier, hopefully, to do what you think is appropriate for the situation.

dredmorbius · on July 24, 2016

Thanks for the reply.

I haven't looked in detail at the library, and probalby don't have the chops to identify good or bad features. But the mechanisms described and my understanding of the origins of the concept of "fail safe" seemed at odds, and I wanted to raise the point.

bhntr3 · on July 24, 2016

They have a CircuitBreaker. They don't seem to have exponential backoff. But that seems close to correct. For a networked application, what do you think is safer than retrying with exponential backoff and circuit breaking?

jodah · on July 24, 2016

Author here, Failsafe does support exponential backoffs [1]:

  retryPolicy.withBackoff(1, 30, TimeUnit.SECONDS);

and if you want to specify the exponent [2]:

  retryPolicy.withBackoff(1, 30, TimeUnit.SECONDS, 1.5);

As for which failure handling strategy is safer or what it means to fail safely, in my experience it not only depends on the use case but the type of failure. Certain exceptions, even in a networked application, can and should be retried or recovered from while others cannot. Sometimes retrying is good, sometimes preventing subsequent executions (via circuit breakers), sometimes falling back to an alternative resource. It's all based on the scenario.

[1]: http://jodah.net/failsafe/javadoc/net/jodah/failsafe/RetryPo...

[2]: http://jodah.net/failsafe/javadoc/net/jodah/failsafe/RetryPo...

bhntr3 · on July 24, 2016

Cool! Didn't see that on my quick scan. I agree and will probably use this library. It's needed. Thanks.

nitrogen · on July 23, 2016

Very cool. Consistent and clear retry, backoff, and failure behaviors are an important part of designing robust systems, so it's disappointing how uncommon they are. If I were starting a new Java project today I would almost certainly want to use this library instead of the various threads and timers I had to hack together years ago.

heisenbit · on July 24, 2016

Indeed this is conceptually hard stuff. The reason for that I believe is that the problems one is solving are system level problems and not local ones. Another way to look at this: It is the other guys problem. A lot of naive retry strategies sort of work until one has a larger number of clients to deal with. I still remember the time trying to get through to a base-station designer who refused to acknowledge the need to do exponential back-off and other mitigation steps. We ran into interesting times shortly later in the field on the management system side. Personally I would also put in a bit of randomness to spread out requests when all clients were initially impacted at the same time and were thus synchronized.

jodah · on July 24, 2016

Good example of where random retry delays would be valuable. I filed this as a feature to add for the next release:

https://github.com/jhalterman/failsafe/issues/39

SwellJoe · on July 24, 2016

This title would be 100% better with "for Java" on the end.

_Codemonkeyism · on July 24, 2016

... for JVM languages.

ckugblenu · on July 23, 2016

Quite interesting. It shows potential to be used in numerous use cases. Anyone know of similar projects in other languages like Python and Javascript?

rdli · on July 23, 2016

(Full disclosure: co-founder of Datawire)

We released a microservices development kit (MDK) last week that includes similar semantics (e.g., circuit breakers, failover) that implements these semantics in Python, JavaScript, Java, and Ruby. The implementation is actually written in a DSL which we transpile into language native impls. We do this to insure interop between different languages. We're working on updating our compiler to support Go and C#, adding richer semantics, and making the service discovery piece pluggable (currently there's a dependency on our own service discovery).

https://github.com/datawire/mdk

sync · on July 23, 2016

We use re for Javascript, it works well: https://www.npmjs.com/package/re

rekwah · on July 23, 2016

Although, not feature parity with this project, Pybreaker[0] for the circuit breaker patterns in Python.

[0] - https://github.com/danielfm/pybreaker

Rapzid · on July 23, 2016

.Net has Polly https://github.com/App-vNext/Polly

Rauchg · on July 23, 2016

We use `async-retry` which implements `node-retry` in a way that's friendly to usage with `Promise` and `async/await`.

https://github.com/zeit/async-retry

garthk · on July 23, 2016

See also: Twitter's Finagle [1] for the JVM, and Bouyant [2] providing Finagle-as-a-microservice on localhost for language independence.

1: https://twitter.github.io/finagle/ 2: https://buoyant.io

cpitman · on July 24, 2016

How is this distinct from Hystrix (https://github.com/Netflix/Hystrix)? Why should I use one over the other?

jodah · on July 24, 2016

Good question. Someone asked that recently on Github - here's a quick comparison:

https://github.com/jhalterman/failsafe/wiki/Comparisons#fail...

vikiomega9 · on July 24, 2016

Is there a more detailed comparison?

For example,

>Executable logic can be passed through Failsafe as simple lambda expressions or method references. In Hystrix, your executable logic needs to be placed in a HystrixCommand implementation

It's not apparent to me what the advantage of either interface is. In both situations I have to define a "lambda" and hold state somewhere(either as an object field or passed into the lambda). Unless I'm something here, either seems acceptable.

jodah · on July 24, 2016

> Is there a more detailed comparison?

There's nothing more detailed that I know of. Is there a particular feature area/comparison you're curious about? I can add a bit more detail.

> It's not apparent to me what the advantage of either interface is. In both situations I have to define a "lambda"

What I meant by this bit is that the user experience is different. Failsafe can be used with method references or lambda expressions [1], which are a nice, concise way of wrapping executable logic with some failure handling strategy. You cannot do this with Hystrix since all logic must be wrapped in a HystrixCommand impl, which cannot be implemented as a lambda.

> either seems acceptable.

Like anything, it just depends on what you want. If retries and general purpose failure handling, consider Failsafe. If request collapsing, thread pool management and monitoring, consider Hystrix.

[1]: https://github.com/jhalterman/failsafe#synchronous-retries

vikiomega9 · on July 24, 2016

I'm curious about design in general. Circuit breaking and timeouts should be a well defined semantic of the caller and so my thoughts were more, how one could compose their code and bolt on failsafe, for example, but also quickly switch out to some other library.

ap22213 · on July 23, 2016

It seems like a well-thought, fluent interface to what lots of Java developers (especially Java 8 ones) inevitably have to write themselves.

mandeepj · on July 24, 2016

Please find some of these patterns for .net\azure\c# stack here - https://msdn.microsoft.com/en-us/library/dn568099.aspx

fdsaaf · on July 23, 2016

Beware of runaway retries: https://blogs.msdn.microsoft.com/oldnewthing/20051107-20/?p=...

Personally, I'd rather systems fail quickly, with retries only at the highest (application) and lowest (TCP) levels.