Notes on Distributed Systems for Young Bloods (2013)

Animats · on Aug 8, 2016

It's a step backwards that distribution is being dealt with at the application level. Tandem Computers [1] had distribution at the OS and database level working well in the 1980s. The technology is still available from HP.

Tandem's NonStop OS is fundamentally different from UNIX-like OSs. UNIX is a time-sharing system with a file system. NonStop is a transaction system with a database system. There's very little state outside the database. Programs are started, do one transaction, and exit. If there's a problem during a transaction, the entire transaction aborts and does not commit.

Tandem was running an average system uptime of about 10 years across their installed base, with the main cause of failure being human errors during hardware maintenance. One of their people wrote a paper on how to get the uptime up to 50 years. Tandem's own hardware had heavy self-checking. The goal was to detect errors and abort, not fix them. Failing a transaction would result in a fallover to the paired backup machine, which would result in a retry and recovery.

Tandem's main problem was high hardware cost; you needed several of anything. Also, because of a series of acquisitions, the system had to be ported from Tandem to MIPS to Alpha to HP-RISC to Itanium to, finally, x86. X86 lacks the checking hardware, so it's not as good for this.

[1] https://en.wikipedia.org/wiki/Tandem_Computers

nostrademons · on Aug 8, 2016

The tricky thing about that is that most highly-reliable distributed systems these days evolve from startups that are not-at-all-reliable.

Tandem's technology is incredible, even by today's standards and certainly by the standards of 1974. But they were designed for a world where you sold computers (hardware + software) to big established companies for millions of dollars. Once you saturate that market (which Tandem, Stratus, DEC, IBM, etc. did by the late 80s), what's next?

Since the mid-90s, most tech innovation has been driven by creating new companies, new networks, and new value chains to replace the old incumbents. This process is inherently experimental, and requires that you start out with a small system that grows into a large system. Buying a Tandem computer for the sharing-economy company you're starting out of your apartment is a non-starter; even if you could afford it, it couples you too tightly to a set of applications that probably isn't relevant for what you want to do.

That's why most of the interest in distributed systems today is in taking clutzy single-box programs that used to be Rails or Node apps and then turning them into robust, fault-tolerant systems that serve billions of users and never go down. That's where the money is. The market of existing big companies that will pay a lot for software is largely tapped-out; the upcoming growth is in working for small companies that are rapidly becoming big by replacing those incumbents.

Ericson2314 · on Aug 8, 2016

By your own evidence, s/most tech innovation/most business success/

nostrademons · on Aug 8, 2016

Well, it's innovation along a different dimension. There's been a massive amount of progress in terms of bringing computing to the consumer since 1995 - just compare the user interface of an ATM to that of a mobile app. There's been a concurrent slide backward in attributes like performance, reliability, and especially security. The market seems to reward ubiquity more than any other dimension right now, and you get products to match.

Much of the complexity in often-derided web & mobile app development comes from supporting features - like the back button, direct linking, social interactions, real-time response, and no-install usage - that directly benefit ubiquity, but come with penalties on performance/reliability/security/etc.

marssaxman · on Aug 9, 2016

Bringing computing to the consumer in a way that lets you control their experience and extract rents, perhaps. We haven't been doing so well when it comes to the original liberatory purpose of personal computing.

nickpsecurity · on Aug 9, 2016

"Bringing computing to the consumer in a way that lets you control their experience and extract rents, perhaps."

Desktops are more open and capable than ever. A number of those capabilities started in rent-seeker platforms, too. So, parent's point about increase in capabilities for consumers stands both for proprietary and FOSS.

GauntletWizard · on Aug 8, 2016

That's not moving it to an OS level, that's rethinking everything to deal with the problem you think you have, but don't.

Making every subprocess transactional is a good idea, in theory. It makes it simple to clear and retry in case of failure. But by the time you've gotten to that point, you're already halfway through the problem of distributed systems, dividing your program into small, transactional slices.

Once you're there, though, the Tandem architecture paints you into a corner. These database-oriented architectures do not scale to having large numbers of useful-sized transactions running at once. The bottlenecks of the OS and Database layer become insurmountable. Your architecture is a classic case of a local optimum.

zzzcpan · on Aug 8, 2016

I think there is a bit of misunderstanding of what Tandem really is. It looks like it relies on communications between processors to never fail and be fast, which makes a lot of things easier, but at the same time this also doesn't make it into what we consider a distributed system. It's more like a single big computer, that can fail.

Btw, Erlang OTP shares many ideas with Tandem. And Joe Armstrong actually discussed them in his thesis [1] and talks and articles [2].

[1] http://erlang.org/download/armstrong_thesis_2003.pdf [2] http://cacm.acm.org/magazines/2010/9/98014-erlang/fulltext

Animats · on Aug 8, 2016

The original Tandem systems involved all processors on a redundant local bus. The system was later extended to allow processors to be networked remotely. The two machines acting as primary and backup for a transaction no longer needed to be in the same room. They did require a low-latency connection, though. This was all pre-Internet, so Tandem had their own protocols.

[1] http://www.hpl.hp.com/techreports/tandem/TR-86.2.pdf

susan_hall · on Aug 8, 2016

Agreed. Anyone who wants to read some intelligent thoughts about the advantages of the Tandem system should read what Joe Armstrong wrote about it, in his own work developing Erlang:

http://erlang.org/download/armstrong_thesis_2003.pdf

kragen · on Aug 8, 2016

Are there Tandem programming manuals or tutorials online, both for the instruction set, for TAL, and for T/TOS/Guardian? I'd like to learn enough about the programming model to be able to get a feel for what it was like to write and debug programs on it.

nickpsecurity · on Aug 8, 2016

Hit Wikipedia and Bitsavers. I recall first hasmd many links to detailed papers. Latter often has programming, admin, etc guides. I'm on mobile right now away from my papers but those were my sources.

kragen · on Aug 9, 2016

It turns out HP has the Tandem manuals online (or at least 790 of them) and is still somewhat actively developing the platform. http://h20565.www2.hpe.com/portal/site/hpsc/template.PAGE/pu... has the manuals.

nickpsecurity · on Aug 9, 2016

Good you found some. And "somewhat developing?" It's their main, mission-critical platform:

http://www8.hp.com/us/en/products/servers/integrity/nonstop/...

It's why they ditched OpenVMS. Both advertised similar things for similar markets and hence were competing products. Now, HP doesn't necessarily update NonStop fast as with any legacy acquisition they milk for cash. They are updating and promoting it, though.

Let me look at the manuals. I recall the OS was called Guardian. Found it! Here you go buddy:

http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c02543407...

kragen · on Aug 9, 2016

I think HP's main platform these days is Microsoft Windows on amd64. But yeah, NonStop seems a lot more alive than Itanic, HP-3000, Alpha, VMS, Domain/OS, the HP 9000, PA-RISC, PDP-11, the PalmPilot, the VAX, the Convex, webOS, maybe even HP-UX. In that context, maybe you can see how NonStop's remaining vitality surprised me?

And yeah, the Guardian Programmer's Guide you link there was the thing I read the most of last night. I don't suppose you know of any material covering the details of passive failover in process pairs? Because the current version of the Guardian Programmer's Guide pretty much only covers active failover, saying that passive failover isn't compatible with heap allocation.

nickpsecurity · on Aug 11, 2016

" Because the current version of the Guardian Programmer's Guide pretty much only covers active failover, saying that passive failover isn't compatible with heap allocation."

Just did a quick search to find you an answer. The materials kind of conflate general stuff about backups/redundancy with NonStop-specific stuff. Makes it confusion. Might have hit paydirt here, though:

https://h20566.www2.hpe.com/hpsc/doc/public/display?sp4ts.oi...

On page 185 about process pairs, it says this:

"Passive backup, in which the backup process passively waits to be activated in the event the primary process fails."

"Active backup, in which the backup process actively updates its state while listening for system messages."

I could be reading too much into my skimming. However, the initial description looks to me like the passive model doesn't run at all unless a failure already occurred. Whereas the active one constantly gets relevant state from primary. That means passive is inherently stateless in operation. Applications using heap are stateful. So, they need a stateful/active, backup component.

I see this playing out in practice where (example) HTTP servers would use passives and databases would use actives. Availability of passive for such needs is probably a cost/performance optimization.

Note: Thanks to "spitfire" in another thread for giving me the link below that led to the availability guide. Great they kept the reports up.

http://www.hpl.hp.com/techreports/tandem/

Note 2: I'm keeping that availability guide as it might have nice tips on the topic usable for OSS solutions. I've reposted the OpenVMS Availability Guide a few times because it's still relevant if you swap the proprietary for OSS stuff.

nickpsecurity · on Aug 9, 2016

Lol. Yeah, their track record is only slightly worse than Betamax at this point. HP 9000 or 3000, whichever did high-level machine, was actually kind of neat. Destined to die at that point, though.

"I don't suppose you know of any material covering the details of passive failover in process pairs?"

Not on-hand. Not compatible with heap allocation make me think it's for reactive stuff like real-time control. Working soon but might try to look it up next day in whatever I downloaded before.

adamnemecek · on Aug 8, 2016

This is the first time I'm hearing about Tandem. Are there any other noteworthy, older computer technologies that are in some ways superior to current technologies?

kragen · on Aug 9, 2016

In some ways? That's easy. Aside from the ones nostrademons points out:

The Transputer: a CPU architecture centered on communication.

Maybe the Forth machines like the Novix and the MuP21, currently in the form of the GreenArrays chips: very simple, performance-competitive with much more complex concurrent alternatives, but harder to program.

The F-83 Forth system, which also had an IDE you could modify and customize while it was running, but ran comfortably on an IBM PC without a graphics card.

The Alto that Interlisp and Smalltalk ran on, which ran user-defined microcode in order to allow you to customize the machine for your language.

The AS/400.

Pick.

APL.

The CM-4.

The Cytocomputer, although I think modern technologies caught up with it a few years back.

HP programmable calculators, starting from the 9100, had their CPU instruction set as their user interface, making them relatively easy to program. Unfortunately, as with TECO, this involved compromises both as the UI and the instruction set.

The IBM SP2 high-speed switch with its buffered cut-through wormhole routing for super-low latency.

Oberon: a full-stack computing system including its own compiler and novel, customizable GUI, within a single book's worth of code.

Pad++, the canonical zooming user interface prototype.

Lotus Improv, reportedly.

Sketchpad, which was the first graphical user interface and the first constraint-programming system — arguably an enormous improvement not only on its predecessors but also most of it successors.

Augment/NLS; seen the Mother of All Demos?

Usenet, though it couldn't survive spam — censorship-resistant, peer-to-peer, super-low-bandwidth, fault-tolerant, and with superior threading mechanisms and UI to a lot of what we use now. I saw a talk a few years back by someone who'd set up a private Usenet server to sync sensor data from an intermittently connected polar weather sensing platform.

nickpsecurity · on Aug 8, 2016

I have a list here with at least one example from numerous styles:

https://news.ycombinator.com/item?id=10957020

nostrademons · on Aug 8, 2016

Some examples I can think of:

- Burroughs B5000. Designed from the start to be programmed solely with high-level languages. Supported type-tagging directly in the hardware. All code was automatically re-entrant; problems like PC-lusering, interrupt handling, concurrent execution, and unsafe accesses to globals simply didn't exist. All of this in 1961.

- SmallTalk, Interlisp, and the Lisp Machine development environment (late 70s). All of these had the property that you could modify & customize your IDE as it was running - it's sorta like Eclipse or Emacs on steroids, but rather than a plugin architecture where you have to wade through megabytes of docs to figure out how to write some code, the IDE was just written in itself and open for re-use and modification. Smalltalk in particular was very good about letting you inspect & pull up documentation on any object inside the running system, seeing the runtime values of its fields.

- SmallTalk also let you search by example, eg. you could type in the inputs and the outputs of a function and it would suggest which function you were thinking about.

- The Common Lisp condition system (mid 80s) is way more thorough than any error-handling system used today. A key feature was that error-handlers could run in the stack frame that caused the error; thus, they could inspect the code and data that caused it, and attempt to correct the problem, letting the algorithm run uninterrupted. If the handler couldn't do this, you had the option of dropping into an interactive debugger, fixing the code, and continuing where you left off.

- Multics (1964) did away with the distinction between files and memory. It was as if every file was implicitly mmapped, or alternatively, the processes's memory was just another file.

- The capability security model (late 90s) fixes many of the security problems that most systems today have. The key idea is that instead of letting each subsystem (remote node, process, library, function call, etc.) know about every other node in the system and then force them all to check whether each request is authorized from an intended sender, each subsystem can only communicate to subsystems which have explicitly delegated a capability that they may perform. For example, instead of every process having access to the open() system call and then the syscall checking the process's uid against the ACLs for the file, each process would be passed in a list of file descriptors on which they could operate and no mechanism would exist for accessing something outside of that set.

- Exokernels (late 1990s) were a reformulation of the role of the operating system such that the OS is responsible only for securely multiplexing the hardware, and any abstraction mechanisms are delegated to userspace libraries. They've been undergoing a bit of a comeback under other names lately, with virtualization hypervisors like Xen being basically an exokernel, and unikernels being the libOS.

nostrademons · on Aug 8, 2016

Oh, and some examples of things that were largely considered dead & unknown relics when I first started programming in 2000, but have since undergone renaissances and are now quite mainstream:

- Lexical closures were largely unknown to working desktop programmers. They've been around since Algol in 1960, but Assembly/C/C++/Java don't support them and those languages had "won" in 2000, so they were quite unknown to programmers first encountering them in Perl, Python, Ruby, or JS (the first three of which have strange variable-capture behaviors). Now basically everyone knows about them.

- Erlang-style distributed programming was also a lost art; I remember coming across some mailing list posts by Joe Armstrong c. 2003 where he described Erlang's strategy as "wait for everyone else to realize this is important." Now that future is here, and we get articles like this one.

- Type inference has been around since the 80s, but it was missing from virtually all mainstream languages, which either had manifest static typing or dynamic typing. Almost all languages released since 2009 have been type-inferenced.

- map/filter/fold functions were also something commonly known to Lisp/ML/Haskell programmers, but unknown to working C/C++/Java software engineers. Now they're fundamental to many programming languages, and the basis for MapReduce (which itself was revolutionary in bringing functional programming ideas to a mainstream audience).

fanf2 · on Aug 9, 2016

Proper lexical closures were a prominent feature of Perl 5 in the mid 1990s - hardly unknown!

Type inference dates from 1969 (Hindley) and 1978 (Milner)

sedachv · on Aug 9, 2016

> - Multics (1964) did away with the distinction between files and memory. It was as if every file was implicitly mmapped, or alternatively, the processes's memory was just another file.

Don't forget IBM System i (AS/400), which has a single persistent address space for everything. Frank Soltis' Inside the AS/400 is an interesting read.

fanf2 · on Aug 9, 2016

The capability security model dates from the 1970s or before - see for example https://en.wikipedia.org/wiki/CAP_computer

adamnemecek · on Aug 9, 2016

Thanks for the info but I think that I was looking for something more obscure :-). Smalltalk, Lisp, and Erlang are all pretty much household names on this site.

kragen · on Aug 9, 2016

FWIW the search-by-example thing got added to Squeak sometime in the 2000s IIRC.

protomyth · on Aug 9, 2016

OpenVMS and its file system were amazing for clustering. Amazing how HP owns all the cool always on technology.

fit2rule · on Aug 9, 2016

Retrograde degression is a thing in computers. Those who fail to read the code, re-invent the code.

kragen · on Aug 9, 2016

I've spent the night reading Tandem tech reports and manuals. I think the system you have described is very interesting, and it's maybe the system Jim Gray would have liked to build, but it doesn't seem to be the system they actually did build. The system they did build is a lot like Unix, except that it lacks the uniform hierarchical filesystem namespace that is the heart of Unix. (And, maybe more fundamentally, Unix is first of all tasteful; Guardian/NKS is completely lacking in taste, and everything about it is fractally ugly.)

The programs running under the Tandem operating system/Guardian/NonStop Kernel (they couldn't make up their mind about the name) did not, in general, start, do one transaction, and exit. Instead, they were long-running processes very much in the Unix mold. They even had suid (which makes no sense), a superuser, and zombie processes.

The filesystem has what looks like an ISAM library, called Enscribe structured files, built into it, and pretty decent locking. I guess that's what you're calling "a database system". Also, there are SQL files as an alternative to Enscribe files.

There is some really interesting stuff going on in interprocess communication for fault-tolerance, but it's not transactional; instead, it has to do with process naming, sequence numbers, and copying around memory space images between process pairs running on separate "CPUs", so that if one of them fails, the other one can pick up from the last checkpoint. The IPC mechanisms get involved in order to replay any responses to requests that the failed process made between its last checkpoint and the crash, which is what the sequence numbers are for. Every time you open a file, you have to declare how many messages that file might need to replay in case of failover (the "sync depth"). The IPC is kind of vaguely Symbianish, really.

I put "CPUs" in quotes because each CPU had its own memory space, communicating with the other CPUs by sending them messages, not by mutating shared memory. Shared memory is death to fault tolerance. CPUs within a "node" or "system" were numbered, like cattle; nodes were named, like pets.

There was also distribution between nodes — you could access files or devices or do IPC to processes on any other node very nearly as simply as on your own node.

They did have better IPC and locking than Unix, and additionally there's this transaction-management facility called Transaction Management Facility that "audits" selected files by snapshotting data that's about to be overwritten, and allowing you to roll back in case of a transaction abort. I don't fully understand the TMF programming interface and semantics yet, but it definitely wasn't a systemwide universal thing applied to all files, and it definitely didn't need your program to exit in order to commit.

One thing that makes this stuff super hairy is that it has to handle transactions that include activity in multiple separate processes because of how Tandem encouraged you to structure your applications in a requester-server way, with one requester possibly talking to several servers; the server needs to associate its updates with the requester's transaction so they can be rolled back if need be. And at any moment any of the computers ("CPUs") involved can fail. But Tandem didn't have to deal with the really hairy part of that hairy problem, which is that when a computer fails, you don't know if it failed or is just slow — they had a 300ms heartbeat from the beginning.

About the hardware architecture, you say, "the system had to be ported from Tandem to MIPS to Alpha to HP-RISC to Itanium to, finally, x86." As far as I can tell, the path was their own custom HP-3000-ish hardware ("TNS") → MIPS ("TNS/R") → Itanic ("TNS/E") → amd64 ("TNS/X"). I can find no trace of Alpha, PA-RISC, or i386. Also, I don't think the issue there was acquisitions; they were still independent when they switched to MIPS, but then MIPS got bought by SGI, a direct competitor and maybe not the stablest company, and anyway Itanic seemed like a good bet before everyone saw the iceberg. And certainly deserting Itanic for the amd64 lifeboats can't be blamed on acquisitions.

They claimed from the beginning that their hardware cost was comparable to competitors, but maybe they were lying. And surely their hardware cost didn't fall as fast as Google's did in the 2000s. So maybe even if they weren't lying, they might have become wrong.

Still, I'd propose as an alternative that their main problem was that they weren't IBM and, unlike e.g. Amdahl, they didn't run S/370 assembly.

What do you think? Am I missing the forest for the trees?

nickpsecurity · on Aug 9, 2016

"I've spent the night reading Tandem tech reports and manuals."

Wow. Great, detailed write-up from a night. I've been waiting for one like this. You should look at the links to the hardware in the Tandem or NonStop wikipedia articles, though, as the tradeoffs made and techniques used were sometimes interesting.

" instead, it has to do with process naming, sequence numbers, and copying around memory space images between process pairs running on separate "CPUs", so that if one of them fails, the other one can pick up from the last checkpoint."

A bit reminiscient of the cache-coherence & other memory protocols in NUMA machines. Just for availability as much as scaling.

" they had a 300ms heartbeat from the beginning."

Costs money but works. :)

" I can find no trace of Alpha, PA-RISC, or i386. Also, I don't think the issue there was acquisitions;"

It's in the Wikipedia article along with some of those hardware details I referenced above:

https://en.wikipedia.org/wiki/Tandem_Computers

They got acquired by DEC. Half-way through Itanium port, idiots in management forced them to switch it to Alpha. Then Compaq killed Alpha itself. Later, Intel started killing Itanium. NonStop ran into bad management and bad luck. I see no PA-RISC or vanilla x86. Actually, Itanium was a joint venture between Intel and HP to replace x86 and PA-RISC with something having benefits of both plus experimental stuff. Porting it to PA-RISC instead of Itanium would be downgrade.

"They claimed from the beginning that their hardware cost was comparable to competitors, but maybe they were lying."

IBM's prices are so inlfated that many mainframes competed on price. I read before Fujitsu's were 50-75% as cheap. Far as NonStop, it's around $120,000-$2.7million. Quotes I heard on IBM mainframes were much higher esp on high end.

http://www.fundinguniverse.com/company-histories/tandem-comp...

kragen · on Aug 9, 2016

Glad you enjoyed it!

Heartbeats, to a first approximation, don't cost money. You could do an entirely reasonable heartbeat using parallel ports (like Princeton PAPERS, but for failover rather than HPC), sound cards, or a couple of GPIO pins. What heartbeats cost is the ability to use off-the-shelf networking hardware and software, because it's all best-effort — nothing based on Ethernet tries to guarantee delivery, much less latency.

I guess it makes sense that if they never shipped an Alpha product, their tech reports and manuals would never mention it. I guess John was right about that!

I don't think Tandem was comparing their prices to IBM prices in the early years, though I could be wrong; I think they were comparing them to the prices of the Seven Dwarves.

nickpsecurity · on Aug 9, 2016

"Heartbeats, to a first approximation, don't cost money. "

Over a WAN? The low-latency connections were an important part of their WAN high-availability per Animats. Leased lines here cost more than a house payment.

"I think they were comparing them to the prices of the Seven Dwarves."

Well, they were called the Seven Dwarves because IBM dominated the market. So, Tandem was taking on IBM and all of the rest. They had to be price-competitive with Seven Dwarves while looking way better than IBM, esp price/performance. Them getting SQL to scale linearly in NonStop SQL was one of their differentiators IIRC.

nickpsecurity · on Aug 8, 2016

Exactly! VMS showed how to do clustered setups well. Then Tandem did it for higher uptime at HW/SW level in like 4 or 5 different ways. I plan to clone one of the older, patent-free ones at some point in future with secure CPU's like SAFE, CHERI, etc. Show Dan Geer you can get high security and availability in same package.

Btw, it certainly doesn't have to be so expensive in a FOSS or nonprofit model. Definitely cost a few times more than regular servers but not six digits or anything.

nickpsecurity · on Aug 9, 2016

"One of their people wrote a paper on how to get the uptime up to 50 years."

I forgot to ask you if you have a link to that paper or its title/author?

fit2rule · on Aug 9, 2016

I once witnessed a demo of Tandems' capabilities, a kind of cowboy show, where the Tandem rep wheeled out a running box, terminal running, app running .. took out a .38 special, and shot the box. The app kept running. The contract got signed.

The 80's were a fantastic, fantastic time.

agentultra · on Aug 8, 2016

You may think you understand your specification but unless you used a model checker or proof assistant there will be bugs in your implementation. Unit tests are not a specification (or are, at best, incomplete). Get used to predicate calculus, temporal logic, and writing proofs. If you cannot express a design with a model or proof that can be verified you're purposefully leaving bugs in your system.

They may be buried > 40 steps deep in some code path where the stars have to align just right to trigger it. At the scale where Gustafson's law makes the trade-offs acceptable you will find that such errors are no longer, "highly improbable."

Don't accept, "battle-tested," as the hallmark of a reliable system or algorithm -- all that means is that people willing to tolerate the risk did so and paid for it for a few years. There are still bugs in that system and they will affect you or your users (and often both).

viraptor · on Aug 8, 2016

> Get used to predicate calculus, temporal logic, and writing proofs.

But at that point, you need to start considering bugs in proofs. You're saying "Don't accept, "battle-tested," as the hallmark of a reliable system or algorithm" - and that would be awesome - I'd like to use a fully verified system. The reality however is that so far we verified some specific implementations of specific algorithms and did massive efforts to verify something like sel4... and that's it. (in publicly available systems anyway) Unless you're into spending billions, nobody will prove their systems correct.

agentultra · on Aug 9, 2016

> But at that point, you need to start considering bugs in proofs.

If there's a flaw in your specification it is much more obvious than in your code. A high-level specification is usually built up from formal mathematics to model the execution of a system. If you use a model-checker it will take the specification of the model of your system and check that the invariants and properties you specify will hold across all possible executions of that model. If there is a problem in your specification it will be immediately obvious which cannot be said of computer programs.

In other words if you cannot see the error in your specification what makes you think you should be implementing it?

Now there is a long-held myth that we cannot verify the entire software stack... so what's the point?

You do not have to verify the entire software stack. A specification of a critical component is often sufficient. Take Paxos as an example: the author, Leslie Lamport, wrote the specification in TLA+. You can follow the mathematics to convince yourself of its efficacy or you can trust that the model checker is implemented correctly and will be sufficient for your purposes. The same is implied for a proof assistant. The point is that you do not need to assume these things are true: you can verify them by hand if you have to... the math will check out.

The same is true for other critical components of distributed systems. A single and even multi-threaded application on a single computer is difficult enough to reason about (what with multiple cores executing tasks out-of-order and sharing different levels of cache... it's almost like a tiny distributed system itself). Humans make terrible predictions of performance and correctness that fail time and again when they are benchmarked and profiled. It gets magnitudes worse in large, distributed systems and specifications are only a tool to manage thinking in those terms.

It's actually a tool for simplifying the development of such complex systems. I don't trust a consensus protocol or a lock-free algorithm if it is not accompanied by some sort of verifiable specification... not some hand-waving English prose, but a hard formal mathematical specification.

Do we still use these under-specified systems? Yes, sometimes. The trade-off being that the risk is acceptable and we design ways to tolerate failures as best as we can. However I will prefer a system that provides specifications for its key components.

On the myth of cost... it's true, thinking hard about your system up-front has a cost to it. Where and what you choose to specify will have some correlation to how much time you choose to spend. However the other component of the equation to consider is how critical is it to get this system correct? Another way to think about it is to consider what is the worst thing that could happen if your design contains a flaw.

This is obscenely easy to do in distributed systems. Our brains are not well enough equipped to imagine a model of our system and trace its execution and all of its state 49 steps deep. When confronted with such information many of us are not even capable of seeing the forest for the trees! We think, "Bah, that could only happen once in 250,000 executions.. it's not worth worrying about!" They don't realize that when you make that bet a couple of million times a minute then it's quite often you will see that error. Finding those kinds of traces postmortem is incredibly difficult... and why waste the time searching for that error when spending the extra time up-front will save you the headache down the road?

Yes it requires more work and it costs more at the beginning of a project... but if you cannot afford to spend it then know what you're getting yourself into and act accordingly.

I honestly believe that these mathematical tools can be taught to intermediate-to-advanced programmers who could use them to check their designs and help them write better software. It may seem intimidating but it's just booleans: there are only two possible values for any expression in predicate logic! In all seriousness the tools that are available today to do this work are becoming use-able by mere mortals and are not the eldritch scrawls of snooty ivory-tower academics.

You don't need to spend billions. You can learn TLA+ in a couple of weeks and start writing specifications in a month. There's a fairly slim tome called, Programming in the 1990s that demonstrates a practical technique for proofs that is digestible by programmers.

I say, "don't accept 'battle-tested' as the hallmark of a reliable system," because it says nothing about the underlying, undiscovered errors in the implementation (I made the mistake of using the word, bug, earlier... a habit I am trying to correct). All that buzz-word signifies is that the developers have attempted to implement a white-paper as best they could and wrote unit tests for the code they care about and used their customers to find the errors and edge-cases they didn't initially consider. While that is valuable to you after the fact you must be wary that you'll probably find some of those errors yourself. Hopefully the paper they based their critical sections on published a specification or proof. That's always a bonus... but you need to check these things.

Worse, someone might try to sell you on their distributed mutex based on their well-reasoned essay describing said algorithm... don't trust it unless you can check the model or understand the proof. It may sound reasonable but engineers don't build sky-scrapers without blueprints.

viraptor · on Aug 9, 2016

> In other words if you cannot see the error in your specification what makes you think you should be implementing it?

Since we're talking about proving correctness, nobody can prove they didn't miss an error - either in the specification or implementation. Even wikipedia provides a trivial case of that in the article on TLA+ https://en.wikipedia.org/wiki/TLA%2B#Liveness

Would you never miss the fact that you didn't specify that the clock must tick? Would you never miss an error like that in a system orders of magniture more complicated?

So by extension - if you cannot ensure you see an error in your specifications, what makes you think you should write them? I'm not saying the specification will be incorrect - you can check that of course. I'm saying it can be incomplete.

It's the same as a test. You write more of them, you catch more bugs. You specify more, you verify more. In either one you can't prove you covered all the cases you implicitly meant to.

Honestly, at this point, I'd rather take an implementation that's battle-tested by a large company for months than a verified implementation which hasn't been used by anyone apart from the authors in their lab. (and yeah, that may not be a popular view with some people :) )

nickpsecurity · on Aug 9, 2016

"So by extension - if you cannot ensure you see an error in your specifications, what makes you think you should write them?"

Your standard here is that it must achieve perfection or we don't use it at all. A realistic approach is to use anything proven to increase correctness, reliability, security, whatever as much as one can justify. Decades of work in formal specification show they catch all kinds of errors. The most recent, which I think agentultra was referring to, was Amazon's use of TLA catching problems that would've required 30 or more steps in testing. Easily. They caught so many problems & improved their understanding enough that the team is totally sold on it. Same kinds of results as safety-critical industry and high-assurance security. Nobody doubted it was helping them.

So, such empirical evidence showing all kinds of problems caught by the method plus benefits means it's a good method to use if one can. The complexity of distributed systems increasing potential problems means one should be increasing their tooling rather than decreasing it. For another good tool there, check out the Verdi system for verifying distributed systems against failures. As usual, applying it already caught problems in real-world stuff.

viraptor · on Aug 9, 2016

> Your standard here is that it must achieve perfection or we don't use it at all.

Quite the opposite actually. It was just a hyperbolic response to "why implement if you don't have proven specification" - which you explained why it's silly as well.

I completely agree - if you have time, resources and proper setting for it - go for proofs and verification. Most software as we use it today cannot afford and doesn't need this. The bugs are either acceptable or not, but they're never "purposefully [left] in your system" as the OP claimed. (which was the phrase that really annoyed me / caused the response)

I'm quite excited about the progress in software verification. But outside of very special cases today, I don't think it's worth it.

agentultra · on Aug 9, 2016

> The bugs are either acceptable or not, but they're never "purposefully [left] in your system" as the OP claimed. (which was the phrase that really annoyed me / caused the response)

It seems you are really annoyed that I give responsibility for "bugs," or as I like to call them: errors, to programmers. If you do not use a specification for your lock-free mutex or consensus protocol then I posit that you are willfully choosing not to use a tool that can help you avoid errors in your design and that you should conduct yourself accordingly. The original article provides many good suggestions for coping with this situation.

And while I'm on the subject it is worth pointing out that while the specification might check out the implementation may still contain errors. This is normal. The specification gives you a compass that helps you to track down errors in your implementation. Once found it gives you another invariant to add to your specifications and another assertion or contract to add to your code.

These are good tools to have and as I and others have suggested one does not need to "verify" the entire system to make use of specifications. They are a great design tool in the programmer's tool belt that help us to write better software. Distributed systems are hard enough and we need all the help we can get.

nickpsecurity · on Aug 9, 2016

"If you do not use a specification for your lock-free mutex or consensus protocol then I posit that you are willfully choosing not to use a tool that can help you avoid errors in your design and that you should conduct yourself accordingly."

Sort of. Remember that, if they work for companies, programmers' job is not to write bug-free software. It's to translate requirements into software within constraints imposed by management. They should put in what quality they can. They often don't have time to spec out stuff because management won't allow it. "Ship, ship, ship!" It's why I typically recommend them re-use previously-proven solutions for stuff like you mentioned.

So, they certainly are leaving errors in intentionally if they avoid tools that would catch them. Yet, it's not always their fault. It's also not always a bad thing if we're talking hobbyists doing stuff for fun where they and/or their community accept some occasional bugs. On opposite end, people would get fired at Altran/Praxis for not doing what they could for correctness. To each their own based on their needs or situation.

agentultra · on Aug 10, 2016

Hence the caveat: conduct yourself accordingly.

Just because you don't use specifications doesn't make you a bad programmer but you must be aware of your own short-comings and plan for them. If you're not allowed to use specifications you're being told you're not allowed to think. If that's the case you know there are going to be errors produced by the team either in the design itself or in the implementation.

rdtsc · on Aug 8, 2016

> Implement backpressure throughout your system.

This is a subtle one and gets lost among CAP theorem talk, but it is important.

Just recently had to struggle with adding backpressure. Did some things as just sending a message (a "gen_server:cast" in Erlang). The sender shoved all 300K+ messages in the mailbox of receiver which could not process them in time because of a bottleneck and bug. Receiver's memory usage went to 120GB when it eventually restarted.

The simple solution was just to switch a gen_server:cast with a gen_server:call i.e. make the receiver acknowledge, and the sender wait for acknowledge before sending another message. (As an aside that's the nice benefit of Erlang/Elixir there, it was really a 2 line change. If it was some kind of serialization library + rpc socket layer, it would have been a lot more code to write).

An obvious corollary of this is "always test on production workload and conditions, not just your laptop". What had happened locally is with a smaller test data it would fit in memory and for a 10Gb or so it wasn't noticeable. So it seemed like everything was working fine.

But speaking of CAP theorem this is always a great fun:

http://ferd.ca/beating-the-cap-theorem-checklist.html

---

Beating the CAP Theorem Checklist

Your ( ) tweet ( ) blog post ( ) marketing material ( ) online comment advocates a way to beat the CAP theorem. Your idea will not work. Here is why it won't work: ...

---

alanning · on Aug 8, 2016

Thanks for sharing this. A few questions from someone interested in learning how to use BEAM-based systems in production:

* How did you debug the issue?

I know Erlang in Anger [1] is kind of written just for this but it feels pretty intense to me as a newbie. I don't know enough to tell whether the issues there are only things I'd need to worry about at larger scale but I found it pretty intimidating, to the point where I have delayed actually designing/deploying a Erlang solution. Designing for Scalability with Erlang/OTP [2] has a chapter on monitoring that I'm looking forward to reading. Wondering if there's some resources you could recommend to get started running a prod Erlang system re: debugging, monitoring, etc.

* How do you decide when to use a message queue (like sqs, RabbitMQ) vs regular processes? Do you have any guidelines you could share or is it more just, "use processes when you can; more formal MQ if interfacing with non-beam systems"? I struggle since conceptually each sender/receiver has its own queue via its mailbox.

1. https://www.erlang-in-anger.com/

2. http://shop.oreilly.com/product/0636920024149.do

rdtsc · on Aug 8, 2016

Good question. So first I noticed in metrics dashboard (so it is important to have metrics) the receiver never seemed to have gotten the expected number of messages.

Then noticed the count of messages was being reset. Suspected something restarted the node. Browsed through metrics, noticed both memory usage was spiking too high and node was indeed restarting (uptime kept going up and down).

Focused on memory usage. We have a small function which returns top N memory hungry processes. Notice a particular one. Noticed its mailbox was tens of gigabytes.

Then used recon_trace to trace that process to see what it was doing. recon_trace is written by author of Erlang In Anger. So did something like:

    recon_trace:calls[{my_module, '_', '_'}, 100, [{scope, local}].

dbg module is built-in and can use that, but it doesn't have rate limiting so it can kill a busy node if you trace the wrong thing (it floods you with messages).

So noticed what it was doing and noticed that a particular operation was taking way too long. It was because it was doing an O(n) operation instead of O(1) on each message. On a smaller scale it wasn't noticeable but when got to 1M plus instances it was bringing everything down.

After solving the problem to test it, I compiled a .beam file locally, scp-ed to test machine, hot-patched it (code:load_abs("/tmp/my_module")) and noticed that everything was working fine.

On whether to pick a message queue vs regular processes. It depends. They are very different. Regular processes are easier, simpler and cheap. But they are not persistent. So perhaps if your messages are like "add $1M to my account" and sender wants to just send the messages and not worry about acknowledging it. Then you'd want something with very good persistence guarantees.

sitkack · on Aug 8, 2016

This isn't backpressure, this is forcing synchronicity by using a queue depth of 1. For low throughput applications this is fine, but the latency will kill you on the long run and vastly underutilize available resources.

Backpressure should be proportional to the amount of work you want to prevent. And that backpressure should propagate up the event stream to limit the whole capacity of the subgraph. The slowest chain governs the speed of the whole system so that it can retain functionality instead of falling over. Which relates to other properties of the system

  * work stealing, processes handle their own dynamic capacity
  * prioritization, we can automatically guage what to NOT do
  * feedback

rdtsc · on Aug 8, 2016

You can think of synchronicity as the most basic form of backpressure. The goal is to avoid flooding the target when source sends at a too high of a rate.

But that was a quick fix. Eventually I implemented something else. Notice that in my case there was a also a bug, so it wasn't just due to normal hardware limits.

One other simplistic hack I found is to force the source to send every Nth message synchronously. Say it sends 10k messages as a cast, then if target is too slow and has queued them up, source sends 10001st as a call So it is forced to wait for target to process them before sending the next. (Sorry reverting to Erlang/OTP terminology here, cast mean send without acknowledgement, call means wait for a response).

sitkack · on Aug 8, 2016

Agree, but that synchronous kludge needs to be temporary. I too have used a similar hack, launching waves of tasks/messages and waiting with a timeout, possibly terminating the slow ones and then starting again. The ones that are slow multiple times can be put into a longer, bounded queue.

critium · on Aug 8, 2016

I would add to the list;

* If you have transactions, keep your transactions to a single machine/instance/db as much as possible. Muti-machine or software transactions are the LAST solution you should try.

* Pay attention to payload sizes. Make sure you dont come close to saturating the network. Which leads to weird "app is slow" problems.

* Design for testability and diagnosability of production systems. If this is java, use JMX extensions EVERYWHERE for EVERYTHING.

* Time (and timing) is your enemy. Make it your friend.

EDIT: Side note. IMO, JMX extensions are one of the most under-appreciated things that java and jvm devs have but keep forgetting about but its so powerful.

AdieuToLogic · on Aug 8, 2016

  IMO, JMX extensions are one of the most
  under-appreciated things that java and
  jvm devs have but keep forgetting about
  but its so powerful.

This is an excellent point which I cannot agree with more. When doing distributed systems using JVM's, I almost always reach for the excellent Metrics[0] library. It provides substantial functionality "out of the box" (gauges, timers, histograms, etc.) as well as exposing each metric to JMX. It also integrates with external measuring servers, such as Graphite[1], though that's not germane to this post.

0 - http://metrics.dropwizard.io/3.1.0/

1 - http://graphiteapp.org/

twic · on Aug 8, 2016

I'd hesitate to call Metrics 'excellent'. It does quite a bit of processing on numbers before it emits them - averaging over timespans and so on - so if you're feeding them into something like Graphite which does further processing, you're averaging twice, and getting the wrong numbers.

If you're just going to print the numbers out and look at them, it's very handy. But if you want to handle them further, which you really do, I'd avoid it. Put individual measurements in your logs, then scrape them out with Logstash or something and send them to Graphite for aggregation.

AdieuToLogic · on Aug 8, 2016

Use of Metrics functionality is going to vary based on need. So if a system is using something such as Graphite, then the measurements sent should reflect that. This is independent of a particular library IMHO.

So when you say:

  ... if you're feeding them into something
  like Graphite which does further processing,
  you're averaging twice, and getting the
  wrong numbers.

This would be an issue with sending Histogram, Meter, and/or Timer values, but I'd categorize that as a design defect. Using Metrics to send Gauge and Counter data to a Graphite-esque system shouldn't be a problem at all. For systems which do not have a metrics aggregator or when some need to be exposed to external agents (such as client systems), using the types provided by Metrics can be very useful.

As for using log files to "scrape out" system metrics, I realize this is a common practice yet is one which I am fundamentally against. What I've observed when taking the "put it in the logs and we'll parse it out later" tactic is that a system ends up with sustained performance degradation due to high log output as well ops as having to incorporate yet-another system just to deal with profuse logging.

Treating system metrics as a first-class concern has shown itself to be very beneficial in my experience. Persisting them is a desirable thing, agreed, and a feature point fairly easily met when metrics are addressed separately from logging.

neeleshs · on Aug 8, 2016

I agree. I'm a fan of pull model for metrics, where app logs very granular numbers and whoever wants metrics at whatever granularity will pull from these logs into splunk/tsdb etc. Let the app be simple in emitting metrics

neeleshs · on Aug 8, 2016

Beware of memory leaks in histograms!

AdieuToLogic · on Aug 9, 2016

Are you referencing the ThreadLocal[0] issue? If so, it has been committed to master (as mentioned in the comments).

0 - https://github.com/dropwizard/metrics/issues/742

lightbendover · on Aug 8, 2016

I feel like tracing is the biggest thing missing from the article, especially since the author takes a hard stance on logging. Tracing is one of those things that will eventually save the day for you (ops released a bad config on some load balancer you don't know about -- good luck) and if you don't approach your design with tracing in mind, it will be very hard to add it later.

alanning · on Aug 8, 2016

The author doesn't explicitly say "tracing" but he specifically calls out Dapper and Zipkin in the section, "“It’s slow” is the hardest problem you’ll ever debug."

Dapper - http://research.google.com/pubs/pub36356.html

Zipkin - http://engineering.twitter.com/2012/06/distributed-systems-t...

0xmohit · on Aug 8, 2016

For tracing, one may want to have a look at Zipkin [0] from Twitter.

Also see, Tracing Request IDs [1].

[0] https://github.com/openzipkin/zipkin

[1] https://brandur.org/request-ids

cryptica · on Aug 8, 2016

Those are really good points. Particularly the one about how there are so few open source systems built to scale.

I think this is starting to change (slowly). I run such a project called SocketCluster and we recently released some Docker images and Kubernetes config files to make running SC on a distributed Kubernetes cluster super easy.

See https://github.com/socketcluster/socketcluster#introducing-s...

I think there will come a point when most open source stacks will be designed to run on one or more orchestration systems like Kubernetes, Swarm and/or Mesos.

The main problem at the moment is lack of skills in that area - There is currently a big divide between DevOps engineers and software engineers.

As a long-time open source developer, I felt somewhat out of my comfort zone learning Kubernetes but it was well worth it.

simplify · on Aug 8, 2016

Based on your DevOps experience, what's your opinion of Nix[1] and/or its OS? I keep hearing it solves a lot of pain for deploying systems, but I don't have any DevOps experience to evaluate it.

[1] https://nixos.org/nix/

cryptica · on Aug 8, 2016

I haven't used Nix. Kubernetes runs pretty well on all the Linux distros that I've come across - I guess this is because K8s itself runs most of its components inside Docker containers so K8s isn't generally prone to configuration issues.

That said, you still have to deal with certain host-level configs like file limits and such - So I guess for more advanced use cases something like NixOS could come in handy - I just haven't needed it yet - I'm running most of my stuff on Amazon EC2 - So I just make sure that my base AMI has the correct host configs for the kinds of use cases I deal with.

jamesblonde · on Aug 8, 2016

I would recommend people interested in learning about distributed systems to take this course on EdX starting next month: https://www.edx.org/course/reliable-distributed-algorithms-p...

pejrich · on Aug 9, 2016

KTH (Stockholm) and they're not using Erlang. Blasphemy!

jamesblonde · on Aug 9, 2016

And it's Joe Armstrong's supervisor giving the course! It's using kompics, which is also actor-based but runs on Scala and Java. The course will use Scala.

rdtsc · on Aug 8, 2016

> One last note: Out of C, A, and P, you can’t choose CA.

I have my own, which is not obvious I think:

"CAP doesn't mean your system is automatically CP or AP, it also means, it could be neither"

davidw · on Aug 8, 2016

Yeah, these things are often like that. "Fast, Cheap, Good: pick two" - some people manage to only get one.

panic · on Aug 8, 2016

Some good previous discussion here: https://news.ycombinator.com/item?id=5055371

AdieuToLogic · on Aug 8, 2016

Regarding using timeouts:

I recommend using deadlines established at the initiation of a workflow instead of timeouts. This allows all collaborations to be constrained instead of having the additive effect of multiple discrete timeouts.

For example, specifying:

  Customer login must complete within 10 seconds of
  initiation.

Is preferable to:

  Read customer record within 5 seconds.
  Verify passphrase within 1 second.
  Update account access within 5 seconds.
  Generate response within 2 seconds.

Note that the cumulative seconds in the second example exceed the 10 second threshold of the first. This is intentional as it exemplifies the very common situation of "this should never take longer than X" timeout specifications resulting in potentially violating the system response constraints. It is also indicative of developers having to pick what is considered reasonable maximums within a workflow due to not having context. By providing a deadline downstream, those components need only be concerned with the question "do I have any time left to do what I need to do?"

For an example of both deadline and timeout types, see:

http://www.scala-lang.org/api/current/#scala.concurrent.dura...

agentgt · on Aug 8, 2016

I have general and abstract problems for large systems that I haven't really figured out easy solutions for other than using metrics and logs:

* Downstream dependency failures and isolation (circuit breaker pattern sort of helps but what do you set the thresholds to?)

* When to split resources to different machines (ie resource contention) (if you do it you then have more coordination issues)

* Alerting intelligently with out causing fatigue.

* Reducing latency at scale (this I haven't done well but some hit multiple duplicate services at the same time and pick the fastest one ala google).

* The index problem... if we add the index we will improve read speed but at the cost of write speed. Some systems have a careful balance or is dynamic depending on environment (time of day).

* Queues are easy to understand and implement but can fail in epic catastrophe. The alternatives are often hard to understand and implement (ie the backpressure problem).

existencebox · on Aug 8, 2016

I am by no means an expert, just a guy who has written these sort of systems a bit. These are my answers, take them with a massive grain of "pragmatism" salt.

(EDIT: realized how long this was after writing it. Christ; apologies in advance for this perhaps literal response to a partially rhetorical question? Hope this wall of text doesn't fall on anyone.)

- How to set thresholds: "Find out!" Break your systems. introduce failures. track the circuit breaker threshold metric and find what its distribution is, and what std.dev. of outliers correlate with (or which you consider to be) an error.

- How big is the system going to be? Is perf going to be a big consideration? If you find these to be "yes" then you probably _should_ prematurely optimize a LITTLE, at least in the sense of "my system has these natural coupling boundaries, if I split them there I'll get a very intuitive way to scale each component individually" (obviously things like nosql on one platform, sql on another but then you can dig deeper with tables built in such a way that you could split them out if they became problematic, have an understanding of what you can use as partition keys, etc) Coordination issues are certainly one of the doozies of distributed systems, and I can't give a fix-all, you'll need to think about coordination deeply, however, for most resource contention I've seen its more an issue of just having a good story ahead of time for how a given component will scale, and the actual coordination usually happens at a meta-level to that.

- See #1. It's a learning experience for every sytem. Obviously there are KPIs that are easy to alert for("half my nodes just went down") but those aren't the interesting ones. Learn your system; learn the telemetry behavior, make alerts that let you make an ACTIONABLE response to the occurence. (emphasis on actionable, since non-actionable alerting is in my experience the leading cause of ignoring it; whereas flappy/loud alerting that IS actionable indicates that the system may need to be more robust/fixed)

- There are a lot of solutions for this, the one you cited is a thing but mostly for issues like geolocality/really unfortunate scheduling/contention. At the risk of falling back onto a generic answer, if latency starts being a real issue, some form of caching (and locality) are two things I might look into.

- Going to keep coming back to "Know your workload". You can certainly optimize it to be read/write friendly. the over time dynamicism is an interesting point, since candidly, I haven't run into that sort of a situation in my day to day. I know that within certain systems (MSSQL) you can hint to adjust the query plan in certain ways/have it not used cached plans if you realize your profile has shifted, and this gives you SOME power, but it's starting to get into things I can only hand wave about at best. To end this less than satisfactory answer, I tend to err on "careful balance with semi-regular reassessment based off of movement in core KPIs"

-EVERYTHING can fail in an epic catastrophe. Depending on the semantics of how you build your queues, however, (and tables and constraints and etc.) you can handle that a little better. Think pathologically from day one; "if this DB gets corrupted, if this node entirely starts returning garbage, how to we recover?" and find solutions that build recovery into normal function. This is another super hand wavy answer but more because it goes into the entire space of distributed system design and I would likely make a fool of myself both in general and in this small space if I tried to move past merely "what has worked for me." However; I will warn, "the backpressure problem" isn't going to go away just because you make a more robust pipeline. In the same way that you've found benefits of circuit breakers as a design pattern, backpressure, even if it's passively observed backpressure, is a broadly useful tool for the general case of "I need my components to be contextually aware to not shoot each other in the head." (That being said, if you can design WITHOUT needing more moving parts/backpressure/etc, _do it_.)

My one big takeaway if I could is that the more general and abstract problems you remove out of the simplicity of your implementation, the more pain you save in having to answer and maintain them; at the risk of stating the obvious. (I'm almost sure I said some things here where dist. systems engs more senior than me will point out terrible design choices, but I hope they do since then I'll get to learn from your questions as well :) )

agentgt · on Aug 8, 2016

I appreciate your answer. In large part what you have said is what I have done in the past. The question/issue is can this ever be automated or made easier with out reliance on expensive PaaS services?

Right now it seems the easy answer in general is... just monitor and deal with shit but that gets old fast. Particularly if you find out you were wrong in some architecture assumption it can become very expensive to change. These things should be foreseen but due to time, resources and economic dynamics they are not.

In large part this is why there has been so much success with PaaS... let someone else figure out how to scale but again these services lock you in and generally are expensive (in terms of scale).

You would think that such a closed system where you can tinker and adjust a few settings (eg thread pools, timers, etc) would be an easy problem for machine learning but I have yet to see many companies employ techniques like these for scaling.

existencebox · on Aug 8, 2016

It's funny you mention PAAS; in my last two jobs I've been respectively devving for a large on-prem datacenter, and an entirely PAAS/IAAS team. Both had their own pitfalls, and both can be supplemented with automation in some areas, it really is to some extent an apples to oranges comparison, and automation is a boon in both environments.

re: the arch assumptions, man you've touched on a can of worms there (things should be forseen but aren't due to time.) I have so many things to say about this but in brief; the longer you do this "sort of stuff" the faster you are at seeing common patterns/pitfalls/avoiding that, and to that end is why I recommend SO STRONGLY working with someone who is a seasoned expert at this since a lot of that intuition is very domain specific/voodoo-esque. Now, this will help you in the future, but it won't help you now; in the absence of proper expertise/time/resources to determine the arch, I take two approaches: "thinking about it really hard", alongside my team. If I'm not an expert, and my peer isn't an expert, maybe however both of us together may manage to catch each other's mistakes in logic better than alone. Pair architecting can really facilitate working through a crunch, additionally if you do so with the philosophy of, "assuming any of these decisions are CRAP, do we have a path to remediating it without pain"; such that even if/when you make the wrong choices, you can see a direction to rectify it. I realize this is nebulous but I hope it conveys at least the shape of an approach.

Re: the success of PaaS, absolutely, however, it incurs pitfalls commensurate with its benefits. Take Azure SQL for instance. 1 TB DB limit. It's a FANTASTIC managed service, but for many truly big data scenarios that's an unworkable cap. Additionally, if someone really deep breaks, you may have a whole other support chain to go through rather than in-house expertise. (I do not mean to direct that latter statement explicitly at azure, I've seen it across PAAS/IAAS providers) WRT Lock In, I've taken to aggressively avoiding *AAS systems that I don't have a good migration story for upfront (paramount to my statement re: if this is a terrible choice have a way out); use these tools and a knowledge of the costs/benefits to chose the components that solve the problems you want, we have a toolchest and can often pick and chose the most appropriate ones.

Re: ML: AML does this to some extent; endpoint scalability and custom ML modules internally that you can paramaterize to dynamically tune some functionality. It certainly is not a perfect solution but it does some neat things, I've found some success using it but I by no means intend to paint it as a silver bullet; however I share your sentiment that in the long run these capabilities are very useful and more plug-and-play/flexible hosted ML seems a natural next step with the direction we've been going in to more "commoditize" some aspects of data analytics.

Disclaimer since I mention some MS products; I am an MS eng but all of this is just my own ramblings as a dev. Any advocation/caveats I say are only my own experiences, and I am not even a true expert on all of our in-house platforms so much as I am a consumer.

bluejekyll · on Aug 8, 2016

This is a superb list, but I have a nitpick:

> Distributed systems are different because they fail often

And:

> design for failure

While this is important, I think it's better to come up with fail-fast solutions, and some type of backup recovery.

Example: when designing a push notification system, what do you do when the push fails? I tend to then rely on pull, but then you have two competing operations... So maybe it's better to always initiate push operations with a pull simplifying the code, and handling both cases and a failure mode.

Also, most networks do not fail often, if they did we would have serious problems. See the update to the CAP theorem that discusses this, failure will happen, but for most programming not nearly as often as we try to deal with it.

Computers don't crash that often; local networks are amazing stable; and power outages are rare. So don't waste your time over-engineering for these things, spend you time making simple design decisions that handle these cases and then move on.

Example: split brain, focus on discovery that you are in a split brain situation, then fail-fast and notify until the network recovers. This is far simpler to deal with than say merging vector clock based records after the fact. (Obviously some systems can't deal with any down time, but my experience has shown that for a local network this is generally a safe call).

My basic pint is, keep your code simple, trying to deal with error conditions will creat brittle rarely tested code. You want as much code to be consistent across all executions as possible. Also, make your code testable without needing to spin up multiple processes, this means writing protocol utilities that accept generic readers and writers as opposed to IO channels.

Edit: one more thing, make sure you focus on releasing as quickly and dependably as possible to fix bugs/errors. People often leave this to last, and realize that they have a system that is impossible to manage.

brianolson · on Aug 8, 2016

Given sufficiently many operations, an operation that fails rarely will fail frequently. So if something fails 'one in a million' times, and you do ten million things, it'll fail ten times. Large distributed systems press on this.

bluejekyll · on Aug 8, 2016

I don't disagree with this, but I think making this the primary issue that you solve for is going to significantly slow down development.

zianwar · on Aug 8, 2016

You can also find a handful of useful blogs about DS in this Quora thread: https://www.quora.com/What-are-some-good-blogs-about-distrib...

bogomipz · on Aug 8, 2016

Can someone explain to me what this means:

"Suppose you’re going from a single database to a service that hides the details of a new storage solution. Have the service wrap around the legacy storage, and ramp up writes to it slowly. With backfilling, comparison checks on read (another feature flag), and then slow ramp up of reads (yet another flag), you will have much more confidence and fewer disasters."

Does the author mean to say:

"Have the service wrap around the legacy storage, and ramp up writes to the new system slowly"? It reads like he's saying ramping up writes to the legacy system.

Also I am confused by his use of the word "backfilling", what would being backfilled? IOPS to the new system after IOPs to the legacy system? Is that the idea.

This was a good read. Kudos to the author to putting it out there.

Falell · on Aug 8, 2016

If you want to migrate from system A to system B you don't start writing all data to B immediately. You don't know if B can handle the load, if your code is right, if double writing will cause an unacceptable slowdown etc. Instead you mirror a % of all writes from A to B and increase it over time.

Now you have a problem because B only has a % of the data. To fix this you need a separate task that periodically identifies records in A missing from B and syncs them. This is the "backfill" - it isn't part of your running application and is only needed for the duration of the migration.

Once you think you have sufficient data in B you can start checking your work, run a % of your reads from both A AND B and make sure the results match.

Once that matches for a while you can start making a % of reads from B without ever contacting A. When this reaches 100% you stop writing to A.

Each % here is a separate feature flag.

bogomipz · on Aug 9, 2016

Yep, makes perfect sense, I've handled this in quite a similar fashion. Thanks for clarifying.

yumaikas · on Aug 8, 2016

Interesting that he considers logs to not be worth the investment. I can appreciate his argument and all, and logging seems pretty hard to get right, but still.

Thankfully, my current job is a monolith database and webserver on a box, so I'm still in the realm of pointers for now.

tristor · on Aug 8, 2016

You misunderstood what he's saying here. He's not saying logs aren't worth the investment, he's saying that logging is rarely done usefully/correctly, so don't have an over reliance on log output. Instead, you should track and rely on system-wide metrics.

Of course you should put the time in to do logging, but at every point of control over what the contents of the logs are you should be ensuring that what you log is information about the state of the system, not picking a class of errors you think is possible at the time and logging only that. If you've worked with any "enterprise" type Java software that ends up in a distributed system, you'll know exactly what he means. Your logs end up filled with huge XML outputs that are completely irrelevant about one particular error that actually doesn't really affect system state overall, while the things that plague you in a global way are almost impossible to hunt down without building a test cluster and running captured requests through it.

Logging request/response, logging process steps by request ID, etc. are much more useful. Having the IDs in your logs correlate to the IDs in your data is also helpful because it allows your Ops people to do queries across both datasets to find common factors for where requests fail and what that type of data looks like vs requests that succeed, as an example.

As he mentions though, most people don't think it's important to log successes, but in fact what you care most about is correlating request types/request parameters to metrics. Whether a request failed or succeeded is just a metric itself, but the more data you have the better understanding of the system you gain. The real challenge is that distributed systems are designed to be opaque in their complexity to the users external to the system and this often leads to them being opaque in their complexity to the developers internal to the system as well.

yumaikas · on Aug 8, 2016

Thanks for the clarification on all that. You have some good "simple" (but not obvious) ideas there too, like the correlated ids in logs

youjiuzhifeng · on Aug 8, 2016

You can find a lot of materials about DS on this GitHub project [0]. Althought they are writen in chinese, the materials are almost english.

[0]https://github.com/ty4z2008/Qix/blob/master/ds.md

firasd · on Aug 8, 2016

Great line: "Learn to estimate your capacity. You’ll learn how many seconds are in a day because of this."

bogomipz · on Aug 8, 2016

It's funny I think that anyone who has ever had to manage their own DNS infrastructure knows the number 86400. And DNS is one of the oldest distributed systems out there.

honkhonkpants · on Aug 8, 2016

On backpressure: "Common versions include dropping new messages on the floor (and incrementing a metric) if the system’s resources are already over-scheduled."

What I would add here is that "and incrementing a metric" must use fewer resources than would have been spent serving the request. I once worked on a system that would latch up into an overload state because in overload it would log errors at a severity high enough to force the log file to be flushed on every line. This was counterproductive to say they least. From that I learned that a system's flow control / pushback mechanism must be dirt cheap.

susan_hall · on Aug 8, 2016

On dealing with backpressure, I would recommend what Zach Tellman writes about in his video "Everything will flow":

https://www.youtube.com/watch?v=1bNOO3xxMc0

Tellman also tells his own adventure about a system something like this:

"a system that would latch up into an overload state because in overload it would log errors at a severity high enough to force the log file to be flushed on every line"

He talks about some problems he faced when writing to Amazon S3, when S3 stopped accepting requests. It's a fascinating talk.

vegabook · on Aug 8, 2016

just lean on the shoulders of giants who have thought all these points out before, and use Erlang (or Elixir). That includes back pressure.

http://elixir-lang.org/blog/2016/07/14/announcing-genstage/?...

felixgallo · on Aug 8, 2016

Erlang doesn't have great inter-node backpressure. Messages sent from one node to another in a cluster will suspend the sender if the output queue is too big, which is a global value independent of the queue length of the receiver. Also, the inter-node protocol is based on TCP, which has the usual TCP problems (head of line blocking, etc.).

Erlang's distribution story is great when you're on a single machine; pretty good if you're on a very high reliability, high bandwidth LAN; and not great if you're going across any network not under your direct control (e.g. AWS, the public internet).

vegabook · on Aug 8, 2016

k but a) back pressure as per my original post with Elixir (or Apache Storm, or Flink, amongst many others), and b) if you need latency-aware cross-timezone you should probably be letting another piece of distributed infrastructure, explicitly designed for this, handle the issue, for example local/cluster/rack/data-center aware Cassandra. There is never any reason to re-invent something which big, multi-national corpos have worried about for you for a long time already. It doesn't strike me that anything in the OP post has not already been solved for 99% of us, using the right, freely available and well-documented tools. Distributed computing is hardly a new technology.

rpazyaquian · on Aug 8, 2016

+1. Just use Elixir.

pejrich · on Aug 9, 2016

It's not quite that easy. While Elixir/Erlang provide a lot of great tools for distribution, it still takes a lot of knowledge and effort. It's not quite a case of "Just use Elixir".

dschiptsov · on Aug 8, 2016

> garbage collection pauses make masters “disappear”

This happens only with piles of Java crap (gigabytes of jars). With [mostly] functional Erlang, Common Lisp, Haskell everything is fine.

erpellan · on Aug 8, 2016

All of the languages you mentioned have garbage collectors.

eropple · on Aug 8, 2016

Yup, and all of them have demonstrable edge cases where the GC will throw a rod. The random slagging of the most advanced virtual machine technology out there that doesn't cost four-plus digits is pretty funny, though.

catwell · on Aug 8, 2016

The issue is not really the VM, it's the language. GC performance is somehow tied to the number of objects it has to deal with.

Java code tends to be object-oriented and create a lot of objects. Functional code can create significantly fewer objects, so even though the Java world has the best GCs (like Azul C4) those other languages typically give their GCs less work to do. That also means that if you write code that does give a lot of work to the GC in a functional language it can certainly perform very poorly as well.

Erlang is probably different, afaik creating lots of (lightweight) processes is frequent in Erlang. I don't know how the Erlang GC works so I can't really comment on that.

qohen · on Aug 8, 2016

I don't know how the Erlang GC works so I can't really comment on that

If you're interested, here are a few helpful blogposts on the subject:

- This is a pretty short and readable post, though it's a bit old (2008): "Garbage Collection in Erlang" [0]

- A more detailed (and recent -- 2015) blogpost: "Erlang Garbage Collection Details and Why It Matters" [1].

- And this is the most up-to-date description, from April 2016 concerning Erlang R19, by Lukas Larsson, who works on the Erlang VM -- it is long, but well-written and has nice illustrations [2].

[0] http://prog21.dadgum.com/16.html

[1] https://hamidreza-s.github.io/erlang%20garbage%20collection%...

[2] https://www.erlang-solutions.com/blog/erlang-19-0-garbage-co...

qohen · on Aug 8, 2016

Or, if you don't want to chase those blogposts down, here's the nutshell version, from reference [1]:

In order to explain current default Erlang’s GC mechanism concisely we can say; it is a Generational Copying garbage collection that runs inside each Erlang process private heap independently, and also a Reference Counting garbage collection occurs for global shared heap.

(To expand on that a little: each process has its own copy of everything that it needs -- if it needs something from another process, it'll get a copy in a message sent to its mailbox, so: no shared memory, links, etc.

This encapsulation means that when a process terminates, all its memory is instantly reclaimed -- no GC is required, since no other process can possibly be referencing any data in the process that's going away.

And, while a process is alive, there's per-process generational-copying GC, as mentioned.

However, for pragmatic reasons, items > 64 bytes are stored in a global heap and references to those items are passed used (i.e. to avoid sending large things around) -- that's where the ref-counting GC comes in).

dschiptsov · on Aug 8, 2016

Yes, this strong process isolation is one of the design decisions which leads to the ultimate success of Erlang/OTP as a telecom platform. Joe Armstrong's thesis has explicit explanation why JVM is not suitable for telecom-grade reliability (no matter what sales people would tell you).

Another principal design decision, which complements share-nothing architecture, is a pure functional language with single "assignment". This greatly simplified runtime and VM itself which leads to more controlled and predictable behavior, suitable for soft-realtime systems.

The concept of drivers is another one, but it is irrelevant to this discussion.

catwell · on Aug 8, 2016

Thank you for the explanation. I will read the links as well. Even if I don't use it, the design of Erlang is an interesting topic to study.

catnaroek · on Aug 8, 2016

> Functional code can create significantly fewer objects

Functional programs, if anything, tend to create more objects than object-oriented ones. However:

(0) Objects tend to be shorter lived, because new versions of data structures will be written to new objects, rather than overwritten to old objects.

(1) Because functional programming deemphasizes object identities, the GC can duplicate or deduplicate objects with equal contents. It can even merge several nodes of a linked data structure into a single fat node, improving locality of reference. If object identities matter, these “optimizations” actually break your program.

(2) The invariant that old immutable objects can't point to newer ones (at least in a strict functional language, not a lazy one like Haskell) can be used to optimize the GC algorithm as well.

chrisseaton · on Aug 8, 2016

How does functional programming lead to generally fewer objects allocated? I would guess that it leads to more, because functional data structures require allocation to make changes where imperative programming can make destructive updates without allocating.

Many functional languages allow imperative parts of course, but even in that case how is it less than the object oriented approach rather than just the same?

dschiptsov · on Aug 8, 2016

It is not only about absolute number of objects created, of course. It is also how long they kept and how often the are changing.

Immutability gives simplifies everything including incremental, parallel GC and hence reduces the pauses for GC.

This is the big idea behind Clojure and Datomic - immutability makes everything simple (less complex runtimes) and even more scalable (share nothing means no locks, mutexes, busy waiting) and in the case of Erlang - fault tolerant (strong process isolation - no threads with shared stack or any writable data).

What is funny - everything has been well-researched and right principles formulated by Erlang team decades ago, and the resulting product just works for telecoms.

The two fundamental principles they described are 1) Posix threads and fault tolerance does not match. 2) Imperative code and GC does not match. While Golang, Lua or PyPy are trying really hard to disprove the second principle, Java is an existential proof.

catwell · on Aug 8, 2016

That's kind of a digression, but the Lua GC is pretty simple (which is good) and not so efficient. The LuaJIT GC has some interesting features but sadly Mike Pall did not complete all of his plans for it.

What makes LuaJIT so efficient regarding the GC is that you can allocate memory that is not managed by it (using the FFI). You have to if you need to use a lot of memory anyway, since depending on the architecture GCd memory can be limited to as little as 1 GB.

pjmlp · on Aug 8, 2016

Gut feeling, not sure how much this holds.

The majority of functional languages support value types (in the stack/global sense), which many OO languages that follow Smalltalk don't. Also escape analysis tends to be relatively weak in most JVM implementations (maybe Graal is the exception here).

Immutable data makes it easier to collect garbage when moving living data between nurseries, thus leading to shorter pauses as in imperative languages.

EDIT: Updated the sentence about escape analysis.

chrisseaton · on Aug 8, 2016

I think as you say immutability is probably the real benefit of functional for GCs as it simplifies a lot. But that doesn't reduce the number of objects allocated, which was the claim made.

catnaroek · on Aug 8, 2016

The directed graph of immutable objects and links between them is by necessity a DAG. Only mutation can introduce cycles. If 90-95% of your objects are immutable, your GC algorithm can be specifically optimized for them. For instance, the mark and compact phases could run concurrently (interleaved) and in-place.

(Yes, this means that, from a GC implementor's POV, a lazy language like Haskell mutates like crazy.)

notacoward · on Aug 8, 2016

It seems to me that objects are being copied no matter what. In one world, mutation is common and the copying occurs within the GC subsystem. In the other world, mutation is rare and the programmer has to manage the copies manually. Pretending that the copies are wholly new objects which owe nothing to their predecessors strikes me as disingenuous. For all practical purposes, creating a new object that differs from its predecessor in one field is equivalent to mutating that field (except that it's far worse for performance).

Is moving a burden from the runtime to the programmer a good idea? Generally, no. There might be something about this particular case that makes it an exception to the general rule, but the point seems very far from proven.

zaphar · on Aug 8, 2016

As I've been teaching myself rust this point has been hammered home hard. It's really eye opening to see all the copies plainly represented in code and the lifetime checker is really good about highlighting where you should be making a copy vs sharing a reference.

I haven't reached the stage where I can write performant Rust yet since I've only recently started to truly grok the borrow checker but I can see how longer term it will be easier to figure out how to make my code performant since the all the information required is front and center. It will be interesting to see whether that promise holds for rust code as the community matures.

catnaroek · on Aug 8, 2016

> It's really eye opening to see all the copies plainly represented in code and the lifetime checker is really good about highlighting where you should be making a copy vs sharing a reference.

From a Rust POV (a useful one, even if you're using a higher-level language), when you use a garbage collector, technically everything is owned by the garbage collector, and merely borrowed by you. That's why in most languages you can get away with making shallow copies everywhere, even when Rust would tell you that a `.clone()` is in order.

catnaroek · on Aug 8, 2016

> For all practical purposes, creating a new object that differs from its predecessor in one field is equivalent to mutating that field (except that it's far worse for performance).

(0) If the object has, say, just two or three fields (which isn't uncommon in functional languages, since they really tend to decompose everything in a piecemeal fashion), this is no big deal.

(1) If a functional language has substructural types (which admittedly most don't), the compiler can even tell, purely from type information, when an object won't be used anymore and can be overwritten anyway, as though you had used imperative assignment all along.

notacoward · on Aug 8, 2016

So, best case, with proper type annotation and a really good optimizer, you get the performance you would have had with plain old mutability? Pardon me if I remain unconvinced that immutability is solving the problem adequately.

catnaroek · on Aug 10, 2016

> you get the performance you would have had with plain old mutability

And muuuch better type safety. You can only mutate data in-place if either no one else has access to it, or it's wrapped in a mutex and you've taken a lock.

catwell · on Aug 8, 2016

To be honest when I said that I can was thinking about how less boxing occurs in (statically typed) functional languages. But like you and others said in the thread immutability can actually result in the creation of more objects, they just have different life cycles and properties that make the work of the GC easier.

marcosdumay · on Aug 8, 2016

> Functional code can create significantly fewer objects

Immutability means the number of individual objects explodes.

It also means those are easier to garbage collect, so you can have several different algorithms acting on different time-frames, alleviating the problem for the slower ones.

pjmlp · on Aug 8, 2016

However languages that have both GC and value types, do happen to behave better than Java, at least until Java 10 eventually makes value types a reality on the JVM.

Taking something like Eiffel or Modula-3 as examples, many others are possible, architectures just like in C++ are perfectly approachable.

Not doing so is just a consequence of developers that are trigger-happy with new and don't think about designing for performance.

bogomipz · on Aug 8, 2016

I think its worth noting that it isn't only Java-based distributed systems that have masters disappear. This also happens during high IO wait when heartbeats aren't received in time. Example - Mongodb which doesn't use Java at all.

mSparks · on Aug 8, 2016

care to point me to a robust in production distributed system written in one of those other languages?

dschiptsov · on Aug 8, 2016

My pleasure!

http://basho.com/products/

http://highscalability.com/blog/2014/2/26/the-whatsapp-archi...

mSparks · on Aug 8, 2016

Both distributed databases based on erlang (most of which I guess run on erjang anyway) not distributed systems such as boinc.

The OP is talking about the system part

i.e. Not merely using a distributed system such as erlang, couch/mongodb, and pretending you can do distributed systems.

billskills · on Aug 14, 2016

It's funny that you would pick boinc (a rotten mashup of C and PHP with no formal spec) as your exemplar of a distributed system.

iamd3vil · on Aug 8, 2016

There's Riak Core which is a framework for distributed systems in Erlang.

girvo · on Aug 8, 2016

For Common Lisp, ITA Software[0] is a good answer, I would think.

[0] https://en.wikipedia.org/wiki/ITA_Software

draven · on Aug 8, 2016

I don't know how distributed ITA software's QPX is, but there are some information about how they work around the GC in some cases here (from 2001): http://paulgraham.com/carl.html

dschiptsov · on Aug 8, 2016

Does it mean that ITA services doesn't work in production at Google or that Google does not use any projects written in Common Lisp?

draven · on Aug 8, 2016

I didn't find any architectural details on QPX after they've been acquired by Google.

For Haskell I remember reading about Cloud Haskell ( https://haskell-distributed.github.io/ ) which allows you to write Erlang-style systems and Sparkle ( https://github.com/tweag/sparkle ) by Tweag ( http://www.tweag.io/ ) which allows you to use Haskell with Apache Spark. But those are tools and libraries and not complete production systems.

I'd love to read about Haskell or Common Lisp distributed system implementations if you have any pointers!

erpellan · on Aug 8, 2016

Mnesia? http://erlang.org/doc/man/mnesia.html

catwell · on Aug 8, 2016

Well, for Erlang it's pretty easy. Also including WhatsApp servers, Riak, a lot of router code...

The question is still interesting for the other two languages though.

mSparks · on Aug 8, 2016

system. not database. something like boinc for example.

i.e. something doing decent math rather than just reading and writing chunks of data.

since erlang is up to 100 times slower than java for calculation tasks.

thats 100 times more expensive in terms of hardware.

so i doubt its used much for distributed systems.

But i can see the confusion.

Others I've simply never even considered would be chosen.

dschiptsov · on Aug 8, 2016

There is a much better reading for Young Bloods

https://pragprog.com/book/jaerlang2/programming-erlang