While it is a bit dated in terms of the 'types' of applications it is still quite relevant in terms of understanding "web 2.0[1]" distributed system reliability versus other systems.
One of the coolest things Greg Lindhal built at Blekko was a system that processed the error logs of every single system in the cluster all at the same time. It then reduced by string matching error messages that varied only by their variables into a single error (so 'error reading sector 10 on disk sda3' and 'error reading sector 20 on disk sde2' would reduce to 'error reading sector X on disk Y' x={...}, y={...}') really fun stuff. And you could easily pull out of that errors in the software (happens on many nodes) errors in hardware (happens on unrelated nodes) and errors in infrastructure (happens on nodes correlated by a switch or PDU or Rack)
[1] Distributed systems made of many many identical nodes both from a software and hardware perspective, managed by an orchestration service with software that assumes unreliable platforms.
While this post is from 2012, much of it applies very much, especially when building your own systems/(micro)services:
>The actual reliability of your system depends largely on how bug free it is, how good you are at monitoring it, and how well you have protected against the myriad issues and problems it has. This isn’t any different from traditional systems, except that the new software is far less mature. I don’t mean this disparagingly, I work in this area, it is just a fact. Maturity comes with time and usage and effort.
Working at Uber on larger systems, it’s surprised me how much more effort we’re putting in operating the system reliably, vs the upfront planning/design (and we spent a lot of time on planning/design). I wrote in-depth about those practices and here’s the relevant HN discussion: https://news.ycombinator.com/item?id=20462349
One of the coolest things Greg Lindhal built at Blekko was a system that processed the error logs of every single system in the cluster all at the same time. It then reduced by string matching error messages that varied only by their variables into a single error (so 'error reading sector 10 on disk sda3' and 'error reading sector 20 on disk sde2' would reduce to 'error reading sector X on disk Y' x={...}, y={...}') really fun stuff. And you could easily pull out of that errors in the software (happens on many nodes) errors in hardware (happens on unrelated nodes) and errors in infrastructure (happens on nodes correlated by a switch or PDU or Rack)
[1] Distributed systems made of many many identical nodes both from a software and hardware perspective, managed by an orchestration service with software that assumes unreliable platforms.