First, I think our general uptime metrics are trending upwards. Recovery times tend to be much shorter as well.
Big services are bigger, more mission-critical parts can fail.
Continuous development culture is designed with failure as part of the process. We don't spend time looking for obscure issues when they'll be easier to find by looking at metrics. This is fine when a staggered deployment can catch an issue with a small number of users. It's bad when that staggered deployment creates a side-effect that isn't fixed by rolling it back. Much harder to fix corrupted metadata, etc.
Automated systems can propagate/cascade/snowball mistakes far more quickly than having to manually apply changes.
We notice errors more now. Mistakes are instantly news.
> We notice errors more now. Mistakes are instantly news.
Heck, just look at Twitter itself from its original "Fail Whale" days where there was so much downtime, to now where even this relatively small amount of downtime is the top story on HN for hours.
Big services are bigger, more mission-critical parts can fail.
Continuous development culture is designed with failure as part of the process. We don't spend time looking for obscure issues when they'll be easier to find by looking at metrics. This is fine when a staggered deployment can catch an issue with a small number of users. It's bad when that staggered deployment creates a side-effect that isn't fixed by rolling it back. Much harder to fix corrupted metadata, etc.
Automated systems can propagate/cascade/snowball mistakes far more quickly than having to manually apply changes.
We notice errors more now. Mistakes are instantly news.