Hacker News new | past | comments | ask | show | jobs | submit login

First, I think our general uptime metrics are trending upwards. Recovery times tend to be much shorter as well.

Big services are bigger, more mission-critical parts can fail.

Continuous development culture is designed with failure as part of the process. We don't spend time looking for obscure issues when they'll be easier to find by looking at metrics. This is fine when a staggered deployment can catch an issue with a small number of users. It's bad when that staggered deployment creates a side-effect that isn't fixed by rolling it back. Much harder to fix corrupted metadata, etc.

Automated systems can propagate/cascade/snowball mistakes far more quickly than having to manually apply changes.

We notice errors more now. Mistakes are instantly news.




> We notice errors more now. Mistakes are instantly news.

Heck, just look at Twitter itself from its original "Fail Whale" days where there was so much downtime, to now where even this relatively small amount of downtime is the top story on HN for hours.


So, when it went down, was there a Fail Whale displayed during this most recent incident?


I think they retired the fail whale some time ago.

I looked it up: in 2013, because they didn't want to be associated w/ outages.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: