Yes, everything has limits. Where Google says "quota system", for normal people that means "buy another computer"; you have hit your quota when you're out of memory / cpu cycles / disk. At Google, they have some extra computers sitting around, but it's still not infinite. Quota is a way of hitting some sort of limit before every atom in the Universe becomes a computer on which to run your program.
I don't think there is any way to avoid it. It sounds bad when it's software that's telling your service it can't write to disk, rather than the disk not having any more free sectors on which to write, but it's exactly the same thing. Everyone has a quota, and left unchecked, your software will run into it.
(In the case of this postmortem, there was a bug in the software, which makes it all feel self-inflicted. But if it wasn't self-inflicted, the same problem would have manifested in some other way.)
There is a comment in this thread where the author says they take less risks when the safety systems are turned off. That is fine and nice, but is not really a good argument against safety systems. I have definitely had outages where something hit a quota I set, but I've had more confusing outages from something exhausting all physical resources, and an unrelated system failing because it happened to be nearby. I think you should wear a helmet AND ride safely.
> I think you should wear a helmet AND ride safely.
There's a difference here; helmets are personal safety equipment, which is the proper approach: monitor and manage yourself, don't rely on external barriers. But did-you-know that a statistically significant proportion of drivers change their behaviour around riders wearing helmets? [1] (That's not a reason to not wear helmets, everyone should ATGATT; it's a reason to change driver behaviour through other incentives).
We cannot deny the existence of moral hazards. If you want to nullify a well-understand, thoroughly documented, and strongly correlated statistical behaviour, something has to replace it. Google would, apparently, prefer to cover a hard barrier with soft padding. That might help ... until the padding catches fire.
To your example, writing to disk until the OS reports "there are no more sectors to allocate" just means no-one was monitoring the disk consumption, which would be embarrassing, since that is systems administration 101. Or projecting demand rate for more storage, which is covered in 201, plus an elective of haggling with vendors, and teaching developers about log rotation, sharding, and tiered archive storage.
Actionable monitoring and active management of infrastructure beats automatic limits, every time, and I've always seen it as a sign of organisational maturity. It's the corporate equivalent of taking personal responsibility for your own safety.
I don't think there is any way to avoid it. It sounds bad when it's software that's telling your service it can't write to disk, rather than the disk not having any more free sectors on which to write, but it's exactly the same thing. Everyone has a quota, and left unchecked, your software will run into it.
(In the case of this postmortem, there was a bug in the software, which makes it all feel self-inflicted. But if it wasn't self-inflicted, the same problem would have manifested in some other way.)
There is a comment in this thread where the author says they take less risks when the safety systems are turned off. That is fine and nice, but is not really a good argument against safety systems. I have definitely had outages where something hit a quota I set, but I've had more confusing outages from something exhausting all physical resources, and an unrelated system failing because it happened to be nearby. I think you should wear a helmet AND ride safely.