> a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0. An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident.
Grace period on enforcement of a major policy change is an excellent practice...but it also means months can go by between the introduction of a problem and when the problem actually surfaces. That can lead to increased time-to-resolution because many engineers won't have that months-old change at the front of their mind while debugging.
You need gradual rollouts. In particular, you need rollouts where the behavior of your system changes gradually as you apply your change to more of your instances/zones/whatever-rollout-unit. And the right speed is whatever speed gives you enough time to detect a problem and stop the rollout while the damage is still small enough to be "acceptable". With "acceptable" determined by the needs of your service (but if you say "no damage is ever acceptable" then I have some bad news for you).
Grace periods don't give you gradual rollouts like this; that's not their purpose. And I agree, grace periods can be a double edged sword for the reason you mention.
> a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0. An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident.
Grace period on enforcement of a major policy change is an excellent practice...but it also means months can go by between the introduction of a problem and when the problem actually surfaces. That can lead to increased time-to-resolution because many engineers won't have that months-old change at the front of their mind while debugging.