My guess is they just A/B test everything by default, even stuff that's court or...

mattnewton · on April 24, 2022

Basically true but not the terms I'd use; they aren't really A/B tests but staged rollouts, though the process and tooling required is similar. We did staged rollouts of _everything_ back when I worked on google search that wasn't a trivial bug fix. We'd move it to 1% for a day, check metrics, increase to 10%, hold a couple more days and check metrics, then to 100%. Very sensitive or risky launches might hold at a full 50% for some time. UI changes were "dark launched" behind a flag that we incrementally flipped on. The reason is that no test suite captures reality and this discipline forces you to account for easy rollbacks (just turn the flag off) and handle "skew" (the case where I user starts a session where the flag is off but then starts talking to a machine where it is on, or vice/versa). This was in addition to the binary that released multiple times a week and rolled out slowly over the course of the day, and often this happened after multiple versions were tested in experiments with statistically significant samples.