A year is far longer than they would need just to achieve statistical significance. They've got over 160 million unique visitors per month [1]. Even showing the variants to only 1% of traffic you're working with over 50,000 visitors every day, enough to run large multivariate tests.
Amazon probably also has vested interest in how a change affects LTV which is way harder to understand than whether there was a purchase regression in a short time window.
Maybe for some things, but not necessarily for everything amazon cares about, such as performance and reliability. For both of those amazon doesn't really care about averages, they care about the 3 9's experience. If there's a 0.5% chance that page load will take longer than some small number of milliseconds then that's not good enough for amazon and they'll go back to the drawing board. Factor in bugs in the new layout being fixed during the A/B test (necessitating resetting statistics) and it's easy to see how it could take a year to fully roll out a big change given amazon's cautiousness.
Statistical significance is not the same as significance in business terms. I can't imagine anyone would use such an approach when deciding how much data to collect.
>Statistical significance is not the same as significance in business terms
Depending on the level of significance, why not? If it's expensive or simply not possible to gather the full data set, sampling is absolutely a valid basis for business decisions. Why wouldn't it be?