This was really interesting both in exploring the architecture of a retail system and looking at how systems fail. Better to read about it and learn than to live it.
I'd call it a 4 hour outage because the initial "recovery" was a result of cashiers manually typing in prices for items. Then when load decreased and they discovered that scanning items worked again the problem came right back.
Maybe returning 404 for both a cache miss and a "there's no endpoint at this path" error is an issue too. For other status codes there's a distinction between temporary and permanent failure; e.g. 301 versus 302. It would've been good to use HTTP 400 Bad Request for the misconfigured URL and 404 for a cache miss.
In the 10% of stores with the early roll out of the config change the cache hit rate went to 0 right away, and that started 12 days before the outage. Alerts on cache hit rates and per-store alerts would've caught that.
Then there were 4 days where traffic to the main inventory micro-service in the data center jumped 3x which took it to what appears to be 80% of capacity. Load testing to know your capacity limits and alerts when you near that limit would've called out the danger.
Then during the outage when services slowed down due to too many requests they were taken out of rotation for failing health checks. Applying back pressure/load shedding could have kept those servers in active use so that the system could keep up.
204 no content is an underused http status. 404 should be monitored as an error, 204 as, well no content available. If a status code has two responsibilities that will be a monitoring issue waiting to happen.
It seems absolutely insane to me that a system was designed and developed that allows taking down all registers in the country at once. I would have thought it would be designed to be much more "batch" oriented and the worst that could happen is you lose price updates and sales info unto the batches can get through again.
If the local system doesn't have the item data (or thinks it doesn't have the item data, because it's looking in the wrong place) where exactly is it supposed to get the item data from if not the central system?
The total size of all item data for a store like Target can't be much more than what, a few gigabytes? Or at least the "UPC -> Price" dataset. So download the whole dataset each night, if you can't get delta changes to work.
And if that had failed somehow, it would have been noticed immediately upon the new code roll-out.
The internet was designed to be extremely resilient to host/route losses, we've made it so reliable we assume all machines are reachable at all times.
(To be fair, apparently they "do" have this but the dataset is printed on the items and the cashiers had to enter it by hand)
Their system was basically designed the way you're saying, with a fallback to grab the data from the central location if it's missing locally. What you're asking for is the same system without a fallback, which doesn't make any sense.
The fallback was the problem - design it without it, or with a manual window that pops up saying "ITEM NOT FOUND, QUERY TARGET ORACLE" or something, and the fault wouldn't have taken down the whole company.
If suddenly every cashier is being forced to hit OK on every item, people would hear about it immediately from the test rollout instead of when it hit everything (of course, assuming you have good methods for detecting things like this and don't just completely ignore associates' complaints).
One reason for tests to hit a real external API is if you're using a "record and replay" test framework to capture the interactions so that you can run the tests against the recorded data quickly later. But because the API calls you make change (and the external implementation changes) you need to re-record from time to time.
This strikes a balance where 99% of the time you are making calls that never leave the process for fast testing, but can validate against the real implementation as needed.
There is no picture of the entry level model; it says "coming soon". Since the Ford F-150 usually costs a lot to upgrade to 4 full-size doors and a navigation computer, I'm wondering if that will be the case here too. Near the bottom of the page it looks like you have to upgrade two levels to the "Lariat" configuration to get that 15.5 inch touchscreen.
The claims I saw in all of the regurgitated Press Releases seemed to indicate they are going to do 4 doors as standard on the Lightning, at least for now.
I'm not sure how much of that is streamlining production versus design (i.e. fitting all the batteries in place may more or less necessitate that specific body style)
In the spirit of the article, I think that the approach would be to ask your partners to provide a sandbox/test environment that your sandbox/test environment can interact with, or test accounts in production at least.
He is not modeling the threshold below which wealth is not taxed. The wealth tax being proposed in California is 0.4% on amounts over $30MM, so the lifetime percentage taken by the government on a $30MM stock cash-out via the proposed wealth tax would be 0%.
So the proposal boils down to $30MM tax free, and then 21.4% on the amount over that. But you're also likely to invest that money making, say, 7%. So your return each year on the first $30MM is 7.0% and is 6.6% on the rest after paying this tax.
Functional Programming is not just for Haskell; modern Java has lots of pretty decent options so I feel like their Java version of the code is a bit of a straw man. Here is how a more modern code style in Java 8 (which is years old now) might look. Notice that the logic for this problem is 7 lines of code, with one line being the closing brace.
I'd call it a 4 hour outage because the initial "recovery" was a result of cashiers manually typing in prices for items. Then when load decreased and they discovered that scanning items worked again the problem came right back.
Maybe returning 404 for both a cache miss and a "there's no endpoint at this path" error is an issue too. For other status codes there's a distinction between temporary and permanent failure; e.g. 301 versus 302. It would've been good to use HTTP 400 Bad Request for the misconfigured URL and 404 for a cache miss.
In the 10% of stores with the early roll out of the config change the cache hit rate went to 0 right away, and that started 12 days before the outage. Alerts on cache hit rates and per-store alerts would've caught that.
Then there were 4 days where traffic to the main inventory micro-service in the data center jumped 3x which took it to what appears to be 80% of capacity. Load testing to know your capacity limits and alerts when you near that limit would've called out the danger.
Then during the outage when services slowed down due to too many requests they were taken out of rotation for failing health checks. Applying back pressure/load shedding could have kept those servers in active use so that the system could keep up.