More

haroldl · on June 23, 2022

This was really interesting both in exploring the architecture of a retail system and looking at how systems fail. Better to read about it and learn than to live it.

I'd call it a 4 hour outage because the initial "recovery" was a result of cashiers manually typing in prices for items. Then when load decreased and they discovered that scanning items worked again the problem came right back.

Maybe returning 404 for both a cache miss and a "there's no endpoint at this path" error is an issue too. For other status codes there's a distinction between temporary and permanent failure; e.g. 301 versus 302. It would've been good to use HTTP 400 Bad Request for the misconfigured URL and 404 for a cache miss.

In the 10% of stores with the early roll out of the config change the cache hit rate went to 0 right away, and that started 12 days before the outage. Alerts on cache hit rates and per-store alerts would've caught that.

Then there were 4 days where traffic to the main inventory micro-service in the data center jumped 3x which took it to what appears to be 80% of capacity. Load testing to know your capacity limits and alerts when you near that limit would've called out the danger.

Then during the outage when services slowed down due to too many requests they were taken out of rotation for failing health checks. Applying back pressure/load shedding could have kept those servers in active use so that the system could keep up.

jabart · on June 23, 2022

204 no content is an underused http status. 404 should be monitored as an error, 204 as, well no content available. If a status code has two responsibilities that will be a monitoring issue waiting to happen.

magicalhippo · on June 23, 2022

So your suggestion is that if I have a /invoice endpoint, a "GET /invoice/abc123" should return 204 if it's an invalid/non-existing invoice number?

Seems reasonable.

bombcar · on June 23, 2022

It seems absolutely insane to me that a system was designed and developed that allows taking down all registers in the country at once. I would have thought it would be designed to be much more "batch" oriented and the worst that could happen is you lose price updates and sales info unto the batches can get through again.

jaywalk · on June 23, 2022

If the local system doesn't have the item data (or thinks it doesn't have the item data, because it's looking in the wrong place) where exactly is it supposed to get the item data from if not the central system?

bombcar · on June 23, 2022

The total size of all item data for a store like Target can't be much more than what, a few gigabytes? Or at least the "UPC -> Price" dataset. So download the whole dataset each night, if you can't get delta changes to work.

And if that had failed somehow, it would have been noticed immediately upon the new code roll-out.

The internet was designed to be extremely resilient to host/route losses, we've made it so reliable we assume all machines are reachable at all times.

(To be fair, apparently they "do" have this but the dataset is printed on the items and the cashiers had to enter it by hand)

jaywalk · on June 23, 2022

Their system was basically designed the way you're saying, with a fallback to grab the data from the central location if it's missing locally. What you're asking for is the same system without a fallback, which doesn't make any sense.

kevan · on June 23, 2022

It's counterintuitive but when you're dealing with distributed systems lots of things are: https://aws.amazon.com/builders-library/avoiding-fallback-in...

bombcar · on June 23, 2022

Exactly - they had a fallback system that worked well enough for the testing to pass, but not well enough for the main system to operate on it.

Interestingly enough the Amazon example there is basically exactly what happened to Target.

bombcar · on June 23, 2022

The fallback was the problem - design it without it, or with a manual window that pops up saying "ITEM NOT FOUND, QUERY TARGET ORACLE" or something, and the fault wouldn't have taken down the whole company.

If suddenly every cashier is being forced to hit OK on every item, people would hear about it immediately from the test rollout instead of when it hit everything (of course, assuming you have good methods for detecting things like this and don't just completely ignore associates' complaints).

haroldl · on June 8, 2022

I worked on a C project where we counted 14 different hash table implementations done for different key/value data types strewn through the code.

haroldl · on June 16, 2021

One reason for tests to hit a real external API is if you're using a "record and replay" test framework to capture the interactions so that you can run the tests against the recorded data quickly later. But because the API calls you make change (and the external implementation changes) you need to re-record from time to time.

This strikes a balance where 99% of the time you are making calls that never leave the process for fast testing, but can validate against the real implementation as needed.

haroldl · on May 21, 2021

There is no picture of the entry level model; it says "coming soon". Since the Ford F-150 usually costs a lot to upgrade to 4 full-size doors and a navigation computer, I'm wondering if that will be the case here too. Near the bottom of the page it looks like you have to upgrade two levels to the "Lariat" configuration to get that 15.5 inch touchscreen.

csharptwdec19 · on May 21, 2021

The claims I saw in all of the regurgitated Press Releases seemed to indicate they are going to do 4 doors as standard on the Lightning, at least for now.

I'm not sure how much of that is streamlining production versus design (i.e. fitting all the batteries in place may more or less necessitate that specific body style)

haroldl · on Feb 26, 2021

In the spirit of the article, I think that the approach would be to ask your partners to provide a sandbox/test environment that your sandbox/test environment can interact with, or test accounts in production at least.

haroldl · on Aug 18, 2020

He is not modeling the threshold below which wealth is not taxed. The wealth tax being proposed in California is 0.4% on amounts over $30MM, so the lifetime percentage taken by the government on a $30MM stock cash-out via the proposed wealth tax would be 0%.

So the proposal boils down to $30MM tax free, and then 21.4% on the amount over that. But you're also likely to invest that money making, say, 7%. So your return each year on the first $30MM is 7.0% and is 6.6% on the rest after paying this tax.

haroldl · on Feb 4, 2020

Sounds like other fixes were needed:

"Crema Coffee owner June Tran said bringing her 100-year-old building up to code would cost $100,000."

haroldl · on Feb 4, 2020

Agreed.

"Crema Coffee owner June Tran said bringing her 100-year-old building up to code would cost $100,000."

Bringing a 100-year-old building up to code != installing a ramp.

BurningFrog · on Feb 4, 2020

But you need a building permit to install a ramp, and that may come with a requirement to bring it up to code.

mondoshawan · on Feb 4, 2020

Then she was already willfully ignorant of the law when she started, long before ADA got involved.

haroldl · on Sept 19, 2019

Functional Programming is not just for Haskell; modern Java has lots of pretty decent options so I feel like their Java version of the code is a bit of a straw man. Here is how a more modern code style in Java 8 (which is years old now) might look. Notice that the logic for this problem is 7 lines of code, with one line being the closing brace.

https://gist.github.com/haroldl/aee6a407a01131345fc4ecb1b9c9...

haroldl · on March 20, 2019

I think the analogy is to suppose you insure your car with 10 different policies and then total it so you can collect 10x what it is worth.