More

snewman · on July 26, 2024

Contestants get 4.5 hours for each of the two days of competition. They have to solve three problems in that time, so on average you can spend 1.5 hours per problem (if you're aiming to finish all three).

That said, the gap from "can't do it at all" to "can do it in 60 hours" is probably quite a bit larger than the gap from 60 hours to 1.5 hours.

moffkalast · on July 26, 2024

Timing something that can be ran faster by throwing better hardware at it honestly feels conceptually irrelevant, as long as the complexity is actually tractable.

snewman · on July 26, 2024

When tackling IMO problems, the hard part is coming up with a good approach to the proof. Verifying your proof (and rejecting your false attempts) is much easier. You'll know which one to submit.

(Source: I am a two-time IMO silver medalist.)

lupire · on July 26, 2024

You are a human, not an AI. You know whether your idea seems related to the solution. The AI has thousands of ideas and doesn't know which are better. Graders shouldn't accept a thousand guesses grubbing for 1 point.

If verifying a good idea is easy, then the evidence shows that the AI didn't have good ideas for the other 2 problems.

badrunaway · on July 26, 2024

we are talking about lean proofs. Given a formal statement and a proof - the lean can verify whether it's correct or not. It's like generating computer programs to solve a problem - the problem lied in generating useful solutions/sub-solutions so that the search is effective. They achieve this via using gemini as a lean proof generator aka. using a world model LLM fine tuned to generate lean proofs in a more effective manner.

Humans are even better at this as you mention - but effectively the approach is similar. Come up with lot of ideas and see what proves it.

saagarjha · on July 26, 2024

I don't see why we should take your word for it, as opposed to just asking AlphaProof to comment instead.

topato · on July 26, 2024

Well, he does have twice the amount of silver medals... And can speak the English language... Although, an AI attempting to speak with the human race entirely through an esoteric math-proofing language would be an interesting take on the whole, "humans make ET contact, interact through universal language.... pause for half of the movie, military hothead wanting to blow it out of the sky, until pretty lady mathematician runs into the oval office waving a sheet of paper... OF MATH!" trope.... But now, it's a race between you and I, to see who can write the screenplay first!

snewman · on July 23, 2024

In many ways it doesn't cancel out. On Mars, things weigh less but their mass is the same. So, for example, if you are walking on Mars at a normal Earth walking speed, and you bump your shin on something, the inertia of your leg as it hits the obstacle is just as great as the inertia of your leg on Earth. In general, a lot of the little physical interactions that you rely on bone strength to get through scale according to mass / inertia (and muscle strength), not weight.

snewman · on July 8, 2024

> What Boeing really needs is a complete change in management culture, as that was the real root cause for the MAX disasters, but that is impossible to enforce, you can't even really verify that it has happened.

Serious question: why not? Why couldn't we require Boeing to change its culture, and then verify and enforce that change? Verification might involve periodically interviewing employees across all levels of the organization, performing spot checks and audits to make sure that procedures are being followed (no more failing to enter a work item into the system), and so forth.

(I don't mean to imply that this would be feasible under current regulatory law – I have no idea whether that is the case or not. I'm just saying that you could imagine a world where this could be done.)

snewman · on June 27, 2024

Presumably Anthropic / Claude, OpenAI / GPT, Google DeepMind / Gemini.

lowyek · on June 27, 2024

thank you. yes I meant that.

snewman · on May 31, 2024

This seems clear and authoritative. The one thing I can't wrap my head around: why didn't he say this when the story of Sam getting fired first circulated? Why didn't Sam ask him to clarify the record sooner? It would have been easy to tweet this out at any time.

Edit to add: obviously pg isn't going to respond to every rumor, but this one has had significant attention and he was particularly well placed to address it.

snewman · on May 27, 2024

Edit to add: feel free to email me (address is in my profile) if you'd like to discuss this with someone in private. I've co-founded multiple startups, but don't have any experience with a bad situation like this, so could give general perspective only, but if you just need someone to talk to I can be that person.

---

The first question to ask is, what would you like to see happen?

Considering the short runway and the apparent problems with the CEO, do you see value in the company and would you like to keep it alive? Are you ready to write it off, and you just want to manage the endgame to ensure that yourselves, other employees (if any), customers (if any), investors, or other parties are treated fairly? Is there some other outcome that you'd like to optimize for?

Separately: who are the members of your board? I've never dealt with a situation like you're describing, but at a high level I imagine you have basically three channels for action:

1. Call a board meeting, fire the CEO.

2. Legal action. I'm not sure whether this sort of action is something you could sue over, something you could ask the police to press criminal charges for, both, or something else. Consider, however, that filing a lawsuit is expensive, and it's not obvious to me that either of these actions would benefit you in any practical way (aside from revenge, if you're looking for that).

3. "Soft power" – threaten to do (1) or (2) above, or to quit, or do other things, in order to coerce the CEO into leaving / behaving better / whatever else you'd like to see.

One other note: while I'm off the edge of my experience and expertise here, I can imagine possible scenarios where you could wind up in legal trouble if you're not careful. (For instance, if you're on the board and fail to take appropriate action now that you're aware of the problem?) You might want to consult a lawyer for this reason alone.

snewman · on May 24, 2024

Given the level of impact that this incident caused, I am surprised that the remediations did not go deeper. They ensured that the same problem could not happen again in the same way, but that's all. So some equivalent glitch somewhere down the road could lead to a similar result (or worse; not all customers might have the same "robust and resilient architectural approach to managing risk of outage or failure").

Examples of things they could have done to systematically guard against inappropriate service termination / deletion in the future:

1. When terminating a service, temporarily place it in a state where the service is unavailable but all data is retained and can be restored at the push of a button. Discard the data after a few days. This provides an opportunity for the customer to report the problem.

2. Audit all deletion workflows for all services (they only mention having reviewed GCVE). Ensure that customers are notified in advance whenever any service is terminated, even if "the deletion was triggered as a result of a parameter being left blank by Google operators using the internal tool".

3. Add manual review for any termination of a service that is in active use, above a certain size.

Absent these broader measures, I don't find this postmortem to be in the slightest bit reassuring. Given the are-you-f*ing-kidding-me nature of the incident, I would have expected any sensible provider who takes the slightest pride in their service, or even is merely interested in protecting their reputation, to visibly go over the top in ensuring nothing like this could happen again. Instead, they've done the bare minimum. That says something bad about the culture at Google Cloud.

steveBK123 · on May 25, 2024

>> 1. When terminating a service, temporarily place it in a state where the service is unavailable but all data is retained and can be restored at the push of a button. Discard the data after a few days. This provides an opportunity for the customer to report the problem.

This is so obviously "enterprise software 101" that it is telling Google is operating in 2024 without it.

Since my new hire grad days, the idea of immediately deleting data that is no longer needed was out of the question.

Soft deletes in databases with a column you mark delete. Move/rename data on disk until super duper sure you need to delete it (and maybe still let the backup remain). Etc..

crazygringo · on May 25, 2024

It sounds like the problem is that the deletion was configured with an internal tool that bypassed all those kinds of protections -- that went straight to the actual delete. Including warnings to the customer, etc.

Which is bizarre. Even internal tools used by reps shouldn't be performing hard deletes.

And then I'd also love to know how the heck a default value to expire in a year ever made it past code review. I think that's the biggest howler of all. How did one person ever think there should be a default like that, and how did someone else see it and say yeah that sounds good?

roughly · on May 25, 2024

> This is so obviously "enterprise software 101" that it is telling Google is operating in 2024 without it.

My impression of GCP generally is that they've got some very smart people working on some very impressive advanced features and all the standard boring stuff nobody wants to do is done to the absolute bare minimum required to check the spec sheet. For all its bizarre modern enterprise-ness, I don't think Google ever really grew out of its early academic lab habits.

steveBK123 · on May 25, 2024

I know a bunch of way-too-smart PHD types that worked at GOOG exclusively in R&D roles that they bragged to me earnestly was not revenue generating.

jononor · on May 26, 2024

It makes sense for companies to have 1-10% of resources allocated for that. At Google scale that is thousands of people.

steveBK123 · on May 27, 2024

What if it's a lot more than 1-10% at GOOG?

Why haven't I met anyone who proudly works on revenue generating product at GOOG compared to the several R&Ders I know from different social circles?

jononor · on May 27, 2024

Sure it might well be. Ads and Cloud are the main things making money at Google. Those have very high profit margins. So it might be 90% at Google are just spending the money earned by the 10% that are directly bringing in the dough :)

nikanj · on May 25, 2024

There are many voices in the industry arguing against soft deletes. Mostly coming from a very Chesterton's Fence perspective.

For some examples https://www.metabase.com/learn/analytics/data-model-mistakes...

https://www.cultured.systems/2024/04/24/Soft-delete/

https://brandur.org/soft-deletion

Many more can easily be found.

snewman · on May 25, 2024

For the use case we're discussing here, of terminating an entire service, the soft delete would typically be needed only at some high level, such as on the access list for the service. The impact on performance, etc. should be minimal.

steveBK123 · on May 25, 2024

Precisely, before you delete a customer account, you disable its access to the system. This is a scream test.

Once you've gone through some time and due diligence you can contemplate actually deleting the customer data and account.

Nathanba · on May 26, 2024

I think the reason why someone wouldn't want to do this is because it will cost Google money to keep it active on any level.

danparsonson · on May 25, 2024

OK, but those examples you gave all boil down to the following:

1. you might accidentally access soft-deleted data and/or the data model is more complicated 2. data protection 3. you'll never need it

to which I say

1. you'll make all kinds of mistakes if you don't understand the data model, and, it's really not that hard to tuck those details away inside data access code/SPs/etc that the rest of your app doesn't need to care about

2. you can still delete the data later on, and indeed that may be preferable as deleting under load can cause performance (e.g. locking) issues

3. at least one of those links says they never used it, then gives an example of when soft-deleted data was used to help recover an account (albeit by creating a new record as a copy, but only because they'd never tried an undelete before and where worried about breaking something; sensible but not exactly making the point they wanted to make)

So I'm gonna say I don't get it; sure it's not a panacea, yes there are alternatives, but in my opinion neither is it an anti-pattern. It's just one of dozens of trade-offs made when designing a system.

jiveturkey · on May 25, 2024

gdpr compliance precludes such approach

rezonant · on May 25, 2024

Hard agree. They clearly were more interested in making clear that there's not a systemic problem in how GCP's operators manage the platform, which read strongly and alarmingly that there is a systemic problem in how GCP's operators manage the platform. The lack of the common sense measures you outline in their postmortem just tells me that they aren't doing anything to fix it.

ok_dad · on May 25, 2024

“There’s no systemic problem.”

Meanwhile, the operators were allowed to leave a parameter blank and the default was to set a deletion time bomb.

Not systemic my butt! That’s a process failure, and every process failure like this is a systemic problem because the system shouldn’t allow a stupid error like this.

joshuamorton · on May 25, 2024

If you're arguing that that was the systemic problem, then it's been fully fixed, as the manual operation was removed and so validation can no longer be bypassed.

rezonant · on May 26, 2024

I think you glossed over the importance of the term process failure.

The idea is that this one particular form missing the appropriate care is indicative of a wider lack of discipline amongst the engineers building it.

Definitionally, you cannot solve a process problem by fixing a specific bug.

joshuamorton · on May 26, 2024

"we removed the system that can enable a process failure" fixes the process failure. I didn't misunderstand anything.

cowboylowrez · on May 27, 2024

joshua nails it. companies do not do root cause anymore. its just the great enshitification, google edition.

sangnoir · on May 25, 2024

> When terminating a service, temporarily place it in a state where the service is unavailable but all data is retained and can be restored at the push of a button. Discard the data after a few days. This provides an opportunity for the customer to report the problem

Replacing actual deletion with deletion flags may lead to lead to other fun bugs like "Google Cloud fails to delete customer data, running afoul of EU rules". I suspect Google would err on the side of accidental deletions rather than accidental non-deletions: at least in the EU.

mcherm · on May 25, 2024

> I suspect Google would err on the side of accidental deletions rather than accidental non-deletions: at least in the EU.

I certainly hope not, because that would be incredibly stupid. Customers understand the significance of different kinds of risk. This story got an incredible amount of attention among the community of people who choose between different cloud services. A story about how Google had failed to delete data on time would not have gotten nearly as much attention.

But let us suppose for a moment that Google has no concern for their reputation, only for their legal liability. Under EU privacy rules, there might be some liability for failing to delete data on schedule -- although I strongly suspect that the kind of "this was an unavoidable one-off mistake" justifications that we see in this article would convince a court to reduce that liability.

But what liability would they face for the deletion? This was a hedge fund managing billions of dollars. Fortunately, they had off-site backups to restore their data. If they hadn't, and it had been impossible to restore the data, how much liability could Google have faced?

Surely, even the lawyers in charge of minimizing liability would agree: it is better to fail by keeping customers accounts then to fail by deleting them.

shadowgovt · on May 30, 2024

> Customers understand the significance of different kinds of risk.

Customers do; the law does not. The GDPR introduces unintended consequences.

> Surely, even the lawyers in charge of minimizing liability would agree: it is better to fail by keeping customers accounts then to fail by deleting them.

Not at all. Those in charge of enforcing the GDPR are heavily incentivized to assume the opposite of that. Google accidentally losing customer data is a win for privacy as far as the law's intent.

rlpb · on May 25, 2024

A deletion flag is acceptable under EU rules. For example, they are acceptable as a means of dealing with deletion requests for data that also exists in backups. Provided that the restore process also honors such flags.

pm90 · on May 25, 2024

I highly doubt this was the reason. Google has similar deletion protection for other resources eg GCP projects are soft deleted for 30 days before being nuked.

boesboes · on May 25, 2024

Not really how it works, GDPR protects individuals and allow them to request deletion with the data owner. They need to then, within 60(?) days, respond to any request. Google has nothing to do with that beyond having to make sure their infra is secure. There even are provisions for dealing with personal data in backups.

EU law has nothing to do with this.

phito · on May 25, 2024

It's a joke that they're not doing these things. How can you be an giant cloud provider and not think of putting safe guards around data deletion. I guess that realistically they thought of it many times but never implemented it because our costs money.

pm90 · on May 25, 2024

It’s probably because implementing such safeguards wouldn't help anyones promo packet.

I really dislike that most of our major cloud infrastructure is provided by big tech rather than eg infrastructure vendors. I trust equinix a lot more than Google because thats all they do.

Thorrez · on May 25, 2024

I work in GCP and have seen a lot of OKRs about improving reliability. So implementing something like this would help someone's promo packet.

cbarrick · on May 25, 2024

This is exactly the kind of work that would get SREs promoted.

passion__desire · on May 25, 2024

It is funny Google has internal memegen but not ideagen. Ideate away your problems, guys.

metadat · on May 25, 2024

Understandable, however public clouds are a huge mix of both hardware and software, and it takes deep proficiency at both to pull it off. Equinix are definitely in the hardware and routing business.. may be tough to work upstream.

Hardware always get commoditized to the max (sad but true).

lima · on May 25, 2024

As a customer of Equinix Cloud... No thank you. Infrastructure vendors are terrible software engineers.

mehulashah · on May 25, 2024

I’m completely baffled by Google’s “postmortem” myself. Not only is it obviously insufficient to anyone that has operated online services as you point out, but the conclusions are full of hubris. I.e. this was a one time incident, it won’t happen again, we’re very sorry, but we’re awesome and continue to be awesome. This doesn’t seem to help Google Cloud’s face-in-palm moment.

playingalong · on May 25, 2024

It looks like they could read the SRE book by Google. BTW available for free at https://sre.google/sre-book/table-of-contents/

A bit chaotic (a mix of short essays) and simplistic (assuming one kind of approach or design), but definitely still worth a read. No exaggeration to state it was category defining.

ajross · on May 25, 2024

FWIW, you're solving the bug by fiat, and that doesn't work. Surely analogs to all those protections are already in place. But a firm and obvious requirement of a software system that is capable of deleting data the ability to delete data. And if it can do it you can write a bug that short-circuits any architectural protection you put in place. Which is the definition of a bug.

Basically I don't see this as helpful. This is just a form of the "I would never have written this bug" postmortem response. And yeah, you would. We all would. And do.

belter · on May 25, 2024

Can you imagine if there was no backup? Google would be in for to cover the +/- 200 billion in losses?

This is why the smart people at Berkshire Hathaway don't offer Cyber Insurance: https://youtu.be/INztpkzUaDw?t=5418

ratorx · on May 25, 2024

I’d be very surprised if there wasn’t legalese in the contract/ToS about liability limitations etc. Would maybe expect it to be more than infrastructure costs for a big company custom contract, but probably not unlimited/as high as that, because it seems like such a blatant legal risk…

Disclaimer: Am Googler who knows nothing real about this. This is rampant speculation on my part.

PcChip · on May 25, 2024

Could it have been a VMware expiration setting somewhere, and thus VMware itself deleted the customer’s tenant? If so then Google wouldn’t have a way to prove it won’t happen again except by always setting the expiration flag to “never” instead of leaving it blank

yalok · on May 25, 2024

I would add one more -

4. Add an option to auto-backup all the data from the account to the outside backup service of users choice.

This would help not just with these kind of accidents, but also any kind of data corruption/availability issues.

I would pay for this even for my personal gmail account.

2OEH8eoCRo0 · on May 24, 2024

That sounds reasonable. Perhaps they felt that a larger change to process would be riskier overall.

TheCleric · on May 25, 2024

No it would probably be even worse from Google’s perspective: more expensive.

markhahn · on May 25, 2024

most of this complaint is explicitly answered in the article. must have been TL...

Ocha · on May 25, 2024

I wouldn’t be surprised if VMware support is getting deprecated in GCP so they just don’t care - waiting for all customers to move off of it

snewman · on May 25, 2024

My point is that if they had this problem in their VMware support, they might have a similar problem in one of their other services. But they didn't check (or at least they didn't claim credit for having checked, which likely means they didn't check).

snewman · on May 22, 2024

A company with a high P/E is by definition not making much money for its shareholders, relative to the size of their investment.

Unless you mean that the share price may appreciate. That's absolutely a thing, but it's a dangerous game. Of course plenty of people have made fortunes this way; people have also lost fortunes; I think the advice to steer away from such companies is basically a statement about risk.

chii · on May 23, 2024

> Unless you mean that the share price may appreciate.

this has already happened because the P/E is high! Betting that it will continue to grow in price (aka, reach an even higher P/E) is risky.

nightski · on May 23, 2024

Or it means that earnings are lagging the price increase. They just announced a 629% increase in earnings from a year ago (461% non-gaap) and it seems to be accelerating.

snewman · on May 18, 2024

You could merge these data structures as well. If the two instances to be merged are not at the same "round", take the one that's at an earlier round and advance it (by discarding half the entries at random) by the difference in rounds. Then just insert the values from one list to the other, ignoring duplicates; if the result is too large, discard half at random and increment the round number.

I implemented exactly this algorithm at my previous employer, except that alongside each value, we stored an estimate of the number of times that value appeared. This allowed us to generate an approximate list of the most common values and estimated count for each value.

gwillen · on May 19, 2024

Merging like that doesn't work -- it will tend to overestimate the number of distinct elements.

This is fairly easy to see, if you consider a stream with some N distinct elements, with the same elements in both the first and second halves of the stream. Then, supposing that p is 0.5, the first instance will result in a set with about N/2 of the elements, and the second instance will also. But they won't be the same set; on average their overlap will be about N/4. So when you combine them, you will have about 3N/4 elements in the resulting set, but with p still 0.5, so you will estimate 3N/2 instead of N for the final answer.

I have a thought about how to fix this, but the error bounds end up very large, so I don't know that it's viable.

finnh · on May 18, 2024

Was there, reviewed the PR, can confirm. Hi Steve!

Since then we've also tuned it up in a couple ways, in particular adding "skip" logic similar to fast reservoir sampling to trade some accuracy for the ability to not even look at the next N {M,G,T}B if you've already seen many many many matches. For non-selective searches over PB of data it's a good tradeoff, despite introducing some search-order bias.