First I found myself nodding along, but then a troublesome question popped into ...

Smaug123 · on May 24, 2021

Indeed, that's the thing I rather swept under the rug. I'll definitely take precautions if it's at all easy to avoid the assumption that just because my `--dry-run` stage said something was safe, it will definitely be safe by the time execution passes to the "do-it" stage. However, in general you are indeed doomed the instant you decided to split the tool up this way.

I have never actually encountered such a race condition while using any tool I've written this way. But I do remember one place where I specifically didn't add a feature to the tool because it was so vulnerable to a TOC/TOU race condition. Instead, I turned that feature into a "did I just manage to do the thing successfully, and if not, why not" check at the end of the "do-it" stage.

So yes, that concern is extremely valid, and if you're doing an operation whose validity is liable to change after you've done the check, then you do just need to follow a different approach for that operation. (But then there's probably no possibility even in principle of a `--dry-run` for such an operation anyway!)

mst · on May 24, 2021

Something I've found can work is "check, output plan, prompt for confirmation, if 'yes' execute plan in a way that maximises the odds of bailing out if anything changed while you were displaying the prompt" - for bonus points, recalculate after each action that had to be simulated and do something policy-defined if anything differs from your prior expectations.

Obviously the effort of doing this becomes a trade-off but it's worth considering.

iudqnolq · on May 24, 2021

I think this can be mitagated with more abstract descriptions, like:

"Would have gotten all servers tagged foo (currently 1,341) and deleted them"

Smaug123 · on May 24, 2021

That's all very well for just `--dry-run`, but it's still a problem if you use the method in OP to architect the entire tool (and in particular the non-dry-run execution) around the existence of a `--dry-run` phase.

iudqnolq · on May 24, 2021

I don't think so? The planning stage generates something like DeleteOp(ServerSpec { tag: "foo" }).

Smaug123 · on May 24, 2021

I think the problem is that describing it as "would have deleted 1378 servers", and then going off and deleting those 1378 servers, doesn't solve the race condition problem that leaves you with some servers left undeleted because they popped into existence after the check.

With server deletion, of course, the problem is less visible because it's naturally idempotent (unless you're referring to servers by some non-unique key). In that case, you can safely just pretend the leftover servers came into existence after the entire tool came into being.

Of course, if you're referring to servers by name or IP or something, then you hit exactly the problem. Concretely: I run a command that deletes all servers older than one day. The 'dry-run' section determines what servers need to be deleted and it turns out that server "foo" needs to be deleted. Elsewhere, someone spots that "foo" exists, deletes it, and creates a new server with the name "foo". Then the 'execute' flow deletes the new "foo", which is entirely not the server we wanted to delete.

cmeacham98 · on May 24, 2021

Your example inaccurately describes what would happen with GP's approach.

Planning stage would generate `Delete(Server{CreatedBefore: 1621887357}))`.

It would output something like "Deleting servers created before $time$, currently $servers$."

The point is the planning stage would encode the operation being done, not a list of servers to delete.

Smaug123 · on May 24, 2021

Ohh, sorry - I got the wrong end of the stick entirely. Yes, in that case I'm satisfied!