Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Faulty script. Second, the script we used provided both the "mark for deletion" capability used in normal day-to-day operations (where recoverability is desirable), and the "permanently delete" capability that is required to permanently remove data when required for compliance reasons. The script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.

Ouch. I hope no one person got the blame. This is a systemic failure. Regardless, my regards to the engineers involved.



I don't want to assume too much, since the details are sparse. But I know for a fact that few of my current coworkers know a thing about writing tooling code. It's becoming a bit of a lost art.

Here's the way such a script should be done. You have a dry-run flag. Or, better yet, make the script dry-run only. What this script does is it checks the database, gathers actions, and then sends those actions to stdout. You dump this to a file. These commands are executable. They can be SQL, or additional shell scripts (e.g. "delete-recoverable <customer-id>" vs. "delete-permanent <customer-id>").

The idea is you now have something to verify. You can scan it for errors. You can even put it up on Github for review by stakeholders. You double/triple check the output and then you execute it.

Tooling that enhances visibility by breaking down changes into verifiable commands is incredibly powerful. Making these tools idempotent is also an art form, and important.


That’s how I did one of my more impactful deduplication/deletion scripts. It had to reach across environments to do its work. But there was no way to send any flags to it to do stuff. The environment names were hard coded, so like dev-uw2 reaching out to stg-ue1. It would output a dry run result by default. And you could look and see what was going to get deleted and from what environment.

Because the names were hard coded, I had to get changes approved in GitHub. Then the script would run on Jenkins.

That script was also only for that purpose and nothing else. It made a mess because I needed a ton of functionality around creation and querying, too. I just copied the script to folders and modified them as needed but a better solution would’ve been to make a python module. I just liked the code itself being highly specific to what the script was doing to help reduce mistakes. If I’m running a script to delete repos, I need to go to the delete-repos directory.


If coding is theatrical then ops is operatic. You have to telegraph stuff so over the top that the people in the cheap seats know what’s going on.

I think what we’ve lost in the post-XP world is that just because you build something incrementally doesn’t mean it’s designed incrementally (read: myopically).

My idiot coworkers are “fixing” redundancy issues by adding caching, which recreates the same problem they’re (un?)knowingly trying to avoid, which is having to iterate over things twice to accomplish anything. They’ve just moved the conditional branches to the cache and added more.

Most of the time, and especially on a concurrent system, you are better off building a plan of action first and then executing it second. You can dedupe while assembling the plan (dynamic programming) and you don’t have to worry about weird eviction issues dropping you into a logic problem like an infinite loop.

More importantly, you can build the plan and then explain the plan. You can explain the plan without running it. You can abort the plan in the middle when you realize you’ve clicked the wrong button. And you can clean up on abort because the plan is not twelve levels deep in a recursive call, where trying to clean up will have bugs you don’t see in a Dev sandbox.

    Deleting 500 users…
Versus

    Permanently deleting 500 users…
Maybe with a nice 10 second pause (what’s an extra ten seconds for a task that takes five minutes?)


I suppose that’s why you don’t combine a tazer and gun into 1 device with 2 triggers.


If you have a 3rd trigger where the gun turns on the user it would be fairly safe.


the problem is that sometimes that gun looks like a taser.


Instead, you make it with one trigger and a PRNG that decides which gets activated. Just hope you've chosen the right PRNG!!


I will then write a script calls your script with the PRNG of my choice: PRNG1 always returns "trigger 2", and PRNG2 always returns "trigger 1". This detail will be documented in Confluence.


Considering American police can't even seem to get it right when they have two distinct firearms, and are trained to holster them on specific sides so they know what they are grabbing - and still manage to f*ck it up....this might be an improvement.


This speaks to a lack of operational excellence - when you develop a platform like JIRA, Confluence, etc, the operational tools required to manage the systems are just as important as the features themselves. If all you do is pump out features, you're a feature factory and will suffer these kinds of issues. There's no reasonable explanation for needing a script to do what was described when the necessary tooling to generalize such an operation should have been in existence.


Right? The way this reads it seems like one person set a flag incorrectly, something I'm sure we've all done numerous times. And there were no checks down the line to catch it.


Hi, this is Mike from Atlassian Engineering. You are right that the checks need to improve to reduce human error, but that's only half of it. I don't see this as human error though. It's a system error. We will be doing some work to make these kind of hard deletes impossible in our system.


The bigger story here is that they're restoring data that should have been permanently deleted for compliance reasons.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: