S3 glitch causes Tarsnap outage

Groxx · on Sept 17, 2010

Nicely written... I don't think I've read a more reassuring outage message before. Especially since it doesn't particularly sound like it's meant to be placating, it's just informing.

cperciva · on Sept 17, 2010

Thanks, but I wasn't trying to be reassuring, only to provide an explanation of what happened.

uuilly · on Sept 17, 2010

A good explanation is reassuring.

may · on Sept 17, 2010

Still though, well-written and clear, as always. Kudos.

Groxx · on Sept 17, 2010

That's a big part of why it's reassuring. No dodging blame, no useless half-answers. That, and that your safety-first decisions existed at all and worked is excellent news to any of your customers.

jacquesm · on Sept 17, 2010

Interesting how a 404 basically now translates in to purge DNS cache and try again.

Did I understand it right that the inability to locate a single object for a single customer affected all your customers?

cperciva · on Sept 17, 2010

Did I understand it right that the inability to locate a single object for a single customer affected all your customers?

Almost right. The Tarsnap server aggregates together data from multiple machines into a single S3 object; I don't know how many users had data stored in that object, but it's probably more than 1 and less than 10.

But the problem wasn't really caused by the object going missing; rather, it was caused by S3 doing Something Which S3 Should Never Do, plus the Tarsnap server code not being designed with that possibility in mind. I've since adjusted the code so that an error like this will be handled less severely.

(That said, I doubt I'll ever see this S3 glitch again -- I got a phone call from Amazon providing some additional details about what caused this and it was clear that they were taking it very seriously.)

jacquesm · on Sept 17, 2010

> That said, I doubt I'll ever see this S3 glitch again -- I got a phone call from Amazon providing some additional details about what caused this and it was clear that they were taking it very seriously.

They'd better.

I'm as surprised as you are (well, probably not because I'm not using S3, but I know a bit about how it is put together and I can't believe the Amazon people are happy about having this happen to them, it's the exact opposite of what should happen in a 'cloud' storage situation).

I think there will be some pretty high level meetings on this glitch, the one thing you don't want is customer data going absent-without-leave, even if only on a holiday instead of a permanent departure.

Isn't it against your instinct to have data from different customers live in what amounts to be the same file? I understand you've got it encrypted to the hilt but that seems 'un-Colinish' ;)

die_sekte · on Sept 17, 2010

As far as I understand he is using S3 as a dumb block-level store. He has implemented a file system on top of S3.

All customer data is stored in different files, it's just that those files don't map 1:1 to S3 objects.

rlpb · on Sept 17, 2010

The tarsnap client does the encryption. The server never sees plaintext. I think it's reasonable to consider the tarsnap protocol as adequate separation, since an error on the server wouldn't be able to leak customer data (except for stats about size and access patterns, perhaps, but that could happen regardless of how the data is chunked together).

almost · on Sept 17, 2010

Ouch, I could imagine that S3 glitch causing a serious problem. Not in Colin's code I'm sure but maybe with someone a little less diligent. A 404 could cause a system to assume something isn't there and maybe, shock horror, write something else in its place. I can imagine that being very very bad... Still, sounds like it won't be happening again.

hopeless · on Sept 17, 2010

The kill-switch-on-error idea is very interesting and I can see why it might be necessary for something like Tarsnap.

One question though: could this feature be used in a denial-of-service attack? i.e., induce errors in the Tarsnap server or it's supporting environment (such as DNS) so that it shuts down for everyone? Admittedly, there doesn't seem to be much point in this but I'm curious if it's an angle you've considered.

cperciva · on Sept 17, 2010

could this feature be used in a denial-of-service attack? i.e., induce errors in the Tarsnap server or it's supporting environment (such as DNS) so that it shuts down for everyone?

If you can impersonate an S3 server, yes.

But if you're impersonating an S3 server, I want Tarsnap to shut down pending investigation.

chopsueyar · on Sept 17, 2010

Not after his code change.

zackattack · on Sept 17, 2010

incidentally, i have no idea why everyone treats s3 as so reliable. all these services offer backups to s3. well folks, i think that s3 is bound to fail some day and some good peace of mind could be manufactured by mirroring s3 files to a few different mirrors.

lsc · on Sept 17, 2010

They sound like they have a pretty good system, and if you can afford five petabytes and don't do much moving in and out, they are cheap, too.

On the other hand, their system is non-standard. It's not used by or tested by anyone else. And it sounds like a fairly complex system. The fact that they haven't lost a whole lot of data yet means that they must be pretty good... but the more complex (and unusual) a system, the more I would fear a failure caused by a software bug.

cperciva · on Sept 17, 2010

[S3] sounds like a fairly complex system

That depends on your definition of "complex", of course. In terms of the number of lines of code, I'd guess that S3 is significantly simpler than a typical filesystem -- key-blob CRUD is a much simpler thing to implement than directory trees, file metadata, memory-mapped files, and all the other horrible messes filesystems need to handle.

As far as data loss specifically goes: S3 is in a good place. Most of the complexity of a system like S3 is in finding data -- that is, routing requests to the right node -- not in merely not losing data. I don't think it's a coincidence at all that despite several outages over the past few years which have made S3 unavailable, they've always been able to bring the service back online with no loss of data.

lsc · on Sept 17, 2010

You'd know much more about these sorts of things than I would, but most filesystems largely ignore disk corruption. If amazon did that, we'd be seeing a lot more trouble than we have. This is where I'm seeing the complexity; detecting and recovering from subtle errors. I've gone a short way down that path and from where I stand it looks quite hairy.

Besides, I've seen data loss caused by ext3 on my own systems that was not caused by disk corruption, so really, at s3's scale, it's got to function (and seems to have been functioning) much better than a regular filesystem.

cperciva · on Sept 17, 2010

S3 has an advantage over regular filesystems there too. S3 doesn't need to provide single-node durability as long as data loss events on different nodes are uncorrelated; so they can protect data with cryptographic checksums and throw it away (from an individual node) at the first sign of corruption, knowing that they'll still have it on all the other nodes.

michaelhalligan · on Sept 17, 2010

Such is the danger of sharecropping.