S3 is files, but not a filesystem

breckognize · on March 10, 2024

> I haven't heard of people having problems [with S3's Durability] but equally: I've never seen these claims tested. I am at least a bit curious about these claims.

Believe the hype. S3's durability is industry leading and traditional file systems don't compare. It's not just the software - it's the physical infrastructure and safety culture.

AWS' availability zone isolation is better than the other cloud providers. When I worked at S3, customers would beat us up over pricing compared to GCP blob storage, but the comparison was unfair because Google would store your data in the same building (or maybe different rooms of the same building) - not with the separation AWS did.

The entire organization was unbelievably paranoid about data integrity (checksum all the things) and bigger events like natural disasters. S3 even operates at a scale where we could detect "bitrot" - random bit flips caused by gamma rays hitting a hard drive platter (roughly one per second across trillions of objects iirc). We even measured failure rates by hard drive vendor/vintage to minimize the chance of data loss if a batch of disks went bad.

I wouldn't store critical data anywhere else.

Source: I wrote the S3 placement system.

treflop · on March 10, 2024

What’s your experience like at other storage outfits?

I only ask because your post is a bit like singing praises for Cinnabon that they make their own dough.

The things that you mentioned are standard storage company activities.

Checksum-all-the-things is a basic feature of a lot of file systems. If you can already set up your home computer to detect bitrot and alert you, you can bet big storage vendors do it.

Keeping track of hard drive failure rates by vendor is normal. Storage companies publicly publish their own reports. The tiny 6-person IT operation I was in had a spreadsheet. Hell, I toured a friend’s friend’s major data center last year and he managed to find time to talk hard drive vendors. Now you. I get it — y’all make spreadsheets.

There are a lot of smart people working on storage outside AWS and long before AWS existed.

pclmulqdq · on March 10, 2024

When I worked at Google in storage, we had our own figures of merit that showed that we were the best and Amazon's durability was trash in comparison to us.

As far as I can tell, every cloud provider's object store is too durable to actually measure ("14 9's"), and it's not a problem.

breckognize · on March 10, 2024

9's are overblown. When cloud providers report that, they're really saying "Assuming random hard drive failure at the rates we've historically measured and how we quickly we detect and fix those failures, what's the mean time to data loss".

But that's burying the lede. By far the greatest risks to a file's durability are: 1. Bugs (which aren't captured by a durability model). This is mitigated by deploying slowly and having good isolation between regions. 2. An act of God that wipes out a facility.

The point of my comment was that it's not just about checksums. That's table stakes. The main driver of data loss for storage organizations with competent software is safety culture and physical infrastructure.

My experience was that S3's safety culture is outstanding. In terms of physical separation and how "solid" the AZs are, AWS is overbuilt compared to the other players.

pclmulqdq · on March 10, 2024

That was not how we treated the 9's at Google. Those had been tested through natural experiments (disasters).

I was not at Google for the Clichy fire, but it wasn't the first datacenter fire Google experienced. I think your information about Google's data placement may be incorrect, or you may be mapping AWS concepts onto Google internal infrastructure in the wrong way.

breckognize · on March 10, 2024

Do you mean Google included "acts of God" when computing 9's? That's definitely not right.

11 9's of durability means mean time to data loss of 100 billion years. Nothing on earth is 11 9's durable in the face of natural (or man-made) disasters. The earth is only 4.5 billion years old.

pclmulqdq · on March 10, 2024

Normally, companies store more than 1 byte of data, and the 9's (not just for data loss, for everything) are ensemble averages.

By the way, I don't doubt that AWS has plenty of 9's by that metric - perhaps more than GCP.

fsociety · on March 10, 2024

I would not lose sleep over storing data on GCS, but have heard from several Google Cloud folks that their concept of zones is a mirage at best.

pclmulqdq · on March 10, 2024

Yeah, that's definitely true. Google sort of mapped an AWS concept onto its own cluster splits. However, there are enough regional-scale outages at all the major clouds that I don't personally place much stock in the idea of zones to begin with. The only way to get close to true 24/7 five-9's uptime with clouds is to be multi-region (and preferably multi-cloud).

callalex · on March 11, 2024

I have experienced many outages that were contained to a specific availability zone in AWS, from power failures to flooding to cable cuts. You are correct that 5 9’s still requires multi-region though.

kevincox · on March 11, 2024

I think also Google as a whole has pretty good diversity. But Cloud customers demanded regions in big population centers and smaller countries where Google traditionally avoided due to cost reasons. This lead to less redundant sites that were often owned and/or operated by third parties. So in the US and Europe you can probably trust GCP zones quite literally. But other regions (I have heard lots of rumours in the APAC) they may not be quite as diverse as they appear.

throwaway2037 · on March 11, 2024

> big population centers and smaller countries

Can we stop this dance on HN? Can you just name them, please?

pclmulqdq · on March 11, 2024

I think most Googlers actually don't know the specifics (I certainly don't know), and if they could, they probably couldn't tell you. It's sort of common knowledge that some of them are like this, but not exactly which ones.

flaminHotSpeedo · on March 11, 2024

See: that fire in France that took down a whole region

But on the other hand, GCP supports multi-region so that's not nearly as big of a deal as it would be if AWS zones were not sufficiently isolated

jftuga · on March 10, 2024

If I were to upload a 50kb object to S3 (standard tier), about how many unique physical copies would exist?

cyberax · on March 10, 2024

At least 3.

bigiain · on March 11, 2024

At least 3, in at least 3 seperate datacenters.

According to https://nuclearsecrecy.com/nukemap/ - it'd take at least a 1 megaton warhead to take out two of the ap-southeast-2 datacenters, and over 10MT to take out 3.

I suspect you'd need a lot less than that though, the 1MT warhead would probably take out enough outside-the-datacenter infrastructure to take the entire AZ offline. I don't care too much though, if someone's dropped a warhead that close to home I have other things to worry about than whether all the cat pictures and audit logs survive.

tomcam · on March 11, 2024

> if someone's dropped a warhead that close to home I have other things to worry about than whether all the cat pictures and audit logs survive.

Speak for yourself. Many of us love our audit logs and show them to strangers whatever we can.

dmvdoug · on March 11, 2024

I’m picturing you having a slideshow of audit logs that you make guests to your home sit down and watch with you, like the vacation pictures slideshow of old.

tomcam · on March 11, 2024

Yes but my audit logs are special and everyone just loves them, although they playfully act bored

konfusinomicon · on March 11, 2024

scrolling through them like it's the matrix can't be all that boring!

bigiain · on March 11, 2024

Awwwww! Check out this cute little IAM audit log! Look at its funny little fizzy privilege escalation! I just want to scratch it's belly until it p0wns the whole prod deployment.

tomcam · on March 11, 2024

Oh, like you don’t?

We’re among friends here.

flybarrel · on March 11, 2024

1MT will take out the infrastructure to make the data not Available. However the data is still in the 3rd datacenter, making it still accessible therefore no compromising on Durable but yes we don't need the cat pictures when that is close to home :)

bigiain · on March 13, 2024

I'm guessing that even though the 3rd datacenter is about 35km away from the other 2, and so the building isn't in the expected destruction zone of a 1MT warhead, the damage to the city's electricity/water/network infrastructure would take the 3rd datacenter offline as well - so while your cat pictures are probably still in existence on the no-longer-spinning-rust there, they'd be unaccessible for quite some time.

tom910 · on March 11, 2024

I think lees. Why AWS need to store 3 times if they can use Reed-Solomon algorithms (or similar) and decrease this number to 2 or 1.5 and save a lot of storage space

Kluggy · on March 11, 2024

Reed-Solomon would allow you to lose any one of the three and recover. Losing two would be catastrophic.

AWS's guarantee is that you can lose two of the three copies and still have all the data. You can't do that without three complete copies.

laluser · on March 11, 2024

Probably for just a short period of time before it’s erasure coded

wubrr · on March 11, 2024

9's are useful when they're backed by an actual SLA - like GCP Cloud Storage and AWS S3 availability SLAs. Neither one commits to any durability SLAs whatsoever so I wouldn't put any stock into the 'eleventy nine nines' durability claims.

kragen · on March 11, 2024

0. user error (deletion or overwriting a file they regret later, possibly much later)

-1. government, the historical cause of most data loss

1½. google canceling the product and deleting all the data, as they did with google+

wubrr · on March 11, 2024

> As far as I can tell, every cloud provider's object store is too durable to actually measure ("14 9's"), and it's not a problem.

How can you tell that if it's not measurable?

As far as I can tell the '11/14 9s' durability numbers are more or less completely made up. That's why AWS doesn't offer any actual durability SLA for S3, only a 99.9% availability SLA[0].

[0] https://aws.amazon.com/s3/sla/

flybarrel · on March 11, 2024

buddy you are mixing availability and durability

urda · on March 11, 2024

Sorry I am not buying the personal anecdote as the public numbers from both orgs tell a different story. When reliability and long term support come up in conversation, Google is not a name to reach for.

pclmulqdq · on March 11, 2024

Note that I said "durability," not any other reliability metric. GCP is pretty well-known for its outages and abysmal support. It's a reputation they want to change, but they did earn it.

However, Google is very good at not losing data.

urda · on March 12, 2024

Yup still not buying it. People who want their production stable, secure, and durable do not choose Google.

mauvehaus · on March 10, 2024

As a point of order: Not all Cinnabon locations make their own dough. The one I worked at the summer of 2001 made their own dough and frosting that summer, but switched to premade rolls partially that holiday season, and fully to premade rolls and frosting by the 2002 holiday season.

Also: you had to be eighteen or older to operate the mixer. It was something like a 60 quart machine, and all the recipes were pre-programmed for time and power with pauses to change from the paddle to the whip if needed.

SlightlyLeftPad · on March 11, 2024

The beauty of proprietary systems is that the only information we can ever expect to get about how they’re built is the biased information we get from the builders of those systems.

4death4 · on March 10, 2024

This was a few years ago, but blob storage on GCP had a global outage due to an outage in a single zone. That, among numerous other issues with GCP, lost my confidence entirely. Maybe it’s better now.

FooBarWidget · on March 10, 2024

"Checksum-all-the-things is a basic feature of a lot of file systems"

"A lot"? Does anything but ZFS and maybe btrfs do this? Ext4 anf XFS — two very common filesystems — still don't have data checksums.

Filligree · on March 10, 2024

Bcachefs, and LVM also has a way to do it.

Unfortunately I’m not aware of any filesystem that does it while maintaining the full bandwidth of a modern NVMe. Not even with the extra reads factored in; on ZFS I get 800 MB/s max.

EraYaN · on March 11, 2024

ZFS should absolutely be able to go faster even with lz4 compression I get writes above 5 GB/s on a older 32-core EPYC CPU. And that is with mostly random and already compressed data. And that write speed is a limitation of the RAIDz2 on top of not the fastest drives (6 PCIe 3.0 intel U.2 ones from 2 years ago).

Filligree · on March 11, 2024

Maybe I've configured something wrong, then. I'll do some tests.

fierro · on March 10, 2024

it's well known and not debatable that Cinnabon is fire

rsync · on March 10, 2024

"AWS' availability zone isolation is better than the other cloud providers."

Not better than all of them.

A geo-redundant rsync.net account exists in two different states (or countries) - for instance, primary in Fremont[1] and secondary in Denver.

"S3 even operates at a scale where we could detect "bitrot""

That is not a function of scale. My personal server running ZFS detects bitrot just fine - and the scale involved is tiny.

[1] he.net headquarters

breckognize · on March 10, 2024

Backing up across two different regions is possible for any provider with two "regions" but requires either doubling your storage footprint or accepting a latency hit because you have to make a roundtrip from Fremont to Denver.

The neat thing about AWS' AZ architecture is that it's a sweet spot in the middle. They're far enough apart for good isolation, which provides durability and availability, but close enough that the network round trip time is negligible compared to the disk seek.

Re: bit rot, I mean the frequency of events. If you've got a few disks, you may see one flip every couple years. They happen frequently enough in S3 that you can have expectations about the arrival rate and alarm when that deviates from expectations.

logifail · on March 10, 2024

> The neat thing about AWS' AZ architecture is that it's a sweet spot in the middle

What may be less of a sweet spot is AWS' pricing.

emodendroket · on March 10, 2024

Sending the data to /dev/null is the cheapest option if that’s all you care about.

logifail · on March 10, 2024

Seems the snark detector just went off :)

Back on topic, I'd hope all of us would expect value for money for any and all services we recommend or purchase. Search for "site:news.ycombinator.com Away From AWS" to find dozens of discussions on how to save money by leaving AWS.

EDIT: just one article of the many I've read recently:

"What I’ve always found surprising about egress is just how expensive it is. On AWS, downloading a file from S3 to your computer once costs 4 times more than storing it for an entire month"

https://robaboukhalil.medium.com/youre-paying-too-much-for-e...

looknohandsma · on March 12, 2024

And that is egress which works as expected, unlike the AWS S3 denial of wallet attack...: https://news.ycombinator.com/item?id=39625029

alexchamberlain · on March 10, 2024

> They're far enough apart for good isolation, which provides durability and availability

It can't possibly be enough for critical data though, right? I'm guessing a fire in 1 is unlikely to spread to another, but could it affect the availability of another? What about a deliberate attack on the DCs or the utilities supplying the DCs?

rfoo · on March 11, 2024

> but could it affect the availability of another

Availability is a different beast than durability. I think people are paranoid here about durability instead of availability.

S3 advertises four nines availability and 12 nines durability.

immibis · on March 10, 2024

Yes, if a terrorist blows up all of the several Amazon DCs holding your data, your data will be lost. This is true no matter how many DCs are holding your data, who owns them, or where they are. You can improve your chances, of course.

There have been region-wide availability outages before. They're pretty rare and make worldwide news media due to how much of the internet they take out. I don't think there's been S3 data loss since they got serious about preventing S3 data loss.

senderista · on March 10, 2024

> the network round trip time is negligible compared to the disk seek

Only for spinning rust, right?

breckognize · on March 10, 2024

Yes, which is what all the hyperscalers use for object storage. HDD seek time is ~10ms. Inter-az network latency is a few hundred micros.

allset_ · on March 10, 2024

FWIW, both AWS S3 and GCP GCS also allow you to store data in multi-region.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/MultiR...

https://cloud.google.com/storage/docs/locations#consideratio...

andrewguenther · on March 10, 2024

Yes, but S3 has single region redundancy that is better than GCP. Your data in two AZs in one region is in two physically separate buildings. So multi-region is less important to durability.

Helmut10001 · on March 10, 2024

Agree.

> S3 even operates at a scale where we could detect "bitrot" - random bit flips caused by gamma rays hitting a hard drive platter (roughly one per second across trillions of objects iirc).

I would expect any cloud provider to be able to detect bitrot these days.

senderista · on March 10, 2024

I think the point the OP was trying to make is that they regularly detected bitrot due to their scale, not that they were merely capable of doing so.

Helmut10001 · on March 10, 2024

Ah, thank you. This makes more sense. And I think I remember reading about it once. Apologies for the misinterpretation!

pclmulqdq · on March 10, 2024

Everyone with significant scale and decent software regularly detects bitrot.

mannyv · on March 10, 2024

How does the latest ZFS bug impact your bitrot statement?

I mean, technically it’s not bitrot if zeros were accidentally written out instead of data.

woodada · on March 10, 2024

Probably none because they didn't update to the exact version that had the bug

supriyo-biswas · on March 10, 2024

Checksumming the data is not based out of paranoia but simply as a result of having to detect which blocks are unusable in order to run the Reed-Solomon algorithm.

I'd also assume that a sufficient number of these corruption events are used as a signal to "heal" the system by migrating the individual data blocks onto different machines.

Overall, I'd say the things that you mentioned are pretty typical of a storage system, and are not at all specific to S3 :)

catlifeonmars · on March 10, 2024

The S3 checksum feature applies to the objects, so that’s entirely orthogonal to erasure codes. Unless you know something I don’t and SHA256 has commutative properties. You’d still need to compute the object hash independent of any blocks.

Source: https://docs.aws.amazon.com/AmazonS3/latest/userguide/checki...

benlivengood · on March 11, 2024

It's not entirely orthogonal; RAID5 plus stripe-level CRC (or better) can reliably correct bitrot at any single position in a stripe whereas RAID5 alone can only report an error. My guess is that S3 and other large object stores have the equivalent of stripe-level checksums for this purpose.

catlifeonmars · on March 12, 2024

I’m positive something like this is the case. Yet it’s entirely orthogonal to the object hash in the user facing feature, which would need to be computed separately.

benlivengood · on March 15, 2024

For append-only or write-once objects or for BLAKE-3 and other fully parallelizable hashes it's possible to store the intermediate hash function state with each chunk or stripe so that the final bytes of the data, once the hash is finished, yield the user-facing checksum as well.

medler · on March 10, 2024

> customers would beat us up over pricing compared to GCP blob storage, but the comparison was unfair because Google would store your data in the same building

I don’t think this is true. Per the Google Cloud Storage docs, data is replicated across multiple zones, and each zone maps to a different cluster. https://cloud.google.com/compute/docs/regions-zones/zone-vir...

yencabulator · on March 10, 2024

Zones are about correlated power and networking failures. Regions are about disasters. If you want multiple regions, Google can of course do that too:

https://cloud.google.com/storage/docs/locations#consideratio...

singron · on March 10, 2024

Google puts multiple clusters in a single building.

medler · on March 10, 2024

Seems you’re right. They say each zone is a separate failure domain but you kind of have to trust their word on that.

navaati · on March 10, 2024

Flashback to that Clichy datacenter fire near Paris...

staunch · on March 10, 2024

> Believe the hype.

I'd rather believe the test results.

Is there a neutral third-party that has validated S3's durability/integrity/consistency? Something as rigorous as Jepsen?

It'd be really neat if someone compared all the S3 compatible cloud storage systems in a really rigorous way. I'm sure we'd discover that there are huge scary problems. Or maybe someone already has?

Veserv · on March 10, 2024

But they asked if the claims were audited by a unbiased third party. Are there such audits?

Alternatively, AWS does publicly provide legally binding availability guarantees, but I have never seen any prominently displayed legally binding durability guarantees. Are these published somewhere less prominently?

cyberax · on March 10, 2024

> Alternatively, AWS does publicly provide legally binding availability guarantees, but I have never seen any prominently displayed legally binding durability guarantees. Are these published somewhere less prominently?

It's listed prominently in the public docs: https://aws.amazon.com/s3/storage-classes/

Veserv · on March 11, 2024

I read that page and it does not provide any contractual durability guarantees as far as I can see. It provides "designed for availability" and then contractual availability SLA guarantees. It provides "designed for durability", but presents no contractual durability guarantee as far as I can see.

Given that their lawyers clearly indicate that "designed for availability" is not what they are contractually obligated to provide, only the letter of the SLA does that; "designed for durability" is similarly a marketing statement that does not incur any contractual obligations. Is there some specific statement in that document that I am missing which indicates that data durability is not fully at their convenience?

AdamN · on March 11, 2024

SLAs are more of a financial construct than anything else. Once the payback cost of missing an SLA is built into the contract then it just becomes a conversation about money. I've been at plenty of shops that obviously tried to hit the SLA but if it was missed it just became a financial issue which helped smooth over what otherwise might have been a trust buster.

I would never ever think of an SLA as anything more than a financial commitment - if you think more of it you'll eventually be in a world of hurt.

Veserv · on March 11, 2024

Obviously, as I made no mention of specific performance in the event of breach of contract the remedy for failure to meet contractual obligations would be damages.

The question is: What level of durability does AWS contractually guarantee where a failure to provide that level results in a breach of contract that may incur damages and where, specifically, in the documentation do they specify that?

tracerbulletx · on March 10, 2024

My first job was at a startup in 2012 where I was expected to build things at a scale way over what I really had the experience to do. Anyways the best choice I ever made was using RDS and S3 (and django).

loeg · on March 10, 2024

Not a public cloud, but storage at Facebook is similar in terms of physical infrastructure, safety culture, and scale.

simonebrunozzi · on March 11, 2024

I also worked at AWS, but not in the S3 team. However, I was Tech Evangelist and met with literally thousands of customers over my 6 years tenure. S3 was one of the hottest topics, but I got a sense of how good and robust it was directly from these customers.

What you say resonates really well with me, and what I've heard during these years.

chupasaurus · on March 10, 2024

> and bigger events like natural disasters

Outdated anecdata: I've worked for a company that lost some parts of buckets after the lightning strike incident in 2011, which bumped the paranoia quite a bit. AFAIK same thing couldn't happen for more than a decade.

zooq_ai · on March 11, 2024

Google discovered random bit flips caused by gamma rays.

spintin · on March 10, 2024

Correct me if I'm wrong but bitrot only affects spinning rust since NAND uses ECC?

If you see this I wonder if S3 is planning on adding hardlinks?

sgtnoodle · on March 10, 2024

Pretty much any modern storage medium depends on a healthy amount of error correcting code.

surajrmal · on March 10, 2024

Nand is constantly moving around your data to prevent it from bit rotting. If you leave data too long without moving it, you may not be able to read the data from the nand.

orf · on March 10, 2024

> And listing files is slow. While the joy of Amazon S3 is that you can read and write at extremely, extremely, high bandwidths, listing out what is there is much much slower. Slower than a slow local filesystem

This misses something critical. Yes, s3 has fast reading and writing, but that’s not really what makes it useful.

What makes it useful is listing. In an unversioned bucket (or one with no delete markers), listing any given prefix is essentially constant time: I can take any given string, in a bucket with 100 billion objects, and say “give me the next 1000 keys alphabetically that come after this random string”.

What’s more, using “/“ as a delimiter is just the default - you can use any character you want and get a set of common prefixes. There are no “directories”, ”directories” are created out of thin air on demand.

This is super powerful, and it’s the thing that lets you partition your data in various ways, using whatever identifiers you need, without worrying about performance.

If listing was just “slow”, couldn’t list on file prefixes and got slower proportional to the number of keys (I.e a traditional unix file system), then it wouldn’t be useful at all.

calpaterson · on March 10, 2024

I have to say that I'm not hugely convinced. I don't really think that being able to pull out the keys before or after a prefix is particularly impressive. That is the basis for database indices going back to the 1970s after all.

Perhaps the use-cases you're talking about are very different from mine. That's possible of course.

But for me, often the slow speed of listing the bucket gets in the way. Your bucket doesn't have to get very big before listing the keys takes longer than reading them. I seem to remember that listing operations ran at sub-1mbps, but admittedly I don't have a big bucket handy right now to test that.

orf · on March 10, 2024

It depends on a few factors. The list objects call hides deleted and noncurrent versions, but it has to skip over them. Grouping prefixes also takes time, if they contain a lot of noncurrent or deleted keys.

A pathological case would be a prefix with 100 million deleted keys, and 1 actual key at the end. Listing the parent prefix takes a long time in this case - I’ve seen it take several minutes.

If your bucket is pretty “normal” and doesn’t have this, or isn’t versioned, then you can do 4-5 thousand list requests a second, at any given key/prefix, in constant time. Or or you can explicitly list object versions (and not skip deleted keys) also in constant time.

It all depends on your data: if you need to list all objects then yeah it’s gonna be slow because you need to paginate through all the objects. But the point is that you don’t have to do that if you don’t want to, unlike a traditional filesystem with a directory hierarchy.

And this enables parallelisation: why list everything sequentially, when you can group the prefixes by some character (i.e “-“), then process each of those prefixes in parallel.

The world is your oyster.

cuno · on March 10, 2024

We and our customers use S3 as a POSIX filesystem, and we generally find it faster than a local filesystem for many benchmarks. For listing directories we find it faster than Lustre (a real high performance filesystem). Our approach is to first try listing directories with a single ListObjectV2 (which on AWS S3 is in lexicographic order) and if it hasn't made much progress, we start listing with parallel ListObjectV2. Once you start parallelising the ListObjectV2 (rather than sequentially "continuing") you get massive speedups.

crabbone · on March 10, 2024

> find it faster than a local filesystem for many benchmarks.

What did you measure? How did you compare? This claim seems very contrary to my experience and understanding of how things work...

Let me refine the question: did you measure metadata or data operations? What kind of storage medium is used by the filesystem you use? How much memory (and subsequently the filesystem cache) does your system have?

----

The thing is: you should expect, in the best case, something like 5 ms latency on network calls over the Internet in an ideal case. Within the datacenter, maybe you can achieve sub-ms latency, but that's hard. AWS within region but different zones tends to be around 1 ms latency.

This is while NVMe latency, even on consumer products, is 10-20 micro seconds. I.e. we are talking about roughly 100 times faster than anything going through the network can offer.

cuno · on March 10, 2024

For AWS, we're comparing against filesystems in the datacenter - so EBS, EFS and FSx Lustre. Compared to these, you can see in the graphs where S3 is much faster for workloads with big files and small files: https://cuno.io/technology/

and in even more detail of different types of EBS/EFS/FSx Lustre here: https://cuno.io/blog/making-the-right-choice-comparing-the-c...

crabbone · on March 10, 2024

The tests are very weird...

Normally, from someone working in the storage, you'd expect tests to be in IOPS, and the goto tool for reproducible tests is FIO. I mean, of course "reproducibility" is a very broad subject, but people are so used to this tool that they develop certain intuition and interpretation for it / its results.

On the other hand, seeing throughput figures is kinda... it tells you very little about how the system performs. Just to give you some reasons: a system can be configured to do compression or deduplication on client / server, and this will significantly impact your throughput, depending on what do you actually measure: the amount of useful information presented to the user or the amount of information transferred. Also throughput at the expense of higher latency may or may not be a good thing... Really, if you ask anyone who ever worked on a storage product about how they could crank up throughput numbers, they'd tell you: "write bigger blocks asynchronously". This is the basic recipe, if that's what you want. Whether this makes a good all around system or not... I'd say, probably not.

Of course, there are many other concerns. Data consistency is a big one, and this is a typical tradeoff when it comes to choosing between object store and a filesystem, since filesystem offers more data consistency guarantees, whereas object store can do certain things faster, while breaking them.

BTW, I don't think most readers would understand Lustre and similar to be the "local filesystem", since it operates over network and network performance will have a significant impact, of course, it will also put it in the same ballpark as other networked systems.

I'd also say that Ceph is kinda missing from this benchmark... Again, if we are talking about filesystem on top of object store, it's the prime example...

cuno · on March 10, 2024

IOPS is a really lazy benchmark that we believe can greatly diverge from most real life workloads, except for truly random I/O in applications such as databases. For example, in Machine Learning, training usually consists of taking large datasets (sometimes many PBs in scale), randomly shuffling them each Epoch, and feeding them into the engine as fast as possible. Because of this, we see storage vendors for ML workloads concentrate on IOPS numbers. The GPUs however only really care about throughput. Indeed, we find a great many applications only really care about the throughput, and IOPS is only relevant if it helps to accomplish that throughput. For ML, we realised that the shuffling isn't actually random - there's no real reason for it to be random versus pseudo-random. And if its pseudo-random then it is predictable, and if its predictable then we can exploit that to great effect - yielding a 60x boost in throughput on S3, beating out a bunch of other solutions. S3 is not going to do great for truly random I/O, however, we find that most scientific, media and finance workloads are actually deterministic or semi-deterministic, and this is where cunoFS, by peering inside each process, can better predict intra-file and inter-file access patterns, so that we can hide the latencies present in S3. At the end of the day, the right benchmark is the one that reflects real world usage of applications, but that's a lot of effort to document one by one.

I agree that things like dedupe and compression can affect things, so in our large file benchmarks each file is actually random. The small file benchmarks aren't affected by "write bigger blocks" because there's nothing bigger than the file itself. Yes, data consistency can be an issue, and we've had to do all sorts of things to ensure POSIX consistency guarantees beyond what S3 (or compatible) can provide. These come with restrictions (such as on concurrent writes to the same file on multiple nodes), but so does NFS. In practice, we introduced a cunoFS Fusion mode that relies on a traditional high-IOPS filesystem for such workloads and consistency (automatically migrating data to that tier), and high throughput object for other workloads that don't need it.

rfoo · on March 11, 2024

> And if its pseudo-random then it is predictable, and if its predictable then we can exploit that to great effect

This is an interesting hack. However, an IOP is an IOP, no matter how good you predicted it and prefetch it so that you hide the latency it's going to be translated to a GetObject.

I think what you really exploited here is that even though S3 is built on HDDs (and have very low IOPS per TiB) their scale is so large that even if you milk 1M+ IOPS out of it AWS still doesn't care and is happy to serve you. But if my back-of-envelope calculation is correct this isn't going to work well if everyone starts to do it.

How do you get around S3's 5.5k GET per second per prefix limit? If I only have ~200 20GiB files can you still get decent IOPS out of it?

and...

> IOPS is a really lazy benchmark that we believe can greatly diverge from most real life workloads

No, it's not. I have a workload training a DL model on time series data which demands 600k 8KiB IOPS per compute instance. None of the thing I tested work well. Had to build a custom one with bare metal NVMe-s.

cuno · on March 13, 2024

Sorry for the late response - I didn't see your comment until now.

Our aim is to unleash all the potential that S3/Object has to offer for file system workloads. Yes, the scale of AWS S3 helps, as does erasure coding (which enhances flexibility for better load balancing of reads).

Is it suitable for every possible workload? No, which is why we have a mode called cunoFS Fusion where we let people combine a regular high-performance filesystem for IOPS, and Object for throughput, with data automatically migrated between the two according to workload behaviour. What we find is that most data/workloads need high throughput rather than high IOPS, and this tends to be the bulk of data. So rather than paying for PBs of ultra-high IOPS storage, they only need to pay for TBs of it instead. Your particular workload might well need high IOPS, but a great many workloads do not. We do have organisations doing large scale workloads on time-series (market) data using cunoFS with S3 for performance reasons.

hnlmorg · on March 10, 2024

EFS is ridiculously slow though. Almost to the point where I fail to see how it’s actually useful for any of the traditional use cases for NFS.

geertj · on March 11, 2024

> EFS is ridiculously slow though. Almost to the point where I fail to see how it’s actually useful for any of the traditional use cases for NFS.

Would you care to elaborate on your experience or use case a bit more? We've made a lot of improvements over the last few years (and are actively working on more), and we have many happy customers. I'd be happy to give a perspective of how well your use case would work with EFS.

Source: PMT turned engineer on EFS, with the team for over 6 years

hnlmorg · on March 11, 2024

Unfortunately I can’t say too much publicly on HN. But one of the big shortcomings is dealing with hundreds of files. It doesn’t even matter if those are big or small files (I’ve had experience with both).

Services like DataSync show that the underlying infra can be performant. But it feels almost impossible to replicate that on EFS via standard POSIX APIs. And unfortunately one of our use cases depend upon that.

If feels, to me at least, like EFS isn’t where AWSs priorities lie. At least if you compare EFS to FSx Lustre and recent developments to S3. Both of which has been the direction our AWS SAs have pushed us.

dekhn · on March 10, 2024

if you turn all the EFS performance knobs up (at a high cost), it's quite fast.

hnlmorg · on March 10, 2024

Faster, sure. But I wouldn’t got so far as to say it is fast

harshaw · on March 11, 2024

Have you tried it recently? Because we've made it a lot faster over the years.

hnlmorg · on March 11, 2024

More recently and for more use cases and varied workflows than most people. But that’s as much as I can say without getting people to sign an NDA.

Our AWS spend is high enough to warrant a very close working relationship with AWS so this is something we have worked with you guys on already.

wenc · on March 10, 2024

S3 is really high latency though. I store parquet files on S3 and querying them through DuckDB is much slower than file system because random access patterns. I can see S3 being decent if it’s bulk access but definitely not for random access.

This is why there’s a new S3 Express offering that is low latency (but costs more).

YZF · on March 11, 2024

It can't be a POSIX filesystem if it doesn't meet POSIX filesystem guarantees. I worked on an S3 compatible object store in a large storage company and we also had distributed filesystem products. Those are completely different animals due to the different semantics and requirements. We've also built compliant filesystems over object store and the other way around. Certain operations like, write-append, are tricky to simulate over object stores (S3 didn't use to support append, I haven't really stayed up to date, does it now?). At least when I worked on this it wasn't possible to simulate POSIX semantics over S3 at all without needing to add additional object store primitives.

supriyo-biswas · on March 10, 2024

> Once you start parallelising the ListObjectV2 (rather than sequentially "continuing")

How are you "parallelizing" the ListObjectsV2? The continuation token can be only fed in once the previous ListObjectsV2 response has completed, unless you know the name or structure of keys ahead of time, in which listing objects isn't necessary.

cuno · on March 10, 2024

For example, you can do separate parallel ListObjectV2 for files starting a-f and g-k, etc.. covering the whole key space. You can parallelize recursively based on what is found in the first 1000 entries so that it matches the statistics of the keys. Yes there may be pathological cases, but in practice we find this works very well.

johnmaguire · on March 10, 2024

You're right that it won't work for all use cases, but starting two threads with prefixes A and M, for example, is one way you might achieve this.

fijiaarone · on March 10, 2024

If you think s3 is fast, you should try FTP. It’s at least a hundred times faster. And combined with rsync, dozens of times more reliable.

orf · on March 11, 2024

Neither of those are true though? Not sure if this is sarcastic or not, if so make it more clear in the future

nh2 · on March 10, 2024

The key difference between lexicographically keyed flat hierarchies, and directory-nested filesystem hierarchies, becomes clear based on this example:

    dir1/a/000000
    dir1/a/...
    dir1/a/999999
    dir1/b

On a proper hierarchical file file system with directories as tree interior nodes, `ls dir1/` needs to traverse and return only 2 entries ("a" and "b").

A flat string-indexed KV store that only supports lexicographic order, without special handling of delimters, needs to traverse 1 million dirents ("a/00000" throuh "a/999999") before arriving at "b".

Thus, simple flat hierarchies are much slower at listing the contents of a single dir: O(all recursive children), vs. O(immediate children) on a "proper" filesystem.

Lexicographic strings cannot model multi-level tree structures with the same complexities; this may give it the reputation of "listing files is slow".

UNLESS you tell the listing algorithm what the delimter character is (e.g. `/`). Then a lexicographical prefix tree can efficiently skip over all subtrees at the next `/`.

Amazon S3 supports that, with the docs explicitly mentioning "skipping over and summarizing the (possibly millions of) keys nested at deeper levels" in the `CommonPrefixes` field: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-...

I have not tested whether Amazon's implemented actually saves the traversal (or whether it traverses and just returns less results), but I'd hope so.

nh2 · on March 10, 2024

For completeness: The orignal post says:

    S3 has no rename or move operation.
    Renaming is CopyObject and then DeleteObject.
    CopyObject takes linear time to the size of the file(s).
    This comes up fairly often when someone has written a lot of files
    to the wrong place - moving the files back is very slow.

This is right:

In a normal file system, renaming a directory is fast O(1), in S3 it's slow O(all recursive children).

And Amazon S3 has not added a delimiter-based function to reduce its complexity, even though that would be easily possible in a lexicographic prefix tree (re-rooting the subtree).

So here the original post has indeed found a case where S3 is much slower than a normal file system.

adrian_b · on March 10, 2024

Since 30 years ago (starting with XFS in 1993, which was inspired by HPFS) all the good UNIX file systems implement the directories as some kind of B trees.

Therefore they do not get slower proportional to the number of entries and listing based on file prefixes is extremely fast.

nh2 · on March 10, 2024

> listing based on file prefixes is extremely fast

This functionality does not exist to my knowledge.

ext4 and XFS return directory entries in pseudo-random order (due to hashing), not lexicographically.

For an example, see e.g. https://righteousit.wordpress.com/2022/01/13/xfs-part-6-btre...

If you know a way to return lexicographical order directly from the file system, without the need to sort, please link it.

kbolino · on March 10, 2024

Resolving random file system paths still gets slower proportional to their depth, which is not the case for S3, where the prefix is on the entire object key and not just the "basename" part of it, like in a filesystem.

orf · on March 10, 2024

Yes they do. What APIs does Linux offer that allows you to list a directories contents alphabetically starting at a specific filename in constant time? You have to iterate the directory contents.

You can maybe use “d_off” with readdir in some way, but that’s specific to the filesystem. There’s no portable way to do this with POSIX.

Regardless of if you can do it with a single directory, you can’t do it for all files recursively under a given prefix. You can’t just ignore directories, or say that “for this list request, ‘-‘ is my directory separator”.

The use of b-trees in file systems is completely beside the point.

adrian_b · on March 10, 2024

The POSIX API is indeed even older, so it is not helpful.

But as you say, there are filesystem-specific methods or operating-system specific methods to reach the true performance of the filesystem.

It is likely that for maximum performance one would have to write custom directory search functions using directly the Linux syscalls, instead of using the standard libc functions, but I would rather do that instead of paying for S3 or something like it.

orf · on March 10, 2024

Yes. You could also just use a SQLite table with two columns (path, contents), then just query that. Or do any number of other things.

The question isn’t if it’s possible, because of course it is, the question is if it’s portable and well supported with the POSIX interface. Because if it’s not, then…

anamexis · on March 10, 2024

> The question isn’t if it’s possible, because of course it is, the question is if it’s portable and well supported with the POSIX interface. Because if it’s not, then…

Where did this goalpost come from? S3 is not portable or POSIX compliant.

orf · on March 10, 2024

From the article we're commenting on, which is comparing the interface of S3 to the POSIX interface. Not any given filesystem + platform specific interface.

anamexis · on March 10, 2024

The article does not mention POSIX, or anything about listing files, at all.

orf · on March 10, 2024

It mistakenly mentions UNIX whilst referencing the POSIX filesystem API, and I literally quoted where it talks about listing in my original comment.

zaphar · on March 10, 2024

The article starts out by making a comparison between the posix api filesystem calls and S3's api. The context is very much a comparison between those two api surface areas.

justincormack · on March 10, 2024

There are no specific syscalls that you can use for this. The libc functions and the syscalls are extremely similar.

foldr · on March 10, 2024

>What makes it useful is listing.

I think 99% of S3 usage just consists of retrieving objects with known keys. It seems odd to me to consider prefix listing as a key feature.

bostik · on March 10, 2024

When you embed the relevant (not necessarily that of object creation) timestamp as a prefix, it sure becomes one. Whether that prefix is part of the "path" (object/path/prefix/with/<4-digit year/)" or directly part of the basename (object/path/prefix/to/app-specific/files/<4-digit year>-<2-digit month>-....), being able to limit the search space server-side becomes incredibly useful.

You can try it yourself: list objects in a bucket prefix with lots of files, and measure the time it takes to list all of them vs. the time it takes to list only a subset of them that share a common prefix.

gamache · on March 10, 2024

> ...listing any given prefix is essentially constant time: I can take any given string, in a bucket with 100 billion objects, and say “give me the next 1000 keys alphabetically that come after this random string”.

I'm not sure we agree on the definition of "constant time" here. Just because you get 1000 keys in one network call doesn't imply anything about the complexity of the backend!

orf · on March 10, 2024

Constant time irregardless of the number of objects in the bucket and irregardless of the initial starting position of your list request.

hobobaggins · on March 10, 2024

The technical implementation is indeed impressive that it operates more-or-less within constant time, but probably very few use cases actually fit that narrow window, so this technical strength is moot when it comes to actual usage.

Since each request is dependent upon the position received in the last request, 1000 arbitrary keys on your 3rd or 1000th attempt doesn't really help unless you found your needle in the haystack in that request (and in that case the rest of that 1000 key listing was wasted.)

orf · on March 10, 2024

You’re assuming you’re paginating through all objects from start to finish.

A request to list objects under “foo/“ is a request to list all objects starting with “foo/“, which is constant time irregardless of the number of keys before. Same applies for “foo/bar-“, or any other list request for any given prefix. There are no directories on s3.

aeyes · on March 10, 2024

And if for some reason you need a complete listing along with object sizes and other attributes you can get one every 24 hours with S3 inventory report.

That has always been good enough for me.

harshaw · on March 11, 2024

When you have billions of objects this is really the only way to go - and build workflows around inventory (including athena, spark, etc)

tjoff · on March 10, 2024

Is listing really such a key feature that people use it as a database to find objects?

Have not used S3, but that is not how I imagined using it.

orf · on March 10, 2024

Sure. It's kind of an index - limited to prefix-only searching, but useful.

Say you store uploads associated with a company and a user. You'd maybe naively store them as `[company-uuid]/[user-id].[timestamp]`.

If you need to list a given users (123) uploads after a given date, you'd list keys after `[company-uuid]/123.[date]`. If you need to list all users uploads, you'd list `[company-uuid]/123.`. If you need to get a set of all users who have photos, you'd list `[company-uuid]/` with a Delimiter set to `.`

The point is that it's flexible and with a bit of thought it allows you to "remove all a users uploads between two dates", "remove all a companies uploads" or "remove all a users uploads" with a single call. Or whatever specific stuff is important to your use-case, that might otherwise need a separate DB.

It's not perfect - you can't reverse the listing (i.e you can't get the latest photo for a given user by sorting descending for example), and needs some thought about your key structure.

tjoff · on March 10, 2024

But surely you need to track that elsewhere anyway?

That some niche edge-case runs efficiently doesn't sound like a defining feature of S3. On the contrary many common operations map terrible to S3, so you kind of need the logic to be elsewhere.

orf · on March 10, 2024

My overall point can be summarised as this:

- Listing things is a very common operation to do.

- The POSIX api and the directory/file hierarchy it provides is a restrictive one.

- S3 does not suffer from this, you can recursively list and group keys into directories at “list time”.

- If you find yourself needing to list gigantic numbers of keys in one go, you can do better by only listing a subset. S3 isn’t a filesystem, you shouldn’t need to list 1k+ keys sequentially apart from during maintenance tasks.

- This is actually quite fast, compared to alternatives.

Whether or not you see a use case for this is sort of irrelevant: they exist. it’s what allows you to easily put data into s3 and flexibly group/scan it by specific attributes.

tjoff · on March 10, 2024

Listing things is very common, so why would you outsource that to S3 when all your bookkeeping is elsewhere? It's not like you would ever rely on the POSIX API for that anyway, even for when your files actually are on a POSIX filesystem.

For sure, for maintenance tasks etc. it sounds quite useful. And good hygiene with prefixes sounds like a sane idea. But listing being a critical part of what "makes S3 useful"? That seems like an huge stretch that your points don't seem to address.

orf · on March 10, 2024

> It's not like you would ever rely on the POSIX API for that anyway, even for when your files actually are on a POSIX filesystem.

Because there is no POSIX api for this. Depending on your requirements and query patterns, you may not need a completely separate database that you need to keep in sync.

IanCal · on March 12, 2024

You may not need other bookkeeping. The prefix listing properties can be enough, removing the need to have two distinct systems kept in sync.

kbolino · on March 10, 2024

> But surely you need to track that elsewhere anyway?

Why? If the S3 structure and listing is sufficient, I don't need to store anything else anywhere else.

Many use cases may involve other requirements that S3 can't meet, such as being able to find the same object via different keys, or being able to search through the metadata fields. However, if the requirements match up with S3's structure, then additional services are unnecessary and keeping them in sync with S3 is more hassle than it's worth.

tjoff · on March 10, 2024

I agree, but something as simple (in functionality) as that ought to be an edge-case. Not a defining feature of S3.

orf · on March 10, 2024

It’s fundamental to how S3 works and its ability to scale, so it is a defining feature of S3.

If you think wider, a bucket itself is just a prefix.

tjoff · on March 10, 2024

From amazons perspective, sure!

But that's not what we are discussing.

dekhn · on March 10, 2024

it's a property of the system that I, as an architect, would seriously consider as part of my system's design. I've worked with many systems where iterating over items in order starting from a prefix is extremely cheap (sstables).

belter · on March 10, 2024

No. The standard practice is to use a DynamoDB table as the index for your objects in S3.

This article misunderstood S3 and could as well have the title: "An Airplane is not a Car" :-)

fijiaarone · on March 10, 2024

So in reality S3 takes about 2 seconds to retrieve a single file, under ideal conditions. 1 second round trip for the request to DynamoDB to get the object key of the file and 1 second round trip to S3 to get the file contents (assuming no CPU cost on the search because you’re getting the key by ID from the DynamoDB in a flat single table store. And that the file has no network IO because it is a trivial number of bytes, so the HTTP header overwhelmed the content.)

I know what you’re thinking — 2 seconds, that’s faster than I can type the 300 character file key with its pseudo prefixes)!

Ah, but what if you wanted to get 2 files from S3?

deathanatos · on March 11, 2024

… S3's response time is nowhere near 2 seconds. (Or even 1 second.) Like a sibling poster says, 50ms is a much more realistic ballpark for TTFB.

orf · on March 11, 2024

2 seconds is a nuts response time, but I guess it depends entirely on your file size. TTFB is usually 50ms.

macintux · on March 10, 2024

I don't know that you can characterize that as a "standard practice".

Maybe it's widespread, but I've not encountered it.

belter · on March 10, 2024

"Building and Maintaining an Amazon S3 Metadata Index without Servers" - https://aws.amazon.com/pt/blogs/big-data/building-and-mainta...

Here is the architecture of Amazon Drive and the storage of metadata.

"AWS re:Invent 2014 | (ARC309) Building and Scaling Amazon Cloud Drive to Millions of Users" - https://youtu.be/R2pKtmhyNoA

And you can see the use here at correct time: https://youtu.be/R2pKtmhyNoA?t=546

ianburrell · on March 10, 2024

That article is old. DynamoDB was used because of the old, weak consistency model of S3. Writes were atomic, but lists could return old results so needed consistent list of objects.

But in 2020, S3 changed to strong consistency model. There is no need to use DynamoDB now.

belter · on March 11, 2024

The problem was not the eventual consistency model, was the speed of the object list.

"...Finding objects based on other attributes, however, requires doing a linear search using the LIST operation. Because each listing can return at most 1000 keys, it may require many requests before finding the object. Because of these additional requests, implementing attribute-based queries in S3 alone can be challenging..."

orf · on March 11, 2024

It was both actually, but more for the listing issue. Netflix built a lot of tooling around this.

But yeah: things like filtering on tags or created at dates requires another approach.

mr_toad · on March 11, 2024

Hive stores metadata in a relational database. So does Snowflake.

jacobsimon · on March 10, 2024

What is it about S3 that enables this speed, and why can’t traditional Unix file systems do the same?

orf · on March 10, 2024

S3 doesn’t have directories, it could be thought of a flat + sorted list of keys.

UNIX (and all operating systems) differentiate between a file and a directory. To list the contents of a directory, you need to make an explicit call. That call might return files or directories.

So to list all files recursively, you need to list, sort, check if an entry is a directory, recurse”. This isn’t great.

jacobsimon · on March 11, 2024

Interesting - isn't this just a matter of indexing/caching the file names, though? Surely S3 must store the files somewhere and index them. There's a Unix command called `locate` that does the same thing by maintaining a local database of keys and lets you search with prefixes.[1]

Anyway, I guess this is beyond the point of the original commenter above. I would disagree that listing files efficiently is the most useful part of S3. The main value prop is the fact that you can easily upload and download files from a distributed store. Most use cases involve uploading and downloading known files, not efficiently listing millions of files.

[1] https://jvns.ca/blog/2015/03/05/how-the-locate-command-works...

bradleyjg · on March 10, 2024

Code written against s3 is not portable either. It doesn’t support azure or gcp, much less some random proprietary cloud.

cuno · on March 10, 2024

Actually we've found it's often much worse than that. Code written against AWS S3 using the AWS SDK often doesn't work on a great many "S3-compatible" vendors (including on-prem versions). Although there's documentation on S3, it's vague in many ways, and the AWS SDKs rely on actual AWS behaviour. We've had to deal with a lot of commercial and cloud vendors that subtly break things. This includes giant public cloud companies. In one case a giant vendor only failed at high loads, making it appear to "work" until it didn't, because its backoff response was not what the AWS SDK expected. It's been a headache that we've had to deal for cunoFS, as well as making it work with GCP and Azure. At the big HPC conference Supercomputing 2023, when we mentioned supporting "S3 compatible" systems, we would often be told stories about applications not working with their supposedly "S3 compatible" one (from a mix of vendors).

yencabulator · on March 10, 2024

Back in 2011 when I was working on making Ceph's RadosGW more S3-compatible, it was pretty common that AWS S3 behavior differed from their documentation too. I wrote a test suite to run against AWS and Ceph, just to figure out the differences. That lives on at https://github.com/ceph/s3-tests

orf · on March 11, 2024

What differences in behaviour from the AWS docs did you find, out of interest?

yencabulator · on March 11, 2024

What I can dig up today is that back in 2011, they documented that bucket names cannot look like IPv4 addresses and the character set was a-z0-9.-, but they failed to prevent 192.168.5.123 or _foo.

I recall there were more edge cases around HTTP headers, but they don't seem to have been recorded as test cases -- it's been too long for me to remember details, I may have simply ran out of time / real world interop got good enough to prioritize something else.

2011 state, search for fails_on_aws: https://github.com/tv42/s3-tests/blob/master/s3tests/functio...

Current state, I can't speak to the exact semantics of the annotations today, they could simply be annotating non-AWS features: https://github.com/ceph/s3-tests/blob/master/s3tests/functio...

arcfour · on March 10, 2024

I've seen several S3-compatible APIs and there are open-source clients. If anything it's the de-facto standard.

zaphar · on March 10, 2024

GCP storage buckets implement the S3 api. You can treat them like they were an s3 bucket. Something I do all the time.

mechanicalpulse · on March 10, 2024

Isn't that a limitation imposed by the POSIX APIs, though, as a direct consequence of the interface's representation of hierarchical filesystems as trees? As you've illustrated, that necessitates walking the tree. Many tools, I suppose, walk the tree via a single thread, further serializing the process. In an admittedly haphazard test, I ran `find(1)` on ext4, xfs, and zfs filesystems and saw only one thread.

I imagine there's at least one POSIX-compatible file system out there that supports another, more performant method of dumping its internal metadata via some system call or another. But then we would no longer be comparing the S3 and POSIX APIs.

hayd · on March 10, 2024

You can set up cloud watch events to trigger a lambda function to store meta data about the s3 file in a regular database. That way you can index it how you expect to list.

Very effective for our use case.

donatj · on March 10, 2024

> And listing files is slow. While the joy of Amazon S3 is that you can read and write at extremely, extremely, high bandwidths, listing out what is there is much much slower. Slower than a slow local filesystem.

I was taken aback by this recently. At my coworkers request, I was putting some work into a script we have to manage assets in S3. It has a cache for the file listing, and my coworker who wrote it sent me his pre-populated cache. My initial thought was “this can’t really be necessary” and started poking.

We have ~100,000 root level directories for our individual assets. Each of those have five or six directories with a handful of files. Probably less than a million files total, maybe 3 levels deep at its deepest.

Recursively listing these files takes literally fifteen minutes. I poked and prodded suggestions from stack overflow and ChatGPT at potential ways to speed up the process and got nothing notable. That’s absurdly slow. Why on earth is it so slow?

Why is this something Amazon has not fixed? From the outside really seems like they could slap some B-trees on the individual buckets and call it a day.

If it is a difficult problem, I’m sure it would be for fascinating reasons I’d love to hear about.

catlifeonmars · on March 10, 2024

S3 is fundamentally a key value store. The fact that you can view objects in “directories” is nothing more than a prefix filter. It is not a file system and has no concept of directories.

anonymous-panda · on March 10, 2024

Directories make up a hierarchical filesystem, but it’s not a necessary condition. A filesystem at its core is just a way of organizing files. If you’re storing and organizing files in s3 then it’s a filesystem for you. Saying it’s “fundamentally a key value store” like it’s something different is confusing because a filesystem is just a key value store of path to contents of file.

Indeed there’s every reason to believe that a modern file system would perform significantly faster if the hierarchy was implemented as a prefix filter than actually maintaining the hierarchical data structures (at least for most operations). You can guess that this might be the case that file creation is extremely slow on modern file systems (on the order of hundreds or maybe thousands per second on a modern NVME disk that can otherwise do millions of IOPs and listing the contents of an extremely large directory is exceedingly slow)

catlifeonmars · on March 10, 2024

In context of the comment I was addressing, it’s clear that filesystem means more than just a key value store. I’d argue that this is generally true in common vernacular.

anonymous-panda · on March 10, 2024

This is a technical website discussing the nuances of filesystems. Common vernacular is how you choose to define it but even the Wikipedia definition says that directories and hierarchy are just one property of some filesystems. That they became the dominant model on local machines doesn’t take away from the more general definition that can describe distributed filesystems.

jmull · on March 11, 2024

I'm kind of chuckling at this thread because you're working so hard to not understand.

I think the previous poster could/should have said, "It is not a hierarchical file system and has no concept of directories." where I added the word "hierarchical".

But it's also pretty obvious that was the point.

anonymous-panda · on March 12, 2024

I disagree with that characterization because the contrast by OP was that S3 is “just a KV store implying” it doesn’t meet the criteria for being considered a filesystem.

For example, you could implement POSIX directory semantics on top of S3. About the only POSIX filesystem API you couldn’t implement it append / overwrite (well you could but it might be prohibitively expensive).

senderista · on March 10, 2024

A real hierarchy makes global constraints easier to scale, e.g. globally unique names or hierarchical access controls. These policies only need to scale to a single node rather than to the whole namespace (via some sort of global index).

mistrial9 · on March 11, 2024

no - a filesystem implementation on an ordinary OS has more than what you mention, including interfaces to disk device drivers

Spivak · on March 10, 2024

If I wanted to use S3 as a filesystem in the manner people are describing I would probably start looking at storing filesystem metadata in a sidecar database so you can get directory listings, permissions bits, xattrs and only have to round-trip to S3 when you need the content.

SOLAR_FIELDS · on March 10, 2024

Isn't this essentially what systems like Minio and SeaweedFS do with their S3 integrations/mirroring/caching? What you describe sounds a lot like SeaweedFS Filer when backed by S3

electroly · on March 10, 2024

The way that you said "recursively" and spent a lot of time describing "directories" and "levels" worries me. The fastest way to list objects in S3 wouldn't involve recursion at all; you just list all objects under a prefix. If you're using the path delimiter to pretend that S3 keys are a folder structure (they're not) and go "folder by folder", it's going to be way slower. When calling ListObjectsV2, make sure you are NOT passing "delimiter". The "directories" and "levels" have no impact on performance when you're not using the delimiter functionality. Split the one list operation into multiple parallel lists on separate prefixes to attain any total time goal you'd like.

blakesley · on March 10, 2024

All these comments saying merely "S3 has no concept of directories" without an explanation (or at least a link to an explanation) are pretty unhelpful, IMO. I dismissed your comment, but then I came upon this later one explaining why: https://news.ycombinator.com/item?id=39660445

After reading that, I now understand your comment.

electroly · on March 10, 2024

I appreciate you sharing that point of view. There's a "curse of knowledge" effect with AWS where its card-carrying proponents (myself included) lose perspective on how complex it actually is.

petters · on March 10, 2024

Yes, this is very good advice and will likely solve their problem

jameshart · on March 10, 2024

A fun corollary of this issue:

Deleting an S3 bucket is nontrivial!

You can't delete a bucket with objects in it. And you can't just tell S3 to delete all the objects. You need to send individual API requests to S3 to delete each object. Which means sending requests to S3 to list out the objects, 1000 at a time. Which takes time. And those list calls cost money to execute.

This is a good summary of the situation: https://cloudcasts.io/article/deleting-an-s3-bucket-costs-mo...

The fastest way to quickly dispose of an S3 bucket turns out to be to delete the AWS account it belongs to.

electroly · on March 10, 2024

No, don't do that. Set up a lifecycle rule that expires all of the objects and wait 24 hours. You won't pay for API calls and even the cost of storing the objects themselves is waived once they are marked for expiration.

The article has a mistake about this too: expirations do NOT count as lifecycle transitions and you don't get charged as such. You will, of course, get charged if you prematurely delete objects that are in a storage class with a minimum storage duration that they haven't reached yet. This is what they're actually talking about when they mention Infrequent Access and other lower tiers.

jameshart · on March 10, 2024

Still counts as nontrivial.

electroly · on March 10, 2024

This is really easy; much easier than trying to delete them by hand. AWS does all the work for you. It takes longer to log into the AWS Management Console than it does to set up this lifecycle rule.

orf · on March 11, 2024

Literally 1 API call.

jameshart · on March 11, 2024

Two. The one to set up the lifecycle rule. Then the one to delete the bucket, some number of hours later.

orf · on March 11, 2024

Incorrect. One call to trigger a step function that sets up the lifecycle rule, sleeps for 24 hours and then deletes the bucket.

Stop being silly, as if 1 vs 2 API calls matters. You should empty large buckets with lifecycle policies. It's trivial.

jameshart · on March 11, 2024

Imagine for a second you’re a Unix user, familiar with the rm command.

Imagine you are using windows for the first time and you want to delete a directory, so you find an answer on Serverfault that explains that to do so you need to spin up a COM object that marks the directory for deletion, then the next day comes back and deletes it.

You might be inclined to say ‘that seems overly complicated’.

The original answerer is confused though. ‘It’s trivial, stop being silly. Can you think of a simpler way to delete a directory?’

Do you see now why I thought the ‘non triviality’ of deleting an S3 bucket was perhaps relevant in a discussion on an article about why S3 is both simpler and more complex than a file system?

And why your approach might not actually be making the case for it being as simple as you think?

orf · on March 11, 2024

Right click, move to recycle bin, wait for the progress bar to finish. Except the progress bar takes a day or so.

This is only needed if you have a huge (100 million+) bucket, at which point you should be experienced with s3, otherwise you can just click the big, clear and obvious “empty bucket” button on the console.

anonymous-panda · on March 10, 2024

I think it’s far more mundane a reason. You can list 10k objects per request and getting the next 10k requires the result of the previous request, so it’s all serial. That means to list 1M files, you’re looking at 100 back to back requests. Assuming a ping time of 50ms, that’s easily 5s of just going back and forth, not including the cost of doing the listing itself on a flat iteration. The cost of a 10k item list is about the cost of a write which is kinda slow. Additionally, I suspect each listing is a strongly consistent snapshot which adds to the cost of the operation (it can be hard to provide an inconsistent view).

I don’t think btrees would help unless you’re doing directory traversals, but even then I suspect that’s not that beneficial as your bottleneck is going to be the network operations and exposed operations. Ultimately, file listing isn’t that critical a use case and typically most use cases are accomplished through things like object lifecycles where you tell S3 what you want done and it does it efficiently at the FS layer for you.

tsimionescu · on March 10, 2024

That's 5s of a 15m duration. I don't think it matters in the least.

anonymous-panda · on March 10, 2024

Depends how you’re iterating. If your iterating by hierarchy level, then you could easily see this being several orders of magnitude more requests.

perryizgr8 · on March 10, 2024

It's not a good model to think of S3 has having directories in a bucket. It's all objects. The web interface has a visual way of representing prefixes separated by slashes. But that's just a nice way to present the objects. Each object has a key, and that key can contain slashes, and you can think of each segment to be a directory for your ease of mind.

But that illusion breaks when you try to do operations you usually do with/on directories.

returningfory2 · on March 10, 2024

Are you performing list calls sequentially? If you have O(100k) directories and are doing O(100k) requests sequentially, 15 minutes works out at O(10ms) per request which doesn’t seem that bad? (assuming my math is correct…)

luhn · on March 10, 2024

At risk of being pedantic, you seem to be using big O to mean “approximately” or “in the order of”, but that’s not what it means at all. Big O is an expression of the growth rate of a function. Any constant value has a growth rate of 0, so O(100k) isn’t meaningful: It’s exactly the same as O(1).

wetmore · on March 11, 2024

You're right technically, it's an abuse of notation that isn't uncommon. My physics profs would do it in college.

returningfory2 · on March 11, 2024

Fair point, I guess the notation ~100k, ~10ms would be better.

jamesrat · on March 10, 2024

I implemented a solution by threading the listing. Get the files in the root then spin a separate process to do the recursion for each directory.

jasonwatkinspdx · on March 10, 2024

> Why is this something Amazon has not fixed?

It's common to store metadata on DynamoDB where it can be queried, and just have whatever arbitrary links to the values in the buckets.