Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
S3 is files, but not a filesystem (calpaterson.com)
571 points by todsacerdoti on March 10, 2024 | hide | past | favorite | 430 comments


> I haven't heard of people having problems [with S3's Durability] but equally: I've never seen these claims tested. I am at least a bit curious about these claims.

Believe the hype. S3's durability is industry leading and traditional file systems don't compare. It's not just the software - it's the physical infrastructure and safety culture.

AWS' availability zone isolation is better than the other cloud providers. When I worked at S3, customers would beat us up over pricing compared to GCP blob storage, but the comparison was unfair because Google would store your data in the same building (or maybe different rooms of the same building) - not with the separation AWS did.

The entire organization was unbelievably paranoid about data integrity (checksum all the things) and bigger events like natural disasters. S3 even operates at a scale where we could detect "bitrot" - random bit flips caused by gamma rays hitting a hard drive platter (roughly one per second across trillions of objects iirc). We even measured failure rates by hard drive vendor/vintage to minimize the chance of data loss if a batch of disks went bad.

I wouldn't store critical data anywhere else.

Source: I wrote the S3 placement system.


What’s your experience like at other storage outfits?

I only ask because your post is a bit like singing praises for Cinnabon that they make their own dough.

The things that you mentioned are standard storage company activities.

Checksum-all-the-things is a basic feature of a lot of file systems. If you can already set up your home computer to detect bitrot and alert you, you can bet big storage vendors do it.

Keeping track of hard drive failure rates by vendor is normal. Storage companies publicly publish their own reports. The tiny 6-person IT operation I was in had a spreadsheet. Hell, I toured a friend’s friend’s major data center last year and he managed to find time to talk hard drive vendors. Now you. I get it — y’all make spreadsheets.

There are a lot of smart people working on storage outside AWS and long before AWS existed.


When I worked at Google in storage, we had our own figures of merit that showed that we were the best and Amazon's durability was trash in comparison to us.

As far as I can tell, every cloud provider's object store is too durable to actually measure ("14 9's"), and it's not a problem.


9's are overblown. When cloud providers report that, they're really saying "Assuming random hard drive failure at the rates we've historically measured and how we quickly we detect and fix those failures, what's the mean time to data loss".

But that's burying the lede. By far the greatest risks to a file's durability are: 1. Bugs (which aren't captured by a durability model). This is mitigated by deploying slowly and having good isolation between regions. 2. An act of God that wipes out a facility.

The point of my comment was that it's not just about checksums. That's table stakes. The main driver of data loss for storage organizations with competent software is safety culture and physical infrastructure.

My experience was that S3's safety culture is outstanding. In terms of physical separation and how "solid" the AZs are, AWS is overbuilt compared to the other players.


That was not how we treated the 9's at Google. Those had been tested through natural experiments (disasters).

I was not at Google for the Clichy fire, but it wasn't the first datacenter fire Google experienced. I think your information about Google's data placement may be incorrect, or you may be mapping AWS concepts onto Google internal infrastructure in the wrong way.


Do you mean Google included "acts of God" when computing 9's? That's definitely not right.

11 9's of durability means mean time to data loss of 100 billion years. Nothing on earth is 11 9's durable in the face of natural (or man-made) disasters. The earth is only 4.5 billion years old.


Normally, companies store more than 1 byte of data, and the 9's (not just for data loss, for everything) are ensemble averages.

By the way, I don't doubt that AWS has plenty of 9's by that metric - perhaps more than GCP.


I would not lose sleep over storing data on GCS, but have heard from several Google Cloud folks that their concept of zones is a mirage at best.


Yeah, that's definitely true. Google sort of mapped an AWS concept onto its own cluster splits. However, there are enough regional-scale outages at all the major clouds that I don't personally place much stock in the idea of zones to begin with. The only way to get close to true 24/7 five-9's uptime with clouds is to be multi-region (and preferably multi-cloud).


I have experienced many outages that were contained to a specific availability zone in AWS, from power failures to flooding to cable cuts. You are correct that 5 9’s still requires multi-region though.


I think also Google as a whole has pretty good diversity. But Cloud customers demanded regions in big population centers and smaller countries where Google traditionally avoided due to cost reasons. This lead to less redundant sites that were often owned and/or operated by third parties. So in the US and Europe you can probably trust GCP zones quite literally. But other regions (I have heard lots of rumours in the APAC) they may not be quite as diverse as they appear.


> big population centers and smaller countries

Can we stop this dance on HN? Can you just name them, please?


I think most Googlers actually don't know the specifics (I certainly don't know), and if they could, they probably couldn't tell you. It's sort of common knowledge that some of them are like this, but not exactly which ones.


See: that fire in France that took down a whole region

But on the other hand, GCP supports multi-region so that's not nearly as big of a deal as it would be if AWS zones were not sufficiently isolated


If I were to upload a 50kb object to S3 (standard tier), about how many unique physical copies would exist?


At least 3.


At least 3, in at least 3 seperate datacenters.

According to https://nuclearsecrecy.com/nukemap/ - it'd take at least a 1 megaton warhead to take out two of the ap-southeast-2 datacenters, and over 10MT to take out 3.

I suspect you'd need a lot less than that though, the 1MT warhead would probably take out enough outside-the-datacenter infrastructure to take the entire AZ offline. I don't care too much though, if someone's dropped a warhead that close to home I have other things to worry about than whether all the cat pictures and audit logs survive.


> if someone's dropped a warhead that close to home I have other things to worry about than whether all the cat pictures and audit logs survive.

Speak for yourself. Many of us love our audit logs and show them to strangers whatever we can.


I’m picturing you having a slideshow of audit logs that you make guests to your home sit down and watch with you, like the vacation pictures slideshow of old.


Yes but my audit logs are special and everyone just loves them, although they playfully act bored


scrolling through them like it's the matrix can't be all that boring!


Awwwww! Check out this cute little IAM audit log! Look at its funny little fizzy privilege escalation! I just want to scratch it's belly until it p0wns the whole prod deployment.


Oh, like you don’t?

We’re among friends here.


1MT will take out the infrastructure to make the data not Available. However the data is still in the 3rd datacenter, making it still accessible therefore no compromising on Durable but yes we don't need the cat pictures when that is close to home :)


I'm guessing that even though the 3rd datacenter is about 35km away from the other 2, and so the building isn't in the expected destruction zone of a 1MT warhead, the damage to the city's electricity/water/network infrastructure would take the 3rd datacenter offline as well - so while your cat pictures are probably still in existence on the no-longer-spinning-rust there, they'd be unaccessible for quite some time.


I think lees. Why AWS need to store 3 times if they can use Reed-Solomon algorithms (or similar) and decrease this number to 2 or 1.5 and save a lot of storage space


Reed-Solomon would allow you to lose any one of the three and recover. Losing two would be catastrophic.

AWS's guarantee is that you can lose two of the three copies and still have all the data. You can't do that without three complete copies.


Probably for just a short period of time before it’s erasure coded


9's are useful when they're backed by an actual SLA - like GCP Cloud Storage and AWS S3 availability SLAs. Neither one commits to any durability SLAs whatsoever so I wouldn't put any stock into the 'eleventy nine nines' durability claims.


0. user error (deletion or overwriting a file they regret later, possibly much later)

-1. government, the historical cause of most data loss

1½. google canceling the product and deleting all the data, as they did with google+


> As far as I can tell, every cloud provider's object store is too durable to actually measure ("14 9's"), and it's not a problem.

How can you tell that if it's not measurable?

As far as I can tell the '11/14 9s' durability numbers are more or less completely made up. That's why AWS doesn't offer any actual durability SLA for S3, only a 99.9% availability SLA[0].

[0] https://aws.amazon.com/s3/sla/


buddy you are mixing availability and durability


Sorry I am not buying the personal anecdote as the public numbers from both orgs tell a different story. When reliability and long term support come up in conversation, Google is not a name to reach for.


Note that I said "durability," not any other reliability metric. GCP is pretty well-known for its outages and abysmal support. It's a reputation they want to change, but they did earn it.

However, Google is very good at not losing data.


Yup still not buying it. People who want their production stable, secure, and durable do not choose Google.


As a point of order: Not all Cinnabon locations make their own dough. The one I worked at the summer of 2001 made their own dough and frosting that summer, but switched to premade rolls partially that holiday season, and fully to premade rolls and frosting by the 2002 holiday season.

Also: you had to be eighteen or older to operate the mixer. It was something like a 60 quart machine, and all the recipes were pre-programmed for time and power with pauses to change from the paddle to the whip if needed.


The beauty of proprietary systems is that the only information we can ever expect to get about how they’re built is the biased information we get from the builders of those systems.


This was a few years ago, but blob storage on GCP had a global outage due to an outage in a single zone. That, among numerous other issues with GCP, lost my confidence entirely. Maybe it’s better now.


"Checksum-all-the-things is a basic feature of a lot of file systems"

"A lot"? Does anything but ZFS and maybe btrfs do this? Ext4 anf XFS — two very common filesystems — still don't have data checksums.


Bcachefs, and LVM also has a way to do it.

Unfortunately I’m not aware of any filesystem that does it while maintaining the full bandwidth of a modern NVMe. Not even with the extra reads factored in; on ZFS I get 800 MB/s max.


ZFS should absolutely be able to go faster even with lz4 compression I get writes above 5 GB/s on a older 32-core EPYC CPU. And that is with mostly random and already compressed data. And that write speed is a limitation of the RAIDz2 on top of not the fastest drives (6 PCIe 3.0 intel U.2 ones from 2 years ago).


Maybe I've configured something wrong, then. I'll do some tests.


it's well known and not debatable that Cinnabon is fire


"AWS' availability zone isolation is better than the other cloud providers."

Not better than all of them.

A geo-redundant rsync.net account exists in two different states (or countries) - for instance, primary in Fremont[1] and secondary in Denver.

"S3 even operates at a scale where we could detect "bitrot""

That is not a function of scale. My personal server running ZFS detects bitrot just fine - and the scale involved is tiny.

[1] he.net headquarters


Backing up across two different regions is possible for any provider with two "regions" but requires either doubling your storage footprint or accepting a latency hit because you have to make a roundtrip from Fremont to Denver.

The neat thing about AWS' AZ architecture is that it's a sweet spot in the middle. They're far enough apart for good isolation, which provides durability and availability, but close enough that the network round trip time is negligible compared to the disk seek.

Re: bit rot, I mean the frequency of events. If you've got a few disks, you may see one flip every couple years. They happen frequently enough in S3 that you can have expectations about the arrival rate and alarm when that deviates from expectations.


> The neat thing about AWS' AZ architecture is that it's a sweet spot in the middle

What may be less of a sweet spot is AWS' pricing.


Sending the data to /dev/null is the cheapest option if that’s all you care about.


Seems the snark detector just went off :)

Back on topic, I'd hope all of us would expect value for money for any and all services we recommend or purchase. Search for "site:news.ycombinator.com Away From AWS" to find dozens of discussions on how to save money by leaving AWS.

EDIT: just one article of the many I've read recently:

"What I’ve always found surprising about egress is just how expensive it is. On AWS, downloading a file from S3 to your computer once costs 4 times more than storing it for an entire month"

https://robaboukhalil.medium.com/youre-paying-too-much-for-e...


And that is egress which works as expected, unlike the AWS S3 denial of wallet attack...: https://news.ycombinator.com/item?id=39625029


> They're far enough apart for good isolation, which provides durability and availability

It can't possibly be enough for critical data though, right? I'm guessing a fire in 1 is unlikely to spread to another, but could it affect the availability of another? What about a deliberate attack on the DCs or the utilities supplying the DCs?


> but could it affect the availability of another

Availability is a different beast than durability. I think people are paranoid here about durability instead of availability.

S3 advertises four nines availability and 12 nines durability.


Yes, if a terrorist blows up all of the several Amazon DCs holding your data, your data will be lost. This is true no matter how many DCs are holding your data, who owns them, or where they are. You can improve your chances, of course.

There have been region-wide availability outages before. They're pretty rare and make worldwide news media due to how much of the internet they take out. I don't think there's been S3 data loss since they got serious about preventing S3 data loss.


> the network round trip time is negligible compared to the disk seek

Only for spinning rust, right?


Yes, which is what all the hyperscalers use for object storage. HDD seek time is ~10ms. Inter-az network latency is a few hundred micros.



Yes, but S3 has single region redundancy that is better than GCP. Your data in two AZs in one region is in two physically separate buildings. So multi-region is less important to durability.


Agree.

> S3 even operates at a scale where we could detect "bitrot" - random bit flips caused by gamma rays hitting a hard drive platter (roughly one per second across trillions of objects iirc).

I would expect any cloud provider to be able to detect bitrot these days.


I think the point the OP was trying to make is that they regularly detected bitrot due to their scale, not that they were merely capable of doing so.


Ah, thank you. This makes more sense. And I think I remember reading about it once. Apologies for the misinterpretation!


Everyone with significant scale and decent software regularly detects bitrot.


How does the latest ZFS bug impact your bitrot statement?

I mean, technically it’s not bitrot if zeros were accidentally written out instead of data.


Probably none because they didn't update to the exact version that had the bug


Checksumming the data is not based out of paranoia but simply as a result of having to detect which blocks are unusable in order to run the Reed-Solomon algorithm.

I'd also assume that a sufficient number of these corruption events are used as a signal to "heal" the system by migrating the individual data blocks onto different machines.

Overall, I'd say the things that you mentioned are pretty typical of a storage system, and are not at all specific to S3 :)


The S3 checksum feature applies to the objects, so that’s entirely orthogonal to erasure codes. Unless you know something I don’t and SHA256 has commutative properties. You’d still need to compute the object hash independent of any blocks.

Source: https://docs.aws.amazon.com/AmazonS3/latest/userguide/checki...


It's not entirely orthogonal; RAID5 plus stripe-level CRC (or better) can reliably correct bitrot at any single position in a stripe whereas RAID5 alone can only report an error. My guess is that S3 and other large object stores have the equivalent of stripe-level checksums for this purpose.


I’m positive something like this is the case. Yet it’s entirely orthogonal to the object hash in the user facing feature, which would need to be computed separately.


For append-only or write-once objects or for BLAKE-3 and other fully parallelizable hashes it's possible to store the intermediate hash function state with each chunk or stripe so that the final bytes of the data, once the hash is finished, yield the user-facing checksum as well.


> customers would beat us up over pricing compared to GCP blob storage, but the comparison was unfair because Google would store your data in the same building

I don’t think this is true. Per the Google Cloud Storage docs, data is replicated across multiple zones, and each zone maps to a different cluster. https://cloud.google.com/compute/docs/regions-zones/zone-vir...


Zones are about correlated power and networking failures. Regions are about disasters. If you want multiple regions, Google can of course do that too:

https://cloud.google.com/storage/docs/locations#consideratio...


Google puts multiple clusters in a single building.


Seems you’re right. They say each zone is a separate failure domain but you kind of have to trust their word on that.


Flashback to that Clichy datacenter fire near Paris...


> Believe the hype.

I'd rather believe the test results.

Is there a neutral third-party that has validated S3's durability/integrity/consistency? Something as rigorous as Jepsen?

It'd be really neat if someone compared all the S3 compatible cloud storage systems in a really rigorous way. I'm sure we'd discover that there are huge scary problems. Or maybe someone already has?


But they asked if the claims were audited by a unbiased third party. Are there such audits?

Alternatively, AWS does publicly provide legally binding availability guarantees, but I have never seen any prominently displayed legally binding durability guarantees. Are these published somewhere less prominently?


> Alternatively, AWS does publicly provide legally binding availability guarantees, but I have never seen any prominently displayed legally binding durability guarantees. Are these published somewhere less prominently?

It's listed prominently in the public docs: https://aws.amazon.com/s3/storage-classes/


I read that page and it does not provide any contractual durability guarantees as far as I can see. It provides "designed for availability" and then contractual availability SLA guarantees. It provides "designed for durability", but presents no contractual durability guarantee as far as I can see.

Given that their lawyers clearly indicate that "designed for availability" is not what they are contractually obligated to provide, only the letter of the SLA does that; "designed for durability" is similarly a marketing statement that does not incur any contractual obligations. Is there some specific statement in that document that I am missing which indicates that data durability is not fully at their convenience?


SLAs are more of a financial construct than anything else. Once the payback cost of missing an SLA is built into the contract then it just becomes a conversation about money. I've been at plenty of shops that obviously tried to hit the SLA but if it was missed it just became a financial issue which helped smooth over what otherwise might have been a trust buster.

I would never ever think of an SLA as anything more than a financial commitment - if you think more of it you'll eventually be in a world of hurt.


Obviously, as I made no mention of specific performance in the event of breach of contract the remedy for failure to meet contractual obligations would be damages.

The question is: What level of durability does AWS contractually guarantee where a failure to provide that level results in a breach of contract that may incur damages and where, specifically, in the documentation do they specify that?


My first job was at a startup in 2012 where I was expected to build things at a scale way over what I really had the experience to do. Anyways the best choice I ever made was using RDS and S3 (and django).


Not a public cloud, but storage at Facebook is similar in terms of physical infrastructure, safety culture, and scale.


I also worked at AWS, but not in the S3 team. However, I was Tech Evangelist and met with literally thousands of customers over my 6 years tenure. S3 was one of the hottest topics, but I got a sense of how good and robust it was directly from these customers.

What you say resonates really well with me, and what I've heard during these years.


> and bigger events like natural disasters

Outdated anecdata: I've worked for a company that lost some parts of buckets after the lightning strike incident in 2011, which bumped the paranoia quite a bit. AFAIK same thing couldn't happen for more than a decade.


Google discovered random bit flips caused by gamma rays.


Correct me if I'm wrong but bitrot only affects spinning rust since NAND uses ECC?

If you see this I wonder if S3 is planning on adding hardlinks?


Pretty much any modern storage medium depends on a healthy amount of error correcting code.


Nand is constantly moving around your data to prevent it from bit rotting. If you leave data too long without moving it, you may not be able to read the data from the nand.


> And listing files is slow. While the joy of Amazon S3 is that you can read and write at extremely, extremely, high bandwidths, listing out what is there is much much slower. Slower than a slow local filesystem

This misses something critical. Yes, s3 has fast reading and writing, but that’s not really what makes it useful.

What makes it useful is listing. In an unversioned bucket (or one with no delete markers), listing any given prefix is essentially constant time: I can take any given string, in a bucket with 100 billion objects, and say “give me the next 1000 keys alphabetically that come after this random string”.

What’s more, using “/“ as a delimiter is just the default - you can use any character you want and get a set of common prefixes. There are no “directories”, ”directories” are created out of thin air on demand.

This is super powerful, and it’s the thing that lets you partition your data in various ways, using whatever identifiers you need, without worrying about performance.

If listing was just “slow”, couldn’t list on file prefixes and got slower proportional to the number of keys (I.e a traditional unix file system), then it wouldn’t be useful at all.


I have to say that I'm not hugely convinced. I don't really think that being able to pull out the keys before or after a prefix is particularly impressive. That is the basis for database indices going back to the 1970s after all.

Perhaps the use-cases you're talking about are very different from mine. That's possible of course.

But for me, often the slow speed of listing the bucket gets in the way. Your bucket doesn't have to get very big before listing the keys takes longer than reading them. I seem to remember that listing operations ran at sub-1mbps, but admittedly I don't have a big bucket handy right now to test that.


It depends on a few factors. The list objects call hides deleted and noncurrent versions, but it has to skip over them. Grouping prefixes also takes time, if they contain a lot of noncurrent or deleted keys.

A pathological case would be a prefix with 100 million deleted keys, and 1 actual key at the end. Listing the parent prefix takes a long time in this case - I’ve seen it take several minutes.

If your bucket is pretty “normal” and doesn’t have this, or isn’t versioned, then you can do 4-5 thousand list requests a second, at any given key/prefix, in constant time. Or or you can explicitly list object versions (and not skip deleted keys) also in constant time.

It all depends on your data: if you need to list all objects then yeah it’s gonna be slow because you need to paginate through all the objects. But the point is that you don’t have to do that if you don’t want to, unlike a traditional filesystem with a directory hierarchy.

And this enables parallelisation: why list everything sequentially, when you can group the prefixes by some character (i.e “-“), then process each of those prefixes in parallel.

The world is your oyster.


We and our customers use S3 as a POSIX filesystem, and we generally find it faster than a local filesystem for many benchmarks. For listing directories we find it faster than Lustre (a real high performance filesystem). Our approach is to first try listing directories with a single ListObjectV2 (which on AWS S3 is in lexicographic order) and if it hasn't made much progress, we start listing with parallel ListObjectV2. Once you start parallelising the ListObjectV2 (rather than sequentially "continuing") you get massive speedups.


> find it faster than a local filesystem for many benchmarks.

What did you measure? How did you compare? This claim seems very contrary to my experience and understanding of how things work...

Let me refine the question: did you measure metadata or data operations? What kind of storage medium is used by the filesystem you use? How much memory (and subsequently the filesystem cache) does your system have?

----

The thing is: you should expect, in the best case, something like 5 ms latency on network calls over the Internet in an ideal case. Within the datacenter, maybe you can achieve sub-ms latency, but that's hard. AWS within region but different zones tends to be around 1 ms latency.

This is while NVMe latency, even on consumer products, is 10-20 micro seconds. I.e. we are talking about roughly 100 times faster than anything going through the network can offer.


For AWS, we're comparing against filesystems in the datacenter - so EBS, EFS and FSx Lustre. Compared to these, you can see in the graphs where S3 is much faster for workloads with big files and small files: https://cuno.io/technology/

and in even more detail of different types of EBS/EFS/FSx Lustre here: https://cuno.io/blog/making-the-right-choice-comparing-the-c...


The tests are very weird...

Normally, from someone working in the storage, you'd expect tests to be in IOPS, and the goto tool for reproducible tests is FIO. I mean, of course "reproducibility" is a very broad subject, but people are so used to this tool that they develop certain intuition and interpretation for it / its results.

On the other hand, seeing throughput figures is kinda... it tells you very little about how the system performs. Just to give you some reasons: a system can be configured to do compression or deduplication on client / server, and this will significantly impact your throughput, depending on what do you actually measure: the amount of useful information presented to the user or the amount of information transferred. Also throughput at the expense of higher latency may or may not be a good thing... Really, if you ask anyone who ever worked on a storage product about how they could crank up throughput numbers, they'd tell you: "write bigger blocks asynchronously". This is the basic recipe, if that's what you want. Whether this makes a good all around system or not... I'd say, probably not.

Of course, there are many other concerns. Data consistency is a big one, and this is a typical tradeoff when it comes to choosing between object store and a filesystem, since filesystem offers more data consistency guarantees, whereas object store can do certain things faster, while breaking them.

BTW, I don't think most readers would understand Lustre and similar to be the "local filesystem", since it operates over network and network performance will have a significant impact, of course, it will also put it in the same ballpark as other networked systems.

I'd also say that Ceph is kinda missing from this benchmark... Again, if we are talking about filesystem on top of object store, it's the prime example...


IOPS is a really lazy benchmark that we believe can greatly diverge from most real life workloads, except for truly random I/O in applications such as databases. For example, in Machine Learning, training usually consists of taking large datasets (sometimes many PBs in scale), randomly shuffling them each Epoch, and feeding them into the engine as fast as possible. Because of this, we see storage vendors for ML workloads concentrate on IOPS numbers. The GPUs however only really care about throughput. Indeed, we find a great many applications only really care about the throughput, and IOPS is only relevant if it helps to accomplish that throughput. For ML, we realised that the shuffling isn't actually random - there's no real reason for it to be random versus pseudo-random. And if its pseudo-random then it is predictable, and if its predictable then we can exploit that to great effect - yielding a 60x boost in throughput on S3, beating out a bunch of other solutions. S3 is not going to do great for truly random I/O, however, we find that most scientific, media and finance workloads are actually deterministic or semi-deterministic, and this is where cunoFS, by peering inside each process, can better predict intra-file and inter-file access patterns, so that we can hide the latencies present in S3. At the end of the day, the right benchmark is the one that reflects real world usage of applications, but that's a lot of effort to document one by one.

I agree that things like dedupe and compression can affect things, so in our large file benchmarks each file is actually random. The small file benchmarks aren't affected by "write bigger blocks" because there's nothing bigger than the file itself. Yes, data consistency can be an issue, and we've had to do all sorts of things to ensure POSIX consistency guarantees beyond what S3 (or compatible) can provide. These come with restrictions (such as on concurrent writes to the same file on multiple nodes), but so does NFS. In practice, we introduced a cunoFS Fusion mode that relies on a traditional high-IOPS filesystem for such workloads and consistency (automatically migrating data to that tier), and high throughput object for other workloads that don't need it.


> And if its pseudo-random then it is predictable, and if its predictable then we can exploit that to great effect

This is an interesting hack. However, an IOP is an IOP, no matter how good you predicted it and prefetch it so that you hide the latency it's going to be translated to a GetObject.

I think what you really exploited here is that even though S3 is built on HDDs (and have very low IOPS per TiB) their scale is so large that even if you milk 1M+ IOPS out of it AWS still doesn't care and is happy to serve you. But if my back-of-envelope calculation is correct this isn't going to work well if everyone starts to do it.

How do you get around S3's 5.5k GET per second per prefix limit? If I only have ~200 20GiB files can you still get decent IOPS out of it?

and...

> IOPS is a really lazy benchmark that we believe can greatly diverge from most real life workloads

No, it's not. I have a workload training a DL model on time series data which demands 600k 8KiB IOPS per compute instance. None of the thing I tested work well. Had to build a custom one with bare metal NVMe-s.


Sorry for the late response - I didn't see your comment until now.

Our aim is to unleash all the potential that S3/Object has to offer for file system workloads. Yes, the scale of AWS S3 helps, as does erasure coding (which enhances flexibility for better load balancing of reads).

Is it suitable for every possible workload? No, which is why we have a mode called cunoFS Fusion where we let people combine a regular high-performance filesystem for IOPS, and Object for throughput, with data automatically migrated between the two according to workload behaviour. What we find is that most data/workloads need high throughput rather than high IOPS, and this tends to be the bulk of data. So rather than paying for PBs of ultra-high IOPS storage, they only need to pay for TBs of it instead. Your particular workload might well need high IOPS, but a great many workloads do not. We do have organisations doing large scale workloads on time-series (market) data using cunoFS with S3 for performance reasons.


EFS is ridiculously slow though. Almost to the point where I fail to see how it’s actually useful for any of the traditional use cases for NFS.


> EFS is ridiculously slow though. Almost to the point where I fail to see how it’s actually useful for any of the traditional use cases for NFS.

Would you care to elaborate on your experience or use case a bit more? We've made a lot of improvements over the last few years (and are actively working on more), and we have many happy customers. I'd be happy to give a perspective of how well your use case would work with EFS.

Source: PMT turned engineer on EFS, with the team for over 6 years


Unfortunately I can’t say too much publicly on HN. But one of the big shortcomings is dealing with hundreds of files. It doesn’t even matter if those are big or small files (I’ve had experience with both).

Services like DataSync show that the underlying infra can be performant. But it feels almost impossible to replicate that on EFS via standard POSIX APIs. And unfortunately one of our use cases depend upon that.

If feels, to me at least, like EFS isn’t where AWSs priorities lie. At least if you compare EFS to FSx Lustre and recent developments to S3. Both of which has been the direction our AWS SAs have pushed us.


if you turn all the EFS performance knobs up (at a high cost), it's quite fast.


Faster, sure. But I wouldn’t got so far as to say it is fast


Have you tried it recently? Because we've made it a lot faster over the years.


More recently and for more use cases and varied workflows than most people. But that’s as much as I can say without getting people to sign an NDA.

Our AWS spend is high enough to warrant a very close working relationship with AWS so this is something we have worked with you guys on already.


S3 is really high latency though. I store parquet files on S3 and querying them through DuckDB is much slower than file system because random access patterns. I can see S3 being decent if it’s bulk access but definitely not for random access.

This is why there’s a new S3 Express offering that is low latency (but costs more).


It can't be a POSIX filesystem if it doesn't meet POSIX filesystem guarantees. I worked on an S3 compatible object store in a large storage company and we also had distributed filesystem products. Those are completely different animals due to the different semantics and requirements. We've also built compliant filesystems over object store and the other way around. Certain operations like, write-append, are tricky to simulate over object stores (S3 didn't use to support append, I haven't really stayed up to date, does it now?). At least when I worked on this it wasn't possible to simulate POSIX semantics over S3 at all without needing to add additional object store primitives.


> Once you start parallelising the ListObjectV2 (rather than sequentially "continuing")

How are you "parallelizing" the ListObjectsV2? The continuation token can be only fed in once the previous ListObjectsV2 response has completed, unless you know the name or structure of keys ahead of time, in which listing objects isn't necessary.


For example, you can do separate parallel ListObjectV2 for files starting a-f and g-k, etc.. covering the whole key space. You can parallelize recursively based on what is found in the first 1000 entries so that it matches the statistics of the keys. Yes there may be pathological cases, but in practice we find this works very well.


You're right that it won't work for all use cases, but starting two threads with prefixes A and M, for example, is one way you might achieve this.


If you think s3 is fast, you should try FTP. It’s at least a hundred times faster. And combined with rsync, dozens of times more reliable.


Neither of those are true though? Not sure if this is sarcastic or not, if so make it more clear in the future


The key difference between lexicographically keyed flat hierarchies, and directory-nested filesystem hierarchies, becomes clear based on this example:

    dir1/a/000000
    dir1/a/...
    dir1/a/999999
    dir1/b
On a proper hierarchical file file system with directories as tree interior nodes, `ls dir1/` needs to traverse and return only 2 entries ("a" and "b").

A flat string-indexed KV store that only supports lexicographic order, without special handling of delimters, needs to traverse 1 million dirents ("a/00000" throuh "a/999999") before arriving at "b".

Thus, simple flat hierarchies are much slower at listing the contents of a single dir: O(all recursive children), vs. O(immediate children) on a "proper" filesystem.

Lexicographic strings cannot model multi-level tree structures with the same complexities; this may give it the reputation of "listing files is slow".

UNLESS you tell the listing algorithm what the delimter character is (e.g. `/`). Then a lexicographical prefix tree can efficiently skip over all subtrees at the next `/`.

Amazon S3 supports that, with the docs explicitly mentioning "skipping over and summarizing the (possibly millions of) keys nested at deeper levels" in the `CommonPrefixes` field: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-...

I have not tested whether Amazon's implemented actually saves the traversal (or whether it traverses and just returns less results), but I'd hope so.


For completeness: The orignal post says:

    S3 has no rename or move operation.
    Renaming is CopyObject and then DeleteObject.
    CopyObject takes linear time to the size of the file(s).
    This comes up fairly often when someone has written a lot of files
    to the wrong place - moving the files back is very slow.
This is right:

In a normal file system, renaming a directory is fast O(1), in S3 it's slow O(all recursive children).

And Amazon S3 has not added a delimiter-based function to reduce its complexity, even though that would be easily possible in a lexicographic prefix tree (re-rooting the subtree).

So here the original post has indeed found a case where S3 is much slower than a normal file system.


Since 30 years ago (starting with XFS in 1993, which was inspired by HPFS) all the good UNIX file systems implement the directories as some kind of B trees.

Therefore they do not get slower proportional to the number of entries and listing based on file prefixes is extremely fast.


> listing based on file prefixes is extremely fast

This functionality does not exist to my knowledge.

ext4 and XFS return directory entries in pseudo-random order (due to hashing), not lexicographically.

For an example, see e.g. https://righteousit.wordpress.com/2022/01/13/xfs-part-6-btre...

If you know a way to return lexicographical order directly from the file system, without the need to sort, please link it.


Resolving random file system paths still gets slower proportional to their depth, which is not the case for S3, where the prefix is on the entire object key and not just the "basename" part of it, like in a filesystem.


Yes they do. What APIs does Linux offer that allows you to list a directories contents alphabetically starting at a specific filename in constant time? You have to iterate the directory contents.

You can maybe use “d_off” with readdir in some way, but that’s specific to the filesystem. There’s no portable way to do this with POSIX.

Regardless of if you can do it with a single directory, you can’t do it for all files recursively under a given prefix. You can’t just ignore directories, or say that “for this list request, ‘-‘ is my directory separator”.

The use of b-trees in file systems is completely beside the point.


The POSIX API is indeed even older, so it is not helpful.

But as you say, there are filesystem-specific methods or operating-system specific methods to reach the true performance of the filesystem.

It is likely that for maximum performance one would have to write custom directory search functions using directly the Linux syscalls, instead of using the standard libc functions, but I would rather do that instead of paying for S3 or something like it.


Yes. You could also just use a SQLite table with two columns (path, contents), then just query that. Or do any number of other things.

The question isn’t if it’s possible, because of course it is, the question is if it’s portable and well supported with the POSIX interface. Because if it’s not, then…


> The question isn’t if it’s possible, because of course it is, the question is if it’s portable and well supported with the POSIX interface. Because if it’s not, then…

Where did this goalpost come from? S3 is not portable or POSIX compliant.


From the article we're commenting on, which is comparing the interface of S3 to the POSIX interface. Not any given filesystem + platform specific interface.


The article does not mention POSIX, or anything about listing files, at all.


It mistakenly mentions UNIX whilst referencing the POSIX filesystem API, and I literally quoted where it talks about listing in my original comment.


The article starts out by making a comparison between the posix api filesystem calls and S3's api. The context is very much a comparison between those two api surface areas.


There are no specific syscalls that you can use for this. The libc functions and the syscalls are extremely similar.


>What makes it useful is listing.

I think 99% of S3 usage just consists of retrieving objects with known keys. It seems odd to me to consider prefix listing as a key feature.


When you embed the relevant (not necessarily that of object creation) timestamp as a prefix, it sure becomes one. Whether that prefix is part of the "path" (object/path/prefix/with/<4-digit year/)" or directly part of the basename (object/path/prefix/to/app-specific/files/<4-digit year>-<2-digit month>-....), being able to limit the search space server-side becomes incredibly useful.

You can try it yourself: list objects in a bucket prefix with lots of files, and measure the time it takes to list all of them vs. the time it takes to list only a subset of them that share a common prefix.


> ...listing any given prefix is essentially constant time: I can take any given string, in a bucket with 100 billion objects, and say “give me the next 1000 keys alphabetically that come after this random string”.

I'm not sure we agree on the definition of "constant time" here. Just because you get 1000 keys in one network call doesn't imply anything about the complexity of the backend!


Constant time irregardless of the number of objects in the bucket and irregardless of the initial starting position of your list request.


The technical implementation is indeed impressive that it operates more-or-less within constant time, but probably very few use cases actually fit that narrow window, so this technical strength is moot when it comes to actual usage.

Since each request is dependent upon the position received in the last request, 1000 arbitrary keys on your 3rd or 1000th attempt doesn't really help unless you found your needle in the haystack in that request (and in that case the rest of that 1000 key listing was wasted.)


You’re assuming you’re paginating through all objects from start to finish.

A request to list objects under “foo/“ is a request to list all objects starting with “foo/“, which is constant time irregardless of the number of keys before. Same applies for “foo/bar-“, or any other list request for any given prefix. There are no directories on s3.


And if for some reason you need a complete listing along with object sizes and other attributes you can get one every 24 hours with S3 inventory report.

That has always been good enough for me.


When you have billions of objects this is really the only way to go - and build workflows around inventory (including athena, spark, etc)


Is listing really such a key feature that people use it as a database to find objects?

Have not used S3, but that is not how I imagined using it.


Sure. It's kind of an index - limited to prefix-only searching, but useful.

Say you store uploads associated with a company and a user. You'd maybe naively store them as `[company-uuid]/[user-id].[timestamp]`.

If you need to list a given users (123) uploads after a given date, you'd list keys after `[company-uuid]/123.[date]`. If you need to list all users uploads, you'd list `[company-uuid]/123.`. If you need to get a set of all users who have photos, you'd list `[company-uuid]/` with a Delimiter set to `.`

The point is that it's flexible and with a bit of thought it allows you to "remove all a users uploads between two dates", "remove all a companies uploads" or "remove all a users uploads" with a single call. Or whatever specific stuff is important to your use-case, that might otherwise need a separate DB.

It's not perfect - you can't reverse the listing (i.e you can't get the latest photo for a given user by sorting descending for example), and needs some thought about your key structure.


But surely you need to track that elsewhere anyway?

That some niche edge-case runs efficiently doesn't sound like a defining feature of S3. On the contrary many common operations map terrible to S3, so you kind of need the logic to be elsewhere.


My overall point can be summarised as this:

- Listing things is a very common operation to do.

- The POSIX api and the directory/file hierarchy it provides is a restrictive one.

- S3 does not suffer from this, you can recursively list and group keys into directories at “list time”.

- If you find yourself needing to list gigantic numbers of keys in one go, you can do better by only listing a subset. S3 isn’t a filesystem, you shouldn’t need to list 1k+ keys sequentially apart from during maintenance tasks.

- This is actually quite fast, compared to alternatives.

Whether or not you see a use case for this is sort of irrelevant: they exist. it’s what allows you to easily put data into s3 and flexibly group/scan it by specific attributes.


Listing things is very common, so why would you outsource that to S3 when all your bookkeeping is elsewhere? It's not like you would ever rely on the POSIX API for that anyway, even for when your files actually are on a POSIX filesystem.

For sure, for maintenance tasks etc. it sounds quite useful. And good hygiene with prefixes sounds like a sane idea. But listing being a critical part of what "makes S3 useful"? That seems like an huge stretch that your points don't seem to address.


> It's not like you would ever rely on the POSIX API for that anyway, even for when your files actually are on a POSIX filesystem.

Because there is no POSIX api for this. Depending on your requirements and query patterns, you may not need a completely separate database that you need to keep in sync.


You may not need other bookkeeping. The prefix listing properties can be enough, removing the need to have two distinct systems kept in sync.


> But surely you need to track that elsewhere anyway?

Why? If the S3 structure and listing is sufficient, I don't need to store anything else anywhere else.

Many use cases may involve other requirements that S3 can't meet, such as being able to find the same object via different keys, or being able to search through the metadata fields. However, if the requirements match up with S3's structure, then additional services are unnecessary and keeping them in sync with S3 is more hassle than it's worth.


I agree, but something as simple (in functionality) as that ought to be an edge-case. Not a defining feature of S3.


It’s fundamental to how S3 works and its ability to scale, so it is a defining feature of S3.

If you think wider, a bucket itself is just a prefix.


From amazons perspective, sure!

But that's not what we are discussing.


it's a property of the system that I, as an architect, would seriously consider as part of my system's design. I've worked with many systems where iterating over items in order starting from a prefix is extremely cheap (sstables).


No. The standard practice is to use a DynamoDB table as the index for your objects in S3.

This article misunderstood S3 and could as well have the title: "An Airplane is not a Car" :-)


So in reality S3 takes about 2 seconds to retrieve a single file, under ideal conditions. 1 second round trip for the request to DynamoDB to get the object key of the file and 1 second round trip to S3 to get the file contents (assuming no CPU cost on the search because you’re getting the key by ID from the DynamoDB in a flat single table store. And that the file has no network IO because it is a trivial number of bytes, so the HTTP header overwhelmed the content.)

I know what you’re thinking — 2 seconds, that’s faster than I can type the 300 character file key with its pseudo prefixes)!

Ah, but what if you wanted to get 2 files from S3?


… S3's response time is nowhere near 2 seconds. (Or even 1 second.) Like a sibling poster says, 50ms is a much more realistic ballpark for TTFB.


2 seconds is a nuts response time, but I guess it depends entirely on your file size. TTFB is usually 50ms.


I don't know that you can characterize that as a "standard practice".

Maybe it's widespread, but I've not encountered it.


"Building and Maintaining an Amazon S3 Metadata Index without Servers" - https://aws.amazon.com/pt/blogs/big-data/building-and-mainta...

Here is the architecture of Amazon Drive and the storage of metadata.

"AWS re:Invent 2014 | (ARC309) Building and Scaling Amazon Cloud Drive to Millions of Users" - https://youtu.be/R2pKtmhyNoA

And you can see the use here at correct time: https://youtu.be/R2pKtmhyNoA?t=546


That article is old. DynamoDB was used because of the old, weak consistency model of S3. Writes were atomic, but lists could return old results so needed consistent list of objects.

But in 2020, S3 changed to strong consistency model. There is no need to use DynamoDB now.


The problem was not the eventual consistency model, was the speed of the object list.

"...Finding objects based on other attributes, however, requires doing a linear search using the LIST operation. Because each listing can return at most 1000 keys, it may require many requests before finding the object. Because of these additional requests, implementing attribute-based queries in S3 alone can be challenging..."


It was both actually, but more for the listing issue. Netflix built a lot of tooling around this.

But yeah: things like filtering on tags or created at dates requires another approach.


Hive stores metadata in a relational database. So does Snowflake.


What is it about S3 that enables this speed, and why can’t traditional Unix file systems do the same?


S3 doesn’t have directories, it could be thought of a flat + sorted list of keys.

UNIX (and all operating systems) differentiate between a file and a directory. To list the contents of a directory, you need to make an explicit call. That call might return files or directories.

So to list all files recursively, you need to list, sort, check if an entry is a directory, recurse”. This isn’t great.


Interesting - isn't this just a matter of indexing/caching the file names, though? Surely S3 must store the files somewhere and index them. There's a Unix command called `locate` that does the same thing by maintaining a local database of keys and lets you search with prefixes.[1]

Anyway, I guess this is beyond the point of the original commenter above. I would disagree that listing files efficiently is the most useful part of S3. The main value prop is the fact that you can easily upload and download files from a distributed store. Most use cases involve uploading and downloading known files, not efficiently listing millions of files.

[1] https://jvns.ca/blog/2015/03/05/how-the-locate-command-works...


Code written against s3 is not portable either. It doesn’t support azure or gcp, much less some random proprietary cloud.


Actually we've found it's often much worse than that. Code written against AWS S3 using the AWS SDK often doesn't work on a great many "S3-compatible" vendors (including on-prem versions). Although there's documentation on S3, it's vague in many ways, and the AWS SDKs rely on actual AWS behaviour. We've had to deal with a lot of commercial and cloud vendors that subtly break things. This includes giant public cloud companies. In one case a giant vendor only failed at high loads, making it appear to "work" until it didn't, because its backoff response was not what the AWS SDK expected. It's been a headache that we've had to deal for cunoFS, as well as making it work with GCP and Azure. At the big HPC conference Supercomputing 2023, when we mentioned supporting "S3 compatible" systems, we would often be told stories about applications not working with their supposedly "S3 compatible" one (from a mix of vendors).


Back in 2011 when I was working on making Ceph's RadosGW more S3-compatible, it was pretty common that AWS S3 behavior differed from their documentation too. I wrote a test suite to run against AWS and Ceph, just to figure out the differences. That lives on at https://github.com/ceph/s3-tests


What differences in behaviour from the AWS docs did you find, out of interest?


What I can dig up today is that back in 2011, they documented that bucket names cannot look like IPv4 addresses and the character set was a-z0-9.-, but they failed to prevent 192.168.5.123 or _foo.

I recall there were more edge cases around HTTP headers, but they don't seem to have been recorded as test cases -- it's been too long for me to remember details, I may have simply ran out of time / real world interop got good enough to prioritize something else.

2011 state, search for fails_on_aws: https://github.com/tv42/s3-tests/blob/master/s3tests/functio...

Current state, I can't speak to the exact semantics of the annotations today, they could simply be annotating non-AWS features: https://github.com/ceph/s3-tests/blob/master/s3tests/functio...


I've seen several S3-compatible APIs and there are open-source clients. If anything it's the de-facto standard.


GCP storage buckets implement the S3 api. You can treat them like they were an s3 bucket. Something I do all the time.


Isn't that a limitation imposed by the POSIX APIs, though, as a direct consequence of the interface's representation of hierarchical filesystems as trees? As you've illustrated, that necessitates walking the tree. Many tools, I suppose, walk the tree via a single thread, further serializing the process. In an admittedly haphazard test, I ran `find(1)` on ext4, xfs, and zfs filesystems and saw only one thread.

I imagine there's at least one POSIX-compatible file system out there that supports another, more performant method of dumping its internal metadata via some system call or another. But then we would no longer be comparing the S3 and POSIX APIs.


You can set up cloud watch events to trigger a lambda function to store meta data about the s3 file in a regular database. That way you can index it how you expect to list.

Very effective for our use case.


> And listing files is slow. While the joy of Amazon S3 is that you can read and write at extremely, extremely, high bandwidths, listing out what is there is much much slower. Slower than a slow local filesystem.

I was taken aback by this recently. At my coworkers request, I was putting some work into a script we have to manage assets in S3. It has a cache for the file listing, and my coworker who wrote it sent me his pre-populated cache. My initial thought was “this can’t really be necessary” and started poking.

We have ~100,000 root level directories for our individual assets. Each of those have five or six directories with a handful of files. Probably less than a million files total, maybe 3 levels deep at its deepest.

Recursively listing these files takes literally fifteen minutes. I poked and prodded suggestions from stack overflow and ChatGPT at potential ways to speed up the process and got nothing notable. That’s absurdly slow. Why on earth is it so slow?

Why is this something Amazon has not fixed? From the outside really seems like they could slap some B-trees on the individual buckets and call it a day.

If it is a difficult problem, I’m sure it would be for fascinating reasons I’d love to hear about.


S3 is fundamentally a key value store. The fact that you can view objects in “directories” is nothing more than a prefix filter. It is not a file system and has no concept of directories.


Directories make up a hierarchical filesystem, but it’s not a necessary condition. A filesystem at its core is just a way of organizing files. If you’re storing and organizing files in s3 then it’s a filesystem for you. Saying it’s “fundamentally a key value store” like it’s something different is confusing because a filesystem is just a key value store of path to contents of file.

Indeed there’s every reason to believe that a modern file system would perform significantly faster if the hierarchy was implemented as a prefix filter than actually maintaining the hierarchical data structures (at least for most operations). You can guess that this might be the case that file creation is extremely slow on modern file systems (on the order of hundreds or maybe thousands per second on a modern NVME disk that can otherwise do millions of IOPs and listing the contents of an extremely large directory is exceedingly slow)


In context of the comment I was addressing, it’s clear that filesystem means more than just a key value store. I’d argue that this is generally true in common vernacular.


This is a technical website discussing the nuances of filesystems. Common vernacular is how you choose to define it but even the Wikipedia definition says that directories and hierarchy are just one property of some filesystems. That they became the dominant model on local machines doesn’t take away from the more general definition that can describe distributed filesystems.


I'm kind of chuckling at this thread because you're working so hard to not understand.

I think the previous poster could/should have said, "It is not a hierarchical file system and has no concept of directories." where I added the word "hierarchical".

But it's also pretty obvious that was the point.


I disagree with that characterization because the contrast by OP was that S3 is “just a KV store implying” it doesn’t meet the criteria for being considered a filesystem.

For example, you could implement POSIX directory semantics on top of S3. About the only POSIX filesystem API you couldn’t implement it append / overwrite (well you could but it might be prohibitively expensive).


A real hierarchy makes global constraints easier to scale, e.g. globally unique names or hierarchical access controls. These policies only need to scale to a single node rather than to the whole namespace (via some sort of global index).


no - a filesystem implementation on an ordinary OS has more than what you mention, including interfaces to disk device drivers


If I wanted to use S3 as a filesystem in the manner people are describing I would probably start looking at storing filesystem metadata in a sidecar database so you can get directory listings, permissions bits, xattrs and only have to round-trip to S3 when you need the content.


Isn't this essentially what systems like Minio and SeaweedFS do with their S3 integrations/mirroring/caching? What you describe sounds a lot like SeaweedFS Filer when backed by S3


The way that you said "recursively" and spent a lot of time describing "directories" and "levels" worries me. The fastest way to list objects in S3 wouldn't involve recursion at all; you just list all objects under a prefix. If you're using the path delimiter to pretend that S3 keys are a folder structure (they're not) and go "folder by folder", it's going to be way slower. When calling ListObjectsV2, make sure you are NOT passing "delimiter". The "directories" and "levels" have no impact on performance when you're not using the delimiter functionality. Split the one list operation into multiple parallel lists on separate prefixes to attain any total time goal you'd like.


All these comments saying merely "S3 has no concept of directories" without an explanation (or at least a link to an explanation) are pretty unhelpful, IMO. I dismissed your comment, but then I came upon this later one explaining why: https://news.ycombinator.com/item?id=39660445

After reading that, I now understand your comment.


I appreciate you sharing that point of view. There's a "curse of knowledge" effect with AWS where its card-carrying proponents (myself included) lose perspective on how complex it actually is.


Yes, this is very good advice and will likely solve their problem


A fun corollary of this issue:

Deleting an S3 bucket is nontrivial!

You can't delete a bucket with objects in it. And you can't just tell S3 to delete all the objects. You need to send individual API requests to S3 to delete each object. Which means sending requests to S3 to list out the objects, 1000 at a time. Which takes time. And those list calls cost money to execute.

This is a good summary of the situation: https://cloudcasts.io/article/deleting-an-s3-bucket-costs-mo...

The fastest way to quickly dispose of an S3 bucket turns out to be to delete the AWS account it belongs to.


No, don't do that. Set up a lifecycle rule that expires all of the objects and wait 24 hours. You won't pay for API calls and even the cost of storing the objects themselves is waived once they are marked for expiration.

The article has a mistake about this too: expirations do NOT count as lifecycle transitions and you don't get charged as such. You will, of course, get charged if you prematurely delete objects that are in a storage class with a minimum storage duration that they haven't reached yet. This is what they're actually talking about when they mention Infrequent Access and other lower tiers.


Still counts as nontrivial.


This is really easy; much easier than trying to delete them by hand. AWS does all the work for you. It takes longer to log into the AWS Management Console than it does to set up this lifecycle rule.


Literally 1 API call.


Two. The one to set up the lifecycle rule. Then the one to delete the bucket, some number of hours later.


Incorrect. One call to trigger a step function that sets up the lifecycle rule, sleeps for 24 hours and then deletes the bucket.

Stop being silly, as if 1 vs 2 API calls matters. You should empty large buckets with lifecycle policies. It's trivial.


Imagine for a second you’re a Unix user, familiar with the rm command.

Imagine you are using windows for the first time and you want to delete a directory, so you find an answer on Serverfault that explains that to do so you need to spin up a COM object that marks the directory for deletion, then the next day comes back and deletes it.

You might be inclined to say ‘that seems overly complicated’.

The original answerer is confused though. ‘It’s trivial, stop being silly. Can you think of a simpler way to delete a directory?’

Do you see now why I thought the ‘non triviality’ of deleting an S3 bucket was perhaps relevant in a discussion on an article about why S3 is both simpler and more complex than a file system?

And why your approach might not actually be making the case for it being as simple as you think?


Right click, move to recycle bin, wait for the progress bar to finish. Except the progress bar takes a day or so.

This is only needed if you have a huge (100 million+) bucket, at which point you should be experienced with s3, otherwise you can just click the big, clear and obvious “empty bucket” button on the console.


I think it’s far more mundane a reason. You can list 10k objects per request and getting the next 10k requires the result of the previous request, so it’s all serial. That means to list 1M files, you’re looking at 100 back to back requests. Assuming a ping time of 50ms, that’s easily 5s of just going back and forth, not including the cost of doing the listing itself on a flat iteration. The cost of a 10k item list is about the cost of a write which is kinda slow. Additionally, I suspect each listing is a strongly consistent snapshot which adds to the cost of the operation (it can be hard to provide an inconsistent view).

I don’t think btrees would help unless you’re doing directory traversals, but even then I suspect that’s not that beneficial as your bottleneck is going to be the network operations and exposed operations. Ultimately, file listing isn’t that critical a use case and typically most use cases are accomplished through things like object lifecycles where you tell S3 what you want done and it does it efficiently at the FS layer for you.


That's 5s of a 15m duration. I don't think it matters in the least.


Depends how you’re iterating. If your iterating by hierarchy level, then you could easily see this being several orders of magnitude more requests.


It's not a good model to think of S3 has having directories in a bucket. It's all objects. The web interface has a visual way of representing prefixes separated by slashes. But that's just a nice way to present the objects. Each object has a key, and that key can contain slashes, and you can think of each segment to be a directory for your ease of mind.

But that illusion breaks when you try to do operations you usually do with/on directories.


Are you performing list calls sequentially? If you have O(100k) directories and are doing O(100k) requests sequentially, 15 minutes works out at O(10ms) per request which doesn’t seem that bad? (assuming my math is correct…)


At risk of being pedantic, you seem to be using big O to mean “approximately” or “in the order of”, but that’s not what it means at all. Big O is an expression of the growth rate of a function. Any constant value has a growth rate of 0, so O(100k) isn’t meaningful: It’s exactly the same as O(1).


You're right technically, it's an abuse of notation that isn't uncommon. My physics profs would do it in college.


Fair point, I guess the notation ~100k, ~10ms would be better.


I implemented a solution by threading the listing. Get the files in the root then spin a separate process to do the recursion for each directory.


> Why is this something Amazon has not fixed?

It's common to store metadata on DynamoDB where it can be queried, and just have whatever arbitrary links to the values in the buckets.


> Why is this something Amazon has not fixed? From the outside really seems like they could slap some B-trees on the individual buckets and call it a day.

They fixed it already, it's called DynamoDB. With some SQS and Lambda glue you can index your S3 content in any way you want for later retrieval.


Take this opportunity to read the docs and discard assumptions. Enumerating buckets as though they’re directories will seem peculiar when you understand it is designed for billions of items and up. Index your objects separately, in whatever form makes sense to your application.


It's not "fixed" because it's not a problem. You're just using it wrong.


> Recursively listing these files

There's no "recursive" nature to S3 buckets. "Listing a directory" is simply listing keys by a prefix.

So list by the upper-most prefix that you want. If you have 1,000,000 files, it will take 1,000 API calls to list everything.

If each call takes 1s (I have no idea what your latency to the S3 bucket region is), then it will indeed take 15 min.

https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObje...


> Amazon S3 is the original cloud technology: it came out in 2006. "Objects" were popular at the time and S3 was labelled an "object store", but everyone really knows that S3 is for files. S3

Alternative theory: everyone who worked on this knew that it was not a filesystem and "object store" is a description intended to describe everything else pointed out in this post.

"Objects were really popular" is about objects as software component that combines executable code with local state. None of the original S3 examples were about "hey you can serialize live objects to this store and then deserialize them into another live process!" It was all like "hey you know how you have all those static assets for your website..." "Objects" was used in this sense in databases at the time in the phrase "binary large object" or "blob". S3 was like "hey, stuff that doesn't fit in your database, you know...objects...this is a store for them."

This is meant to describe precisely things like "listing is slow" because when S3 was designed, the launch usecases assumed an index of contents existed _somewhere else_, because, yeah, it's not a filesystem. it's an object store.


Yes, the author doesn't seem to realize that "object storage" is a term of art in storage systems that has nothing to do with OOP.

https://en.wikipedia.org/wiki/Object_storage


Yeah, I'm really worried the author is confusing OOP with an object store.

To quote GCP:

> Object storage is a data storage architecture for storing unstructured data, which sections data into units—objects—and stores them in a structurally flat data environment

> https://cloud.google.com/learn/what-is-object-storage

That is (1) unstructured (2) flat organization (3) whole-item operations (read, write)


S3 is not even files, and definitely not a filesystem.

The thing I would expect from a file abstraction is mutability. I should be able to edit pieces of a file, grow it, shrink it, read and write at random offsets. I shouldn't have to go back up to the root, or a higher level concept once I have the file in hand. S3 provides a mutable listing of immutable objects, if I want to do any of the mutability business, I need to make a copy and re-upload. As originally conceived, the file abstraction finds some sectors on disk, and presents them to the client as a contiguous buffer. S3 solves a different problem.

Many people misinterpret the Good Idea from UNIX "everything is a file" to mean that everything should look like a contiguous virtual buffer. That's not what the real Good Idea is. Really: everything can be listed in a directory, including directories. There will be base leaves, which could be files, or any object the system wants to present to a process, and there will be recursive trees (which are directories). The directories are what make the filesystem, not the type of a particular leaf. Adding a new type of leaf, like a socket or a frame buffer, or whatever, is almost boring, and doesn't erode the integrity of the real good idea. Adding a different kind of container like a list, would make the structure of the filesystem more complex, and that would erode the conceptual integrity.

S3 doesn't do any of these things, and that's fine. I just want a place to put things that won't fit in the database, and know they won't bitrot when I'm not looking. The desire to make S3 look more like a filesystem comes from client misunderstanding of what it's good at/for, and poor product management indulging that misunderstanding instead of guarding the system from it.


> S3 is not even files, and definitely not a filesystem.

I agree. To me the correct analog for S3 is a block storage device (a very weird one where blocks can be any size and can have a key associated with them) and not a filesystem. A filesystem is an abstraction that sits on top of a block storage device and so an "S3 filesystem" would have to be an abstraction that sits on top of S3 as the underlying block storage.


> a very weird one where blocks can be any size and can have a key associated with them

That is a very weird one


How do read-only filesystems align with your definition?


All of the read stuff still applies (list, open, read, seek).


You can read individual blocks on a read only file system. With S3 you’re stuck with range requests which are much larger.


You can't create new things on a read-only filesystem, you can in S3; not a good analogy.


I wasn’t making an analogy. I was asking how read-only filesystem works given the parent commenters description of what makes something a filesystem.


It's a filesystem where many operations return an error (historically, EROFS). There are many things you can't do with one. Is that interesting somehow?

I don't agree with defining a filesystem as something that has to be backed by a block device, but the shape of a filesystem API is historically very different from the shape of the S3 API.


A filesystem is an abstraction built on a block device. A block device just gives you a massive array of bytes and lets you read/write from them in blocks (e.g. write these 300 bytes at position 273041).

A block device itself is an abstraction built on real hardware. "Write these 300 bytes" really means something like "move needle on platter 2 to position 6... etc"

S3 is just a different abstraction that is also built on raw storage somehow. It's a strictly flat key-object store. That's it. I don't know why people have a problem with this. If you need "filesystem stuff" then implement it in your app, or use a filesystem. You only need to append? Use a database to keep track of the chain of appends and store the chunks in S3. Doesn't work for you? Use something else. Need to "copy"? Make a new reference to the same object in your db. Doesn't work for you? Use something else.

S3 works for a lot of people. Stop trying to make it something else.

And stop trying to change the meaning of super well-established names in your field. A filesystem is described in text books everywhere. S3 is not a filesystem and never claimed to be one.

Oh and please study a bit of operating system design. Just a little bit. It really helps and is great fun too.


It's ever discussed in https://github.com/apache/arrow-rs/issues/3888 for comparing object_store in Apache Arrow to the APIs provided by Apache OpenDAL.

Briefly, Apache OpenDAL is a library providing FS-like APIs over multiple storage backends, including S3 and many other cloud storage.

A few database systems, such as GreptimeDB and Databend, use OpenDAL as a better S3 SDK to access data on cloud storage.

Other solutions exist to manage filesystem-like interfaces over S3, including Alluxio and JuiceFS. Unlike Apache OpenDAL, Alluxio and JuiceFS need to be deployed standalone and have a dedicated internal metadata service.


I'm not sure if Alluxio could be substituted by OpenDAL as a local cache layer for TrinoDB.


If I get "local cache layer" correctly, it's possible. And it's even desired if you want to reduce the deployment burden.

Here are some related codes on how we implement such a layer in GreptimeDB:

* https://github.com/GreptimeTeam/greptimedb/blob/v0.7.0/src/o... * https://github.com/GreptimeTeam/greptimedb/blob/v0.7.0/src/m...


Backblaze B2 is worth mentioning while we are speaking of S3. I'm absolutely in love with their prices (3 times lower than of S3). (I'm not their representative).


Backblaze's B2 is cheap - but if your'e using them in production you must include these costs:

* their weekly 2 hour maintenance window 11:30-13:30 PST (which usually has no downtime, but sometimes is a full outage in the middle of the US day)

* having to file support tickets when your error rates increase above a usable threshold (for us about once a year for the last few years)

* support which does not look into the issue, just asks tons of questions as if they do not have error logs or any visibility on their end

* false success on uploads where B2 says it successfully saved your file but it is 0 bytes on their system (ALWAYS verify the upload despite B2's success code)

* extended outages if there's a high severity CVE (ex: they shut down for 10 hours for the Log4j2 CVE)

They have the best price - but when comparing options, it is simply not a directly comparable product to more mature cloud storage services.

(edit: formatting)


With every alternative, the prevailing issue is the fact that your data is as safe as the company your data is with. But I think this can be remedied by doubly external backups.


B2 having an S3-compatible API available makes this particularly easy :)


Backblaze is like if Amazon spun AWS S3 out as its own business (and it added some backup helper tooling as a result) though, I wouldn't really worry any more about it. You could write a second copy to S3 Glacier Deep Archive (using B2 for instant access when you wanted to restore or on a new device) and still be much cheaper.


We liked B2 but not enough to pay for IPv4 addresses, insane they advertise as a multi-cloud solution but basically kill any chance at adoption when NAT gateways and IPv4 charges are everywhere. We would literally save money paying B2 bandwidth fees (high read low write) but not when being pushed through a NAT64 gateway, or paying an hourly charge just to be able to access B2.


How could they launch a cloud service like this and not have IPv6 in 2015? What other basic things did they cheap out on?


Most mayor cloud vendors are still not fully dual stack capable so it's not that surprising. And plenty of ISPs have barely started rollout, or even said they just wont.


I understand that AWS has 200 services, some of which are 20 years old, and making them all IPv6-ready would be hard and costly. Backblaze has one cloud service, and the public interface is a boring REST API over boring HTTPS.


AWS enabled dualstack S3 almost 10 years ago because object storage is pretty much the use case for IPv6.

I’m pretty sure the only other large object storage provider that is v4 only is Azure, and even then they offer a compatibility layer. Backblaze just flat out won’t work unless you pay extra to connect to them.

Honestly the only cloud provider I think you’re talking about is Azure, I don’t know of any other that are IPv4 only because it’s just cost prohibitive.


I also migrated, after asking for IPv6 for more than 3 years on reddit.

they does not seem to understand users on the b2 product. it's almost as if b2 is just a supplementary service from their backup service.

https://www.reddit.com/r/backblaze/comments/ij9y9s/b2_s3_not...


they've started internal v6 rollout with external coming afterwards. no timelines though, and I've waited for years

https://old.reddit.com/r/backblaze/comments/1av4r3g/b2_ipv6_...


Great article - would have been useful to read before starting out on the journey of making rclone mount (mount your cloud storage via fuse)!

After a lot of iterating we eventually came up with the VFS layer in rclone which adapts S3 (or any other similar storage system like Google Cloud Storage, Azure Blob, Openstack Swift, Oracle Object Storage, etc) into a POSIX-ish file system layer in rclone. The actual rclone mount code is quite a thin layer on top of this.

The VFS layer has various levels of compatibility "off" where it just does directory caching. In this mode, like the article states you can't read and write to a file simultaneously and you can't write to the middle of a file and you can only write files sequentially. Surprisingly quite a lot of things work OK with these limitations. The next level up is "writes" - this supports nearly all the POSIX features that applications want like being able to read and write to the same file at the same time, write to the middle of the file, etc. The cost for that though is a local copy of the file which is uploaded asynchronously when it is closed.

Here are some docs for the VFS caching modes - these mirror the limitations in the article nicely!

https://rclone.org/commands/rclone_mount/#vfs-file-caching

By default S3 doesn't have real directories either. This means you can't have a directory with no files in, and directories don't have valid metadata (like modification time). You can create zero length files ending in / which are known as directory markers and a lot of tools (including rclone) support these. Not being able to have empty directories isn't too much of a problem normally as the VFS layer fakes them and most apps then write something into their empty directories pretty quickly.

So it is really quite a lot of work trying to convert something which looks like S3 into something which looks like a POSIX file system. There is a whole lot of smoke and mirrors behind the scene when things like renaming an open file happens and other nasty corner cases like that.

Rclone's lower level move/sync/copy commands don't bother though and use the S3 API pretty much as-is.

If I could change one thing about S3's API I would like an option to read the metadata with the listings. Rclone stores modification times of files as metadata on the object and there isn't a bulk way of reading these, you have to HEAD the object. Or alternatively a way of setting the Last-Modified on an object when you upload it would do too.


> If I could change one thing about S3's API I would like an option to read the metadata with the listings. Rclone stores modification times of files as metadata on the object and there isn't a bulk way of reading these, you have to HEAD the object. Or alternatively a way of setting the Last-Modified on an object when you upload it would do too.

I wonder if you couldn't hack this in by storing the metadata in the key name itself? Obviously with the key length limit of 1024 you would be limited in how much metadata you could store, but it's still quite a lot of space, even taking into account the file path. You could use a deliminator that would be invalid in a normalized path, like '//', for example: /path/to/file.txt//mtime=1710066090

You would still be able to fetch "directories" via prefixes and direct files by using '<filename>//' as the prefix.

This kind of formatting would probably make it pretty incompatible with other software though.


I think that is a nice idea - maybe something we could implement in an overlay backend. However people really like the fact that the object they upload with rclone arrive with the filenames they had originally on s3, so I think the incompatible with other software downside would make it unattractive for most users.


> If I could change one thing about S3's API I would like an option to read the metadata with the listings.

Agree. In MinIO (disclaimer: I work there) we added a "secret" parameter (metadata=true) to include metadata and tags in listings if the user has the appropriate permissions. Of course it being an extension it is not really something that you can reliably use. But rclone can of course always try it and use it if available :)

> You can create zero length files ending in /

Yeah. Though you could also consider "shared prefixes" in listings as directories by itself. That of course makes directories "stateless" and unable to exist if there are no objects in there - which has pros and cons.

> Or alternatively a way of setting the Last-Modified on an object when you upload it would do too.

Yes, that gives severe limitations to clients. However it does make the "server" time the reference. But we have to deal with the same limitation for client side replication/mirroring.

My personal biggest complaint is that there isn't a `HeadObjectVersions` that returns version information for a single object. `ListObjectVersions` is always going to be a "cluster-wide" operation, since you cannot know if the given prefix is actually a prefix or an object key. AWS recently added "GetObjectAttributes" - but it doesn't add version information, which would have fit in nicely there.


> Agree. In MinIO (disclaimer: I work there) we added a "secret" parameter (metadata=true) to include metadata and tags in listings if the user has the appropriate permissions. Of course it being an extension it is not really something that you can reliably use. But rclone can of course always try it and use it if available :)

Is this "secret" parameter documented somewhere? Sounds very useful :-) Rclone knows when it is talking to Minio so we could easily wedge that in.

> My personal biggest complaint is that there isn't a `HeadObjectVersions` that returns version information for a single object. `ListObjectVersions` is always going to be a "cluster-wide" operation, since you cannot know if the given prefix is actually a prefix or an object key

Yes that is annoying having to do a List just to figure out which object Version is being referred to. (Rclone has this problem when using --s3-list-version).


Hey Nick :wave:


> The "simple" in S3 is a misnomer. S3 is not actually simple. It's deep.

Simple doesn't mean "not deep". It means having the fewest parts needed in order to accomplish your requirements.

If you require a distributed, centralized, replicated, high-availability, high-durability, high-bandwidth, low-latency, strongly-consistent, synchronous, scalable object store with HTTP REST API, you can't get much simpler than S3. Lots of features have been added to AWS S3 over the years, but the basic operation has remained the same.


> It means having the fewest parts needed in order to accomplish your requirements.

That is exactly what "deep" means, in the terminology of this post (from Ousterhout's book A Philosophy of Software Design). Simple means "not complex" (see also Rich Hickey's talk Simple Made Easy: https://www.infoq.com/presentations/Simple-Made-Easy/), while "deep" means providing/having a lot of internally-complex functionality via a small interface. The latter is a better description of S3 (which is what you seem to be saying too) than "simple" which would mean there isn't much to it.


Hickey's definition of simple is wrong. It's not the opposite of complex at all. They are not opposites, nor mutually exclusive.

  - Easy is when something does not require much effort.
  - Simple means the least complex it can be and still work.
  - Complex means there are lots of components.
These are all quite different concepts:

  - Easy is a concept that distinguishes the amount of work needed to use a solution
  - Simple is a concept that distinguishes whether or not there is an excess number of interacting properties in a system
  - Complex is a concept describing the quality of having a number of interacting properties in a system
Hickey's talk is useful in terms of thinking about software, but it also contains many over-generalizations which are incorrect and lead to incorrect thinking about things that aren't software. (Even some of his declarations about software are wrong)

"Deep", in the context of software complexity, probably only makes sense in terms of describing the number of layers involved in a piece of technology. You could make something have many layers, and it could still be simple, or be complex, or easy.


In terms the article puts forth, I would almost argue that simple implies deep (and the associated “narrow” interface).


S3 is a tagged versioned object storage with file like semantics implemented in the AWS SDK (via AWS S3 API's). The S3 object key is the tag.

Files and folders are used to make S3 buckets more approachable to those who either don't know or don't want to know what it actually is, and one day they get a surprise.


> S3 is a cloud filesystem, not an object-whatever. [...]I think the idea that S3 is really "Amazon Cloud Filesystem" is a bit of a load bearing fiction.

Does anyone actually think this? I have never encountered anyone who has described S3 in these terms.


Not sure if the author is aware of EFS


Tools like LucidLink and Weka go a way to making S3 even more of a “file system”. They break files into smaller chunks (S3 objects) which helps with partial writes, reads and performance. Alongside tiering of data from S3 to disk when needed for performance.


Someone contributed an nbdkit S3 plugin which basically works the way you described. It uses numbered S3 chunks using the pattern "key/%16x", allowing the virtual disk to be updated. (https://libguestfs.org/nbdkit-S3-plugin.1.html https://gitlab.com/nbdkit/nbdkit/-/tree/master/plugins/S3)


The problem with these approaches is that the data is scrambled on the backend, so you can't access the files directly from S3 anymore. Instead you need an S3 gateway to convert from scrambled S3 to unscrambled S3. They rely on a separate database to reassemble the pieces back together again.


I don’t know a whole lot about LucidLink but Weka basically uses S3 as a dataplane for their own file system.


It's nice to see Ousterhout's idea of module depth (the main idea from his A Philosophy of Software Design) getting more mainstream — mentioned in this article with attribution only in "Other notes", which suggests the author found it natural enough not to require elaboration. Being obvious-in-hindsight like this is a sign of a good idea. :-)

> The concept of deep vs shallow modules comes from John Ousterhout's excellent book. The book is [effectively] a list of ideas on software design. Some are real hits with me, others not, but well worth reading overall. Praise for making it succinct.


I was tempted to define and cite the term more carefully before I used it but that leadened the article a lot right in the middle and so I cut it and just hoped.

It is a great concept and also a great book. I really enjoyed it but I've never found a convincing way to persuade people to read it. I read it on personal recommendation but that only works if it comes from someone you respect (as in my case).


I feel like I understand the lasting popularity of the humble FTP fileserver a bit better now. Thank you.


oh but amazon offers SFTP on top of S3 so you don't have to miss out.


If it's offered on top of S3, though, doesn't it still have all the same issues of needing to totally overwrite files?


Like Gmail is emails but not IMAP. It's fine. We have seen that these kinds of wrappers work pretty well most of the time considering the performance and simplicity they bring in building and managing these systems.


A bit off topic but also related: I use Minio as a local "S3" to store datasets and model checkpoints for my garage compute. Minio, however, has a bunch of features that I simply don't need. I just want to be able copy to/from, list prefixes, and delete every now and then. I could use nfs I suppose, but that'd be a bit inconvenient since I also use Minio to store build deps (which Bazel then downloads), and I'd like to be able to comfortably build stuff on my laptop. In particular, one feature I do not need is the constant disk access than Minio does to "protect against bit rot" and whatever. That protection is already provided by periodic scrubs on my raidz6.

So what's the current best (preferably statically linked) self-hosted, single-node option for minimal S3 like "thing" that just lets me CRUD the files and list them?


FYI, Minio used to have a "File System" mode that did exactly this.

But they deprecated it.

(You can still use it, but it's not getting updates.)


A web server?


> Filesystem software, especially databases, can't be ported to Amazon S3

Hudi, Delta, iceberg bridge that gap now. Databricks built a company around it.

Don't try to do relational on object storage on your own. Use one of those libraries. It seems simple but it's not. Late arriving data, deletes, updates, primary key column values changing, etc.


There is specifically block storage service (EBS) and falvirs of it like EBS multi-attach and EFS that can ne used if there is a need to port software/databases to the cloud with low level filesystem support.

Why would we need to do it on object storage which addresses a different type of storage need.

Nevertheless there are projects like EMRFS and S3 file system mount points that try to provide files stem interfaces to workloads that need to see S3 as a filesystem.


S3 is better for large datasets. It's cheaper and handles large file sizes with ease.

It has become a de-facto standard for distributed, data-intensive workloads like those common with spark.

A key benefit is decoupling the data from the compute so that they can scale independently. EBS is tightly coupled to iops and you pay extra for that.

(Source: a long time working in data engineering)


Yes and I also believe:

Experienced Spark / Data Engineering teams would not assume S3 is readily useable as a filesystem.

This [1] seems like a good guide on how to configure spark for working with Cloud object stores, while recognizing the limitations and pitfalls.

[1]: https://spark.apache.org/docs/latest/cloud-integration.html

---

Amazon EMR offers a managed way to run hadoop or spark clusters and it implements an "EMR FS" [2] system to interface with S3 as storage.

[2]: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-fs.h...

AWS Glue is another option which is "serverless" ETL. Source and Destination can be S3 data lakes read through a data catalog (hive or glue data catalog). During processing AWs Glue can optionally use S3 [3,4,5] for shuffle partition.

[3]: https://aws.amazon.com/blogs/big-data/introducing-amazon-s3-...

[4]: https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shu...

[5]: https://aws.amazon.com/blogs/big-data/introducing-the-cloud-...


I think we're talking about two different things. I was addressing a section in the article about running databases backed by s3. It's less about s3 needing to act as a filesystem, and more about all of the rdbms features that come along with the various types of DB transactions. It's a solved problem with the libraries I mentioned. Not something I'd ever recommend to build on your own. Been there done that when those solutions were still nascent. Wasn't worth the effort vs just using an rdbms.

The problem that emrfs is trying to solve doesn't cover the rdbms scenarios like row-level updates and deletes.


*flavors

*can be used

*file system

(Apologies for typos. The "noprocrast" setting sometimes locks us out of HN right after submitting a comment. And it is now too late, not editable)


I still don't understand why you'd want to do it in the first place. Just by some contiguous storage.


This article was an epiphany for me because I realized I've been thinking of the Unix filesystem as if it has two functions: read_file and write_file. (And then getting frustrated with the filesystem APIs in programming languages.)


So you came from an S3 or other put-get world, and found actual filesystems odd?

I suppose that's not so different from a WMP user's epiphany when they discover processes, shells, etc.


Well I’m used to an application-level view of the file system.

A document editor or text editor opens files and saves files, but these are whole-document operations. I can’t open a document in Sublime Text without reading it, and I can’t save part of a file without saving all of it. So it’s not obvious that these would be different at an OS level.

As the post points out, there are uses for Unix’s sub-file-level read-and-write commands, but I’ve never needed them.


The article is well written, but I am annoyed at the attempt to gatekeep the definition of a filesystem.

Like literally any abstraction out there, filesystems are associated with a multitude of possible approaches with conceptually different semantics. It's a bit sophistic to say that Postgres cannot be run on S3 because S3 is not a filesystem; a better choice would have been to explore the underlying assumptions; (I suspect latency would kill the hypothetical use case of Postgres over S3 even if S3 had incorporated the necessary API semantics - could somebody more knowledgeable chime in?).

A more interesting venue to pursue would be - what other additions could be made to the S3 API to make it more usable on its own right - for example, why doesn't S3 offer more than one filename per blob? (e.g., a similar to what links do in POSIX)


The notion of postgres not being able to run on s3 has more to do with the characteristics of how it works than with it not being a filesystem. After all, people have developed fuse drivers for s3 so they can actually pretend it's a filesystem. But using that to store a database is going to end in tears for the same reasons that using e.g. NFS for this is also likely to end in tears. You might get it to work but it won't be fast or even reliable. And since NFS actually stands for networked file system, it's hard to argue that NFS isn't a filesystem.

Whether something is or isn't a filesystem requires defining what that actually is. A system that stores files would be a simple explanation. Which is clearly something S3 is capable of. This probably upsets the definition gatekeepers for whatever more specific definitions they are guarding. But it has a nice simple logic to it.

It's worth considering that file systems have had a long history, weren't always the way they are now, and predate the invention of relational databases (like postgres). Technically before hard disks were invented in the fifties, we had no file systems. Just tapes and punch cards. A tape would consist a single blob of bits, which you'd load in memory. Or it would have multiple such blobs at known offsets. I had cassettes full of games for my commodore 64. But no disk drive. These blobs were called files but there was no file system. Sometime, after the invention of disks file systems were invented in the early sixties.

Hierarchical databases were common before relational databases and filesystems with directories are basically a hierarchical database. S3 lacking hierarchy as a simpler key value store clearly isn't a hierarchical database. But of course it's easy to mimic one simply by using / characters in the keys. Which is how the fuse driver probably fakes directories. And S3 even has APIs to listfiles with a common prefix. A bigger deal is the inability to modify files. You can only replace them with other files (delete and add). That kind of is a show stopper for a database. Replacing the entire database on every write isn't very practical.


Neon.tech runs Postgresql runs on S3. They persist the WAL to S3 so that they can replicate the data and bring it to local ssds I assume.


Well, RocksDB never overwrites files except the manifest which is small. And you can write DB features on top of that. So that's an example of a database that can work with the S3 limitations.


ClickHouse can work with S3 as a main storage. This is possible because a table is a set of immutable data parts. Data parts can be written once and deleted, possibly as a result of a background merge operation. S3 API is almost enough, except for cases of concurrent database updates. In this case, it is not possible to rely on S3 only because it does not support an atomic "write if not exists" operation. That's why external, strongly consistent metadata storage is needed, which is handled by ClickHouse Keeper.


Google Cloud Storage supports create-if-not-exist and compare-and-swap on generation counter. S3 is much harder to use as a building block without tying your code into a second system like DynamoDB etc.

https://pkg.go.dev/cloud.google.com/go/storage#Conditions


Conditional PUT would be a great addition to S3, indeed.


That would probably require them to rewrite a non-trivial part of S3 from scratch.


Is a "write if not exists" atomic operation enouhg as a concurrency primitive for database locks?


Yes, its not necessarily the most efficient mechanism (could be a lot of retries) but its sufficient. See the Delta Lake paper for example [0]

[0] https://people.eecs.berkeley.edu/~matei/papers/2020/vldb_del...


When talking about analytical databases for "big data", yeah. They generally just want a "atomically replace the list of Parquet files that make up this table", with one writer succeeding at a time.

That would not be a great base to build a transactional database on.


This might be of interest to you: https://neon.tech/blog/bring-your-own-s3-to-neon.

There's also the OG Aurora whitepaper: https://www.amazon.science/publications/amazon-aurora-design...


I’ve wondered this also because it can be handy to have multiple ways of accessing the same file. For example to obfuscate database uuids if they are used in the key. In theory you could implement soft links in AWS by just storing a file with the path to the linked file. But it would be a lot of manual work.


I talked to people at AWS who work in RDS Aurora and they hinted they use S3 internally as a backend for MySQL and PostgreSQL.


Maybe for snapshots, but certainly not for live data.


EBS not S3


Big if true. That was definitely not in the AWS cert I took lol.


Separating compute and storage is one of the core idea behind Aurora. They talked about it in several places, for instance:

* https://www.amazon.science/publications/amazon-aurora-design... * https://d1.awsstatic.com/events/reinvent/2019/REPEAT_Amazon_...


I am currently pondering this exact problem. I want to run a file-sharing web application (think: NextCloud) but I don't want to use expensive block storage or the dedicated server's disk space for the files, as some of them will be accessed infrequently.

I am wondering if s3fs/rclone-mount is sufficient, or if I should use something like JuiceFS that adds random-access, renaming, etc on top of it. Are those really necessary APIs for my use case? Is there only one way to find out?

(The app doesn't have native S3 support)


It depends on if you want to expose filesystem semantics or metadata to applications using it. For example random access writes are done by ffmpeg, which is a workhorse of the media industry, but most things can't handle that or are too slow. We had to build our own solution cunoFS to make it work properly at high speeds.


S3 is obviously not a filesystem in the sense of a POSIX filesystem. And I would argue it is not a filesystem, even if we were to relax POSIX filesystem semantics (do not implement the full spec). But what is certainly possible is to span a filesystem on top of S3. It is basically possible to span a filesystem on anything that can store data. You can even go crazy for demonstration purposes and put a filesystem on top of YouTube (there are some tech demos for that on GitHub).

I think a better question is whether there are any good filesystem implementations on top of S3. There are many attempts like s3fs-fuse[^1] or seaweedfs[^2], but I have not heard many stories about their use at scale from big companies. Just recently there was a post here about cunoFS[^3]. It is a startup that implements a POSIX-compliant (supports symlinks, hard links (emulated), UIDs & GIDs, permissions, random writes, etc.) filesystem on top of S3/AZ/GCP storage and claims to have really good performance. I think only time will tell if it works out in practice for companies to use S3 as a filesystem through fs implementations on top of S3.

[^1]: https://github.com/s3fs-fuse/s3fs-fuse

[^2]: https://github.com/seaweedfs/seaweedfs

[^3]: https://news.ycombinator.com/item?id=39640307


Are filesystems the correct abstraction to build databases on? Isn’t a filesystem a database in a way? Is there a reason to build a database on top of a filesystem abstraction rather than a block abstraction?

To say you can’t build an efficient database on top of S3 makes sense to me. S3 is already a certain kind of data-storing abstraction optimized for certain usages. If you try and build another data-storing abstraction optimized for incompatible usages on top of that, you are going to have a difficult time.


The traditional POSIX filesystem is the wrong abstraction for a database, but not filesystems per se. All databases that care about performance and scalability implement their own filesystems, either directly against raw block devices or as an overlay on top of a POSIX filesystem that bypasses some of its limitations. The performance and scalability gains by doing so are not small.

The issue with POSIX filesystems is that they are required to make a set of tradeoffs to support features a database engine doesn't need, to the significant detriment of scalability and performance in areas that databases care about a lot. For example, one such database filesystem I've used occasionally over the years, while a bit dated at this point, is designed such that you can have tens of millions of files in a single directory where you are creating and destroying tens of thousands of files every second, on upwards of a petabyte of storage. Very far from being POSIX compatible but you don't get anything like that type of scalability on POSIX.

Object storage is far from ideal as database storage. The biggest issue, though, is the terrible storage bandwidth available in the cloud. It is a small fraction of what is available in a normal server and modern database engines are capable of fully exploiting a large JBOD of NVMe.


> Is there a reason to build a database on top of a filesystem abstraction rather than a block abstraction?

Oracle DB for a long time supported running on raw partitions which I think suggests that the answer is "not really". Snowflake (and I hear Clickhouse) can run on S3 which I think is more evidence against running on a filesystem. Not to mention the torrid time Postgres has had with fsync on linux.


In my $dayjob as cloud architect I sometimes suggest S3 as an alternative to pulling massive JSON blobs from RDS Postgres/Redis etc. As long as their latency minimums are high enough there's no reason you can't.


At Hopsworks we built HopsFS-S3 to improve things like listing (becomes a partition pruned scan of a in-memory DB), atomic renames and added a block/object caching layer using NVMe drives.

You can read the research paper here if you are curious: https://www.hopsworks.ai/research-papers/hopsfs-s3-extending...


Very mild take: S3 is a very reliable, relatively cheap, high latency, key-value store.

The reason I don't think about flies on a UNIX-derived systems as a key-value store (filename->file content), is that in such systems, we have many things that aren't really files that expose a file system interface regardless.


Is there a generic name for these distributed cloud file storages?

AWS is S3, google is buckets, Azure is blob storage, the open source version is … ?


Object Storage


I tend to go by Binary Large OBject (BLOB) storage to discern between this kind of object storage and “object” as in OOP. BLOB is also what databases call files stored in columns.


When would that be confusing? As in what would an AWS service offering OOP object storage be/mean?


Google buckets is a bit off - the product is called Google storage. Buckets are also a term used by s3 and are equivalent to azure blob storage containers. They are an intermediary layer that determines attributes for the objects stored within it such as ACLs and storage class (and therefore cost and performance).

As to your question, object storage[1] seems to be the generic term for the technology. Internally they all rely on naming files based on the hash of their contents for quick lookup, deduplication, and avoiding name clashes.

1: https://en.wikipedia.org/wiki/Object_storage


"blob storage" is the usual generic term, even though Azure uses it explicitly. It's like calling adhesive bandages, "bandaids" even though that is a specific company's term.


Underneath the software, there’s still a filesystem with files.

If you stand up an S3 instance with Ceph, you still have a filesystem on spinning rust or fancy SSDs. There’s just a bunch of stuff on top of that. It’s cool, but to say that there’s no filesystem is simply what the customer or middle person sees, not what is actually happening.


S3 actually uses a completely custom system[1] for writing bytes to disk. I haven't seen much in the way of details on the on-disk format but I certainly wouldn't assume it resembles a normal filesystem.

[1]: https://aws.amazon.com/blogs/storage/how-automated-reasoning...


I seriously doubt this is correct. It is common for database engines to install directly on raw block devices, bypassing the Linux kernel and effectively becoming the filesystem for those storage devices. Why would S3 work any differently? There are no advantages to building on top of a filesystem and many disadvantages for this kind of thing.

It would be a poor engineering choice to build something like S3 on top of some other filesystem. There are often ways to do it by using an overlay that converts a filesystem into a pseudo block device, but that is usually considered a compatibility shim used for environments that lacking dedicated storage, at the cost of robustness and performance.


No there isn't. AWS does not use the traditional filesystem layer to store data; that would be a massive mistake from a performance and reliability POV; the POSIX filesystem specification is notoriously vague about things like fsync consistency under particular scenarios, i.e. "do I need to fsync the parent directory before or after fsyncing the contents" for instance and has many bizarre performance cliffs if you aren't careful. At the scale AWS is at even a 10% performance cliff or performance delta would be worth clawing back if it meant removing the POSIX filesystem.

Filesystems are not free; they incur "complexity" (that favorite bugbear everyone on HN loves to complain about) just as much as any other component in the stack does.

> If you stand up an S3 instance with Ceph,

Okay, but AWS does not run on Ceph. Even then, Ceph is an example that recommends the opposite. Nowadays they recommend solutions like the Bluestore OSD backend to store actual data directly on raw block devices, completely bypassing the filesystem layer -- for the exact same reasons I outlined above and many, many others (the actual metadata does use "BlueFS" which is a small FS shim, but this is mostly so that RocksDB can write directly to the block device too, next to the data segments, and BlueFS is in no way a real POSIX filesystem, it's just a shim for existing software).

See "File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution" written by the Ceph authors[1] about why they finally gave in and wrote Bluestore. The spoiler alert is they got rid of the filesystem precisely because "a filesystem with files" underneath, as you describe, was problematic and worked poorly in comparison (see the conclusion in Section 9.)

Many places do use POSIX filesystems for various reasons, even at large scale, of course.

[1] https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf


Ceph's BlueStore has talked direct to block devices, no filesystem in between, since 2017.

https://ceph.com/community/new-luminous-bluestore/

[Disclaimer: ex-Ceph employee, from before BlueStore]


I dunno, are features like partial file overwrites necessary to make something a filesystem? This reminds me of how there are lots of internal systems at Google whose maintainers keep asserting are not filesystems, but everyone considers them so, to the point where "_____ is not a filesystem" has become an inside joke.


They are necessary because as soon as someone decides that S3 is a filesystem, they will look at the other cloud "filesystems," notice that S3 is cheaper than most of them, and then for some reason they will decide to run giant Hadoop fs stuff on it or mount a relational database on it or all other manner of stupidity. I guarantee you S3's customer-facing engineers are fielding multiple calls per week from customers who are angry that S3 isn't as fast as some real filesystem solution that the customer migrated from because S3 was cheaper.

When people decide that X is a filesystem, they try to use it like it's a local, POSIX filesystem, and that's terrible because it won't be immediately obvious why it's a stupid plan.


If a customer makes an IT decision as big as running Hadoop or RDBMS with S3 as storage ... but does not consult at least a Associate level AWS Certified architect (who are doke a dozen) for at least one day worth of advice which is probably a couple of hundred dollars at most ...

Can we really blame AWS?

I am sure none of official AWS documentations or examples show such an architecture.

----

Amazon EMR can run Hadoop and use Amazon S3 as storage via EMR FS.

"S3 mountpoints" are a feature specifically for workloads that need to see S3 as a file system.

For block storage workloads there is EBS and EFS and FSx that AWS heavily advertises.


*dime a dozen

(Apologies for typos. The "noprocrast" setting sometimes locks us out of HN right after submitting a comment. And it is now too late, not editable)


Yeah, it’s sort of funny how “POSIXish semantics” has become our definition of these things, when it’s just one kind of thing that’s been called a filesystem historically.


Fun experiment I made with my mum, building a storage independent dropbox like UI [1] for anything that implement this interface:

  type IBackend interface {
    Ls(path string) ([]os.FileInfo, error)
    Cat(path string) (io.ReadCloser, error)
    Mkdir(path string) error
    Rm(path string) error
    Mv(from string, to string) error
    Save(path string, file io.Reader) error
    Touch(path string) error
  }
My mum really couldn't care less about the posix semantic as soon as she can see the pictures of my kid which happen to be on S3

[1] https://github.com/mickael-kerjean/filestash


Reducing things to basically the interface you laid out is the point of 9p [1], and is what Plan 9’s UNIX-but-distributed design was built on top of. Same inventor as Go! If you haven’t dived down the Plan 9 rabbit hole yet, it’s a beautiful and haunting vision of how simple cloud computing could have been.

[1] https://9fans.github.io/plan9port/man/man9/intro.html


I think this interface is less interesting than the semantics behind it, particularly when it comes to concurrency: what happens when you delete a folder, and then try and create a file in that folder at the same time? What happens when you move a folder to a new location, and during that move, delete the new or old folders?

Like yes, for your mum's use case, with a single user, it's probably not all that important that you cover those edge cases, but every time I've built pseudo-filesystems on top of non-filesystem storage APIs, those sorts of semantic questions have been where all the problems have hidden. It's not particularly hard to implement the interface you've described, but it's very hard to do it in such a way that, for example, you never have dangling files that exist but aren't contained in any folder, or that you never have multiple files with the same path, and so on.


All those considerations are important when implementing the interface but the interface itself isn't invalidated by those concerns and can cope with those constraints fine.


Can S3 murder your wife like ReiserFS and Reiser4?

https://en.wikipedia.org/w/index.php?title=Comparison_of_fil...


The problem is once you let go of those semantics, a lot of software stops working if run against such a "filesystem". If you dilute the meaning of "filesystem" too much, it becomes less useful as a term.

https://en.wikipedia.org/wiki/Andrew_File_System was interesting, I'd actually love to see something similar re-implemented with modern ideas, but it's more of an direct-access archival system than a general-purpose filesystem[1], you can't just put files written by arbitrary software on it. It's a bit like NFS without locks&leases, but even less like a normal filesystem; only really good for files created once that "settle down" into effectively being read-only.

[1]: I wrote https://github.com/bazil/plop that is (unfortunately undocumented) content-addressed immutable file storage over object storage, used in conjunction with a git repo with symlinks to it to manage the "naming layer". See https://bazil.org/doc/ for background, plop is basically a simplification of the ideas to get to working code easier. Site hasn't been updated in almost a decade, wow. It's in everyday use though!


Exactly, especially when the concept of filesystem really is defined before the whole internet scale becomes a thing or reality.

Maybe S3 isn't a filesystem according to this definition, but does it really matter to make it one? I doubt it. The Elastic Filesystem is also an AWS product, but you can't really work as one as you have locally, any folder over 20k files basically will timeout if you do a ls. Does it make EFS a filesystem or not?


S3 not implementate vfs api, but you can treat it as a software defined storage filesystem. Just like Ceph.

there are so many applications depends on file storage, such as Mysql. But horizontal scale for those app still difficult in many case. Replace from vfs api to s3 storage perhaps is trending in my experience.


> Filesystem software, especially databases, can't be ported to Amazon S3

Except they can be. You don't need to overwrite the whole DB file on every INSERT/UPDATE/DELETE; those can be (and often are) stored in memory and periodically checkpointed. You might lose some writes if the process goes down between checkpoints, but for a lot of applications that's entirely acceptable.

Indeed, for SQLite in particular there are tools like Litestream that support replication to and restoration from S3.

Alternately, you could split the DB across multiple files, and then an INSERT/UPDATE/DELETE would only need to overwrite the files actually affected. This is already how server-based RDBMSs usually work.


Random note: Has anyone noticed how fast the author's webpage is? I know it's static, but I mean it's fast even for the DNS lookup. I would love to know what they have on.


The response headers include

server: cloudflare

You said it though - the reason is that its static without any js/frameworks/SPA round trip requests.


Could be using Cloudflare pages hosted on a R2 bucket: https://pages.cloudflare.com/


Full stack Cloudflare is really fast


There is an nginx server running on debian stable somewhere in the dark heart of Germany. But I do have numerous tricks (too many, probably) to keep things quick.

But there are still ways to be quicker. For example, the header photo is smaller than the vector diagrams on the page, by about tenfold.


I feel like a lot of applications can S3 but due to latency needs typically build a layer that sits in front that basically writes logs out to SSDs and then tiers to S3. If S3 offered a fast, reasonably priced Append() API that probably go a long way in capturing those use-cases.


It can be a file system.

I’ve written my own FUSE that uses Rabin Chunking and stores the data (and meta) in S3. The C++/AWS SDK FUSE is connected to a Go SMB server that runs locally on my Mac and works with (local) TimeMachine.

I use Wasabi for cost and speed reasons.


Check out kopia.io is a backup software that uses S3 to store files by blocks or pages.

You can browse, search and sort the files and directories of the different snapshot or versions of the file.

I love it !

For me it's a file system in S3.

Bonus: you must use a key, to encrypt the files.


It can be misused as a filesystem. S3 wants objects? Here are some 512 or 4096 byte objects called clusters…


It seems like they’re moving away from this with S3 directory buckets and express zone.


I absolutely loved this article. Super well written with interesting insights.


That is very kind - glad to hear that you enjoyed it.


This screwdriver makes for a horrible hammer.


> Filesystem software, especially databases, can't be ported to Amazon S3

This seems mistaken. Porting databases that run on local disk to S3 seems like a good way to get a lashing from https://aphyr.com/

Can any databases do it correctly?

If so, I doubt they work with the model of partial overwrites. They probably have to do something very custom, and either sacrifice a lot of tail latency, or their uptime is capped by the uptime of a single AWS availability zone. Doesn't seem like a great design.

(copy of lobste.rs comment)


My employer (Neon) offers Postgres databases that run on top of a couple of caching layers at the end of which there is S3: https://neon.tech/docs/introduction/architecture-overview

Directly exposing every write to S3 gives you the partial overwrite issues as described. But one can collect a bunch of traffic and push state to S3 once it reaches a threshold. Instead, a few writes in the postgres WAL are held outside of S3 in a replicated on-disk cache.


Thanks for the link.

But I searched the docs for "durability" and got zero results. Before I use anything like this, I'd like to see what durability settings are used:

https://www.postgresql.org/docs/current/non-durability.html

Litestream documents the their data loss window, it seems like Neon should too:

https://litestream.io/tips/

By default, Litestream will replicate new changes to an S3 replica every second. During this time where data has not yet been replicated, a catastrophic crash on your server will result in the loss of data in that time window.

I also searched for "data loss" and got zero results -- this is important because Neon is almost certainly sacrificing durability for performance.


Neon handles that by staging the WAL segments on 3x replicated Safekeeper nodes. Durability relies on not having all of those blow up at the same time. I'd expect it to be much safer than traditional Postgres replication mechanisms (with the trade-off having a comparatively large minimum node count; Neon really is built for multitenancy where that cost can be amortized across lots of databases).


> I searched the docs for "durability" and got zero results.

The link I gave above explains it, right the sentence with "durability":

> Safekeepers are responsible for durability of recent updates. Postgres streams Write-Ahead Log (WAL) to the Safekeepers, and the Safekeepers store the WAL durably until it has been processed by the Pageservers and uploaded to cloud storage.

> Safekeepers can be thought of as an ultra reliable write buffer that holds the latest data until it is processed and uploaded to cloud storage. Safekeepers implement the Paxos protocol for reliability.


OK thanks, I glossed over that part. But really what I look for is (1) explicit claims about durability, and (2) a third party (e.g. Aphyr) actually tested the claims.

If there's no claim, then it's impossible to test :)

In particular, there are no numbers in the description you quoted.

Litestream gives a relatively weak claim, but it could be tested, which actually gives me more confidence in it.

If you look at what aphyr writes, a lot of it is claims from vendors that turned out to be false - https://aphyr.com/tags/jepsen


It’s been a while, but I really like the way google handles its file system internally. No confusion.


Holy crap I hate Hacker News! Why tf did this get a downvote? Colossus is a good fs.


JFC the people on this thread missing the difference between object storage and a blocks-and-inodes filesystem is alarming


S3 is a key value store. Just happens to be able to store really large values.


The limitations of S3 (and all the cloud "file systems") are quite astonishing when you consider you're paying for it as a premium service.

Try to imagine your astonishment if a traditional storage vendor showed up and told you that their very expensive premium file system they had just sold you:

    - can't store log files because it can't append
      anything to an existing files
    - can't copy files more than 5GB
    - can't rename or move a file
 
When challenged on how you are supposed to make all your applications work with limitations like that, they glibly told you "oh you're supposed to rewrite them all".


They're not filesystems though, they're object storage or key/value storage if you will. It's intended to store the log files for long term once they're full.

You can rename / move a file, but it involves copying and deleting the original; I don't understand why they don't have a shortcut for that, but it probably makes sense that the user of the service is aware of the process instead of hiding it.

I'm not sure about the 5GB limit, it's probably documented somewhere as to why that is; possibly, like tweets, having an upper limit helps them optimize things. Anyway there too there's tools, you can do multipart somethings and there's this official blogpost on the subject: https://aws.amazon.com/blogs/storage/copying-objects-greater...

Interesting to note maybe in the context of the post; copy, rename, moving large files, all that could be abstracted away, but that would hide the underlying logic - which might lead to inefficient usage of the service - and worse, make users think it's just a filesystem and use it accordingly, but it's not intended or designed for that use case.


Current limit is 5TB. The 5GB is for a single upload, you can hover do multipart upload to get up to the maximum size of 5TB.

https://aws.amazon.com/s3/faqs/


Amazon doesn’t market S3 as a replacement for file systems, that’s why EBS exists.

Also, is S3 really “very expensive”? Relative to what?


S3 usually is the cheapest storage, not only for Amazon, but for other clouds. I don’t understand why.


This is not true in my experience https://www.backblaze.com/cloud-storage/pricing


That Backblaze page (not surprisingly) compares their prices to a fairly expensive S3 pricing tier and makes other assumptions in Blackblaze's favour. For some use cases B2 is more expensive e.g. one copy of my backups goes to AWS Deep Glacier which is really cheap.


It's for building things on top. If you want to rename/move/copy data, implement a layer that maps objects to "filenames" or any metadata you like (or use some lib). If you want to write logs, implement append and rotation. But I for example don't and won't need any of that and if it helps keep the API simpler and more reliable then I benefit.

being a conventional filesystem for S3 would be either a very leaky abstraction or completely different product


S3 is an object storage, not a file system. The file system in AWS is called EFS. S3 is not positioned as a substitute for file systems, either.


It's not a filesystem, but it has better semantics for distributed operation because of it. Nobody talks about the locking semantics of S3 because it's at the blob level; that rules out whole categories of problems.

And that's also why you can't append. If you had multiple readers while appending, and appending to multiple replicas, guaranteeing that each reader would see a consistent only-forwards read of the append is extremely hard. So simply ban people from doing that and force them to use a different system designed for the purpose of logging.

Microservices. S3 is for blobs. If you want something that isn't a blob, use a different microservice.


These “file systems” are not file systems and I don’t understand why people expect them to be.

Some people are creating tools that make those services easier to synch with file systems but that is not intended use anyway.


My big pet peeve is AWS adding buttons in the UI to make "folders".

It is also a fiction! There are no folders in S3.

> When you create a folder in Amazon S3, S3 creates a 0-byte object with a key that's set to the folder name that you provided. For example, if you create a folder named photos in your bucket, the Amazon S3 console creates a 0-byte object with the key photos/. The console creates this object to support the idea of folders.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-...


Hmm well there's no folders but if you interact with the object the URL does become nested. So in a sense it does behave exactly like a folder for all intents and purposes when dealing with it that way. It depends what API you use I guess.

I use S3 just as a web bucket of files (I know it's not the best way to do that but it's what I could easily obtain through our company's processes). But in this case it makes a lot of sense though I try to avoid making folders. But other people using the same hosting do use them.


Except stuff like s3 cli has all these weird names for normal filesystem items and you have to bang your head to try to figure it out what it all means

(also don't get me started on the whole s3api thing)


Is that really so different from how folders work on other systems? A directory inode is just an inode.


Yes. It is, in practice, incredibly different.

Imagine you have a file named /some/dir/file.jpg.

In a filesystem, there’s an inode for /some. It contains an entry for /some/dir, which is also an inode, and then in the very deepest level, there is an inode for /some/dir/file.jpg. You can rename /some to /something_else if you want. Think of it kind of like a table:

  +-------+--------+----------+-------+
  | inode | parent |     name |  data |
  +-------+--------+----------+-------+
  |     1 | (null) |     some | (dir) |
  |     2 |      1 |      dir | (dir) |
  |     3 |      2 | file.jpg |  jpeg |
  +-------+--------+----------+-------+
In S3 (and other object stores), the table is like this:

  +-------------------+------+
  | key               | data |
  +-------------------+------+
  | some/dir/file.jpg | jpeg |
  +-------------------+------+
The kind of queries you can do is completely different. There are no inodes in S3. There is just a mapping from keys to objects. There’s an index on these keys, so you can do queries—but the / character is NOT SPECIAL and does not actually have any significance to the S3 storage system and API. The / character only has significance in the UI.

You can, if you want, use a completely different character to separate “components” in S3, rather than using /, because / is not special. If you want something like “some:dir:file.jpg” or “some.dir.file.jpg” you can do that. Again, because / is not special.


Except, S3 does let you query by prefix and so the keys have more structure than the second diagram implies: they’re not just random keys, the API implies that common prefixes indicate related objects.


That’s kind of stretching the idea of “more structure” to the breaking point, I think. The key is just a string. There is no entry for directories.

> the API implies that common prefixes indicate related objects.

That’s something users do. The API doesn’t imply anything is related.

And prefixes can be anything, not just directories. If you have /some/dir/file.jpg, then you can query using /some/dir/ as a prefix (like a directory!) or you can query using /so as a prefix, or /some/dir/fil as a prefix. It’s just a string. It only looks like a directory when you, the user, decide to interpret the / in the file key as a directory separator. You could just as easily use any other character.


One operation where this difference is significant is renaming a "folder". In UNIX (and even UNIX-y distributed filesystems like HDFS) a rename operation at "folder" level is O(1) as it only involves metadata changes. In S3, renaming a "folder" is O(number of files).


> In S3, renaming a "folder" is O(number of files).

More like O(max(number of files, total file size)). You can’t rename objects in S3. To simulate a rename, you have to copy an object and then delete the old one.

Unlike renames in typical file systems, that isn’t atomic (there will be a time period in which both the old and the new object exist), and it becomes slower the larger the file.


From reading the above, if you have a folder 'dir' and a file 'dir/file', after renaming 'dir' to 'folder', you would just have 'folder' and 'dir/file'.


There is really no such thing as a folder in S3.

If you have something which is dir/file, then NORMALLY “dir” does not exist at all. Only dir/file exists. There is nothing to rename.

If you happen to have something which is named “dir”, then it’s just another file (a.k.a. object). In that scenario, you have two files (objects) named “dir” and “dir/file”. Weird, but nothing stopping you from doing that. You can also have another object named “dir///../file” or something, although that can be inconvenient, for various reasons.


Imho, renaming "folders" on S3 results in copying and deleting O(number of files)


Exactly.


> That’s something users do. The API doesn’t imply anything is related.

Querying ids by prefix doesn’t make any sense for a normal ID type. Just making this operation available and part of your public API indicates that prefixes are semantically relevant to your API’s ID type.


“Prefix” is not the same thing as “directory”.

I can look up names with the prefix “B” and get Bart, Bella, Brooke, Blake, etc. That doesn’t imply that there’s some kind of semantics associated with prefixes. It’s just a feature of your system that you may find useful. The fact that these names have a common prefix, “B”, is not a particularly interesting thing to me. Just like if I had a list of files, 1.jpg, 10.jpg, 100.jpg, it’s probably not significant that they’re being returned sequentially (because I probably want 2.jpg after 1.jpg).


by this logic the file "foo/bar/" correspond to the filename "f:o:o:/:b:a:r:/" (using a different caracter as separator)


Exactly


"filesystem" is not a name reserved for Unix-style file systems. There are many types of file system which is not built on according to your description. When I was a kid, I used systems which didn't support directories, but it was still file systems.

It's an incorrect take that a system to manage files must follow a set of patterns like the ones you mentioned to be called "file system".


Terms evolve and now filesytem and "system of files" mean different things,

I would argue that not supporting folders or many other file operations make something not a filesystem today.


You're free to argue whatever you want, but claiming that a file system should have folders as the parent commenter did, or support specific operations, seems a bit meaningless.

I could create a system not supporting folders because it relies on tags or something else. Or I could create a system which is write-only and doesn't support rename or delete.

These systems would be file systems according to how the term has been used for 40 (?) years at least. Just don't see any point in restricting the term to exclude random variants.


Yeah hacker used to not mean someone hacking into a computer and breaking a password, then it did then now it means both that and a tech tinkerer.


Thank you, now I understand what the special 0-byte object refers to. It represents an empty folder.

Fair enough, basing folders on object names split by / is pretty inefficient. I wonder why they didn't go with a solution like git's trees.


> Fair enough, basing folders on object names split by / is pretty inefficient. I wonder why they didn't go with a solution like git's trees.

What, exactly, is inefficient about it?

Think for a moment about the data structures you would use to represent a directory structure in a filesystem, and the data structures you would use to represent a key/value store.

With a filesystem, if you split a string /some/dir/file.jpg into three parts, “some”, “dir”, “file.jpg”, then you are actually making a decision about the tree structure. And here’s a question—is that a balanced tree you got there? Maybe it’s completely unbalanced! That’s actually inefficient.

Let’s suppose, instead, you treat the key as a plain string and stick it in a tree. You have a lot of freedom now, in how you balance the tree, since you are not forced to stick nodes in the tree at every / character.

It’s just a different efficiency tradeoff. Certain operations are now much less efficient (like “rename a directory” which, on S3, is actually “copy a zillion objects). Some operations are more efficient, like “store a file” or “retrieve a file”.


I think it is fair to say that S3 (as named files) is not a filesystem and it is inefficient to use it directly as such for common filesystem use cases; the same way that you could say it for a tarball[0].

This does not make S3 a bad storage, just a bad filesystem, not everything needs to be a filesystem.

Arguably is it good that S3 is not a filesystem, as it can be a leaky abstraction eg in git you cannot have two tags name "v2" and "v2/feature-1" as you cannot have both a file and a folder with the same name.

For something more closely related to URLs than filenames forcing a filesystem abstraction is a limitation as "/some/url", "/some/url/", and "/some/url/some-default-name-decided-by-the-webserver" can be different.[1]

[0] where a different tradeoff is that searching a file by name is slower but reading many small files can be faster.

[1] maybe they should be the same, but enforcing it is a bad idea


I think what you’re describing is simply not a hierarchical file system. It’s a different thing that supports different operations and, indeed, is better or worse at different operations.


Renaming a folder is inefficient.


> […] what the special 0-byte object refers to. It represents an empty folder.

Alas, no. It represents a tag, e.g. «folder/», that points to a zero byte object.

You can then upload two files, e.g. «folder/file1.txt» and «folder/file2.txt», delete the «folder/», being a tag, and still have the «folder/file1.txt» and «folder/file2.txt» file intact in the S3 bucket.

Deleting «folder/» in a traditional file system, on the other hand, will also delete «file1.txt» and «file2.txt» in it.


It's a matter of a client UI implementation. You can't delete a non-empty folder with POSIX API on common filesystems or FTP too.

However, there are file managers, FTP clients, and S3 clients that will do that for you by deleting individual files.


But if the S3 semantics are not helping you, e.g. with multiple clients doing copy/move/delete operations in the hierarchy you could still end up with files that are not in "directories".

So essentially an S3 file manager must be able to handle the situation where there are files without a "directory"—and that I assume is also the most common case as well for S3. Might just not have the "directories" in the first place.


I have personally never seen the 0-byte files people keep talking about here. In every S3 bucket I’ve ever looked at, the “directories” don’t exist at all. If you have a dir/file1.txt and dir/file2.txt, there is NO such object as dir. Not even a placeholder.


Yeah, this post was the first one I had even heard of them.


Deleting folder/ in a traditional file system will _fail_ if the folder is not empty. Userspace needs to recurse over the directory structure to unlink everything in it before unlinking the actual folder.


"folders" do not exist in S3 -- why do you keep insisting that they do?

They appear to exist because the key is split on the slash character for navigation in the web front-end. This gives the familiar appearance of a filesystem, but the implementation is at a much higher level.


Let’s start with the fact that you’re talking to an HTTP api… Even if S3 had web3.0 inodes, the querying semantics would not make sense. It’s a higher level API, because you don’t deal with blocks of magnetic storage and binary buffers. Of course s3 is not a filesystem, that is part of its definition, and reason to be…


I think if you focus too narrowly on the details of the wire protocol, you’ll lose sight of the big picture and the semantics.

S3 is not a filesystem because the semantics are different from the kind of semantics we expect from filesystems. You can’t take the high-level API provided by a filesystem, use S3 as the backing storage, and expect to get good performance out of it unless you use a ton of translation.

Stuff like NFS or CIFS are filesystems. They behave like filesystems, in practice. You can rename files. You can modify files. You can create directories.


Right, the NFS/CIFS support writing blocks, but S3 basically does HTTP get and post verbs. I would say that these concepts are the defining difference. To call S3 a filesystem is not wrong in abstract, but it’s not different than calling Wordpress a filesystem, or DNS, or anything that stores something for you. Of course, it will be inefficient to implement a block write on top of any of these, that’s because you have to literally do it yourself. As in, download the file, edit it, upload again.


I think the blocks are one part of it, and the other part is that S3 doesn’t support renaming or moving objects, and doesn’t have directories (just prefixes). Whenever I’ve seen something with filesystem-like semantics on top of S3, it’s done by using S3 as a storage layer, and building some other kind of view of the storage on top using a separate index.

For example, maybe you have a database mapping file paths to S3 objects. This gives you a separate metadata layer, with S3 as the storage layer for large blocks of data.


Even youngsters are yelling at clouds now. Just a different kind of cloud.


Another challenge is directory flattening. On a file system "a/b" and "a//b" are usually considered the same path. But on S3 the slash isn't a directory separator, so the paths are distinct. You need to be extra careful when building paths not to include double slashes.

Many tools end up handling this by showing a folder named "a" containing a folder named "" (empty string). This confuses users quite a bit. It's more than the inodes, it's how the tooling handles the abstraction.


Coincidentally I ran into an issue just like this a week ago. A customer facing application failed because there was an object named “/foo/bar” (emphasis on the leading slash).

This created a prefix named “/“ which confused the hell out of the application.


In S3 each file is identified with a full path.

Not only you cannot rename a single file, but you also cannot rename a "folder" (because that would imply a bulk rename on a large number of children of that "folder")

This is the fundamental difference between a first class folder and just a convention on prefixes of full path names.

If you don't allow renames, it doesn't really make sense to have each "folder" store the list of the children.

You can instead have a giant ordered map (some kind of b-tree) that allows you for efficient lookup and scanning neighbouring nodes.


UMich LDAP server, upon which many were based, stored entrys’ hierarchical (distinguished) names with each entry, which I always found a bit weird. AD, eDirectory, and the OpenLDAP HDB backend don’t have this problem.


You can create a simulated directory, and write a bunch of files in it, but you can't atomically rename it--behind the scenes each file needs to be copied from old name to new.


The payload still contains a list of other inodes though


This!

I’m fine with it, I actually appreciate the logic and simplicity behind it, but the amount of times I’ve tried to explain why “folders” on S3 keep disappearing while people stare at me like I’m an idiot is really frustrating.

(When you remove the last file in a “folder” on S3, the “folder” disappears, because that pattern no longer appears in the bucket k/v dictionary so there’s no reason to show it as it never existed in the first place).


Weird that it says folders now. I remember it being very strictly called a prefix when I was at AWS.


I think it's just the Web console, It's still prefix in the APIs and CLI.

https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObje...


The web console even collapses them like folders on slashes, further obfuscating how it actually works. I remember having to explain to coworkers why it was so slow to load a large bucket.


What exactly do you think a folder is? It’s just an abstraction for organising data.


S3 doesn’t have that abstraction.

The console UI shows folders but they don’t actually exist in S3. They’re made up by the UI.


It sounds like they have that abstraction in the UI. But if the CLI and API don't have it too, that's weird.


Yeah, the UI and CLI show you “folders”. It’s a client-side thing that doesn’t exist in the actual service. Behind the scenes, the clients are making specific types of queries on the object keys.

You can’t examine when a folder was created (it doesn’t exist in the first place), you can’t rename a folder (it doesn’t exist), you can’t delete a folder (again, it doesn’t exist).


That's just an implementation detail of well known filesystems.


Yes, which is why it's not ideal to reuse the folder metaphor here. Users have an idea how directories work on well-known filesystems and get confused when these fake folders don't behave the same way.


Are all your s3 keys opaque strings (like UUIDs)?, do you use / (slash) in your keys?

If you truly believe S3 has absolutely no connection to folders, you would answer Yes and No.


It sounds to me like you’re arguing about what the definition of “folders” is.

“Any hierarchical path structure is a folder” is maybe your definition of “folder”, from what I can tell. I would say that S3 lets you treat paths as hierarchical, but that S3 does not have folders—obviously I have a different definition of “folder” than you do.

We’ve discovered that we have different definitions of “folder”, and therefore, we are not going to agree about whether it is true that “S3 does not have folders” unless we have an argument about what the correct definition of “folder” is. I’m not really interested in that discussion—it’s enough to understand what somebody means when they say “S3 does not have folders” even if you think their definitions are wrong.


I don’t think that’s a defensible standpoint.

Folders are an important part of the way most people use filesystems.


If you can't rename or delete a folder, yeah, I would say folders don't really exist.


Similarly the UI in linux is making up the notion of folders and files in them. But we don't say it doesn't exist.


No, they're not made up. A folder (or directory) is a specific type of inode, just a file is.

S3 doesn't have folders. The UI fakes them by creating a 0-byte object (or file, if you will). It's a kludge.


The UI will fake them without even creating the 0-byte object.


Directories actually exist on the filesystem, which is why you have to create them before use and they can exist and be empty. They don't exist in S3 and neither of those properties do, either. Similarly, common filesystem operations on directories (like efficiently renaming them, and thus the files under them) are not possible in S3.

Of course it can still be useful to group objects in the S3 UI, but it would probably be better to use some kind of prefix-centric UI rather than reusing the folder metaphor when it doesn't match the paradigm people are used to.


Speaking of user interfaces with optical illustions about directory separators:

On the Mac, the Finder lets you have files with slashes in their names, even though it's a Unix file system underneath. Don't believe me? Go try to use the Finder to make a directory whose name is "Reports from 2024/03/10". See?

But as everyone knows, slash is the ONLY character you're not allowed to have in a file or directory name under Unix. It's enforced in the kernel at the system call inteface. There is absolutely no way to make a file with a slash in it. Yet there it is!

The original MacOS operating system used the ":" character to delimit directory names, instead of "/", so you could have files and directories with slashes in their names, justs not with colons in their names.

When Apple transitioned from MacOS to Unix, they did not want to freak out their users by reaming all their files.

So now try to use the Finder (or any app that uses the standard file dialog) to make a folder or file with a ":" in its name on a modern Mac. You still can't!

So now go into the shell and list out the parent directory containing the directory you made with a slash in its name. It's actually called "Reports from 2024:03:10"!

The Mac Finder and system file dialog user interfaces actually switche "/" and ":" when they show paths on the screen!

Try making a file in the shell with colons in it, then look at it in the finder to see the slashes.

However, back in the days of the old MacOS that permitted slashes in file names, there was a handy network gateway box called the "Gatorbox" that was a Localtalk-to-Ethernet AFP/NFS bridge, which took a subtly different approach.

https://en.wikipedia.org/wiki/GatorBox

It took advantage of the fact (or rather it triggered the bug) that the Unix NFS implementation boldly made an end-run around the kernel's safe system call interface that disallowed slashes in file names. So any NFS client could actually trick Unix into putting slashes into file names via the NFS protocol!

It appeared to work just fine, but then down the line the Unix "restore" command would totally shit itself! Of course "dump" worked just fine, never raising an error that it was writing corrupted dumps that you would not be able to read back in your time of need, so you'd only learn that you'd been screwed by the bug and lost all your files months or years later!

So not only does NFS stand for "No File Security", it also stands for "Nasty Forbidden Slashes"!

https://news.ycombinator.com/item?id=31820504

>NFS originally stood for "No File Security".

>The NFS protocol wasn't just stateless, but also securityless!

>Stewart, remember the open secret that almost everybody at Sun knew about, in which you could tftp a host's /etc/exports (because tftp was set up by default in a way that left it wide open to anyone from anywhere reading files in /etc) to learn the name of all the servers a host allowed to mount its file system, and then in a root shell simply go "hostname foo ; mount remote:/dir /mnt ; hostname `hostname`" to temporarily change the CLIENT's hostname to the name of a host that the SERVER allowed to mount the directory, then mount it (claiming to be an allowed client), then switch it back?

>That's right, the server didn't bother checking the client's IP address against the host name it claimed to be in the NFS mountd request. That's right: the protocol itself let the client tell the server what its host name was, and the server implementation didn't check that against the client's ip address. Nice professional protocol design and implementation, huh?

>Yes, that actually worked, because the NFS protocol laughably trusted the CLIENT to identify its host name for security purposes. That level of "trust" was built into the original NFS protocol and implementation from day one, by the geniuses at Sun who originally designed it. The network is the computer is insecure, indeed.

[...]

From the Unix-Haters Handbook:

https://archive.org/stream/TheUnixHatersHandbook/ugh_djvu.tx...

Don't Touch That Slash!

UFS allows any character in a filename except for the slash (/) and the ASCII NUL character. (Some versions of Unix allow ASCII characters with the high-bit, bit 8, set. Others don't.)

This feature is great — especially in versions of Unix based on Berkeley's Fast File System, which allows filenames longer than 14 characters. It means that you are free to construct informative, easy-to-understand filenames like these:

1992 Sales Report

Personnel File: Verne, Jules

rt005mfkbgkw0 . cp

Unfortunately, the rest of Unix isn't as tolerant. Of the filenames shown above, only rt005mfkbgkw0.cp will work with the majority of Unix utilities (which generally can't tolerate spaces in filenames).

However, don't fret: Unix will let you construct filenames that have control characters or graphics symbols in them. (Some versions will even let you build files that have no name at all.) This can be a great security feature — especially if you have control keys on your keyboard that other people don't have on theirs. That's right: you can literally create files with names that other people can't access. It sort of makes up for the lack of serious security access controls in the rest of Unix.

Recall that Unix does place one hard-and-fast restriction on filenames: they may never, ever contain the magic slash character (/), since the Unix kernel uses the slash to denote subdirectories. To enforce this requirement, the Unix kernel simply will never let you create a filename that has a slash in it. (However, you can have a filename with the 0200 bit set, which does list on some versions of Unix as a slash character.)

Never? Well, hardly ever.

    Date: Mon, 8 Jan 90 18:41:57 PST 
    From: sun!wrs!yuba!steve@decwrl.dec.com (Steve Sekiguchi) 
    Subject: Info-Mac Digest V8 #3 5 

    I've got a rather difficult problem here. We've got a Gator Box run- 
    ning the NFS/AFP conversion. We use this to hook up Macs and 
    Suns. With the Sun as a AppleShare File server. All of this works 
    great! 

    Now here is the problem, Macs are allowed to create files on the Sun/ 
    Unix fileserver with a "/" in the filename. This is great until you try 
    to restore one of these files from your "dump" tapes, "restore" core 
    dumps when it runs into a file with a "/" in the filename. As far as I 
    can tell the "dump" tape is fine. 

    Does anyone have a suggestion for getting the files off the backup 
    tape? 

    Thanks in Advance, 

    Steven Sekiguchi Wind River Systems 

    sun!wrs!steve, steve@wrs.com Emeryville CA, 94608
Apparently Sun's circa 1990 NFS server (which runs inside the kernel) assumed that an NFS client would never, ever send a filename that had a slash inside it and thus didn't bother to check for the illegal character. We're surprised that the files got written to the dump tape at all. (Then again, perhaps they didn't. There's really no way to tell for sure, is there now?)


I'm having a lot of fun imagining this being said to a kid who's trying to buy some folders for school.


Is it an abstraction for requesting the data you want, or an abstraction for storing the data in a retrievable manner?


I don't know why you are being downvoted, what you said is true and confuses many newcomers.


I see you getting downvotes, but you’re speaking the honest truth, here.


[flagged]


Their writing seems okay to me; which specific parts do you find atrocious?


Author here, I can't reply to the GP because that comment is "dead" but yes, please, be specific!

Then I can fix the sentences that are bad and perhaps also improve in the future.


"Even though the file API handles all those concerns, but it doesn't expose them to you. A narrow interface handling a large number of concerns - that makes the unix file API a "deep" module."

Both sentences here are incomplete, incoherent. I did not read past this point.


Thanks. I take your point on the first one and have corrected it (maybe you need to shift+F5 to bust your cache in order to see it).

For the second one, what's your objection? That it's fragmentary?


I don't know what it's trying to say. Making it a complete sentence would be a good first step. Don't try fancy stuff like this unless showing off style is more important to you than communicating coherently. In technical writing, little is more important than clarity.


A narrow interface handling a large number of concerns... makes the UNIX file API a "deep" module."

Seems self-explanatory to me. "Deep module" [0][1] is a well-defined term.

[0]: https://dev.to/gosukiwi/software-design-deep-modules-2on9

[1]: https://www.amazon.com/Philosophy-Software-Design-John-Ouste...


Honestly the only minor criticism I can see of the OP's writing is to remove things that make it seem more disjointed than it is - dashes, unqualified pronouns (what is "it"? AWS? UNIX file system API? a particular module? all modules?). That's all.


Thanks for your advice. I agree and will try to improve on this for the future


[flagged]


sure seems like it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: