I kind of agree with garnaat's reply, if they just suddenly change the default from TRUE to FALSE they're going to break backwards compatibility with anyone using the library and worse still in a really subtle way!
All they can really do at this stage is add a warning to the documentation and hope that new people using the library figure out the significance.
Until they change, and your code breaks, and you have to go figure out why. The point of not (silently) using defaults is that as long as you're explicit with your arguments, you won't get surprised by changing defaults. Of course, you might get surprised by any number of other changes, but you can at least reduce your failure surface area a little bit :-)
I found this exact same issue a few years ago after determining I had a reasonably-long-existing $75/day "leak" in my S3 expenses :/. (Seriously: $75/day, a number I am not exaggerating at all.)
> 09:00:04 < saurik> dls: I am performing millions of ListBucket request to this one amazon bucket every day
> 09:00:25 < dls> d'oh
> 09:00:28 < saurik> adding up to almost 1.5 gigabytes of data traffic in/out on just those requests
> 09:00:41 < saurik> I have NO CLUE what could POSSIBLY doing even a SINGLE ListBucket request on that bucket
Also see MimicDB [0]. Runs a transparent cache that responds to most S3 API calls locally using Redis to store metadata. Besides the cost savings, it's extremely fast. Listing an entire bucket with millions of objects is close to instantaneous.
While the caching would be more effective with one large centralized instance, I think the intended use case is to have one cache per server. So then it's not really extra infrastructure.
Actually, you can do it either way since it's backed by Redis. You can set all servers to connect to the same Redis instance, or run them all individually.
No, and it wouldn't help much if there was. The biggest benefits come from running the metadata cache locally, or at least on the same local network. The cost/time savings come from preventing external API calls.
That's true if it was local. A 2 million key response from an externally hosted MimicDB instance would probably have to be split in 2000 requests and take the same amount of time.
Send me an email nathan@nathancahill.com and I can show you how to set up a local cache. It's quite simple.
I'm the article author, and yes, we're using or evaluating both services. I find myself swapping the two all the time in conversation. Pretty silly typo on my part :)
I really hated working with boto for s3. The abstractions they chose are really ambiguous and it feels like it's fighting against the underlying api. If you're building anything on top of s3, it might be easier to write a thin S3 REST client and use it directly instead of going through Boto. So many fewer surprises, and no more digging around in the boto source trying to figure out what so and so function actually does.
Agreed, I doubt Amazon had any evil intentions with this change.
I wonder more if this was a product of Amazon developers using S3 (i.e. dogfooding) and not noticing the cost side effect because I'm assuming they don't get billed?
The change is the opposite of what you're implying, it adds the option to turn it off. Before this commit it was on by default. My guess is, when the option was added, it defaulted to validate=True to maintain backward compatibility.
You may want to check LIST request statistics over the next few weeks. Between this thread, an Issue for boto, etc. I'm curious if you see a noticeable decline in LIST requests with the attention this has brought. I'm just curious from a data standpoint.
> interestingly enough, here's the original commit that defaulted `validate=True`:
It's a commit which specifically added validate to allow skipping validation. If you read the diff, the call originally unconditionally performed the validation call.
This has been in the boto library for years, across multiple services. I noticed it when first using SimpleDB, switching it on in a production environment was much more expensive than we had originally calculated. I noticed the bizarre "Domain Validate" calls after pouring through logs of all boto activity.
I still don't agree with their reasoning to leave it default, once you try to do something on a non-existent domain/bucket it will throw an error anyway, and I would argue the "extra work" is much cheaper than leaving these defaults on which I would expect are completely redundant for most users.
To further add: the PyPI version hasn't been updated since March 2013 but the last time the relevant lines were changed was January 2013, so you should still be good with the PyPi version (assuming changes from then went out in the last release).
The setting in question is: AWS_AUTO_CREATE_BUCKET.
As far as I can see, that means this code is broken: if no validation is done it won't raise S3ResponseError (because there's no S3 query involved), and will never raise `ImproperlyConfigured`. And if you ask for your buckets to be auto-created, it will silently do one more request per (existing) bucket, on top of the auto-creation one.
And the conditional inside the except is useless, of course, since you can only be in the except if `auto_create_bucket` is set.
I discovered the 'validate' argument just recently.
My concern was not costs, it was latency and response times.
My server is generating a HTML page with signed URLs to resources on S3. The 'get_bucket' call adds a bit of latency as it contacts S3 and I was thinking, does it really need anything from S3 to generate signed URLs if I already know the exact key names, and am pretty sure the bucket exists? Well, it does not, and adding "validate=False" speeded up things noticeably.
Because of the lack of fast and reliable s3 clients out there, I built https://github.com/rlmcpherson/s3gof3r, that can do over 1 Gbps with ease, both multipart uploads and parallelized downloads. The killer feature, though, is streaming that enables things like gof3r get -b <bucket> -k <key> | tar -x to extract directories tarred directories or any other streaming application. It also provides end-to-end md5 integrity checking. For objects over a few GB, I haven't found anything matching in speed or reliability.
From when I was looking into doing something similar, I recall s3 needing to know the content-length of upload parts up front. How do you handle that for streaming? Do you buffer in memory up to the max part size so that you can give the correct content-length header for the last part? I ask because my uses would include low-memory VMs so I'm curious about the memory overhead.
You don't need the content length of what you are uploading for multipart uploads, see (http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadIniti...). For each part of a multipart upload that is sent, however, you do need to include the content-length header, so that may be what you are referring to. These have a minimum size set by amazon of 5 MB. With the https://github.com/rlmcpherson/s3gof3r the memory overhead of a streaming upload is approximately part size * concurrent uploads (plus a couple of buffers in the pool). So, for example, if you configure the part size to 20 MB and set concurrent uploads to 10, that would be about 220 MB of memory usage.
Yeah, having to buffer the stream in order to set the correct value for the content-length header on the last part of the multipart upload is what I was referring to. Your example using a 20MB part size is quite feasible. Thanks! Is it safe to use the master branch? Do you have build instructions? I haven't used go before.
We use the version in the master branch at CodeGuard to transfer many terabytes into and out of S3 daily. While I hesitate to call it "production-ready", as that means different things to different people, we do use it in production with no issues. If you want to try it without having to install go, there are statically linked binaries for OSX and Linux Amd/64 linked on the github page that should run on any distro without having to install any dependencies. If you have any issues or questions, feel free to contact me at the email in my HN profile. I hope it works for your use case! :)
I missed your question about build instructions, sorry. I'll add them to the readme, but it's just "go get github.com/rlmcpherson/s3gof3r/gof3r" for the CLI once you have installed go. Usage instructions are in the readme already.
This is one of the reasons why Rackspace simplified pricing of Cloud Files from the start. No fees for PUT, POST, LIST, HEAD, GET, DELETE...no extra fees for Akamai CDN requests. Very simple with no hidden fees that surprise you at the end of the month.
Those fees for "operations" are there for a good reason. Otherwise, us smart techies would hack it.
I heard a talk by someone at a mega tech company that has their own internal cloud for their teams, and they "charge" each team based on usage. One team stored lots of file with 12,000 character filenames with zero contents. Since the company only "charged" for file size, that team had a tiny charge!
The problem was people implemented a file system on top of s3. From what I understand, Amazon added the charges, which are pretty nominal, to prevent people from hammering s3 as a block storage system.
If you read the source code it calls a method called get_all_keys. Please realize, this does NOT get all the keys in the bucket. Passed to it is the maxkeys=0 argument which means no keys are returned and a single list call is made.
Yes, it is still a waste of money, but just make sure you understand that it's not actually listing your entire bucket.
That's a crap default and a crap name. It should be prefetch_all_keys=False. (edit: and some documented reason WHY you would want to do such a thing)
I ran into this recently when making my own s3 sync tool, because the commonly used tool is completely broken (requires something called a 'config file' to function). But I didn't pay it too much mind, because I forgot the price discrepancy for ListBucket calls.
PS if you want to see what boto is doing do this:
logging.basicConfig(filename="boto.log", level=logging.DEBUG)
It does not prefetch any key (maxkeys is set to 0), it performs a query on the bucket to validate that the bucket exists and blow up if the bucket does not exist. With validate=False, you can call get_bucket and get a bucket object where no remote bucket exists.
It sounds like an annoying limitation of the API that (apparently?) you can't cheaply validate whether a bucket exists.
Two manual work-arounds that come to mind:
- store a list of created buckets as keys in another bucket.
- store a dummy file in each bucket you create.
Either method allows you to check the existence of the bucket with a GET request rather than a more expensive LIST request, but both are hackish. It seems like this is functionality S3 should already provide cheaply.
Thank you. For the long-term medium data center, it's a good choice. But other then the hardware disk space part, there are the hosting service part which at least provides a stable Linux instance and static IPs, and it' must be up and running 24/7.
The interesting thing is that the Boto S3 tutorial (http://docs.pythonboto.org/en/latest/s3_tut.html) uses s3.get_bucket() with impunity and never mentions you're paying 13.5x more for your usage if you use that bucket object for just one GET and don't pass validate=False. (Perhaps someone from AWS wrote that tutorial? :P) Probably deserves a mention there, though.
Well, if the consumer is the server, rather than the server then serving the S3 data to a client, then I could see the benefit in the server caching retrievals locally.
There are a lot of use cases where s3 makes sense. For a small operation (one or two people) the admin time cost alone makes s3 pretty appealing. Also, the cost of a colo + server + redundant drives is a big initial spend when I have no way to know if my idea/app is going to get traction. Self-hosted may be cheaper for larger scales, but for a couple GBs s3 is going to win every time.
True. Backblaze wouldn't be enough in a natural disaster. Interestingly, even S3 reduced redundancy claims 400 times the durability of a typical disk drive. [0]
Yeah, and the op-ex of a HDD just sitting there is like, zero, man. Totes cheaper than S3.
(You aren't just paying for raw HDD capacity when you give Amazon money to use S3. You're also paying for many of the things between that capacity and you, as well.)
I noticed that no such Issue exists, so I opened one. https://github.com/boto/boto/issues/2078