Hacker News new | past | comments | ask | show | jobs | submit login
How to Save 90% on your S3 Bill (appneta.com)
368 points by trjordan on Feb 5, 2014 | hide | past | favorite | 98 comments



Yikes! That's a horrible thing for a library to do to its users. This definitely should be changed in the library.

I noticed that no such Issue exists, so I opened one. https://github.com/boto/boto/issues/2078


James,

It looks like a HEAD request can fulfill the same purpose at a much lower cost (this is mentioned by a comment on your Issue: (comment by kislyuk)

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketHEA...

Shall I make the change and pull request it? I didn't want to duplicate effort.


Aye, this does look like the right way to handle this functionality. I'm not planning on doing a PR so have at it.


Working on it now.


Don't you have too much to do?


I took the nickname because I always take on more than I have time for, yet find the time to complete everything that I take on.


No offense intended, just enjoyed the relationship between your nick and your jumping on a new task. I don't actually mean to discourage!


Not discouraged at all! :)


We should all be happy that amongst those things he considers "too much to do" is contribute to open source projects.


Agreed!


I kind of agree with garnaat's reply, if they just suddenly change the default from TRUE to FALSE they're going to break backwards compatibility with anyone using the library and worse still in a really subtle way!

All they can really do at this stage is add a warning to the documentation and hope that new people using the library figure out the significance.


This is what major version numbers are for.


Or issue an actual runtime warning: http://docs.python.org/2/library/warnings.html


That's why you should avoid using default arguments in library functions.


Great point! I have never heard that, but it makes perfect sense. Hopefully I will remember in the future.


I don't agree, you should just know what they are. Defaults are helpful.


Until they change, and your code breaks, and you have to go figure out why. The point of not (silently) using defaults is that as long as you're explicit with your arguments, you won't get surprised by changing defaults. Of course, you might get surprised by any number of other changes, but you can at least reduce your failure surface area a little bit :-)


We have been using this in production for over 2 years at scale and just cross checked to have this enabled.

Totally agree, that this should be default False!


I found this exact same issue a few years ago after determining I had a reasonably-long-existing $75/day "leak" in my S3 expenses :/. (Seriously: $75/day, a number I am not exaggerating at all.)

> 09:00:04 < saurik> dls: I am performing millions of ListBucket request to this one amazon bucket every day

> 09:00:25 < dls> d'oh

> 09:00:28 < saurik> adding up to almost 1.5 gigabytes of data traffic in/out on just those requests

> 09:00:41 < saurik> I have NO CLUE what could POSSIBLY doing even a SINGLE ListBucket request on that bucket

> 09:00:49 < dls> LOL


Even the infamous saurik got floor-hit. closes jaw


Also see MimicDB [0]. Runs a transparent cache that responds to most S3 API calls locally using Redis to store metadata. Besides the cost savings, it's extremely fast. Listing an entire bucket with millions of objects is close to instantaneous.

[0] http://mimicdb.com/


It also adds another piece of infrastructure that needs to be maintained and can go down. Not necessarily the best option for everyone.


While the caching would be more effective with one large centralized instance, I think the intended use case is to have one cache per server. So then it's not really extra infrastructure.


Actually, you can do it either way since it's backed by Redis. You can set all servers to connect to the same Redis instance, or run them all individually.


Do you use Redis clustering in this case (otherwise how would the cache stay consistent)?


What if Redis goes down?


"What if X goes down?" is my new favorite straw man on HN.

Regardless, it's just a caching layer, and requests are passed through to the API in that case.


There's always a single point of failure.


Is their a SaaS version of mimicdb? I'm ready to signup right now.


No, and it wouldn't help much if there was. The biggest benefits come from running the metadata cache locally, or at least on the same local network. The cost/time savings come from preventing external API calls.


If I have a bucket with 2 million key, it takes a long time and 2000 requests to get the dir listing. With mimicdb it would only be one fast request.


That's true if it was local. A 2 million key response from an externally hosted MimicDB instance would probably have to be split in 2000 requests and take the same amount of time.

Send me an email nathan@nathancahill.com and I can show you how to set up a local cache. It's quite simple.


> One utility in general that’s provided us with an easy way to slice up and investigate our AWS spending is the awesome Asgard.

The tool they're actually referring to is Ice (also by netflix) [1]. Asgard is another AWS-based tool for managing deployments and auto-scaling.

[1] https://github.com/Netflix/ice


Now mentioned in the article.


I'm the article author, and yes, we're using or evaluating both services. I find myself swapping the two all the time in conversation. Pretty silly typo on my part :)


I really hated working with boto for s3. The abstractions they chose are really ambiguous and it feels like it's fighting against the underlying api. If you're building anything on top of s3, it might be easier to write a thin S3 REST client and use it directly instead of going through Boto. So many fewer surprises, and no more digging around in the boto source trying to figure out what so and so function actually does.


I would look at tinys3: https://github.com/smore-inc/tinys3 It was motivated by exactly the same reasons.

(disclaimer: I used to work at Smore, and I'm friends with the author, but I've been burned by Boto myself)


interestingly enough, here's the original commit that defaulted `validate=True`:

https://github.com/boto/boto/commit/95939debc3813468264159d5...

EDIT: Looks like the original committer is an Amazon employee.


The commit added another parameter so it is possible to skip the expensive get_all_keys call.

It is defaulted to true so that old code that called get_bucket() won't break. (Since old code would have used get_bucket with only one parameter.)

It was always true before that commit.

Therefore, in essence, an Amazon employee made it possible to save 90% of your S3 bills.


Before you jump on the conspiracy train, most Boto contributors are Amazon employees.


Agreed, I doubt Amazon had any evil intentions with this change.

I wonder more if this was a product of Amazon developers using S3 (i.e. dogfooding) and not noticing the cost side effect because I'm assuming they don't get billed?


The change is the opposite of what you're implying, it adds the option to turn it off. Before this commit it was on by default. My guess is, when the option was added, it defaulted to validate=True to maintain backward compatibility.


We actually do get billed, but we don't have to pay. I always check my bill to make sure that I am not using any resources that I don't need.

I also pay for my own personal EC2 instance and about 350 GB of S3 storage. Begin a genuine user and customer of AWS helps me to be a better employee.


Jeff,

You may want to check LIST request statistics over the next few weeks. Between this thread, an Issue for boto, etc. I'm curious if you see a noticeable decline in LIST requests with the attention this has brought. I'm just curious from a data standpoint.


Ahh cool! I completely agree that using your own products from a customer point of makes a better employee. Thanks for the input =)


Mitch just started working for AWS ~1.5 years ago or so I think? That code was written long before he worked for Amazon.


Yeah don't hate on em, Boto folks are pretty helpful.


So the committer added an option to avoid the expensive behavior and that's a sign of evil?


> interestingly enough, here's the original commit that defaulted `validate=True`:

It's a commit which specifically added validate to allow skipping validation. If you read the diff, the call originally unconditionally performed the validation call.


That's a change by the original author of boto. I don't think he was an Amazon employee that far back...


This is a bit easier to read if you ignore whitespace (add ?ws=1 to the end):

https://github.com/boto/boto/commit/95939debc3813468264159d5...


This has been in the boto library for years, across multiple services. I noticed it when first using SimpleDB, switching it on in a production environment was much more expensive than we had originally calculated. I noticed the bizarre "Domain Validate" calls after pouring through logs of all boto activity.

The boto guys have justified it in the past: https://groups.google.com/forum/#!topic/boto-users/1DVfbo4CD...

I still don't agree with their reasoning to leave it default, once you try to do something on a non-existent domain/bucket it will throw an error anyway, and I would argue the "extra work" is much cheaper than leaving these defaults on which I would expect are completely redundant for most users.


Good news for django-storages users – this is off by default:

https://bitbucket.org/david/django-storages/src/cb7366693ce1...


To further add: the PyPI version hasn't been updated since March 2013 but the last time the relevant lines were changed was January 2013, so you should still be good with the PyPi version (assuming changes from then went out in the last release).

The setting in question is: AWS_AUTO_CREATE_BUCKET.


As far as I can see, that means this code is broken: if no validation is done it won't raise S3ResponseError (because there's no S3 query involved), and will never raise `ImproperlyConfigured`. And if you ask for your buckets to be auto-created, it will silently do one more request per (existing) bucket, on top of the auto-creation one.

And the conditional inside the except is useless, of course, since you can only be in the except if `auto_create_bucket` is set.


I discovered the 'validate' argument just recently. My concern was not costs, it was latency and response times. My server is generating a HTML page with signed URLs to resources on S3. The 'get_bucket' call adds a bit of latency as it contacts S3 and I was thinking, does it really need anything from S3 to generate signed URLs if I already know the exact key names, and am pretty sure the bucket exists? Well, it does not, and adding "validate=False" speeded up things noticeably.


Because of the lack of fast and reliable s3 clients out there, I built https://github.com/rlmcpherson/s3gof3r, that can do over 1 Gbps with ease, both multipart uploads and parallelized downloads. The killer feature, though, is streaming that enables things like gof3r get -b <bucket> -k <key> | tar -x to extract directories tarred directories or any other streaming application. It also provides end-to-end md5 integrity checking. For objects over a few GB, I haven't found anything matching in speed or reliability.


From when I was looking into doing something similar, I recall s3 needing to know the content-length of upload parts up front. How do you handle that for streaming? Do you buffer in memory up to the max part size so that you can give the correct content-length header for the last part? I ask because my uses would include low-memory VMs so I'm curious about the memory overhead.


You don't need the content length of what you are uploading for multipart uploads, see (http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadIniti...). For each part of a multipart upload that is sent, however, you do need to include the content-length header, so that may be what you are referring to. These have a minimum size set by amazon of 5 MB. With the https://github.com/rlmcpherson/s3gof3r the memory overhead of a streaming upload is approximately part size * concurrent uploads (plus a couple of buffers in the pool). So, for example, if you configure the part size to 20 MB and set concurrent uploads to 10, that would be about 220 MB of memory usage.


Yeah, having to buffer the stream in order to set the correct value for the content-length header on the last part of the multipart upload is what I was referring to. Your example using a 20MB part size is quite feasible. Thanks! Is it safe to use the master branch? Do you have build instructions? I haven't used go before.


We use the version in the master branch at CodeGuard to transfer many terabytes into and out of S3 daily. While I hesitate to call it "production-ready", as that means different things to different people, we do use it in production with no issues. If you want to try it without having to install go, there are statically linked binaries for OSX and Linux Amd/64 linked on the github page that should run on any distro without having to install any dependencies. If you have any issues or questions, feel free to contact me at the email in my HN profile. I hope it works for your use case! :)


I missed your question about build instructions, sorry. I'll add them to the readme, but it's just "go get github.com/rlmcpherson/s3gof3r/gof3r" for the CLI once you have installed go. Usage instructions are in the readme already.


This is one of the reasons why Rackspace simplified pricing of Cloud Files from the start. No fees for PUT, POST, LIST, HEAD, GET, DELETE...no extra fees for Akamai CDN requests. Very simple with no hidden fees that surprise you at the end of the month.


Those fees for "operations" are there for a good reason. Otherwise, us smart techies would hack it.

I heard a talk by someone at a mega tech company that has their own internal cloud for their teams, and they "charge" each team based on usage. One team stored lots of file with 12,000 character filenames with zero contents. Since the company only "charged" for file size, that team had a tiny charge!


>> No fees for PUT, POST, LIST, HEAD, GET, DELETE...no extra fees for Akamai CDN requests

or you can say the fee is bundled, i.e.

(Rackspace) vs (S3 us-east-1)

Storage: 0.105/GB/mo vs 0.085/GB/mo

Bandwidth out: 0.20/GB/mo vs 0.120/GB/mo


The problem was people implemented a file system on top of s3. From what I understand, Amazon added the charges, which are pretty nominal, to prevent people from hammering s3 as a block storage system.


What are the hidden fees in the S3 use case?


If you read the source code it calls a method called get_all_keys. Please realize, this does NOT get all the keys in the bucket. Passed to it is the maxkeys=0 argument which means no keys are returned and a single list call is made.

Yes, it is still a waste of money, but just make sure you understand that it's not actually listing your entire bucket.


That's a crap default and a crap name. It should be prefetch_all_keys=False. (edit: and some documented reason WHY you would want to do such a thing)

I ran into this recently when making my own s3 sync tool, because the commonly used tool is completely broken (requires something called a 'config file' to function). But I didn't pay it too much mind, because I forgot the price discrepancy for ListBucket calls.

PS if you want to see what boto is doing do this: logging.basicConfig(filename="boto.log", level=logging.DEBUG)


> It should be prefetch_all_keys=False.

It does not prefetch any key (maxkeys is set to 0), it performs a query on the bucket to validate that the bucket exists and blow up if the bucket does not exist. With validate=False, you can call get_bucket and get a bucket object where no remote bucket exists.


It sounds like an annoying limitation of the API that (apparently?) you can't cheaply validate whether a bucket exists.

Two manual work-arounds that come to mind:

- store a list of created buckets as keys in another bucket.

- store a dummy file in each bucket you create.

Either method allows you to check the existence of the bucket with a GET request rather than a more expensive LIST request, but both are hackish. It seems like this is functionality S3 should already provide cheaply.


> It sounds like an annoying limitation of the API that (apparently?) you can't cheaply validate whether a bucket exists.

It looks like there now is: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketHEA...

I'm guessing (hoping?) that didn't exist back when the feature was added to Boto, 7 years ago: https://github.com/boto/boto/commit/8410c365ee0120e073bf00bd...


How to save 95%: don't use AWS long-term, buy dedicated hardware. http://blog.backblaze.com/2013/02/20/180tb-of-good-vibration...


Thank you. For the long-term medium data center, it's a good choice. But other then the hardware disk space part, there are the hosting service part which at least provides a stable Linux instance and static IPs, and it' must be up and running 24/7.


Idea: A tool which give you some improvements advises on your code to save up money on your AWS bills


The interesting thing is that the Boto S3 tutorial (http://docs.pythonboto.org/en/latest/s3_tut.html) uses s3.get_bucket() with impunity and never mentions you're paying 13.5x more for your usage if you use that bucket object for just one GET and don't pass validate=False. (Perhaps someone from AWS wrote that tutorial? :P) Probably deserves a mention there, though.


We ended up writing an s3 library out of our frustration with Boto's slowness and it has a nicer, requests-inspired interface too

https://github.com/smore-inc/tinys3


Is there something similar to this that exists that also caches s3 objects locally?


Check out MimicDB [0]

[0] http://mimicdb.com/


It stores everything but the object


Yes. If you're storing the objects on your server, why would you use S3?


Caching. We request the same objects quite frequently and bandwidth is killing us.


Well, if the consumer is the server, rather than the server then serving the S3 data to a client, then I could see the benefit in the server caching retrievals locally.


S3 for persistence, the server for caching? Perhaps you have multiple servers that carry no state?


Nginx can be used as a caching proxy:

https://coderwall.com/p/rlguog


OR You can move from S3 to some cheap servers with lot of data bandwidth.


There are a lot of use cases where s3 makes sense. For a small operation (one or two people) the admin time cost alone makes s3 pretty appealing. Also, the cost of a colo + server + redundant drives is a big initial spend when I have no way to know if my idea/app is going to get traction. Self-hosted may be cheaper for larger scales, but for a couple GBs s3 is going to win every time.


[deleted]


Why are you comparing S3 costs to "just" HDDs? HDD's aren't web enabled and require other hardware to run and keep running.


A more accurate comparison would be a BackBlaze Storage Pod [0]

[0] http://blog.backblaze.com/2013/02/20/180tb-of-good-vibration...


CapEx + OpEx together averages 3x retail cost of hw for a medium sized web shop, labor and electricity being the biggest costs in each respectively.


That's not enough. Don't forget that Amazon automatically duplicates your data in multiple physical locations.


True. Backblaze wouldn't be enough in a natural disaster. Interestingly, even S3 reduced redundancy claims 400 times the durability of a typical disk drive. [0]

[0] http://aws.amazon.com/s3/faqs/#rrs_anchor


Yeah, and the op-ex of a HDD just sitting there is like, zero, man. Totes cheaper than S3.

(You aren't just paying for raw HDD capacity when you give Amazon money to use S3. You're also paying for many of the things between that capacity and you, as well.)


thats a good catch! thanks man :)


You only need to worry about this if you're doing 100k+ LIST requests. For me, turning off validate would save 0.2%.


I wonder how much savings we're talking about. I always worry when engineers start talking about saving money.


Depends on your scale. Another commenter on this thread was in the +$75/day range from this issue.

I don't worry about cultivating a connection between engineering and business needs.


Are you sure this counts as a LIST request? Technically the method of the list bucket API call is GET.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: