What Powers Instagram: Hundreds of Instances, Dozens of Technologies

jphackworth · on Dec 3, 2011

I'm a little surprised so many machines are used to run Instagram. TechCrunch mentioned their peak has been 50 photo uploads per second (which they say go directly to S3, so Instagram's servers only need to pass a token). Of course there are other forms of requests, but just back of the envelope it seems like it should not require anywhere near "hundreds" of machines.

Not to be too harsh - it's just three engineers, so it makes sense if the setup is still evolving.

rdouble · on Dec 3, 2011

I was surprised they had so few... I once worked on a site with 1/6th the users and 3.5 times the number of instances.

They could do better but they'd have to manage their own datacenter and write portions of the app in C++. It's probably not worth it at this point unless they hire someone with that specific expertise.

jphackworth · on Dec 3, 2011

I once worked on a site with 1/6th the users and only one machine. ;-) Counting users often doesn't match across sites, especially when an Instagram user is someone who has downloaded the app, and might never come back. That's why the 50 photo uploads per second peak is a useful benchmark.

mikeyk · on Dec 3, 2011

Hey, author here. To clarify, our uploads go to our servers first (where we resize for thumbnails, etc) then go to S3.

WALoeIII · on Dec 3, 2011

Do you use GraphicsMagick or ImageMagick? Shell out or python bindings?

What settings do you use for: MAGICK_MEMORY_LIMIT MAGICK_MAP_LIMIT MAGICK_DISK_LIMIT

Do you tune them dynamically or just set and forget?

scottostler · on Dec 3, 2011

I'm curious about this as well.

palish · on Dec 3, 2011

Offtopic: Would you consider letting a hacker work for you remotely?

armandososa · on Dec 3, 2011

I like posts like this a lot. I'm just a web designer, but I found scaling web sites fascinating, like some kind of dark art or secret craft.

Where do you learn this stuff? Do you need a CS Degree from stanford or something? I like the black magic aura, it's romantic, but I'd really like to understand how to scale websites doing stuff like the OP describes.

_juof · on Dec 3, 2011

http://highscalability.com/

jules · on Dec 3, 2011

While scaling web sites is fascinating, for most people it is also unnecessary. The vast majority of web sites runs just fine on 1 computer. For example, hacker news runs not just on one computer, but actually on 1 core. So with a single 8 core box with ~100GB ram you can get quite far and save yourself a big hassle.

tkahn6 · on Dec 3, 2011

I don't get the impression that one would need a degree to devise the scaling strategies they're employing. This would seem more the product of battle-hardened experience rather than a formal education.

d_r · on Dec 3, 2011

I know that this is probably a recruiting-inspired post, but detailed posts like this genuinely benefit the community. Thanks for specifically mentioning the reasons for choosing particular technologies (i.e. why you switched to Gunicorn from mod_wsgi) -- this makes the already excellent post even more helpful for someone trying to build things.

latchkey · on Dec 3, 2011

I guess my question is, how do they make money? I really like instagram images. I've used the site myself, but it certainly isn't something I'd feel the need to pay money for.

gallerytungsten · on Dec 3, 2011

Funding Total: $7.5M (per techcrunch)

Server bill: $35k/month, $420k/year, per estimates in other comments.

Personnel, overhead, other expenses: $1.5M/year (guess).

Runway: 3.9 years to figure it out.

Ecio78 · on Dec 3, 2011

dumb question: is there a way to pay Amazon for AWS fees except for credit card? 'cause i was wondering how can you create big infrastructures on Amazon if you cant pay by wire transfer or some other kind of link to a bank account (like phone and gas bills)

camwest · on Dec 3, 2011

American Express has no limit as long as you pay it back ASAP.

hboon · on Dec 4, 2011

Generally yes, but a country's central bank may prohibit that. Singapore's for example does that.

Ecio78 · on Dec 3, 2011

i dont know if i'll have/use without fears a credit card with no limits... :)

ell · on Dec 3, 2011

It won't be difficult to make money if they don't try to be clever. They can display ad just like Twitpic on the their website. They can have storage limit.

latchkey · on Dec 3, 2011

Those Quadruple Extra Large instances are $2/hr. The 24 of them used for postgres would be like $35k/month just for that part alone. I'm guessing they are spending >$100k/month on just hosting 100+ instances. Not to mention disk, bandwidth, dns, s3, public ip's, etc.

tptacek · on Dec 3, 2011

At ~35k/mo (they may have a deal here, though), that's the fully loaded headcount of 2-3 FTE devops people. In return, they get EC2's turnaround time on new instances. Not to mention that they're constantly pushing images to S3.

I would agree that EC2 isn't a no-brainer decision here, but it seems like a reasonable one.

foobarbazetc · on Dec 3, 2011

Every time I see numbers like this, I wonder why everyone seems to think you have to use AWS or else you've failed at scaling.

They could run their operation for 10-20% of their AWS costs at a dedicated server host. And everything would be much, much faster.

notJim · on Dec 3, 2011

I noticed this, too. This statement stood out to me in particular:

> Our main shard cluster involves 12 Quadruple Extra-Large memory instances… We’ve found that Amazon’s network disk system (EBS) doesn’t support enough disk seeks per second, so having all of our working set in memory is extremely important.

ww520 · on Dec 3, 2011

Using AWS is not just for its instances. S3 is a big factor. It's hard to replicate the S3 functionality in your own hosting without much more effort and cost. Granted that the AWS instances can be used more efficiently.

Ecio78 · on Dec 3, 2011

cant you just upload to S3 from your own dedicated machines? or it adds too much delay to operations? Author posted that images are first loaded on their system, resized and so on and then loaded on s3, so at least for image upload it shouldnt be such a great problem.

disclaimer: i have no smartphone and never used their app :)

ConstantineXVI · on Dec 3, 2011

Besides latency, you don't pay for internal data transfer within AWS services. If you did the image processing on your own machines, you'd be paying for bandwidth every operation; where if you do it in EC2, your only outbound transfer is viewing the images.

jaequery · on Dec 3, 2011

i use aws/cloud like i use spare tires. i only use them in emergencies. why? the price, i don't care much. it's just the performance gain going from aws network i/o to directly-attached SSD/SAS i/o is almost night and day

rkalla · on Dec 3, 2011

Instagram isn't paying on-demand prices, 3yr reserved is 48% cheaper than on-demand.

apu · on Dec 2, 2011

Is there a collection of these kind of blog posts somewhere? i.e., for comparing the stacks of different sites?

cadr · on Dec 2, 2011

The site highscalability.com has some good descriptions (look under the 'REAL LIFE ARCHITECTURES' topic).

SkyMarshal · on Dec 3, 2011

Not quite what you asked for, but related: StackParts.

http://stackparts.com/

http://news.ycombinator.com/item?id=2993371

geuis · on Dec 3, 2011

One thing about how Instagram's load balancing that I don't like is that they rate-limit their proxies on image requests. In my recent testing, its roughly 5-6 requests every 3 seconds or so. Any requests more frequent than that return 503 status codes. I don't entirely understand why they do this, since their load balancer simply does 302 redirects to the S3-hosted image resource.

I can guess at some of the reasons, such as they didn't foresee a user loading more than a few images at once. Perhaps they perceive rate limiting as a protective measure.

However, I've done testing on Twitpic, imgur, and yfrog and haven't run into the same issues. Twitpic, for example, generates a lot more traffic than Instagram and they don't have the same rate-limiting.

ceejayoz · on Dec 3, 2011

> I don't entirely understand why they do this, since their load balancer simply does 302 redirects to the S3-hosted image resource.

S3 accesses cost money, so it makes sense that they'd rate limit access to them. A botnet hitting an S3 URL could incur large fees for the owner of the file very rapidly.

mkjones · on Dec 3, 2011

Glad to see other people using vmtouch. It's also great for keeping large codebases in the filesystem cache on [shared] dev machines.

cagenut · on Dec 3, 2011

With that big a monthly AWS bill, I could pretty easily justify my salary and the costs of building out a 4 - 10 rack colo setup. With room leftover for a dba consultant on retainer and a pro-serv budget for ad-hoc stuff.

sant0sk1 · on Dec 3, 2011

That's a lot of instances! It'd be interesting to run the numbers and get an idea of what their monthly AWS bill looks like.

clarkni5 · on Dec 3, 2011

By my math, the bill for their app and database servers would be approaching $30,000 per month. That doesn't include storage costs, bandwidth, or any of the other aspects of their infrastructure.

That's crazy, if you ask me.

simonw · on Dec 3, 2011

Is that calculation taking reserved instances in to account?

rkalla · on Dec 3, 2011

No, I don't think so. Latchkey did the same calculation, using on-demand prices and came up with $35k[1]

3rd reserved is roughly 48% cheaper than on-demand, so real hosting cost would be around $18,200 for those servers.

[1] http://news.ycombinator.com/item?id=3306394

mcginleyr1 · on Dec 3, 2011

For their load balances, why aren't they assigning elastic ip. Then they would have to wait for DNS just reassign the ip...

vidar · on Dec 3, 2011

What was your take on Gunicorn over uWsgi?