Stack Overflow running short on space

ghshephard · on Feb 7, 2012

The key takeaway for me was this:

"...the server goes to 288GB (16GB x 18 DIMMs)…so why not? For less than $3,000 we can take this server from 6x16GB to 18x16GB and just not worry about memory for the life of the server. This also has the advantage of balancing all 3 memory channels on both processors, but that’s secondary. Do we feel silly putting that much memory in a single server? Yes, we do…but it’s so cheap compared to say a single SQL Server license that it seems silly not to do it."

Clearly - if you are paying 10s of thousands of dollars for a database server License, it makes sense to fully utilize the ones you've purchased.

Also, in my experience, databases these days pretty much have to stay in memory in order to be performant whatsoever. I think the rule I heard from a FaceBook DevOps manager who was interviewing with us was "If you touch it once a day, it goes into SSDs, if you touch it once an hour, it goes into memory." - of course, at a certain size, you _have_ to scale horizontally with those in-memory databases as well.

stavros · on Feb 7, 2012

> Clearly - if you are paying 10s of thousands of dollars for a database server License, it makes sense to fully utilize the ones you've purchased.

I suspect this is the reason they're using a single server. It would cost a few more tens of thousands of dollars to install another one, versus the zero overhead of just upgrading the existing one.

As an OSS user, I often take for granted how easily I can go "Oh, the database reads are too slow? Let's throw another Postgres slave at it and see how that goes." After all, that expense is usually minimal.

ColdAsIce · on Feb 7, 2012

Why dont they just change to an open source rdbms?

dagw · on Feb 7, 2012

There is nothing 'just' about moving a non-trivial database from one rdbms to another. First there is no doubt a huge amount of incompatibility between the dialect of SQL the two speak. Then there's all the platform specific features that the other database is either lacking or implement completely differently. On top of that, once you get everything up and running, you'll notice that the performance characteristics between the two are very different which means you're going to have to redo much of your optimization work to get back to even close to the level of performance you had before the switch.

joshu · on Feb 7, 2012

For what it's worth, the word "just", as used above, almost always indicates that the speaker doesn't understand the problem.

iopuy · on Feb 7, 2012

I can't count the number of times I've heard, "I just need a programmer!"

ColdAsIce · on Feb 8, 2012

That was the point, to illustrate lock-in effects of proprietary software.

hristov · on Feb 7, 2012

Probably because Joel Spolsky is a Microsoft guy. He used to work there and was apparently quite influential before he left.

jonursenbach · on Feb 7, 2012

They're fairly tightly integrated into the MS stack and it'd be an incredible undertaking to migrate to anything else.

th5 · on Feb 7, 2012

you would think MS would cut them a super deal on licensing, being that S.O. is such a major site utilizing and promoting their platform, (and its demographics match who they'd want to promote to). If S.O. goes down or has problems due to a SQL server issue like this, (or them not affording another license) that looks pretty bad.

dan_b · on Feb 7, 2012

I believe Joel said that software licenses were a rounding error in the grand scheme of things.

bunderbunder · on Feb 8, 2012

I'm inclined to agree. If you have to spend any time at all figuring out how to fit the cost of a SQL Server license into your business plan, that's the entrepreneurial equivalent of a code smell.

justincormack · on Feb 7, 2012

Not sure MS cares enough about SQL server. They would probably throw money at them to move to Azure though.

kermatt · on Feb 7, 2012

The scale of the problem does not seem all that difficult to overcome (as indicated by the author). What I am interested / pleased to read is that a fairly popular, high traffic site is backed by a "plain old" RDBMS. MSSQL even.

jacques_chester · on Feb 7, 2012

SQL Server is actually fairly well regarded as a platform these days. It's a "grown up" database; not quite as trusted for very large sites as Oracle, DB2, Sybase, Teradata etc; but definitely capable of solid "midrange" performance with the usual Microsoft abundance of features and tool support.

I think that what's happening here, really, is the emergence of SSDs as a server platform standard. SSDs drastically revise the algo-economics of data storage.

From the mid-90s until now, demand for structured data storage has surged, apparently exponentially (though realistically, such a function would eventually be sigmoidal). So dramatic has this surge been that it overwhelmed the somewhat linear rate of I/O performance improvement for disk drives. Meanwhile, Moore's Law has more or less held and so in-memory architectures have become much more popular, starting with memcache.

But SSDs change the equation, because they too can track Moore's Law. So they can take mature disk-bound technologies like RDBMSes and breathe new life into them. And that's exactly what's happening.

At the turn of the millenium, if you were given StackOverflow levels of traffic and everything from 2011 technology except for SSDs, you'd have to spend dozens of times as much to approach the same level of performance.

zbrock · on Feb 8, 2012

Do you know of any high traffic sites that aren't?

kijin · on Feb 7, 2012

Stack Ovderflow does use Redis, though last time I checked they said it was only for caching.

reuser · on Feb 7, 2012

After all of their evangelism about Windows servers, it is rich to see this: "but it’s so cheap compared to say a single SQL Server license that it seems silly not to do it."

Yes, that is the problem isn't it?

shingen · on Feb 7, 2012

I don't see what's "rich" about that at all.

The Win stack is a trivial cost in their business compared to the revenue potential of StackExchange.

The biggest cost in their business is labor, not licenses. That will always be the case. License costs will continue to shrink toward irrelevant as a percentage of their sales over time.

Given their wild success, there's no great argument to be made against what they've done: quite simply, it has worked, and worked very well. That is all that will ever matter, regardless of what argument is thrown around.

reuser · on Feb 7, 2012

If license costs are so big that they become this important a factor in deciding how you will scale, they have become an operational problem of some size. Probably not a size which is imminently threatening to Stack Exchange. But it is not nothing, or it would not figure so prominently in this shop talk about how to upgrade.

It's certainly a factor worth weighing when you consider what technology you'll build on top of.

In no place have I "argued against what they've done" or implied that the company is a failure. Yet money IS a part of scaling, and managing expenses IS a part of business. I'm happy to take your word that Stack Exchange makes so much money that it cannot ever matter how much their licenses cost. But in the context of evangelism for others to make the same decision, I have to observe that the kind of reasoning casually mentioned in the article implies something that would be negative for many other businesses.

If you are offended by that sort of discussion of reality then you have too thin a skin (or a conflict of interests)

cmer · on Feb 7, 2012

I was surprised at how small their database is!

We (Defensio) store about 300gb of data, and most if it is purged after 60 days. We're quite far from being 274th largest website in the world as well, I assume.

It's just very interesting to see how such a huge website can use so little storage.

zitterbewegung · on Feb 7, 2012

What do they have in the database other than text fields and profile information? They don't store many pictures or anything else do they? Thats probably why they don't need lots of space.

jaequery · on Feb 7, 2012

I'm really surprised by this as well.

jacques_chester · on Feb 7, 2012

How are you storing your data? And is it compressed?

... I suppose for spam, the problem is that as a side-effect of trying to fool Bayesian filters, there's a lot of incompressible gibberish.

cmer · on Feb 7, 2012

It's mostly meta data and hashes of stuff. Uncompressed. Most of it in MongoDB, which is probably one of the worst technical decisions we made, for all the reasons that have been discussed at length.

yourapostasy · on Feb 7, 2012

If you had a chance to do it over again, which store would you have picked instead of Mongo?

cmer · on Feb 7, 2012

I haven't researched this much, but my gut tells me Cassandra and/or Postgres.

One of the main reasons why we escaped MySQL was that we were stuck with out schema. Adding an index or a column basically copies the table to a temporary table, apply the change and copies is back in place. That's my understanding, which might be a little bit wrong. All I know is that making one very simple change took hours and hours because our tables were so big. It was quite ridiculous. We couldn't afford the downtime, so we were stuck with what we had, which was no longer sufficient for our needs.

I understand that Postgres doesn't have that limitation, but since our schema used to change so much and we has so many joined tables, the MongoDB data structure seemed like a great fit. Mongo was also amazingly fast in our tests, but those tests didn't properly take into consideration the global write lock.

The global write lock problem is a very well known issue now but we started using MongoDB before 1.0; way back when nobody really knew anything about it. At least people are now more informed, although it doesn't seem to prevent many from making the bad decision of using it.

mwerty · on Feb 7, 2012

Do you have a link on why mongodb is bad?

cmer · on Feb 7, 2012

This has been discussed so many times here, especially recently. Search HN you'll find tons of info.

MongoDB is great at small scale. Basically, if everything fits in memory, it's awesome. If not, it slowly becomes a major pain in the butt, especially because of the global write lock.

I love all the concepts behind Mongo. Replication is so easy, and the data structure is extremely flexible. I just wish the implementation was as good. Hopefully one day...

jacques_chester · on Feb 7, 2012

At a guess, then, your dataset is larger because of duplicated data.

One of the nice things about normalisation is that, by design, it removes such duplications. The data is reduced to the smallest usable form reasonably possible without compression.

In your case, and I'm just guessing here, there'll be many many cases of duplicate data. Whereas SO would store a user email once, you might be storing it hundreds or thousands of times. And so on.

Compression would help a lot, if it was available (some brief Googling suggests it isn't).

(edit) My guess is that you already know all of the above. I'm not really helping, am I? :/

cmer · on Feb 7, 2012

The problem with de-duplication is that it actually takes resources to do that. At the rate we're serving requests, everything has to go as fast as possible. Normalizing can actually slow things down.

Our app is not quite the typical web app. we can't cache much, we serve 8-9 digits api request every day and we try to do it as quickly as possible. Every request is different and unique.

And don't worry, I'm not insulted :) Hopefully this conversation will give some insight to people about how we do things.

jacques_chester · on Feb 7, 2012

> The problem with de-duplication is that it actually takes resources to do that.

Normalisation doesn't deduplicate by performing comparisons; it works by ensuring that there is one and only one logical place for the data to go. On OLTP tasks highly normalised databases as a rule perform better.

Where normalised data extracts its pound of flesh is in 1. join times and 2. the indices created to support reads and joins.

Yours is a case where, given the pattern of traffic, a highly denormalised data store makes sense. I just thought I'd explain to listeners-in why relational data stores can often do something with less total disk usage.

Devilboy · on Feb 7, 2012

Yea 100GB is tiny! Our database is several times that big and we're just some random B2B company with a few dozen employees. And we don't use sharding or whatever, it's all just one machine with lots of RAM. We've not even moved it to SSDs yet.

serverascode · on Feb 7, 2012

I have 1.7TB of FusionIO PCI SSD at work and it's not even doing that much right now.

These guys should open their wallets and get more than a little bit of SSD, prob PCIe, plus max out on RAM. 96GB is what most would start at now in a larger HP server.

gbeech · on Feb 7, 2012

Those drives still cost about 20k or so to purchase, for only about a 25% increase for reads and 10% for writes (http://blog.serverfault.com/2011/02/09/our-storage-decision/) why would we pay that extra for so little gain?

Also, with FusionIO you are putting all of your eggs in one basket - drive wise. If that card dies you are done. In short we don't _need_ the FusionIO so why would we put out that kind of money for it?

Also we went to 96GB about a year or so ago when we setup those boxes it was a reasonable amount of memory to put in now - we are maxing the thing out at 288GB now.

caw · on Feb 7, 2012

Being of the sysadmin sort, my immediate thought is if the memory is going to downclock because of the number of DIMMs populated. Of course, this only happens if it's an Intel box. And you also may not care. With only 288GB you can definitely buy some new hardware before having to scale out the number of boxes. They make 1TB of RAM machines now...

While you mentioned SAN storage being really expensive (and the price changes depending who you are), you really can't beat the caching modules. I've got a Netapp with 512GB of read cache, and you can buy them in smaller sizes.

Have you guys noticed a degradation of SSD performance? I've never used SSDs in a machine before, and I know Anandtech always writes about the performance degrading after the drive gets mostly full. I just wondered if you've run across the problem in production.

serverascode · on Feb 7, 2012

I don't think they cost $20k now, but I don't actually know. That said, I have 3 in a big IBM server and the whole server, including the FIO cards was just under $60k a year ago. Maybe IBM threw in the server and external storage enclosure half filled with drives for free when we bought 3 cards...?

Also you can mirror them.

mrkurt · on Feb 7, 2012

Did you guys do better than about $13/usable GB of storage? When I was shopping, 6x Intel 710s cost about that, and that's about what Fusion IO storage seems to go for.

nick_craver · on Feb 7, 2012

We'll end up paying about that for the 710s, but you have look at what we're getting for the same price. Fusion IO card dies: we're dead in the water. It takes 2-3 of the Intel 710s (in a RAID 10 like they'll be) dying to do the same. Given that the IO of our storage isn't close to being a factor, we'll always stick with the much more robust route for the same price.

mrkurt · on Feb 7, 2012

Ah. Well if your RAID controller dies, you're dead in the water no? My understanding of the Fusion cards is that they're roughly as resilient as a RAIDed SSD setup.

nick_craver · on Feb 7, 2012

That doesn't really happen, it's an extraordinary event that a raid controller dies. But can it happen? Sure, anything can happen...that's why we have an identical hot backup server only minutes behind on the database for just such an occasion.

salsakran · on Feb 7, 2012

It seems truly bizarre to me that in 2012, there is a major operation that is still trying to vertically scale such a trivially shardable DB.

Scaling a static Q/A site with a few widgets that require user info/counters should be table stakes.

staunch · on Feb 7, 2012

The situation is the opposite of what you seem to think. It's far easier and cheaper to scale up than out than it has ever been. Just 5-6 years ago it would have been nearly unthinkable to have a half a TB of memory, 16 CPU cores, and a TB of SSD for a few thousand dollars.

Most people's traffic/data hasn't grown nearly as fast as Moore's law. Scaling up makes more sense than ever. It may not be hip but it's the right choice for 99% of cases.

marshray · on Feb 7, 2012

Just be ready for the day when you max out your IO bandwidth, or some other hard-to-forsee bottleneck.

jacques_chester · on Feb 7, 2012

There is always a bottleneck. Moving it from "disk I/O" to "cache coherency network traffic" doesn't magically make that go away.

Bottlenecks are not magically confined to only affect scale-up designs. And if you look at the design of scale-out systems, they usually surrender some desirable quality such as easy joins, isolation, atomicity etc.

InclinedPlane · on Feb 7, 2012

It depends on how fast you're growing, since technology keeps chugging along too.

Case in point, here's a 16TB SSD with 560 MB/s peformance: http://www.engadget.com/2012/01/09/ocz-goes-ssd-crazy-at-ces...

sarchertech · on Feb 7, 2012

You might be able to keep cramming in more RAM until memristor storage makes RAM obsolete.

jacques_chester · on Feb 7, 2012

I know it's a cliche, but memristors are going to cause an earthquake in the algo-economics landscape.

jules · on Feb 7, 2012

I have a hard time seeing whether memristors are really going to be the future of all storage or whether it's just a fad. Can somebody who knows something about the topic explain this?

InclinedPlane · on Feb 7, 2012

Here's a useful, but lengthy, talk on the subject: http://www.youtube.com/watch?v=bKGhvKyjgLY

Memristors look extremely likely to be able to provide non-volatile storage with latency, access speeds, and density higher than existing DRAM and flash. They also look like they can be used to make FPGA-like devices which can approach the speed, power efficiency, and logic density of custom designed ASICs but with extremely fast reconfiguration speed. And those seem to be the less fascinating aspects of memristor technology.

jacques_chester · on Feb 7, 2012

The thing about memristors is that they could potentially have transistor-like speed at the microlevel, but they can hold data at much, much higher density without constant input of power.

InclinedPlane · on Feb 7, 2012

Potentially memristors could lead to smartphones and tablets with the power of a server rack of today crammed into them, but also with improved battery life.

And that's one of the least revolutionary implications of memristor technology.

VMG · on Feb 7, 2012

unforeseen problems also happen with horizontal scaling

latch · on Feb 7, 2012

They've proudly blogged about it a few times http://blog.stackoverflow.com/category/server/

I think they've successfully shown how far you can take it. Nevertheless, I agree with you that scaling this horizontally makes a lot of sense.

wavetossed · on Feb 7, 2012

They ARE scaling horizontally. Remember where they mentioned that they are putting StackOverflow, one of several sites that they run, on its own box? That is horizontal scaling. They now have two shards, one with SO and one with all the other sites.

jacques_chester · on Feb 7, 2012

No matter how you slice it, sharding introduces substantial complication. You need to modify your application to shard or to use an auto-sharding store, and then maintain that overhead forever.

The thing is, they don't need to do that. Moore's Law, here represented as SSDs and RAM, is keeping ahead of their traffic growth.

nchuhoai · on Feb 7, 2012

It is very intriguing as a front-end guy to see what server admins have to do. As a front-end guy, while I have to think about performance, I never have to be afraid about running out of space or random disk I/O. What do front-end founders do when they unexpectedly run into traffic spikes? I'm so glad services like Heroku take that from me.

xxpor · on Feb 7, 2012

And as a backend guy, I'm glad I don't have to worry about things like why doesn't this look right in IE, and if I make this button this size vs. this size what will the sales impact be ;)

kondro · on Feb 7, 2012

Also, why not just store everything in memory. 512GB of RAM is also not overly expensive… even with the extra MS licensing required for it.

stavros · on Feb 7, 2012

Does anyone know if there's any specific reason why they have multiple databases on the same box? From what I see, one could trivially install the full text search engine on another server and reduce much of the space requirements and contention.

jacques_chester · on Feb 7, 2012

If you read closely, you'll see that StackOverflow's database has been moved onto its own hardware away from other StackExchange sites.

I think they're using SQL Server's inbuilt text search engine (edit: no, see below).

JeremyBanks · on Feb 7, 2012

Actually, they switched to Lucene.NET a year ago: http://blog.stackoverflow.com/2011/01/stack-overflow-search-... It sounds as though the search indexing isn't done by the database server anymore.

jacques_chester · on Feb 7, 2012

Thanks for the correction.

stavros · on Feb 7, 2012

Oh, no, I see that, I meant why don't they put the full text search engine (Lucene.NET, as per the comment below) on its own server. Wouldn't that make more sense, if they're running into this trouble?

nick_craver · on Feb 7, 2012

We have done just that, Lucene.Net runs on the web tier which is otherwise severely under-utilized hardware (all servers sat at under 10% CPU before we moved search to it). But, it will be moving again soon (we're actually discussing that today). I'll have a another post around this architecture shift for search and a few other things coming up.

stavros · on Feb 7, 2012

Ah, thanks, I figured that was a good move.

th5 · on Feb 7, 2012

Is it not worrisome that they only have one db server with no hot-backup or fail-over db machine? I suppose many components on that one machine are redundant - disks, cpu's... but theres got to be many points of failure in there as well right?

nick_craver · on Feb 7, 2012

Both NY-DB01 (runs everything but Stack Overflow) and NY-DB03 (runs Stack Overflow) have identical backup counterparts: NY-DB02 and NY-DB04. NY-DB04 is on a mirrored config and is always a few minutes behind, while NY-DB02 is restoring scheduled backups. With SQL Server 2012, the backup/mirror is greatly improved and both boxes will have fully-hot spares in a replica configuration. In 2008 R2, SQL Server just doesn't handle mirroring 100+ databases well.

griffordson · on Feb 7, 2012

And those are all in the same data center, right? How much downtime do you guys have a year?

nick_craver · on Feb 7, 2012

The hot backup is in the same New York datacenter, yes. We also have daily backups across the country in our original Oregon datacenter. The whole OR setup is getting love to be a much more resilient failover location as we speak (that'll be the topic of my next post). As for downtimes, pingdom says we were down 7h 6m 23s last year, so 99.92% uptime (note: not nearly all of that was DB related).

amitutk · on Feb 7, 2012

noob q: any reason not to use RAID5 or 6?

orcadk · on Feb 7, 2012

As the others mention, read performance is bad while in a degraded state. However, seeing as they're running their log on the same drives, they have quite a lot of write ops as well.

RAID5 is notoriously bad for write performance, just as RAID6 is. This is a classic writeup on why you should avoid RAID5 at all costs for a scenario like this: http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt

azim · on Feb 7, 2012

Good question. I just googled this: http://en.wikipedia.org/wiki/RAID#RAID_10_versus_RAID_5_in_R...

Summary: Slight benefit to read performance and greater resiliency in the case of drive failure with Raid 10.

nbm · on Feb 7, 2012

RAID-5 and RAID-6 suffer tremendously in their degraded state - instead of being able to read individual blocks from a single disk, you need to read all the disks (or all by 1 in RAID-6) for any blocks stored on the down drive.

Devilboy · on Feb 7, 2012

If you run RAID 10 on a busy system and a disk fails, everything keeps running at full speed until you replace the missing disk and rebuild your redundancy.

If you run RAID 5 or 6 and a disk fails, suddenly every read operation becomes several times slower because the missing data must be computed from the parity and remaining data channels. If your normal day-to-day load on the storage is too high you are screwed until you rebuild your missing disk.

nbm · on Feb 7, 2012

Write throughtput/latency on RAID10 remains consistent with previous numbers (latency perhaps milliseconds faster), but read latency and overall-throughput (as opposed to individual requests) will suffer on the mirror with the bad disk.

Recovery will drop read/write performance, depending on how fast you do it, but nowhere near as bad as RAID-5/6.

ck2 · on Feb 7, 2012

Wait, SO is all on one single node? Or are there reverse proxies?

I guess the static content is CDN but all dynamic is coming from one machine?

Oh wait, nevermind (10 Dell R610 IIS web servers)

http://highscalability.com/blog/2011/3/3/stack-overflow-arch...

JasonPunyon · on Feb 7, 2012

Our web tier has 10 servers that all run at 10-15% CPU. And we have two database servers: one for StackOverflow and one for everything else (SE Network, Careers and Area51)

rorrr · on Feb 7, 2012

If you're 27th largest site in the world, you shouldn't have any trouble getting large fast SSDs. Or just get a bunch of

    Crucial M4 256 GB (4KB Random Write: Up to 50,000 IOPS)

    or

    Plextor M3 Series PX-256M3 256GB (4KB Random Write: Up to 65,000 IOPS)

Plus your whole site is perfect for sharding. Questions are pretty much independent.

martincmartin · on Feb 7, 2012

The article says they're the 274th largest site, not 27th. And that's in the U.S., not the world. And as the article says, they are getting 200GB SSDs.

petercooper · on Feb 7, 2012

I did like that they're getting 200GB vs 300GB though. $18m in funding and still being frugal! :-)