I doubt this HN post could have a more boring title. The actual article is very interesting. Basically Facebook's DB developers talk about what it takes to perform 13 million queries per second, and a lot of the useful tips they've learned along the way to make scaling simpler.
How is the title boring? What could be more telling? Having "Facebook" in the title makes it clear we're going to be talking about astronomical amounts of data. "[insert any db] at Facebook" would be an interesting title in my opinion.
The fact that it's MySQL makes it even more interesting, given the shift of scalability interest to MongoDB, etc.
Maybe boring isn't the right word. Vague is probably better. "MySQL at Facebook" is a facebook page, a launchpad user, a twitter user, a O'Reilly MySQLConf talk, etc.
not a conspiracy theorist by any stretch, but i found it a bit humorous this came out just when oracle offered a pricing plan for mysql and, especially, innoDB.
"They figure out why the response time for the worst query is bad and then fix it."
This can reap huge benefits and doesn't need to be difficult. Just enable the slow query log in MySQL, use the EXPLAIN command to analyze the results, then add indexes where appropriate. I was able to fix poorly indexed tables in a vendor's application with dramatic results. In one case, a twenty-minute(!) query was reduced to less than a second.
While your premise is correct it's not always that simple. More indexes can slow down updates and inserts. A tables overall usage pattern needs to be examined before just adding more indexes. You might have fixed the 20 minute report query while slowing down some other more critical query that loaded or updated data.
Yes, this is a good point and I actually noted it in my issue tracker while troubleshooting. I made sure that inserts & updates weren't negatively affected to an unacceptable degree after adding the indexes. The overall win was huge, because it eliminated wait times for a sizable staff and client base.
"It is OK if a query is slow as long as it is always slow"
I find this enlightening.
I think it is just one instance of difference in emphasis between Google and Facebook. Google, a technology oriented company, would minimize the average response time, while Facebook, a people oriented company would minimize the unpredictability (by minimizing the variance)
For what it's worth, Jeff Dean of Google has famously emphasized 95th and 99th percentile performance in preference to average or median performance over the years. The realization that edge cases more powerfully determine user perception of performance than average cases is a deep one, and is not original to Facebook or Google; Dean and colleagues were making some of the same points when they were at DEC's CRC pre-Google.
There is some social psychology and operations research around this -- not necessarily as it relates to human/machine interactions, but certainly as it relates to such quasi-analogous situations as call centers, lines, speed vs. variability of service at McDonald's, and so forth. The premise being practiced by Facebook takes a lot of cues from behavioral science data that is generalizable to many different domains.
McDonald's, to give a famous example, proved analytically a few years back that -- beyond a certain threshold of commonly expected service times -- their customers would rather get semi-slow service on a consistent basis than highly variable service. In a perfect world, of course, average wait times are as short as possible in addition to a minimization of variance. But when you're at the level of acceptability, there are diminishing returns on speed increases and increasing returns on reduced variance.
Alas, no, it was from an analytics consulting engagement a buddy of mine did a few years ago. But I am reasonably sure the outcome of the data is public knowledge and available somewhere. I wish I knew where.
Alas, my understanding of this point is entirely folkloric. Self-experimentation, and working on improving real interactive systems, agrees with it. Please update if you find anything more comprehensive.
I think the more compelling reason for "queries must be fast or slow, never 'both, depending'" is that it prevents engineers from accidentally using code originally built for reporting purposes (which might well have no latency requirements, or very relaxed ones) and putting it in a widget which gets slapped on people's home pages. You'd notice on your local, test, or staging instance "Hmm, my home page is taking 10 seconds to load -- that's bad" prior to pushing it live and, whoopsie, there's now 50 million people hitting a very scarily expensive query all at once.
I'm sorry, but your interpretation here is completely wrong.
Facebook (and Google!) cares about performance variance because it has more of an impact on overall site performance than average performance does. Variable performance has huge impact on downstream systems, and you can quickly end up with cascading performance problems.
I think that quote is slightly misleading without more context. They prioritize optimizing variable-performing queries higher than others. They aren't going to be using slow queries on the Facebook home page.
>> "It is OK if a query is slow as long as it is always slow"
I'm having trouble understanding the motivation. If a slow query is always slow, then I'm always going to be kept waiting for that page/data. It seems logical to worry about the queries that 100% of the time keeps users waiting rather than the queries that keep users waiting <100% of the time.
Does anyone care to explain why this is a good idea (for Facebook at least)?
I'm sure the rationale here is that if a given query takes 100ms, they don't focus on getting it down to 50ms even if that's several times what the average is because they know they can. Certainly someone should focus on making that query faster, but it's more straightforward.
The harder problem is figuring out why that 20ms query suddenly balloons to 200ms. You can say, "no big deal, it only happens 1% of the time," but if you don't know why, you could make changes to the system that cause it to happen much more frequently and eventually bring the whole system down.
Also, there's a bit of UX here. People are much more frustrated by things they don't understand and/or aren't used. There are parts of GMail that are always slow (archiving a lot of messages). I know this so I know I have to wait 5-10 seconds. What if sometimes it took 1 second and sometimes it took 20 seconds? What if it took 20 seconds 5% of the time. I'd probably always click again and think something was broken. If it's always slow, I want it to be faster, but at least I know what to expect.
It's all about user experience. If a feature loads for you in 300ms, then it will feel unbearably slow if it later take 700ms. If the feature always loaded at 700ms, your expectation of 700ms would never be disappointed. The theory is that a consistent 700ms is more appealing than an inconsistent 300-700ms, even though the median is better in terms of raw performance.
The parent hints at this by talking about minimizing unpredictability and describing Facebook as "people oriented" - it ultimately boils down to user experience.
Think of it this way: if you know a certain function will reliably take a little while to complete, you can justify the effort of adding progress indicators and other feedback to let the user know.
But if query performance is unpredictable, even planning the UI design becomes difficult - not to mention the end user's experience.
The concept of variance reduction is not even new with computing - In manufacturing they have six sigma. I believe the famous GE CEO's quote was "Our Customers Feel the Variance, Not the Mean" (Google it).
I enjoyed the section on creating quality, not quantity, and its emphasis on minimizing variance. I can see how these heuristics could be applied to most startups.
The section on diagnosing should be taken with a grain of salt, though. If your company ever gets to the point where you need to monitor everything at subsecond level to catch problems or analyze and understand every layer of your stack to see how it performs, you've already won. That amount of attention to scalability means your company has a huge base of users. Not only that, it means you have the large and impressive engineering resources to devote to that problem.
That's definitely not my startup, and so the tools described, while definitely useful (and probably fun to build!), aren't anything approaching a priority for me. In the words of the stereotypical Yiddish grandmother, you should be so lucky to have those sorts of problems!
It'd be interesting to know some info on the hardware back end, such as number of servers, storage system, etc. Also, how many servers does a typical query touch?
Does anyone know the througput of the largest MSSQL installation? I'm searching the web to show off a little information at work, but I can't find anything that compares.
Everyone is using MySQL. But I attended a talk by one of their DBAs where he said that they large OLTP server, the one that processes the payments is Oracle. Single-instance, because Oracle RAC can't give them the low latency they needed.
They switch the the biggest machine IBM can give them every few month.
>They switch the the biggest machine IBM can give them every few month.
Yea, almost all this "web scale" scalability can achieve those numbers because they can partition, relax transactional requirements, and scale out. For system-wide true transactions, scaling up is still the only option, it seems.
Although, H-Store/VoltDB claims some impressive numbers if your OLTP app fits the requirements.
Just curious, do you have any good examples of very large scale systems that need system-wide transactions? I might miss something obvious (I'm not actually in the database area), but I couldn't come up with a good example.
It depends on your definition of 'very large scale'. Certainly any large financial/insurance company is going to be churning through great gobs of data that really does have to be universally consistent. This is basically the niche that keeps IBM's mainframe division in business.
I'd suspect that if those don't qualify as 'very large scale' to your definition (which would be entirely justifiable) that such systems don't exist because they're not currently possible. For example I don't think it's possible to implement Facebook with SERIALIZABLE no matter how much money you throw at the problem.
I'd guess that in Amazon and other online retailers, the part that completes the order of the item from the warehouse has to be transactional so two customers won't buy the last item.
Normally, whenever physical objects or money is involved, transactions should be considered.
I can't understand why anyone who is in the know would sign up for this. I work at a fortune-5 corp where Oracle was once king and is being replaced with Microsoft SQL simply due to the outrageous price gouging. It's as if Oracle is trying to squeeze every last penny out of it's aging database as OSS solutions chip away at its profits. PostgreSQL is, in my opinion, poised to do this. MySQL just pales in comparison to Postgres and Oracle is kitchen sink and then some -- even at a fortune 5 we barely use all the "features" in Oracle.
I'm just astonished that a company like Oracle being around as long as they are could be so dumb. At the fortune-5, Oracle has a similar practice of gouging us on Peoplesoft licenses due to, in my opinion, lost DB sales.
Charging a customer a license by CPU core is just unethical.
There's a full video of the talk available at http://www.livestream.com/facebookevents/video?clipId=flv_cc...