Scaling MySQL to 700 million inserts per day

cperciva · on March 21, 2010

This article was mis-titled: The title should be "how to get poor performance by using a completely inappropriate tool".

Quoth the article: "Several million records a day need to be written to a table. These records are then read out once at the end of the day, summarised and then very rarely touched again."

This isn't a job for a database. This is a job for a log file. You might want to use a database to store the summarized data, of course; but bulk data which is written, never updated, and quite likely only ever read once? That's practically the definition of a log file.

andrewvc · on March 21, 2010

Agreed. They did stipulate, however, that the data needed replication between multiple datacenters for redundancy, which would make me think about putting a message queue in there that replicates across DCs, and having logging daemons at each DC pull data out of the queue.

aschobel · on March 21, 2010

Facebook Scribe supports multiple stores

http://wiki.github.com/facebook/scribe/scribe-configuration

reitzensteinm · on March 21, 2010

Surely you could just use logs with rsync?

gaius · on March 21, 2010

Or just get your storage to do it transparently. Every vendor has something like this these days http://www.emc.com/products/detail/software/srdf.htm

andrewvc · on March 21, 2010

You're right, I guess I was thinking both data centers needed to be in as tight of sync as possible, but looking back that does not seem to be the case.

bluesmoon · on March 21, 2010

that is the case. all databases have to be within seconds of each other - it's required for BCP.

bluesmoon · on March 21, 2010

some hackers hack simply to find out what's on the other side

reitzensteinm · on March 21, 2010

Well... that's fun and all, but surely it's better to use the optimal solution and see how far you can take things? I personally find building meaningful products that my customers love hard enough without handycapping myself for my own amusement!

silentbicycle · on March 21, 2010

"How long will people continue believing in the myth of the 15-year-old hacker genius while simultaneously decrying the unreliability of software before the cognitive dissonance finally cracks?" - Dave Herman (http://calculist.blogspot.com/2005/12/12-weeks-with-geeks.ht...)

ntoshev · on March 21, 2010

No, it's just another case of RDBMS blinding people for all other possible solutions.

silentbicycle · on March 21, 2010

Have you ever noticed how when this happens, it's practically always MySQL? (And when people complain about how badly relational databases perform and it turns out they don't know about indexes, transactions, or stored procedures, it's always MySQL...)

donw · on March 21, 2010

This.

The non-MySQL RDBMS people that I know, know when their RDBMS of choice isn't the right tool for the job. Most of the MySQL people I know, seem to think that you just shove everything into the DB, and let Eris sort it out.

Because no other member of the pantheon is crazy enough to even look at that mess.

silentbicycle · on March 21, 2010

I suspect a network effect is in place where the vast majority of people whose entire understating of databases came from "PHP + MySQL For Dummies"-caliber books, blog posts, etc. use MySQL by default. A critical mass of them end up perpetuating bad advice, griping about how since MySQL hasn't worked well for them, all RDBMSs are junk, etc. Then, people looking for advice starting with databases read their tutorials. To some degree, MySQL probably brought this upon themselves[1], but now there's a feedback loop in place[2][3][4].

   [1]: http://www.reddit.com/r/programming/comments/764fp/mysql_vs_postgresql/c05sayb
   [2]: http://sqlanywhere.blogspot.com/2008/03/unpublished-mysql-faq.html
   [3]: http://news.ycombinator.com/item?id=561277
   [4]: http://ask.metafilter.com/117908/Theres-got-to-be-a-faster-way-to-update

There seems to be a similar effect between Linux and BSD. I'm not going to claim that BSD is objectively superior in every regard, but on average, the BSD community seems to be quite a bit better informed than the Linux community. It may just be because Linux is so much more visible. People using BSD, Postgres, etc. probably already knew enough to evaluate their options, while the path of least resistance stays overrun with clueless newbies.

Legion · on March 22, 2010

It's not that hard to understand.

MySQL is a popular RDBMS starting point. Therefore, almost by definition, if you run into someone using a RDBMS but not really knowing WTF they are doing, they're quite often using MySQL.

If it wasn't MySQL, the same people would flock to, and poorly use, some other tool.

silentbicycle · on March 22, 2010

> If it wasn't MySQL, the same people would flock to, and poorly use, some other tool.

That's what I said, deeper in the thread. I don't think it's because MySQL is terrible, but almost all the people who don't really know databases start with MySQL by default, so the community surrounding it consistently has more people who are grossly misinformed than (for example) Postgres's.

ntoshev · on March 21, 2010

No. And it doesn't have anything to do with any deficiency MySQL may have - RDBMS is not an OK replacement for a log file.

silentbicycle · on March 22, 2010

Of course, but that's orthogonal to my point.

MichaelGG · on March 21, 2010

I wonder why they used a database at all: "These records are then read out once at the end of the day, summarised and then very rarely touched again."

And later on they end up writing them to a text file and bulk inserting them in 10K batches. Wouldn't it be easier to just write them all to a file then summarize at the end of the day? (It also says they do a DROP TABLE on each day's data.)

forkqueue · on March 21, 2010

A couple of points for those thinking about doing something similar:

"Since InnoDB stores the table in the primary key, I decided that rather than use an auto_increment column, I'd cover several columns with the primary key to guarantee uniqueness. This had the added advantage that if the same record was inserted more than once, it would not result in duplicates."

The 'correct' way to deal with this in MySQL is using the auto_increment_increment.

http://dev.mysql.com/doc/refman/5.0/en/server-system-variabl...

Of course, the real difficulty with mutli-master setups split across data centres isn't ensuring uniqueness of primary keys, it's ensuring data-integrity under a split-brain scenario, i.e. where one server can't reach the other, but users can reach one or the other. UPDATEs and DELETEs to rows can then become extremely difficult to merge back together. This wasn't a problem for this application, but as others have commented, this use case probably wasn't best suited for an RDBMS anyway.

bluesmoon · on March 21, 2010

with an autoincrement id, duplicate rows may get inserted. having a primary key derived from the data results in duplicate rows getting discarded (this is important).

secondly, the autoincrement id adds 4 bytes to each row which are never used for anything. only use an id if you need to reference a row from another table.

cperciva · on March 21, 2010

duplicate rows may get inserted

Make that duplicate rows WILL get inserted -- even if only due to network glitches causing connections to die after the database adds the row but before the client receives the acknowledgement resulting in the client retrying the request. Unless you don't retry failures, in which case you lose rows instead, of course.

silentbicycle · on March 21, 2010

That's a good warning sign that the data isn't relational in the first place. An RDBMS would probably be a good fit for the analyzed data (which is likely to have explicit relations), though.

encoderer · on March 21, 2010

No it's not. There are plenty of tables that use composite keys that also contain foreign keys.

The use of ORMs like active record, many of which choke on natural keys, has turned a lot of devs into automatons for artificial key creation. Natural keys are often superior.

Paul_Morgan · on March 22, 2010

Why no discussion of the disk drives supporting this database? The write performance dropped to 150 when the disk starts getting pounded as, up to that point, everything was in a memory buffer. Does the temporal partition spread the file over multiple drives? What's the capacity and theoretical throughput of the drives?

bluesmoon · on March 24, 2010

RAID 10 7200rpm 6x1TB

danwolff2 · on March 21, 2010

Interesting use of tables, as previously commented, but we at room.ly surely appreciative of your post as we are currently on a PHP/MySQL diet and looking to reach critical mass soon after a few updates

raghus · on March 21, 2010

This seems to be at Yahoo!