This article was mis-titled: The title should be "how to get poor performance by using a completely inappropriate tool".
Quoth the article: "Several million records a day need to be written to a table. These records are then read out once at the end of the day, summarised and then very rarely touched again."
This isn't a job for a database. This is a job for a log file. You might want to use a database to store the summarized data, of course; but bulk data which is written, never updated, and quite likely only ever read once? That's practically the definition of a log file.
Agreed. They did stipulate, however, that the data needed replication between multiple datacenters for redundancy, which would make me think about putting a message queue in there that replicates across DCs, and having logging daemons at each DC pull data out of the queue.
You're right, I guess I was thinking both data centers needed to be in as tight of sync as possible, but looking back that does not seem to be the case.
Well... that's fun and all, but surely it's better to use the optimal solution and see how far you can take things? I personally find building meaningful products that my customers love hard enough without handycapping myself for my own amusement!
"How long will people continue believing in the myth of the 15-year-old hacker genius while simultaneously decrying the unreliability of software before the cognitive dissonance finally cracks?" - Dave Herman (http://calculist.blogspot.com/2005/12/12-weeks-with-geeks.ht...)
Have you ever noticed how when this happens, it's practically always MySQL? (And when people complain about how badly relational databases perform and it turns out they don't know about indexes, transactions, or stored procedures, it's always MySQL...)
The non-MySQL RDBMS people that I know, know when their RDBMS of choice isn't the right tool for the job. Most of the MySQL people I know, seem to think that you just shove everything into the DB, and let Eris sort it out.
Because no other member of the pantheon is crazy enough to even look at that mess.
I suspect a network effect is in place where the vast majority of people whose entire understating of databases came from "PHP + MySQL For Dummies"-caliber books, blog posts, etc. use MySQL by default. A critical mass of them end up perpetuating bad advice, griping about how since MySQL hasn't worked well for them, all RDBMSs are junk, etc. Then, people looking for advice starting with databases read their tutorials. To some degree, MySQL probably brought this upon themselves[1], but now there's a feedback loop in place[2][3][4].
There seems to be a similar effect between Linux and BSD. I'm not going to claim that BSD is objectively superior in every regard, but on average, the BSD community seems to be quite a bit better informed than the Linux community. It may just be because Linux is so much more visible. People using BSD, Postgres, etc. probably already knew enough to evaluate their options, while the path of least resistance stays overrun with clueless newbies.
MySQL is a popular RDBMS starting point. Therefore, almost by definition, if you run into someone using a RDBMS but not really knowing WTF they are doing, they're quite often using MySQL.
If it wasn't MySQL, the same people would flock to, and poorly use, some other tool.
> If it wasn't MySQL, the same people would flock to, and poorly use, some other tool.
That's what I said, deeper in the thread. I don't think it's because MySQL is terrible, but almost all the people who don't really know databases start with MySQL by default, so the community surrounding it consistently has more people who are grossly misinformed than (for example) Postgres's.
I wonder why they used a database at all:
"These records are then read out once at the end of the day, summarised and then very rarely touched again."
And later on they end up writing them to a text file and bulk inserting them in 10K batches. Wouldn't it be easier to just write them all to a file then summarize at the end of the day? (It also says they do a DROP TABLE on each day's data.)
A couple of points for those thinking about doing something similar:
"Since InnoDB stores the table in the primary key, I decided that rather than use an auto_increment column, I'd cover several columns with the primary key to guarantee uniqueness. This had the added advantage that if the same record was inserted more than once, it would not result in duplicates."
The 'correct' way to deal with this in MySQL is using the auto_increment_increment.
Of course, the real difficulty with mutli-master setups split across data centres isn't ensuring uniqueness of primary keys, it's ensuring data-integrity under a split-brain scenario, i.e. where one server can't reach the other, but users can reach one or the other. UPDATEs and DELETEs to rows can then become extremely difficult to merge back together.
This wasn't a problem for this application, but as others have commented, this use case probably wasn't best suited for an RDBMS anyway.
with an autoincrement id, duplicate rows may get inserted. having a primary key derived from the data results in duplicate rows getting discarded (this is important).
secondly, the autoincrement id adds 4 bytes to each row which are never used for anything. only use an id if you need to reference a row from another table.
Make that duplicate rows WILL get inserted -- even if only due to network glitches causing connections to die after the database adds the row but before the client receives the acknowledgement resulting in the client retrying the request. Unless you don't retry failures, in which case you lose rows instead, of course.
That's a good warning sign that the data isn't relational in the first place. An RDBMS would probably be a good fit for the analyzed data (which is likely to have explicit relations), though.
No it's not. There are plenty of tables that use composite keys that also contain foreign keys.
The use of ORMs like active record, many of which choke on natural keys, has turned a lot of devs into automatons for artificial key creation. Natural keys are often superior.
Why no discussion of the disk drives supporting this database? The write performance dropped to 150 when the disk starts getting pounded as, up to that point, everything was in a memory buffer. Does the temporal partition spread the file over multiple drives? What's the capacity and theoretical throughput of the drives?
Interesting use of tables, as previously commented, but we at room.ly surely appreciative of your post as we are currently on a PHP/MySQL diet and looking to reach critical mass soon after a few updates
Quoth the article: "Several million records a day need to be written to a table. These records are then read out once at the end of the day, summarised and then very rarely touched again."
This isn't a job for a database. This is a job for a log file. You might want to use a database to store the summarized data, of course; but bulk data which is written, never updated, and quite likely only ever read once? That's practically the definition of a log file.