I'm curious: What approach, if any, would you take?
I'm sure you have an interesting prospective based on your experience working at MySQL-friendly shops, in addition to being an engineer at Tumblr once upon a time.
Thank you for asking, but I probably shouldn't get into that :) I'm inherently conflicted / biased due to designing a solid chunk of Tumblr's backend, and also previously worked for another Automattic competitor (Six Apart, the defunct company behind Movable Type).
Edit to avoid anyone misconstruing, I'm not trying to imply one thing or another, just that I can't approach this impartially. And in any case, I wish everyone well on both sides of this acquisition. I'm just genuinely curious how they plan to proceed from a technical standpoint, as it's a really interesting challenge.
Ehhh parts of that article were never accurate, especially the stuff about having an hbase-powered dashboard feed.
Primarily the product backend is monolithic PHP (custom framework) + services in various languages + sharded MySQL + Memcached + Gearman. Lots of other technologies in use though too, but I'll defer to current employees if they want to answer.
Not exactly. Tumblr has a pretty huge Hadoop fleet and decently large Kafka setup too. It's just a question of OLTP vs OLAP use-cases being powered by different tech stacks.
My answer above was limited to the product backend, i.e. technologies used in serving user requests in real-time. And even then I missed a bunch of large technologies in use there, especially around search and algorithmic ranking.
It really depends on the task at hand. I'm one of the most vocally pro-MySQL commenters on HN, and have literally built my career around scaling and automating MySQL, but I still wouldn't recommend it for OLAP-heavy workloads. The query planner just isn't great at monstrous analytics queries, and ditto for the feature set (especially pre-8.0).
For high-volume OLTP though MySQL is an excellent choice.
Regarding Kafka: in many situations I agree. Personally I prefer Facebook's approach of just using the MySQL replication stream as the canonical sharded multi-region ordered event stream. But it depends a lot on the situation, i.e. a company's specific use-case, existing infrastructure and ecosystem in general.
Kafka is not going to replace MySQL specifically because it depends on the task at hand.
If you can't replace MySQL with Kafka, then why not just stick with whatever queue/jobs/stream infra you had before kafka. At least those solutions are quite limited in scope and easily replaceable.
At this point Kafka is a solution looking for a problem.
My feeling about Kafka is that it's a useful tool to solve the "we MUST get this data to reliable storage IMMEDIATELY" problem. And to greatly mitigate the "each item must be processed and shown to be processed, exactly once" problem.
But there are relatively few situations where that's absolutely vital. And you can solve it with good ol' SQL.
I'm sure you have an interesting prospective based on your experience working at MySQL-friendly shops, in addition to being an engineer at Tumblr once upon a time.