I read: "We evaluated a lot of things: a custom mysql impl, voldemort, hbase, mongodb, memcachdb, hypertable, and others"
I wonder if among the others there is Redis too, if so (and somebody at twitter is listening) I could love to know what do you think are the biggest weakness in Redis and if you think in the short term it's more important a fault-tolerant cluster implementation (the redis-cluster project) or virtual memory to support datasets bigger than RAM into a single instance. Thanks in advance for any reply.
As I read Antirez's comment, he was asking about precisely this. (Specifically, how high a priority removing this limitation should have, relative to other important stuff he's presumably working on.)
Even if you overestimate ridiculously, it seems perfectly reasonable to keep the entire dataset in ram: 1 TRILLION tweets, at 500 bytes each is a data set of less than half a petabyte. A quick search shows I can get 4 GB of ram for $66 USD. Assume no redundancy, no bulk discount, and all other hardware is free, that is a cost of $8M or so (about half of their last round of funding, iirc).
Consider now that you don't need to keep non-recent tweets in ram, bulk buyers can get it significantly cheaper than individuals, and the dataset is far smaller than that, then throwing hardware seems far less impossible. I'd imagine that they could keep the last month in ram trivially.
Does Redis really require the entire dataset to be in RAM, or does it just require enough virtual memory to hold the dataset? In other words, couldn't you just let it swap?
Do you plan to implement Virtual Memory in Redis? Why don't just let the Operating System handle it for you?
Yes, in order to support datasets bigger than RAM there is the plan to implement transparent Virtual Memory in Redis, that is, the ability to transfer large values associated to keys rarely used on Disk, and reload them transparently in memory when this values are requested in some way.
So you may ask why don't let the operating system VM do the work for us. There are two main reasons: in Redis even a large value stored at a given key, for instance a 1 million elements list, is not allocated in a contiguous piece of memory. It's actually very fragmented since Redis uses quite aggressive object sharing and allocated Redis Objects structures reuse.
So you can imagine the memory layout composed of 4096 bytes pages that actually contain different parts of different large values. Not only, but a lot of values that are large enough for us to swap out to disk, like a 1024k value, is just one quarter the size of a memory page, and likely in the same page there are other values that are not rarely used. So this value wil never be swapped out by the operating system. This is the first reason for implementing application-level virtual memory in Redis.
There is another one, as important as the first. A complex object in memory like a list or a set is something 10 times bigger than the same object serialized on disk. Probably you already noticed how Redis snapshots on disk are damn smaller compared to the memory usage of Redis for the same objects. This happens because when data is in memory is full of pointers, reference counters and other metadata. Add to this malloc fragmentation and need to return word-aligned chunks of memory and you have a clear picture of what happens. So this means to have 10 times the I/O between memory and disk than otherwise needed.
Maybe but I did some math and Twitter dataset should not be so big... The estimation I posted some time ago in twitter were completely wrong. I hope to be able to post some more data later today.
I think that a few big boxes are enough to take all the twitter dataset in memory, and if you ask me, and seen the continuous scalability concerns Twitter experimented during its history maybe this is something they should consider seriously (Redis apart).
If Google has multiple copies of the entire web in RAM (as has been reported in many places), Twitter should definitely be able to hold 140 character tweets in RAM.
We'd love to be able to store an infinite amount of data in redis with large datasets -- that would let us use it at Posterous in ways that are beyond how we use it now -- which is mainly for analytics and job queueing (via resque)
thanks for the comment rantfoil. The more I talk to people about this issue the more I think virtual memory is the first thing to do. I'll start working on VM this Xmas, really hope to have it pretty solid in three months.
From that post: "Don't use EC2 or friends except for testing; the virtualized I/O is too slow" kind of restricts Cassandra usage for an average Web 2.0 startup
Added: just to clarify that: when starting small the last thing you want to do is either buy expensive infrastructure or hack something cheap instead of doing it properly. If Cassandra won't perform well in the virtualized environment, it will be necessary to first use some other means of storing data and "go cassandra" later if needed.
But everything resembling a DB is slow on EC2 since I/O is slow and memory accesses are slow, so I think the problem is EC2 and not Cassandra, MySQL, ...
to clarify, Cassandra is still faster on a per-node basis than relational databases on EC2, but if you compare EC2 numbers with real hardware and wonder "why is it so slow?" that is why. :)
that said, there was a thread yesterday about Joe Stump's benchmarks showing EC2 I/O not being as bad as most people think so possibly they have made some improvements since. As j3fft mentions, Joe Stump is also in TFA talking about using Cassandra.
A highly-scalable, multicast, real-time, distributed message passing system with full-text search, archiving, deletions, dynamic subscription changes, access restrictions, open APIs with quotas, abuse detection mechanisms, spam fighting algorithms, different authentication protocols, etc.
Everything I take from our set of OTS components will have live, verifiable examples of running at scale.
> multicast, real-time, distributed message passing system
For this the choice really comes down to RabbitMQ or ejabberd. An XMPP solution is appealing for the obvious benefits of having a "presence" concept, but an AMQP solution keeps us closer to the current reality of Twitter.
So we start with a large distributed rabbit setup. A few clusters scattered across the world, connected via shovel pipes and using some nested queue/exchange plumbing to wire it all together.
> full-text search
Some queues dump to a FTS setup that keeps recent messages in RAM and migrates them to disk as they get older. SOLR is probably a better solution here, but I know the Sphinx delta/full-index model better and would reach for that first.
> archiving
Dump a firehose queue to disk and send copies/updates out to various services that want it. Pushing it to HDFS would probably be the first choice I would look at, just because it is easy to go from there to various Hadoop analytics.
> deletions
Nothing is ever deleted, it just becomes invisible. This is just a flag to add to the FTS indexes and archives.
> open APIs with quotas, abuse detection mechanisms, spam fighting algorithms, different authentication protocols, etc.
None of these is particularly difficult once you have the general framework setup, and the structure of the system actually makes it pretty easy to wire in things like realtime abuse and spam prevention once you get rolling. There is only one thing hard about building a better Twitter, getting the userbase to make it worthwhile. If Twitter was in a different market where the network effect was not so strongly self-reinforcing then it would have been cloned and re-implemented better back when the fail whale was our constant companion, but this is not the case so making a "better" twitter is of little value.
I wonder if among the others there is Redis too, if so (and somebody at twitter is listening) I could love to know what do you think are the biggest weakness in Redis and if you think in the short term it's more important a fault-tolerant cluster implementation (the redis-cluster project) or virtual memory to support datasets bigger than RAM into a single instance. Thanks in advance for any reply.