Hacker News new | past | comments | ask | show | jobs | submit login
At Twitter we're working on using Cassandra to store tweets (nabble.com)
60 points by whalesalad on Dec 10, 2009 | hide | past | favorite | 28 comments



I read: "We evaluated a lot of things: a custom mysql impl, voldemort, hbase, mongodb, memcachdb, hypertable, and others"

I wonder if among the others there is Redis too, if so (and somebody at twitter is listening) I could love to know what do you think are the biggest weakness in Redis and if you think in the short term it's more important a fault-tolerant cluster implementation (the redis-cluster project) or virtual memory to support datasets bigger than RAM into a single instance. Thanks in advance for any reply.


Redis requires you have enough RAM to hold your entire dataset in memory...which seems impossible for something as large as twitter


As I read Antirez's comment, he was asking about precisely this. (Specifically, how high a priority removing this limitation should have, relative to other important stuff he's presumably working on.)


Even if you overestimate ridiculously, it seems perfectly reasonable to keep the entire dataset in ram: 1 TRILLION tweets, at 500 bytes each is a data set of less than half a petabyte. A quick search shows I can get 4 GB of ram for $66 USD. Assume no redundancy, no bulk discount, and all other hardware is free, that is a cost of $8M or so (about half of their last round of funding, iirc).

Consider now that you don't need to keep non-recent tweets in ram, bulk buyers can get it significantly cheaper than individuals, and the dataset is far smaller than that, then throwing hardware seems far less impossible. I'd imagine that they could keep the last month in ram trivially.


Does Redis really require the entire dataset to be in RAM, or does it just require enough virtual memory to hold the dataset? In other words, couldn't you just let it swap?


From their FAQ:

Do you plan to implement Virtual Memory in Redis? Why don't just let the Operating System handle it for you?

Yes, in order to support datasets bigger than RAM there is the plan to implement transparent Virtual Memory in Redis, that is, the ability to transfer large values associated to keys rarely used on Disk, and reload them transparently in memory when this values are requested in some way.

So you may ask why don't let the operating system VM do the work for us. There are two main reasons: in Redis even a large value stored at a given key, for instance a 1 million elements list, is not allocated in a contiguous piece of memory. It's actually very fragmented since Redis uses quite aggressive object sharing and allocated Redis Objects structures reuse.

So you can imagine the memory layout composed of 4096 bytes pages that actually contain different parts of different large values. Not only, but a lot of values that are large enough for us to swap out to disk, like a 1024k value, is just one quarter the size of a memory page, and likely in the same page there are other values that are not rarely used. So this value wil never be swapped out by the operating system. This is the first reason for implementing application-level virtual memory in Redis.

There is another one, as important as the first. A complex object in memory like a list or a set is something 10 times bigger than the same object serialized on disk. Probably you already noticed how Redis snapshots on disk are damn smaller compared to the memory usage of Redis for the same objects. This happens because when data is in memory is full of pointers, reference counters and other metadata. Add to this malloc fragmentation and need to return word-aligned chunks of memory and you have a clear picture of what happens. So this means to have 10 times the I/O between memory and disk than otherwise needed.


Maybe but I did some math and Twitter dataset should not be so big... The estimation I posted some time ago in twitter were completely wrong. I hope to be able to post some more data later today.

I think that a few big boxes are enough to take all the twitter dataset in memory, and if you ask me, and seen the continuous scalability concerns Twitter experimented during its history maybe this is something they should consider seriously (Redis apart).


Cassandra runs on a cluster so "the dataset" maybe more than 10 TB, so maybe more than the disk space on 1 machine.


If Google has multiple copies of the entire web in RAM (as has been reported in many places), Twitter should definitely be able to hold 140 character tweets in RAM.


i would probably say the large memory footprint of redis , especially at 64bit vs 32bit.


We'd love to be able to store an infinite amount of data in redis with large datasets -- that would let us use it at Posterous in ways that are beyond how we use it now -- which is mainly for analytics and job queueing (via resque)


thanks for the comment rantfoil. The more I talk to people about this issue the more I think virtual memory is the first thing to do. I'll start working on VM this Xmas, really hope to have it pretty solid in three months.


I'm not at Twitter, but what's the tuning parameter to specify that every x seconds data goes to disk?


hello, it's documented in the default redis.conf comments, the parameter is called "save": http://github.com/antirez/redis/raw/master/redis.conf


Digg was considering Cassandra too http://blog.digg.com/?p=966


Digg is in process to convert all data to Cassandra (mentioned in their NoSQL East presentation).


Evan Weaver blogged about Cassandra (http://blog.evanweaver.com/articles/2009/07/06/up-and-runnin...) awhile ago and used Twitter as a sample schema to show how it worked.


From that post: "Don't use EC2 or friends except for testing; the virtualized I/O is too slow" kind of restricts Cassandra usage for an average Web 2.0 startup

Added: just to clarify that: when starting small the last thing you want to do is either buy expensive infrastructure or hack something cheap instead of doing it properly. If Cassandra won't perform well in the virtualized environment, it will be necessary to first use some other means of storing data and "go cassandra" later if needed.


But everything resembling a DB is slow on EC2 since I/O is slow and memory accesses are slow, so I think the problem is EC2 and not Cassandra, MySQL, ...


Hence the hopes it won't be much more slower


if you look on the OP's thread, Joe Stump talks about using Cassandra for SimpleGeo, which I believe is completely hosted on EC2.


to clarify, Cassandra is still faster on a per-node basis than relational databases on EC2, but if you compare EC2 numbers with real hardware and wonder "why is it so slow?" that is why. :)

that said, there was a thread yesterday about Joe Stump's benchmarks showing EC2 I/O not being as bad as most people think so possibly they have made some improvements since. As j3fft mentions, Joe Stump is also in TFA talking about using Cassandra.


Glad to hear they decided against writing their own in Scala.


I love how twitter is constantly trying to reinvent the wheel for a simple message passing service.


A highly-scalable, multicast, real-time, distributed message passing system with full-text search, archiving, deletions, dynamic subscription changes, access restrictions, open APIs with quotas, abuse detection mechanisms, spam fighting algorithms, different authentication protocols, etc.

Still think it's so simple?


Okay, I'll take the bait...

> highly-scalable

Everything I take from our set of OTS components will have live, verifiable examples of running at scale.

> multicast, real-time, distributed message passing system

For this the choice really comes down to RabbitMQ or ejabberd. An XMPP solution is appealing for the obvious benefits of having a "presence" concept, but an AMQP solution keeps us closer to the current reality of Twitter.

So we start with a large distributed rabbit setup. A few clusters scattered across the world, connected via shovel pipes and using some nested queue/exchange plumbing to wire it all together.

> full-text search

Some queues dump to a FTS setup that keeps recent messages in RAM and migrates them to disk as they get older. SOLR is probably a better solution here, but I know the Sphinx delta/full-index model better and would reach for that first.

> archiving

Dump a firehose queue to disk and send copies/updates out to various services that want it. Pushing it to HDFS would probably be the first choice I would look at, just because it is easy to go from there to various Hadoop analytics.

> deletions

Nothing is ever deleted, it just becomes invisible. This is just a flag to add to the FTS indexes and archives.

> dynamic subscription changes, access restriction

Handled completely by the message queues.

> open APIs with quotas, abuse detection mechanisms, spam fighting algorithms, different authentication protocols, etc.

None of these is particularly difficult once you have the general framework setup, and the structure of the system actually makes it pretty easy to wire in things like realtime abuse and spam prevention once you get rolling. There is only one thing hard about building a better Twitter, getting the userbase to make it worthwhile. If Twitter was in a different market where the network effect was not so strongly self-reinforcing then it would have been cloned and re-implemented better back when the fail whale was our constant companion, but this is not the case so making a "better" twitter is of little value.


highly-scalable - now:maybe, year ago:no

multicast - are there still stupid limits on max number of subscriptions?

real-time - sort of. have you measured the latency? depends what you mean by realtime.

distributed message passing system - distributed:yes, message passing:it depends what you mean. is polling considered as messaging?

archiving - what?

deletions - doesn't remove the items from search. I consider it as not-working.

dynamic subscription changes - yes.

access restrictions - sort of.

open API - yes

...


Surely a very large message passing service has different requirements than your typical system.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: