At the time Facebook was building messages, Cassandra was still pretty immature/buggy (it's very solid in my experience nowadays).
Moreover, FB needed a lot of people who knew the tools. They acqui-hired a team of HBase experts. In 2010 I would have chosen HBase over Cassandra for what they were doing. In 2014, I'd choose C*. But as always, it depends on your application and what you know (or can hire for).
Cassandra has tunable consistency. A lot of people just think "eventually consistent" but it really lets you make tradeoffs yourself about performance, consistency, etc. Plus, the way they do read repair, etc even at CL==ONE makes it a lot better for messaging systems than you'd guess. I'm using it successfully with most stuff at One.
Because Facebook built a whole new Messages feature that was supposed to be email and internal messaging. Because Cassandra is designed to be eventually consistent it really wasn't suited for this use case.
Actually, it is designed to have tunable consistency. You can have immediate (full) consistency if you choose. This is a common misconception. In practice, CAP is about trade-offs, as in any design of scale.
Tuning consistency levels for your requests (R/W/N values) can only guarantee read-after-write consistency at best. It doesn't guarantee strong consistency like what HBase provides.
Huh? Can you describe a scenario that you couldn't implement with tunable consistency levels?
Mostly the limitations people come up with with Cassandra are around cross-row ACID, which if you really need it you can do with LWT, though in practice the better solution tends to be to just denormalize in to one row and/or NOT actually require cross-row ACID. It generally isn't what you really need anyway. Most businesses don't really shut down just because they can't achieve consistency.
Simple use-case: you want to maintain a counter. In Cassandra even if you tune the consistency levels (set R+W>N), there's no way to correctly implement a conditional update. To elaborate further, you could successfully increment the counter in some nodes but fail on others and don't reach quorum (partial write). The next time you do read-repair, it's going to do LWW and propagate that update everywhere else. Meanwhile you think it has failed, and reissue the increment once again and so you've incremented twice. Alternatively you could have 2 competing increments racing each other and one of them wins and the other loses due to LWW (and you've missed 1 count). This is trivial to implement correctly in HBase since it offers strong consistency.
In practice, Cassandra gets around this issue by implementing a custom CRDT for counters. But CRDTs don't exist for everything and in general it's not trivial to get things right. I'm not arguing Cassandra is bad, but it's a misconception that you can tune the consistency levels and somehow get Cassandra to behave like a strongly consistent store.
So, I'm an application developer, for the most part. I write stuff and I don't know a whole lot about databases other than how to create some tables/collections, do some basic indexing, and read the slow query log. Even running the explain commands, I have to spend half a day reading the docs and then I've forgotten everything after a few weeks.
I want to know: Can someone tell me how/where to focus my efforts to learn more? In my day job, I use mostly MongoDB these days, but also some MySQL. I have a few resources that I'm aware of:
I don't know when to use a relational database and when to use NoSQL. I don't know the differences between the various NoSQL databases. I don't know the ins-and-outs of any of the systems and I don't know how to determine what path to go for high availability, geo-distribution, replication, etc.
I find it all very overwhelming and would like to learn more. I'd like to be able to pull plugs in things and see my systems stay up (which, I'm sure would require intimate knowledge of at least one system). Does anybody have any suggestions? How valuable do you think this stuff is?
The book "Seven Databases in Seven Weeks" may be a good start. It may not answer all of your questions, but it does a good job summarizing various types of databases. Honestly you can read through it in a day or two - I don't think there's much need to install the DB's and do the exercises unless you're looking to explore a certain DB in more detail.
I originally skipped over the first chapter on Postgres but went back and read it and learned a few things (too bad there's no mention of the recently added json and jsonb data types).
Hey! So I'm not an expert, but I'm learning more every day.
It sounds like you're conflating three (or more) problems -- knowing if your data is relational in its nature, understanding how to efficiently retrieve the data you're looking for and understanding system administration/architecture tasks for a particular system.
The reason why I bring this up: my first job out of college was as a (very) junior Oracle DBA. At that particular job, that meant knowing some stuff about data retrieval but knowing much more about performance tuning and the architectural idiosyncrasies of Oracle and lots of Linux stuff.
Later I moved into startupland and necessarily learned quite a bit more about the data storage and retrieval patterns popular in web applications.
My point is: you might want to pick a particular aspect of databases that you want to learn about and really focus on that -- just like you would for a new programming language or framework. The way I usually do it is by finding a project that seems just a little too hard to be easily in reach and learning everything I can to make it happen.
i dont have much experience with NoSQL, but the typical use is for document storage where structures can have varying levels of depth and differing schemas or different schema versions. personally, i think a lot of the NoSQL benefits are hype for most use-cases. relational databases will get you very far when used correctly.
a good method of learning is to use some really large dataset and come up with some questions you want answered about the set, then write sql to get that information out. if your queries take a long time, begin indexing and looking at how the queries can be rewritten to return results faster.
i cannot recommend a good, large interconnected/relational data source, but something like zipcode databases [1] can get you started. you can also play with geo-spatial indexing to find zipcodes within x miles of each other. [2]
It's less focused on the technology and more on the business and the technical trends, but I didn't find it poorly written. Seems informative about what intends to be informative about.
Not sure why this was downvoted. This was the story out of Basho last I heard. (Well, lot's of hints like "Some fruit company you may have heard of uses it like this").
I was involved in a migration from MongoDB to Cassandra at my last job.
We switched to Cassandra because of the control it offers over how your data is laid out on disk, and it's ability to not totally fall apart under load (in fact performing really well). It's also a lot easier to deal from an ops perspective.
The main selling points for Cassandra over Mongo are that it scales to more machines easier for handling larger workloads, and the write performance is better.
Yes. I have switched back and forth between them before. The main advantage of Cassandra for me is that it is multi master. This makes it perfect for the cloud where you can trivially scale horizontally whether in the same AZ, across multiple or even across entire DCs with just a config change. It really is a dream to deploy and manage.
MongoDB is a document store and so if you data is structured that way it is unrivalled. But if isn't (more than likely) than it will suffer greatly as you start to increase the number of joins.