Hacker News new | past | comments | ask | show | jobs | submit login
The Abandoned Facebook Tech That Now Helps Power Apple (wired.com)
106 points by ycombi42 on Aug 4, 2014 | hide | past | favorite | 33 comments



Not a word about Apache. That's a pretty bad article (with a click-bait title).


You can't blame publishers for going with click-bait titles, particularly if the article actually delivers what the title promises.

As for not mentioning Apache.... That is bad. I'm not sure what they could put in that article that would speak to their audience though.


For the ignorant and lazy (I, if nobody else), could someone explain why Facebook ditched Cassandra?


At the time Facebook was building messages, Cassandra was still pretty immature/buggy (it's very solid in my experience nowadays).

Moreover, FB needed a lot of people who knew the tools. They acqui-hired a team of HBase experts. In 2010 I would have chosen HBase over Cassandra for what they were doing. In 2014, I'd choose C*. But as always, it depends on your application and what you know (or can hire for).

Cassandra has tunable consistency. A lot of people just think "eventually consistent" but it really lets you make tradeoffs yourself about performance, consistency, etc. Plus, the way they do read repair, etc even at CL==ONE makes it a lot better for messaging systems than you'd guess. I'm using it successfully with most stuff at One.


C*?


C* is short for Cassandra



Because Facebook built a whole new Messages feature that was supposed to be email and internal messaging. Because Cassandra is designed to be eventually consistent it really wasn't suited for this use case.


Actually, it is designed to have tunable consistency. You can have immediate (full) consistency if you choose. This is a common misconception. In practice, CAP is about trade-offs, as in any design of scale.


Tuning consistency levels for your requests (R/W/N values) can only guarantee read-after-write consistency at best. It doesn't guarantee strong consistency like what HBase provides.


Huh? Can you describe a scenario that you couldn't implement with tunable consistency levels?

Mostly the limitations people come up with with Cassandra are around cross-row ACID, which if you really need it you can do with LWT, though in practice the better solution tends to be to just denormalize in to one row and/or NOT actually require cross-row ACID. It generally isn't what you really need anyway. Most businesses don't really shut down just because they can't achieve consistency.


Simple use-case: you want to maintain a counter. In Cassandra even if you tune the consistency levels (set R+W>N), there's no way to correctly implement a conditional update. To elaborate further, you could successfully increment the counter in some nodes but fail on others and don't reach quorum (partial write). The next time you do read-repair, it's going to do LWW and propagate that update everywhere else. Meanwhile you think it has failed, and reissue the increment once again and so you've incremented twice. Alternatively you could have 2 competing increments racing each other and one of them wins and the other loses due to LWW (and you've missed 1 count). This is trivial to implement correctly in HBase since it offers strong consistency.

In practice, Cassandra gets around this issue by implementing a custom CRDT for counters. But CRDTs don't exist for everything and in general it's not trivial to get things right. I'm not arguing Cassandra is bad, but it's a misconception that you can tune the consistency levels and somehow get Cassandra to behave like a strongly consistent store.


So, I'm an application developer, for the most part. I write stuff and I don't know a whole lot about databases other than how to create some tables/collections, do some basic indexing, and read the slow query log. Even running the explain commands, I have to spend half a day reading the docs and then I've forgotten everything after a few weeks.

I want to know: Can someone tell me how/where to focus my efforts to learn more? In my day job, I use mostly MongoDB these days, but also some MySQL. I have a few resources that I'm aware of:

- http://aphyr.com/tags/Jepsen - Have browsed this and found it interesting.

- MySQL reference manual http://dev.mysql.com/doc/#manual

- MongoDB Manual docs.mongodb.org/manual/

- High Performance MySQL - I purchased this book but never got past the first chapter. Not to say that I can't...

- use-the-index-luke.com/ - Rarely read this, but have given it a half-hearted start from time to time.

- http://www.mysqlperformanceblog.com/ - ditto

I don't know when to use a relational database and when to use NoSQL. I don't know the differences between the various NoSQL databases. I don't know the ins-and-outs of any of the systems and I don't know how to determine what path to go for high availability, geo-distribution, replication, etc.

I find it all very overwhelming and would like to learn more. I'd like to be able to pull plugs in things and see my systems stay up (which, I'm sure would require intimate knowledge of at least one system). Does anybody have any suggestions? How valuable do you think this stuff is?


The book "Seven Databases in Seven Weeks" may be a good start. It may not answer all of your questions, but it does a good job summarizing various types of databases. Honestly you can read through it in a day or two - I don't think there's much need to install the DB's and do the exercises unless you're looking to explore a certain DB in more detail.

I originally skipped over the first chapter on Postgres but went back and read it and learned a few things (too bad there's no mention of the recently added json and jsonb data types).


Hey! So I'm not an expert, but I'm learning more every day.

It sounds like you're conflating three (or more) problems -- knowing if your data is relational in its nature, understanding how to efficiently retrieve the data you're looking for and understanding system administration/architecture tasks for a particular system.

The reason why I bring this up: my first job out of college was as a (very) junior Oracle DBA. At that particular job, that meant knowing some stuff about data retrieval but knowing much more about performance tuning and the architectural idiosyncrasies of Oracle and lots of Linux stuff.

Later I moved into startupland and necessarily learned quite a bit more about the data storage and retrieval patterns popular in web applications.

My point is: you might want to pick a particular aspect of databases that you want to learn about and really focus on that -- just like you would for a new programming language or framework. The way I usually do it is by finding a project that seems just a little too hard to be easily in reach and learning everything I can to make it happen.

Anyway, good luck and happy learning :-).


i dont have much experience with NoSQL, but the typical use is for document storage where structures can have varying levels of depth and differing schemas or different schema versions. personally, i think a lot of the NoSQL benefits are hype for most use-cases. relational databases will get you very far when used correctly.

a good method of learning is to use some really large dataset and come up with some questions you want answered about the set, then write sql to get that information out. if your queries take a long time, begin indexing and looking at how the queries can be rewritten to return results faster.

i cannot recommend a good, large interconnected/relational data source, but something like zipcode databases [1] can get you started. you can also play with geo-spatial indexing to find zipcodes within x miles of each other. [2]

[1] http://download.geonames.org/export/zip/

[2] http://www.mysqlperformanceblog.com/2013/10/21/using-the-new...


This was a pretty poorly written article. Lots of name dropping with very little homework done.


It's less focused on the technology and more on the business and the technical trends, but I didn't find it poorly written. Seems informative about what intends to be informative about.


Does anyone know what Apple uses Cassandra for?


IIRC everything in iCloud is stored in Cassandra and is one of the bigger clusters around.


iPhone notifications, as far as I know.


Apple uses Riak for this.


Not sure why this was downvoted. This was the story out of Basho last I heard. (Well, lot's of hints like "Some fruit company you may have heard of uses it like this").

Perhaps this is now out of date?


Yup. Out of date.


Anyone using Cassandra as a primary store, as opposed to something for optimized lookup or similar?


We use it pretty extensively at Spotify. Google will turn up a handful of public talks we've given about it.


Lived and died it for over a year. Contemplating shifting over to it again.


Netflix uses Cassandra as their primary store.


What are Cassandra's advantages over MongoDB? Has anyone switched between the two?


I was involved in a migration from MongoDB to Cassandra at my last job.

We switched to Cassandra because of the control it offers over how your data is laid out on disk, and it's ability to not totally fall apart under load (in fact performing really well). It's also a lot easier to deal from an ops perspective.

I did a post about the migration here: http://blakeeggleston.com/migrating-databases-with-zero-down...


Here are two blog posts I was able to find from companies who switched Mongo -> Cassandra:

http://www.fullcontact.com/blog/mongo-to-cassandra-migration...

http://relistan.com/cassandra-vs-mongo/

The main selling points for Cassandra over Mongo are that it scales to more machines easier for handling larger workloads, and the write performance is better.


Yes. I have switched back and forth between them before. The main advantage of Cassandra for me is that it is multi master. This makes it perfect for the cloud where you can trivially scale horizontally whether in the same AZ, across multiple or even across entire DCs with just a config change. It really is a dream to deploy and manage.

MongoDB is a document store and so if you data is structured that way it is unrivalled. But if isn't (more than likely) than it will suffer greatly as you start to increase the number of joins.


> Has anyone switched between the two?

Just anyone who has run in to scaling problems with MongoDB. ;-)




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: