Netflix: Run Consistency Checkers All The Time To Fixup Transactions

cookiecaper · on April 6, 2011

The described approach just feels wrong to me. I understand that the problems Netflix faces are challenging, but usually when I run into messy systems like that, where you do one thing to address one problem, and then have to do four or five other things to fix the problems that one workaround caused, there's a fundamental misengineering in the mix somewhere. Systems should be simple and consistent as much as possible -- doing something like this across the entire service is crazy.

I understand if they do it a little bit to test things, or whatever, but I think if they sat back and were willing to carefully consider this problem, they could come up with a much simpler workflow. The ability to do that is a key part of quality engineering. A big cascading infrastructure of workarounds that eventually only kind of gets things in a correct state is usually a sign of a problem.

tybris · on April 6, 2011

> Systems should be simple and consistent as much as possible

CAP theorem says otherwise.

Distributed systems are complicated, almost like reality. You can feel the constraints of physics and math everywhere. You simply cannot get the best of all worlds (e.g., consistency, availability, and partition tolerance). If you have the money and need the scale, which often go hand in hand, you're simply better off avoiding hard consistency models, because the other two are more important to your business.

I love their approach. It's exactly what I think people should do to scale up their system. Avoid two-phase commits and other elegant availability mine fields and go for pragmatism. Note that this is only relevant for a few dozen companies.

lsd5you · on April 7, 2011

It is not 100% clear from the examples cited that scaling should necessitate the consistency issues they have. Surely it is quite straight forwards to scale user private data (i.e. 'sharding' on user data) such as position in play and the queue.

For the data which is cross user, such as that which drives their recommendation system (are there other parts, I don't know, not a user) then consistency would be compromised - but given the soft/heuristic nature, presumably also more or less irrelevant. Here eventual consistency would be good enough (and ideally should not require extra processes)?

Like the halting theorem, the implications of the CAP theorem are only unavoidable in pathological cases. Until it is identified as a pathological case then I believe it is premature to invoke such theorems.

jonburs · on April 7, 2011

Scaling does not imply availability. Sharding user data certainly is a good strategy to achieve scalability (i.e., support a large number of users, distributed across a fleet of commodity hardware), but it doesn't a priori make that data highly available.

Let's imagine my personal netflix data (queue, ratings, etc.) live on a lightly loaded (since things are well shareded) DB instance. What happens if the hardware hosting that DB fails? Ok, let's replicate to a slave, and put into place automated failover mechanisms to promote the slave if the master goes down (since with many shards it's really hard to effect a failover manually in a timely fashion). Should the master and slave be in the same datacenter? That after is all is the failure scenario Netflix originally wanted to address -- what happens if your single datacenter goes down (which eventually will happen, no matter how much redundancy you think it may have)?

Fine then, separate the master and slave in different DCs (with fat and fast pipes so 2PC or synchronous replication can keep things consistent without impacting latency too much). Now where does the auto-promotion logic live -- the DC with the slave or the master? What happens if my client can reach both DCs, but there are connectivity issues on the links connecting the DCs? Network hardware can fail like any other kind, backhoes accidentally sever buried optics, and routing protocols do take time to converge.

P is now rearing its ugly and inevitable head, and you're faced with the C/A decision -- can I see and alter my queue (leading to inconsistency) or is it unavailable? Is this scenario pathological?

lsd5you · on April 7, 2011

Sorry, but this argument is the worst kind of wrong - it is half true. 100.00000% consistency/availability is of course impossible in any circumstances.

'It is only relevant for a few dozen companies'

... because of size/volume. Your argument would make it relevant to all companies, unless netflix and other large companies have to live to a higher standard - which i do not think is being argued.

The CAP theory really is about bottlenecks, high volumes of possibly conflicting data to different nodes cannot be synchronised with guarantees.

It is possible to have distributed transactions on lightly loaded servers. These could complete in a timely fashion (<10 secs) 99.9% of the time with, theoretically a geometric drop off for the probability of longer delays.

The answer to the disconnected data centre is to have 3 with majority rule. Then the loss of a single link can no longer break consistency.

What is more, the article talks about the data being internally inconsistent (dangling references... etc.) this seems to be totally unnecessary for something like private user data. For the rest (recommendations), as discussed other strategies can be employed that at least guarantee internal consistency of a node.

gte910h · on April 6, 2011

I've long felt that NoSQL does not jive with a certain type of mindset many computer professionals have.

There is this sterile perfect world mindset many computer professionals idealize.

Transactions and traditional RDBMSes make this sort of person really happy. They love to see everything always working out. That when it doesn't, everything is taken care of for them, etc. I do not know if you're this type of person, but your comment sounds like the type of sentiment that type often expresses when they see NoSQL in real world apps

This is not a work around of a messy system. This is adding more consistency on a system which is designed to be fast a scalable more than consistant.

Fast and scalable is more important with something like netflix. It's going to bother someone a hell of a lot more if they get fits and starts with their streaming, if it takes forever to get a pageview, etc, than if something occasionally get's misordered in a queue.

If they went with a system that demanded consistency over speed/scalability, then what they MIGHT get is a backend database that never went out of sync, but the UI of the website would show that queue items had moved, but the write conflicts would cause the transaction caused by the UI to unroll, and thus the user would still experience inconsistency. Or the website could wait for a confirmation for every step and offer a horrendously slow user experience during peak times.

I think this sort of NoSql data garden with automated gardeners running through all the time will be a fairly typical path from now on for many types of systems. I think financial apps may even adopt it eventually!

cookiecaper · on April 7, 2011

I've actually used NoSQL multiple times in the past. I don't have a problem with that. Maybe it was the way the article was written, but it just sounded messy and disorganized, and they really should be able to do better on the end result (not jump around so much on queue/position in movie).

edoloughlin · on April 6, 2011

Very good and very pragmatic overview of NoSQL here: http://www.infoq.com/presentations/Enterprise-NoSQL