Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Rebuilding Netflix's video processing pipeline with microservices (netflixtechblog.com)
197 points by samaysharma on Feb 9, 2024 | hide | past | favorite | 195 comments


Amazing to read about the continued re-invention of this part of Netflix's internal systems.

I worked with this team (and its predecessors) during my time at Netflix. They achieved several "holy grails" of video encoding: a perceptual quality metric (VMAF), optimal bitrate selection per 2 second video chunk, and then optimal video chunking to be scene based rather than a fixed 2 seconds. Doing any of that in a research lab would be a challenge, but pulling it off at Netflix's scale is epic.

You might need some background on how adaptive video streaming works to fully grok this article.

But this is also just a story about a massive refactoring of a large, critical system. How many companies have you worked for that aggressively pursued refactoring/re-engineering their central systems? At most other places, I've seen risk aversion, fear, and mismanagement conspire to kill innovation. Not so at Netflix.


Pornhub serves more diverse data across the world with a much simpler pipeline. They are the ones who should be emulated. Netflix needs a bit of housecleaning to get rid of this faction in the company.


Adult website stacks are underrated. They manage to be scrappy and yet deliver a lot.

I've seen streaming services serving half a million HD video users a day running on 7 servers, 2 of them being cheap vps.


Yeah, funny what a requirement for sustainable business models + no glut of resume-driven development will do. Almost like the good old days.


Also an open mind in hiring. Nobody will reject you because you look weird or are awkward.


You are so right, I have no idea who is running other streaming services but its pretty obvious they have absolutely no idea what they're doing when they're doing bullshit like pushing blocky h.264 streams to iPhone 15 Pros with AV1 and other crap that costs a shit ton of money for a worse outcome.

Paramount looks like Youtube and uses twice as much bandwidth, that is almost impressively bad.


where can one read about pornhub's infra? does mindgeek have an engineering blog?


Honestly, many business could learn how to beat-off the competition by emulating the adult industry.

They're often first to come to market, aren't intimidated to try new things, resist being bound by incumbents or gagged by censors and are generally wide open to grasping the full thro ... that's enough now.


I see what you did there. Like you had that one ready to pop.


Always locked and loaded me


Sex.


my 0.02 on the matter - i can commend that they got a reliable system but the deliverables are not quite comparable imho.

not all "HD" is made equal. bitrate is a major component of the perceived video quality. some experienced it first hand when youtube premium started pushing higher bitrate version for 1080p (https://news.ycombinator.com/item?id=34918698).

(disney+) hotstar is an indian streaming service which recently handled more than 50M+ concurrent viewers (https://news.ycombinator.com/item?id=38344265) for a live sports event. but if you actually watch the stream, it is of a very poor quality.

long story short: hub pushes for bitrate/quality much lower than netflix needs to in order to satisfy its paying customers, especially when they discriminate based on how much one pays.

but i wonder how much it would help with non-cpu scenarios when we need to do transcoding for upcoming codecs which might benefit from using hardware accelerators.


As a customer I can't really say that I can tell any difference between all the streaming services, so presumably this type of optimization helps Netflix's bottom line somehow?


That interesting. I've noticed that Amazon always takes longer to load and spend more time at a poor bitrate, while Netflix just works.


Netflix beats others when it comes to low bitrate high quality streams. At some remote areas with poor cellular coverage, Netflix is the only service that provides a working video stream.


So it only solves a tail problem. I would assume the percentage of subscribers with poor internet is low.


Why would you assume that?


It also helps Netflix. Lower bitrate means less server load and lower internet bills.


I'm curious about grain encoding - did you work on that at all? My friend heard they were doing a grain extraction layer that was re-added client side. Feel free to contact devin@techcrunch if you have any interesting insight.


Yes, that's part of the AV1 specs actually. See https://norkin.org/pdf/DCC_2018_AV1_film_grain.pdf

Andrey (who works for Netflix) drove the effort. Chatted with him about it.


I used to do a poor-mans version of that with low-bitrate mpeg4 (ASF or divX) movie downloads back in the day; the resolution would be low, and the MS MP4 codec tended to blur as well (plus noise killed the quality so much that there was often a denoiser in the encode pipeline), so adding any noise to the playback significantly improved the look of the movie.


Removing Grain and then re-adding it after encoding isn't specific to Netflix or even AV1 as some have suggested. I believe a recent part of H.265 HEVC, or H.266 VVC or MPEG-5 has something similar.


I‘m not impressed by the quality Netflix gives me when on my phone, all while paying for a 4K plan. Streaming a locally hosted 4K file looks just way better. Apple TV+ is approaching that level of quality, so it’s definitely doable.


That has more to do with Bitrate. Giving enough bitrate your video will be great even with older Codec such as baseline H.264. AppleTV+ uses anywhere from double to quadruple the bitrate. So it is much more about cost of business rather than technology barrier.


And it’s acceptable to bill me €16 every month and then only stream full res when I happen to be on a big screen device?


>They achieved several "holy grails" of video encoding: a perceptual quality metric (VMAF), optimal bitrate selection per 2 second video chunk, and then optimal video chunking to be scene based rather than a fixed 2 seconds.

Both being extremely hard problem but I wouldn't say they are "holy grails" level. The latter part is currently being offered by other encoding services. While the first one, VMAF, is much better than all other metric we were previously using, still has many short comings. I was actually hoping all the AI / LLM model we could have something better than VMAF by now.


Are there other examples of LLMs discovering novel high-performing algorithms, models or metrics? I didn't think they were close to that yet.


That sounds like an epic series of projects to have gotten to work on.


'Long release cycles: The joint deployment meant that there was increased fear of unintended production outages as debugging and rollback can be difficult for a deployment of this size. This drove the approach of the “release train”. Every two weeks, a “snapshot” of all modules was taken, and promoted to be a “release candidate”. This release candidate then went through exhaustive testing which attempted to cover as large a surface area as possible. This testing stage took about two weeks. Thus, depending on when the code change was merged, it could take anywhere between two and four weeks to reach production.'

I guess I'm just old, but I prefer the delay with a couple of weeks of testing versus pushing to prod and having the customer test the code.


Netflix is a company that works with media people. You either understand what that implies or you don't.

And remember, this is the backend for video encoding. Issues in this aren't necessarily user visible.

The big benefit to their velocity is responsiveness. Apparently the backend team understands their customer and the timelines that the customer wants, and adjusted appropriately.

Just dealing with ads would have been problematic, because those tend to be straight 1080 or 4k with stereo. Nothing fancy, but I'll bet they didn't fit inside the chunk size they were expecting since ads are usually 30 seconds or less. And they don't need the dynamic encoding etc that normal titles do.

I wonder how much benefit dynamic encoding brings in space reduction?


> Netflix is a company that works with media people. You either understand what that implies or you don't.

It implies everyone is hopped up on drugs?


Err .. the media people take visual quality and aesthetics very, very seriously. The Director has a vision and the tech goes to amazing lengths to support it. It is a different world as the original post said.


One of the things in media is that shit happens and you get what you get on an extreme timeline and you deal with it. One month is about 29.9 days too long. Someone accidentally encoded it with a codec you don't support? Too bad, you get it done.


What's interesting to me about this is that some companies seem to struggle to get even one reliable deployment process in place. In this case they were able to actively select the right process for the right job, even if it isn't the one they're normally geared toward using.

It's not necessarily anything Earth shattering, but it may be an issue at some smaller places with fewer resources.


Besides the debate about microservices being good or bad, it's clear that netflix developers are very passionate about what they do. To me this seems to play a big role in the success of a software product.


They are good for Netflix, bad for 99% of the other companies/projects, at least initially until they clearly know what they will build.


It’s not full-circle, as much as industry maturity.

More stories lately are about “why we went pack to monoliths and building with borland C++”

Not long ago it was more likely “how microservices solved everything at our company, and why only morons disagree.”

So are we moving towards or away from microservices? Both. We’re maturing to use the right tools for the system.


Using the ... right ... tool for the job? In software development?

Surely not! I want my dogmatic clickbait and LinkedIn-style grandstanding thank you very much!

How else will I stay on the hedonic treadmill of staying up-to-date with a new framework or architecture every 3 months?!


True, and yet people are inclined to think the other grass is greener.


I can't understand the article.

If you want to encode video, use ffmpeg. Netflix serves static movies, so encoding is going to be relatively rare and can probably be done on whatever computing resources are already available. Quality-wise, the ffmpeg/x264/x265 people probably are doing a good job already.

If you want to serve video, serve it with HLS or similar from static files stored on a CDN with a bunch of bitrate profiles. Here the problem is more creating or finding the CDN that anything to do with video.

Can't quite figure out what the purpose of all the stuff in the article could be (maybe to justify the jobs of the people doing the work?)


Disclaimer, not working for Netflix, but working in this area.

Ultimately it will still likely to be ffmpeg or their internal fork of it? But they are talking about per-title/per-shot optimisation here and not apply a blanket quality profile for every single video.

Netflix also produce their own series and allow them to optimise their encoding further.

They also have a Video Validation Services and Video Quality Services that do automatic quality check of the encoded video as the post indicated.

Is it complex, yes it is. But is it overly complex, maybe not?


Thats what they do, but better. Example they have their own CDN and put the nodes in your ISP facility.

Using per-title-encoding, allows you to have less renditions for a given video, having better cache hits.

Another example is pre-fetching data to the CDN nodes(when a new episode or season of a popular show comes out).

Its just extra optimizations that come up and are worth it at scale.


>Can't quite figure out what the purpose of all the stuff in the article could be

Netflix encoding is actually much more complex. Both per Scene and per TiTle encoding, optimal bitrate selection etc are a lot more than what ffmpeg offers if you do it manually.

Arguably you are right though Netflix could ignore all of that and just brute force the problem with much higher cost in terms of encoding, storage and bandwidth. But i guess at their scale it make sense to do all these optimisation.


You know I read articles like that all the time but the user experience of all these app gets worse. The time to first video frame on Netflix is not great to say the least. The rich metadata seems also only to be used internally...


My pet peeve with Netflix is that it occasionally forgets where I last paused my video. I don't often watch Netflix, so when I do, it's annoying that I can't resume from where I left off


This is Disney+ but nearly every time. Moved from the living room to bedroom and had to seek to get back to where we were just last night


I bet it saves them some server $. Probably keep more data client-side and batching to server.


This is Paramount+ as well pretty much every time, and, and!

After every ad break, the video rewinds to the start of the block before the ad break. So I have to fast forward to the ad break marker.

And people wonder why we pirate :-)


I kinda wonder if it's a tracking endpoint my Pihole and/or Firefox privacy blockers are messing with.


I've found actively pausing a video before exiting seems to improve the accuracy of the resume point.

I'm presuming this is because it fires a beacon on pause, but only every X (10?, 30?) seconds on playback.

Not had many issues with it completely forgetting where I am though (it does sometimes get confused as to which episode I watched last though).


I watch Netflix regularly and never had this issue. I wonder if it is your client? I use an Nvidia Shield(Android TV).


Possibly. I use my iPad for netflix.


It might not be, but who are we comparing them to? Secondly who out there is doing this scale as "cheaply"

Disney is already finding out that a streaming platform isn't cheap to run and hard to do efficiently.


There you go - 100% right you are.


Wait, who am I supposed to believe here?!? Prime Video tore down their micro services in favor of a monolith just last year! Which trillion dollar globocorp is my tiny, insignificant company supposed to emulate?

https://thenewstack.io/return-of-the-monolith-amazon-dumps-m...


If you read the Prime Video blog post the takeaway is definitely not "always use a monolith". I haven't used Step Functions but they specifically mention step functions with a lot of state transitions (and the pricing model is per state transitions) and storing things in S3 and having to access things there all the time (which I've used S3 before and I was a bit shocked by since it seemed obvious to me that was gonna be really expensive). The takeaway for me was it's important to actually understand the tools that you're using.

As an aside the Prime video article as a bit funny, at one point they have the line (which I hope is sarcastic but I fear that it isn't) "We experimented and took a bold decision: we decided to rearchitect our infrastructure" when their original design just obviously chose tools that didn't fit their workflow.


And just to add: that decision (to start with serverless for your POCs) came from above, CTO level.

Needless to say, microservices are not serverless/step functions.


> The takeaway for me was it's important to actually understand the tools that you're using.

no no no, that takes time, the hype train doesn't wait for anyone. the sacred monolith it is. all hail the Monolith! crush the microgerms, destroy the filthy tiny services.


They're pretty clear here on what the benefits are. Microservices allow you to scale up and down individual system components without having to carry along the rest of a vertically scaled monolith for the ride. This makes for more efficient utilization of compute resources. For a company renting compute resources from the cloud, like Netflix, this can save a lot of money.

What Amazon did, according to this needlessly snarky article that is not Amazon's tech blog, does not conflict with this. It's all theory. In reality, you should not be dogmatic and religious about your architecture choices, but empirical wherever possible. They measured utilization and cost and found they could do better in some cases with monolithic sub-systems. This doesn't mean all of Prime Video abandoned SOA.


> This makes for more efficient utilization of compute resources.

No it doesn't. The rest of the monolith is just a chunk in your compiled binary sitting on disk, which is trivial in terms of resource cost. If that code is not running, it is not using any runtime resources.

Microservices will, however, greatly increase resource requirements if they lead to additional serialization/deserialization, which is relatively expensive. If you're doing video encoding, this isn't such a big deal. For web services, it is likely to be the bulk of the resource cost. This is only exacerbated in modern infrastructures where services are more and more expected to use TLS to talk to each other.


Swapping out silicon calls for network calls is the definition of complicated...And starting complicated because you want to achieve some premature optimization is insane. Monoliths are preferable for 90% of the use cases that you see in public or private industries.


Honestly, I get that it somehow became cool to ignore the grammatically correct "computation" and "computational resources" in favor of just grunting "compute" — why not go all the way?

"efficient utility of compute resources", etc. Just shorthand everything.

Does it take a famous developer to do it first for everyone to feel comfortable doing it?


IIRC The Prime Video guys were processing videos using AWS step functions - which probably makes sense if you are processing a few videos every so often. If you are processing videos continuously then it’s much more cost effective to just have some big boxes running 24x7 crunching through a queue of jobs.


The meta takeaway is that you shouldn't be afraid to resist trends (in either direction: monolith or microservices). If you're 'wrong' today, you may be 'right' in 5 years.


My problem with these two articles is neither provide sufficient detail to learn anything meaningful.

Are there any good resources from trillion dollar globocorps that get down into the weeds?


AirBnB comes to mind. https://airbnb.io


Yeah, I've always been impressed at AirBnB's open source stuff, even if I don't use their actual product.


I watch ThePrimagen (coding content creator on YouTube/Twich) from time to time, and he works as an engineer at Netflix.

My impression seems to be that he doesn't like the container infrastructure, reflecting my own opinion, though he never calls out explicitly the infrastructure at Netflix as something bad. But every time he talks about work at Netflix, sounds about as complex as I'd image if I'd give the job to a CV driven engineer.


Netflix is huge, has wide breadth of activities, needs to move terabytes of video around, and so has a lot of essential complexity.


"Micro service" and "Monolith" don't have precise definitions anyway. Ideally, there's only one right architecture: the one that's sufficient for the problem at hand, all things considered (latency, availability, cost, provider, maintenance, conceptual integrity, ...).


I feel like the term "microservice" is open to misinterpretation (people tend to focus on the micro part a bit much; I remember someone waxing poetic about something like a csv parsing service they had). But surely monolith is quite unambiguous: a system which is deployed as a whole. That is, you cannot deploy some part of the system on its own. You want to get changes in part X to prod? Better be prepared to deploy A through W and Y and Z as well.

That feels fairly precise. But maybe some folks would disagree with this definition of a monolith.


Be thankful. That madness is what feeding a lot of families. Yours maybe one of those.


Mine is definitely one of those.


They moved one of their use-cases back to a monolith, not their whole suite of applications/services.


IIRC, the rebuild to monolith doesn't tell the whole story. They aren't ditching all of their microservices in favor of monolith. There is a person responsible from Prime giving a clear picture of this in twitter (sadly I'm not saving the tweet source)


Well .. which one is better for your chances to get more money in the performance review?


I was highly commended for bringing up 6 new microservices a year ago during my performance review. Late during development I noticed that at least 2 of them were useless (they are essentially message routing gateways, I planned to "enrich" messages from inside, but ultimately never did). It was already done and I did not want to waste time to integrate those two services into the others so I left it as is and, well, my boss loved it.


> Which trillion dollar globocorp is my tiny, insignificant company supposed to emulate?

None. They have a different set of problems than you.


I mean you could just read a variety of different sources, look at the different trade-offs, and make informed decisions based on the compromises of both different architectural designs and your own products’ needs?

Jesus the quality of conversation here is not good today.


[flagged]


Ridiculed by whom? We've seen many competitors try to make a streaming service and beyond Apple, they all provide a laggy experience even in the menus.

If you're going to emulate someone, it's not a bad idea to emulate who has the best results


Even Apple TV is pretty sluggish on my LG TV. And it sometimes makes my TV crash so I have to reboot it. It's maybe not as bad as the steaming pile of garbage that is Sky-Showtime, but it's got a ways to go before it's comparable to Netflix on my TV. Amazon Prime is pretty terrible on my TV too.


That’s your TV having a shitty processor and WebOS not being the best. Even expensive smart TVs don’t ship with good silicon.

Get something like an Apple TV or Fire TV Cube and you’ll have a better experience. The Apple TV 4K in particular ships with a very powerful processor, it’s far snappier than any other streaming box I’ve tried.


It's been a few years since I let go of my TV, a not new-at-the time, but high-end LG, and I loved WebOS on it. I considered it the best, even better than Apple TV, especially for netflix. The new owner runs it without complaints.


Yeah, no shit. Still, Netflix, Plex and Youtube work just fine, so Apple TV should be able to work at least as well. I'm not buying an extra device just to compensate for shitty software, that's silly. I prefer to unsubscribe.


These services aren't run by a company that primarily sells hardware/devices. See iTunes on Windows.


> I'm not buying an extra device just to compensate for shitty software, that's silly

Serious question: Then why did you buy an Apple TV? This is Apple's entire MO. You are expected to replace all your devices every year or two.


> This is Apple's entire MO. You are expected to replace all your devices every year or two.

As someone who previously was an "anti-fan" of Apple's (we're talking 2000s, early 2010s) for their ridiculous prices (and that still stands for things like the Vision Pro), I've now seen the light (or gone to the dark side if you prefer) and now believe Apple provides better value for money than most of their competition due to the longevity of their devices. I know this is anecdotal and a sample size of one but I'd be curious to see data backing up your claim above.


Apple was a rip off luxury brand back in the day if you had a Samsung Fascinate or something. MacBooks were horrible and macOS was annoying to deal with. Now they're the default price/performance choice if you want a decent reliable machine, and iPhones are obviously very good value if you just want a phone that works for as long as possible.


Can attest to this, typing this on a 6 year old iphone 8 plus


Yes, that must be why they abandon older hardware and don’t support it long term unlike Android based boxes and built in hardware…


I'm talking about the Apple TV streaming service, not the device.


Maybe for their other product lines, but not the Apple TV. It’s infrequently updated and they easily last 5 years or more.


Laggy menus typically have little or nothing to do with microservice architectures.


Incorrect. Frequently, UI lag on components that hit server side back services is made significantly worse by naïve microservices, especially in the face of organic growth.

Specifically, every API call that traverses a machine, boundary, necessarily impart, additional latency, and uncontrolled microservices can have a multiplicative effect


I agree that a bad implementation may lead to poor performance. However, this is irrespective of the architecture. The effects of an architecture are more noticeable in the context of maintainability, scalability, and extensibility.

Or perhaps I am misunderstanding your comment?


it's not actually irrespective of architecture. Some architectures are significantly more prone to certain kinds of problems than others. For example, monoliths can become so large as to make development, especially many-team development, inconvenient or impossible. In the specific case of microservices, the key benefit (multiple teams can develop and deliver in parallel without stepping on each other, separating concerns into their own ownership areas) has the tradeoff of distributed systems overhead, which can range from high latency (when a number of microservices are in a serialized hot path and the complexity is not being effectively managed) to lowered availability or consistency (when data is radiating through a network of microservices asynchronously and different services 'see' different data and make conflicting correct decisions). Monoliths see this set of performance problems much, much later in their lifecycle, because they have much better data locality and local function call characteristics.


Anything that is front-facing a UI or API typically has a response cache.

More so for something like a web app like AppleTV where the content is largely static.

There should be zero difference in performance between microservices and monolith in this scenario.


Incorrect. Many calls, such as to auth, identity, ad service, metrics, etc., cannot be cached.


Ad serving and metrics are asynchronous so won't block any UI. And authentication/identity has the same behaviour with monolith/microservices. It's ultimately just a look up this user in some database.

It's the serving of the content that requires coordination across multiple services and most of that should be cached at the serving layer.


Incorrect, in most apps nontrivial content is highly personalized and dynamically served, auth in microservices is frequently two or more hop rather than one hop, and ad serving and metrics frequently involve synchronous steps.


And yet if you can't even implement a smooth scroll of 2D images over a backdrop in 2024, why would I listen to your opinion of microservices?


Disney owned streaming services and HBO Max are far from laggy thanks to BamTech.

But as far as the menus being laggy. When you are trying to keep the bill of materials for streaming to less than $20 in the case of Roku, what do you expect?

The AppleTV box is $140 and the difference in quality shows


It's incredible how poor of an experience Apple TV Plus is on web browsers, even on Safari.


the circumstances when microservices make sense are pretty well documented, but of course it's not widely known, especially because it doesn't fit into the 'microservices are future' slogan.

https://www.infoq.com/presentations/microservices-netflix-in...


Pretty sure OP was sarcastic, lots of people blindly copy big corporation tech even if they don't have big corporation problems and scale.


Bullseye


These are conversations just like talking about code editors and so on. No one is right or wrong. But if I had to choose - micro services hard harder to control and secure. And vi? not for me ;)


Story time: I worked at Facebook and had to work with someone who came from Netflix. He was one of those people who, when he went to a new company, simply tried to reinvent everything he came from with no care or consideration given to what's already there.

FB very much does not use microservices. The closest is in infra but the www layer is very much a massive monolith, probably too massive but that's another story. They've done some excellent engineering to make the developer experience pretty good, like you can commit to www and have it just push to prod within a few hours automatically (unless someone breaks trunk, which happens).

Anyway, this person tried to reinvent everything as microservices and it pretty much just confirmed every preconceived notion (and hatred) or microservices that I already had.

You create a whole bunch of issues with orchestration, versioning and deployment that you otherwise don't have. That's fine if you gain a huge benefit but often you just don't get any benefit at all. You simply get way more headaches in trying to debug why things aren't working.

One of the key assumptions built into FB code that was broken is RYW (read your write). FB uses an in-memory write-through grraph database. On any given www request any writes you make will be consistent when you read them within that request. Every part of FB assumes this is true.

This isn't true as soon as you cross an RPC boundary... much like you will with any microservices. So this caused no end of problems and the person just wouldn't hear this when it was identified as an issue before anything was done. So th enet effect was 2 years spent on a migration that ultimately was cancelled.

Don't be that guy. When you go into a code base, realize that things are the way they are for a reason. It might not be a good reason. But there'll be a reason. Breaking things for the sake of reinventing the world how you think it should've been done were you starting from zero is just going to be a giant waste of everybody's time.

As for Netflix video processing, they're basically encoding several thousand videos and deploying those segments to a VPN. This is nothing compared to, say, the video encoding needed for FB (let alone Youtube). Also, Netflix video processing is offline. This... isn't a hard problem. Netflix does do some cool stuff like AI scene detection to optimize encoding. But microservices feels like complete overkill.


> He was one of those people who, when he went to a new company, simply tried to reinvent everything he came from with no care or consideration given to what's already there.

You have to have a very bad case of god complex if you look at a codebase that serves >3B users and experiences very little downtime thinking "oh yeah I could completely rearchitect that thing to be better"...


Maybe, but that's also the mindset of serious innovation. The question is whether or not the idea is a good one or not.


In software engineering the FB codebase would definitely be considered >50% maintenance. While we don't have a good fundamental for software as we do for electrical engineering, we do have some basic studies showing that refactors in products at the maintenance stage usually lead to more bug and not less.

> but that's also the mindset of serious innovation

And I'd completely agree, if the project was to build FB from scratch. However in this case a software engineer that shows up in a mature codebase and wants to redo-it in a different architecture is simply immature, reckless and ignorant.


Or looking to play the promo game, build a bunch of unnecessary stuff and get a pat on the back and better comp


It’s either god complex or pure lack of foresight and an inability to learn from or plain lack of experience.


Chesterton's fence. A simple rule of thumb that suggests that you should never destroy a fence, change a rule, or do away with a tradition until you understand why it's there in the first place.


Very cool! I think the “until you understand..” part is most important there. Like you still should be free to, but you absolutely need to understand why it’s like that in the first place.


The fact that something a) works and b) serves very high volume traffic should always be a powerful counterargument against any suggestion to reinvent it.


Would be interested to also know how they handle per-title audio. With stereo -> 5.1 -> 7.1 and the sides and wide layout variants, how do they think about this during the inspection and encoding process? Being completely naive to Netflix's source media, and assuming it comes in a variety for formats and media, it seems like there are decision to make there. Though audio obviously has a much lower bandwidth burden, one would think there could still be QoE gains (and bandwidth savings) by doing things that AV1 can do with different scenes in something like OPUS.


Okay, now that that's done, can someone at Netflix please figure out how to use their multiple data centres worth of distributed clusters to serve more than 4 or 5 subtitle languages please?

If anyone at Netflix would like some assistance, I've previously consulted in the areas of large-scale compression optimisation, and I'm sure we can get those 100KB text files down to under 20KB!

I'll help build distributed Kubernetes buzzword-compliant architectures, if that helps anyone get internal promotions as a part of this pan-cultural effort of inclusivity.


Does anyone have any reading material on the reliability of systems that use microservices? I've had a bit of basic probability rattling around in the back of my head that makes me suspect microservices are in general less reliable. I'd be interested in seeing a real-world analysis.

My thinking goes like this, with some simplifying assumptions. Let's say you have a monolith with 99% uptime that you rearchitect into 5 microservices, each with 99% uptime, and if any one of those services goes down your whole system is down. Let's also assume for the sake of simplicity that these microservices are completely independent, although they are almost assuredly not.

From basic probability, 99% uptime means there is some chunk of time t for which P(monolith goes down) = 1%. But

P(microservice system goes down) = P(service A down or service B down or...) = P(service A down) + P(service B down) + ... = 5%

In reality P(microservice system goes down) < 5% because they aren't independent and the chunk of time in which service A can go down will overlap that of service B. But still, that means the upper bound of the whole system going down is higher than for a monolith.

But microservices are pretty popular, and I'm sure someone has thought along these lines before. One potential rebuttal is that each microservice is in fact more reliable than the monolith, although from what I've seen in my career I am skeptical that's truly the case.

Where's the hole in my reasoning? (Or maybe I'm right. That would be fine too.)


In general, one of the goals of microservices should be that if one of the five services goes down, the other four should be able to operate in some capacity still.

In practice, this can make the math quite a bit messier, but I don't think it necessarily has been worse overall from my perspective.

So instead of having your system be up or down 99% of the time in a monolith, you'll have it fully up 95% of the time (using your numbers), but of that 5% of downtime, 20% of the time one of your products will be running slowly, or 10% of the time some new feature you launched won't work for specific customers in some specific region, etc.

At my company it makes things like SLA/SLO guarantees for "our services" pretty complicated in that it's hard to define what uptime truly means, but overall I think the five microservice approach, when done well, should have less than 1% of complete downtime, at the cost of more partial downtime


> In general, one of the goals of microservices should be that if one of the five services goes down, the other four should be able to operate in some capacity still.

This is an excellent point, but what brought this to my mind was that the microservices in the Netflix article I don't think have this property. It looks to me if any of the VIS, CAS, LGS, or VES go down, then the whole service is effectively down.

Indeed, in my own career what I've seen is that if one microservice goes down the user won't be seeing 500 errors or friends, but the service will be completely useless to the user. You've just gone from a hard error to a spinning load icon, which might in fact be an even worse user experience.

It could be argued that this is just "you're doing microservices wrong", but then we start getting into no true Scotsman territory.


> Indeed, in my own career what I've seen is that if one microservice goes down the user won't be seeing 500 errors or friends

Exactly what it does is that first few hours of triage call goes with people claiming "well my service is up and issue is somewhere else". So find which service failed itself take crucial hours instead of fixing the failing service.

But in a world where Micro Service Incident Commanders can pinpoint failing a service among 1000 micro service within seconds on their vast 80 inch monitoring consoles and direct resolution admirals to fix in next 15 mins. It might just all work fine.


the problem comes when it's a distributed system, and it's the interaction between multiple systems that's causing the problem, and not a specific microservice being down. something got upgraded and the message size changed in an unexpected and incompatible way that worked fine in testing.


> It looks to me if any of the VIS, CAS, LGS, or VES go down,

But the whole point is that by splitting it into micro-services you can efficiently and optimally scale each component individually. So it's extremely rare that VIS for example would entirely go down. And because Netflix has tools like Hystrix if one instance is unavailable it will seamlessly route to another one.

And Even if you push bad code there are techniques like blue/green and canary releases which can be used.


Several reasons this doesn’t pan out in practice.

1) Retries. When one replica of a microservice is down, the calling service can retry, get service from an up replica, and the outage is routed around

2) Queues. Microservices lend themselves to queue and worker patterns where downtime on individual services has less effect on overall service availability

4) Outages have narrower impact. One microservice losing access to its database breaks the functionality that relies on that microservice; other functionality runs fine.

5) Changes have smaller blast radius. Most outages are caused by changes; changes in monoliths that cause outages are more likely to take the whole system offline (eg stack overflows and infinite loops crash processes). Changes that cause outages in microservices can’t knock other services offline.


I did about something same. If there are 5 services each can fail independently (unlikely but lets assume) and uptime is 99%. Then P(All up at same time) = 0.99^5 which gives any one is down at any moment is 1- 0.99^5 ~5%. So this increase the failure 5 times of original 1% downtime. And with 100s of micro services in overall architecture with some indirect connections between many of them I think this number could possibly go much higher.

Further at least where I work it is clearly that failure rate is higher than 5%. But with cottage industry of observability tools, cloud native solutions ..blah..blah.. telling basic maths to people in responsible positions is sure shot way to get fired. I am already being marked as someone opposed to progress so I can basically take my statistics and shove up mine. There is million times more data about reliability of micro services and they all can't be wrong.


I don’t know if our architecture truly qualifies as microservices but in my experience one of the advantages is that the system is able to limp along in a degraded state much more effectively when one service goes down whereas it’s a lot easier to bring the whole system to its knees with a single change in a monolith.

This suggests an addition to your model which is that not all outages are equally costly.


Your assumptions elide the benefits:

- micro services don’t always block the pipeline; often the failing one can catch up later

- scaling can happen for each micro service

- removing faulty components from the main path means the key services are less likely to crash

- you haven’t explained why feature X is more likely to crash in a micro service than a monolith, eg, you’re assuming components A and B have 0.5% crash in a monolith but 1% when run independently

Your model ignores that most of your crashes come from the same code paths between the two models; only a small contribution is to the crash is from hosting.


> My thinking goes like this, with some simplifying assumptions. Let's say you have a monolith with 99% uptime that you rearchitect into 5 microservices, each with 99% uptime, and if any one of those services goes down your whole system is down. Let's also assume for the sake of simplicity that these microservices are completely independent, although they are almost assuredly not.

I've worked on Microservices at HBO. IIRC over 2 years my team's multiple services only had 1 complete outage, and 2 or 3 impactful incidents.

Also a nice benefit of Microservices is that you can shove queues and retry logic between services, and replay messages later on when a service is back up. Obviously not appropriate for anything that needs to give real time results, but there are a surprising number of features that don't mind a 30 second or even 5 minute delay so long as success is guaranteed eventually.


I'm not an expert in this discussion by any means, but my two cents is that collaboration on larger teams is easier on the microservice path than the monolith path.

I would expect that over the course of years, the uptime and feature velocity of the monolith will decline at a rate faster than that of microservices, if a monolith has 99% reliability, then that inflection point might take a while to occur due to the calculation that you provided.

However, as a company you might be willing to go from 99% to 95% reliability (or 3 9's to 2 9's) to double the speed of feature development.

To my understanding, microservices are primarily implemented as an architecture for collaboration due to the inherent inefficiencies and communication difficulties of large teams. This is why they are often not recommended unless you could be categorized as "Big Tech".


Anecdata incoming. This was when I was working in "Big Tech".

When I was in web dev, my experience was that there was actually no good separation, and collaboration with microservices was in fact more difficult. Every single feature I worked on required changes across several microservices, and having Team A run service A and Team B run service B etc. just meant that I had to get buy in from every team. My team was easy enough to work with, but then for the work needed on service B I would have to learn and start using their processes, attend their standups and meetings in addition to mine, and so on.

Frankly it was a nightmare. But in a monolith, the same work is usually just a few quick arguments during code reviews. Maybe inviting a few extra people to design reviews.

In my experience, microservices just make teams more territorial.


The flaw in your logic/math is assuming that the resulting 5 microservices each have the same 99% uptime as the original monolith. In practice some microservices are much simpler and therefore more reliable than others, especially when broken out.


Well, even that depends, the overhead to do microservices well is a lot - versioning of what services are deployed and guarantees around compatibility between said deployments can be a huge amount of work. I’ve seen systems where 2 month old containers are just floating around still being sent incompatible messages. Then there’s teaching developers about CAP theory properly, at Netflix great everyone is sharp enough to get it but at AverageCompanyX, well my experience is 30% of developers actually can’t reason about eventual consistency.

Therefore I would question the assumption that things are simpler, code certainly, infrastructure, debugging and deployment certainly not.

I would say breaking things down into services, slowly, as it makes sense is enough. They don’t need to be micro.


Yea totally agreed with your points, a bunch of new problems arise that most orgs aren't equipped to solve and should probably stick with monolith. I was merely pointing out the probability part and how nobody would use microservices if it meant (.99 ^ 5) reliability, although to your point that can definitely happen at some places!


>In practice some microservices are much simpler and therefore more reliable than others, especially when broken out.

Much simpler in what way?

You mean less close = less crash/segfault potential?

If yes, then oh c'mon, modern stacks are incredibly reliable that they almost never crash.

More microservices = more infra level stuff needed = WAY more potential problems


I am much more looking forward to Serving Netflix Video Traffic at 1600Gb/s and Beyond. Hoping drewg123 will share something soon :)


> Processing Ad creatives posed some new challenges: media formats of Ads are quite different from movie and TV mezzanines that the team was familiar with, and there was a new set of media processing requirements related to the business needs of Ads.

Nice to see them rearchitecture their service around enshittification.


Did any of this:

Make my nextflix better? How about cheaper? Did it deliver better content? Is this the work product of 2000 engineers focused on delivering me the worst content in the best way possible? What exactly am I getting for my 12, 20 ... wait what the hell is netflix charging now for their garbage content...

1000? 2000?d engineers at netflix and this is the article we get, this is their flex?

I am underwhelmed.


And then I always think, Netflix only has ~3,600 movies... My friend's Plex server has 4x that (in just movies). I'm also often underwhelmed by Netflix's engineering posts


But your mate doesnt have to stream them on demand to millions of people round the world.

They not only have to store the movies, but access and simultaneously stream them thousands of times from anywhere on the globe.

Imagine how many people are watching things like Stranger things series premiere at the same time.


Let me have a copy locally, then you don't have to stream it to me every time.

This is basically a scaling issue streaming platforms and publishers created themselves because of copyright.

It makes their product expensive, bloated, clunky and make pirated content much more convenient.

There are no excuses. Games are exactly like that. They just have to do better.


"let me have a copy locally" is a very naive view of the world. 1. netflix works also in places with horrible connections/terrible bandwidth 2. what would you do? preemptively download all the series, movies etc that they think you'll download? (that may work in places with good internet, but even there, how much disk space would you need like on a mobile device?) 3. most behaviour on netflix is "let's try to click on these random series/movies and see if i like it" so you'd be downloading things that you'll never come back to see.

I am not sure i understand your point...plus netflix has the HUGE problem of optimizing the stream as much as possible to give people fluidity in their experience (exactly for the - low bandwidth - connections)


My point is that many of the engineering challenges come from self inflicted wounds that could be overcome with more flexibility.

Giving the user a choice to pay less and run their own binary of netflix would solve most of those issues

1. Shitty connection: no problem, just wait for the movie to load.

2. Preemptive download: on demand pre-load

3. Stream optimization: would be solved by local caching and P2P offload

4. Mobile devices: either caching there (already a feature) or NAT punchthrough, accessing movies you already preloaded in your home infrastructure

5. Naive view of the world: I think many things we do nowadays would be considered naive. What, a computer the size of a chocolate bar in the hand of everyone? This just shuts the discussion off of new ways of thinking. My idea could be technically bad, but "naive" is just a way of saying "out of the common discourse".


"Giving the user a choice to pay less and run their own binary of netflix would solve most of those issues" why? netflix works well for what they want and how they want it to work. you can download things to watch offline already and the rest would benefit just a small percentage ot the world, which already has money to pay for the service. I agree that p2p would be a great solution to networking issues, but am not sure about the legal/technical consequences (again) worldwide.


Doesnt change the required infrasctructure. On Stranger Things premiere day you still have to have the download infrastructure to handle hundreds of millions of downloads simultanesouly.


Not necessarily, you need one download then theoretically torrents will do the rest. People don’t download directly from Netflix.


Just to add to it, torrent isn't piracy technology, it is just a sharing protocol. Netflix could very well leverage that to lighten distribution load. Didn't think of that, nice catch.


It's actually pretty wasteful that it doesn't work that way.


Im pretty sure that would be illegal, but even if not there would be such pushback at using customers own bandwidth to distribute their own content.


World of Warcraft was distributing updates as torrents 20 years ago. It was fine.


Netflix for instance running their own tracker, then clearly advertising an advanced tier with lower pricing, with good documentation on how to set up, is just enough


You cant simply start using your customers bandwidth to distribute your own content.


Who said Netflix was offering the torrent?


just throw it in the TOS >:^)


It shifts the transcoding load to the user and it loads the movie async so you don't need the full bandwidth, so of course it changes the required infrastructure needs.


If I understand correctly streaming seervices dont transcode. They hold different versions of the same media to directly play the sppuroted version to the client


> They hold different versions of the same media to directly play the sppuroted version to the client

those different versions are transcoded from a mezzanine source, by a massive system, which is the subject of the OP. you can't just write off the main task from the discussion


The 4K mezzanine source will typically be very high bitrate, perhaps 1 TB or more. So big that it cannot be repeatedly “distributed to clients for transcoding”.


So I’m going to download everything I might want to watch like it’s 2005 and I just got the first video iPod?

Do I download everything to my computer, iPhone, iPad, and multiple AppleTVs in case I want to start watching a TV series one place and finish watching some place else?

BTW, of course you can download video to mobile devices


Saying 2005 like it was a bad thing to own the stuff you had. I would much rather have it "the old way" instead of having my series be delisted from the streaming service mid season.

Just make a paid local tier, cheaper, but you need to load it and transcode it locally. Give the user a choice.


Unlike music, which was DRM free and could legally be ripped from a CD using iTunes. There was no easy legal solution for Apple to ship software that could rip DVDs. When the iPod with video came out, iTunes also started selling movies. The next year Apple introduced movie rentals .

Are you really suggesting that it would be a good mainstream product to offer video downloads - in 2024 - where people used a computer to download movies to a computer and then upload them to all of their devices? How then do they watch on their TV with the built in apps?

They setup their own Plex server? When I want to binge 30 seasons of South Park - some on my TV, some on my phone, some on my iPad, do I copy it to all three places?

What keeps track of what I watch and don’t watch?


In the apple ecosystem analogy, you would just need a time capsule-like device (or any server for that matter). A single copy would suffice. Locally, you stream as is, since wifi is more than enough for 4k content. On the go, after connecting to your server, if upload isn't enough, the server could transcode the movie scaling it down. Perfectly doable without having to have a copy in each device.

> Are you really suggesting that it would be a good mainstream product to offer video downloads - in 2024 - where people used a computer to download movies to a computer and then upload them to all of their devices?

If it was the sole offer, I think it would be too restrictive, but as a lower or cheaper/advanced tier, why not?


I had a Plex server setup that could do that. It was an old spare computer. iTunes can (could?) do that over a lan. That’s how the very first hard drive AppleTV worked.

It was a pain to maintain and had symmetrical gigabit internet. Most people have cable internet with very low upstream bandwidth.

But now you’re asking them to buy another device and what’s the benefit for them?

In today’s world, I would buy an Nvidia Shield that can do hardware transcoding.

But you can already download video to mobile devices ahead of time.


> Let me have a copy locally, then you don't have to stream it to me every time.

Venting? Local copy would not give you anywhere (mobile/pc/tablet etc) access to their content. Imagine having to carry 20 discs (just 20, not saying 100+ yet) with you everywhere. In case, if you plan to say - 'I don't need mobile access or 20+ discs' - they aren't going to customize their offering just for you or a few users. A USB will not work with mobile devices.


I have Plex with transcoded dvds locally that I can access anywhere with tailscale.

Streaming service just need a relay bastion to punch through NAT for the initial handshake. You just need a server running some kind of client in your own infrastructure.


You can download to you mobile device already


That would raise the price for storage


Put a cache in front of it?

Serving a shitton of files is sort of a solved problem. For huge bursts of a single piece of content you just need request coalescing and a few layers of fanout. If you know what content you can even pre-warm the top layer of the cache.

Sure, you need a lot of infrastructure to serve a lot of traffic. But it isn't complex infrastructure.

The hardest part of Netflix's setup is probably the player. Making it request the right quality for the network and device conditions. And IDK much about DRM but I'm sure that decryption keys add some complexity to it. Serving quick recommendations and other things are probably also much more complex than serving a small amount of fairly large files.


Alright, to release a video on a streaming platform here is what you need to do:

1. Encode the video in multiple different formats and resolutions for different devices 2. Encode the sound track in multiple different formats for different devices, and package those up alongside the video file 3. Encode the subtitles in various formats and languages

The number of combinations of the above is, by itself, super complicated, and if you pay close enough attention to the different streaming platforms you can see that they all get it wrong sometimes.

And remember that content is being ingested from multiple different sources, from internal studios to purchase agreements with small international indie studios.

Alright, so you got that taken care of, now you need to get the files out to CDNs. You have your ISP based CDNs, e.g. Comcast really wants to cut costs, you may possibly be running your own backhaul between your own CDNs, and then there are the large CDNs everyone knows of as well.

And video playback isn't just a static thing. People want to be able to pause a video on their TV and resume it on their phone, so every few seconds you are sending completion info on where the video is at, except some playback platforms are so locked down that they allow you to initiate sending data back over the network (!!!) so you have to find a way to estimate how much of the video the user has played back so far. Spend some time thinking how to do that, and you can imagine that it gets horribly ugly.


He should write a mastrubatory blog post about how effective his set up is... Hell he could do a cost comparison for running his service vs Netflix per month.

I think this is the year where I go back to stealing content, none of these services are worth it.


If paying for content is not owning it, then copying it is not stealing.


I am assuming you got the statistic from here

https://www.comparitech.com/blog/vpn-privacy/netflix-statist...

It also has 1800 TV series and how many episodes does the average TV series have?


How much does that cost to put together, just in storage?


> Haha used to be 1PB, stored completely on Google Drive (they had unlimited), but then they cut me off so I cut down to 300TB and switched to self hosting at a data warehouse.

> Now me and like 100 people share a ceph storage that we all pay $100/mo for. I think the current size is like 1.5PB


Netflix has rotations though, does it paint a better picture ?

It also has TV shows, I wonder how shows and movies compare regarding bandwidth, storage and offers.


This is a bit like complaining about AWS because you don't like the product selection on Amazon.


Except for any aspect of the post to actually justify the engineering effort it would either have to improve efficiency or save money. All while Netflix has increased cost, stopped account sharing, and introduced ads.

Net negative improvement for users whilst there was presumably some net savings or gains for Netflix.

They've from offering a competitive offering to offering a compelling investment, shedding any guise of caring about their users along the way.


Yes, the job of Netflix engineers is to help Netflix gain money.


Is Netflix offering their services as a platform now?


Not sure why that would matter, complaining about content in response to an article about infrastructure doesn't make any more or less sense if the infrastructure is available as a service.

But I do consider some of what Netflix does to be a platform. Most of it is platform in the sense that some of their open source offerings are commonly adopted such as Spinnaker. But if you look at the adoption of microservices at Netflix a part of that includes Conductor, an open-source microservice orchestration engine.

The Netflix developers that created Conductor left Netflix and formed a new company named Orkes to offer Conductor as a platform. So while its not operated by Netflix, the microservice efforts they've made have been turned into a service offering.


The way I see is Netflix is mostly sprawling implementation of mediocre Java stack which is typical of large IT department in F500 companies. But their relentless tech marketing and extremely high pay for work which is a wrapper around enterprise IT has created an aura of sophistication and cutting edge in many people's mind.


I don’t like Netflix’s content either but I’m pretty sure that’s not up to the engineering department.


I doubt the backed engineers have any say over the content side of the company so blaming them for that isn't really reasonable. And while it may not make it cheaper for you or improve the user facing interface again another team, it probably made it easier for them to maintain and debug and administrate, which is something all sysadmins and engineers should respect.


Don't be underwhelmed. It definitely makes your Netflix better: whatever you watch can be encoded better, which enhances quality and lowers the chance of a rebuffer interrupting your experience. And the improved encoding efficiency frees up money that can be spent on content production.

But you can also just enjoy the story of developer achievement!


> so we can maintain our rapid pace of innovation

so it will make things better, pinky-promise!

I'm happy to pay them for the occasional good content, that I'll then torrent (because fuck smart TVs), but ... their app/client/website and their system seems to just work. I'm sure there are many things to optimize, etc, etc.. but probably they could reduce their development (and ops) budget by 70-80% if they would stop fucking with the system.

though, of course, that'd require a drastically different mindset, different people, etc.


Yes.

> While it is still early days, we have already seen the benefits of the new platform, specifically the ease of feature delivery.


What features though? Oh maybe a sleep timer for the kids section so it turns off after 1 or 2 episodes? No, that would be actually useful but wouldn’t warrant a blog post.


Ways to make it better:

1) Implement Apple TV menu integration, like several other services do.

2) Bring back manual rating. My suggestions were way better then.

3) Like literally every other service that has parental controls: I wish you’d just give me a goddamn allow-list. That would be more useful than almost all other effort that goes into this stuff, and relatively easy. But almost nobody does it. It’s very frustrating.

Ways to save a lot of bandwidth:

1) Stop being the only major service that auto-plays video with audio behind the menu when folks are just trying to browse, or are just idling on the menu and talking, and in either case actively do not want that (I know there’s a setting now, finally, that sometimes kinda works for a while—how about flipping that default around?)


They have it in the article itself: it's why they were able to roll out a totally new plan tier quickly.


If you hate Netflix, why don't you cancel your subscription?


I mean, it depends? Obviously Netflix has extremely different priorities than 99.99% of the software in the world. Scale of operations is much different as well.

It’s available in most of the countries in the world, in a lot of varying devices, requiring a ton of different video processing pipelines, content delivery networks, infrastructure and etc. Even very “straightforward things” like downloading for offline viewing can be a significant effort to implement. Now think of audio sync, post processing, sub delivery, localization, partnerships and etc., you can see how you would need a ton of engineering effort to achieve it. Just the scale makes it much nuanced perspective during implementation.

You and me can dislike whatever content they’re delivering, but it’s very obvious how there are millions and millions of people who still enjoy it.


>>> It’s available in most of the countries in the world, in a...

You get that your post right here is a better pitch for Netflix engineering than their own engineering. Blog about some top those problems, the things that your doing that make your domain hard and interesting...


That's exactly what this post is about. It requires some background info to see the scale of their achievement, sure, but their choice is to put some of that burden on you, the reader.


Yes probably all of those things.

I’m not being wide.


It's pretty easy, just stream the results of ffmpeg. I could create a Netflix platform in one day.

/s if anyone took this seriously




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: