It would be cool if you could share when you worked at FB and what general area of the infrastructure you worked on. Your statements don’t match the Facebook infrastructure I know. I recently left after spending 8.5+ years on infra, so I saw it both in the really early days, as well as how it is today.
Facebook actually built many major pieces of its infrastructure itself after finding no suitable solution in open source or commercial vendors. Efforts like this were happening even back in the early days of FB infra. I could list out many publicly known Facebook infrastructure projects that solved Hard Problems. And not just in recent years either. :-)
Hard problems like massive scale with strong consistency which three other firms solved? Or "hard" problems like managing COTS components or accelerating PHP that just require time + labor to pull off? Feel free to name projects solving Hard problems whose capabilities had little to no precedent outside Facebook. I mean that straight up rather than sarcastically in case it comes off that way. I'd like to know the best Facebook pulled off.
I’ll talk about one abstract thing and two concrete things that I think show that Facebook has been able to solve Hard problems that required the development of new capabilities rarely or never seen before. I find these three things to be very impressive and notable myself, although I am obviously very biased. I should note that this is not an exhaustive list, it’s more just the three examples I am most familiar with and wanted to take the time to write up.
First, the abstract thing. I don’t know if most engineers outside Facebook appreciate how much activity there is on the site each day. Hundreds of billions of likes and comments. Billions of photos uploaded. Trillions of photos consumed. And growing each day.
And this is for an “online application” in the sense that the data is live and constantly changing. We’re not talking about crawling the web, storing it, doing offline processing, and then building a bunch of indices (which is a different but still legit kind of hard). This is an immense amount of live data producing an even more immense stream of live events; trillions and trillions and trillions that need processing, live and on the fly, every day. It is hard to underestimate how difficult it was to build an application and backend that could drive this kind of social platform. In terms of liveness and scale, there really is nothing out there that can touch Facebook, by orders of magnitude. That is a solved Hard problem in my opinion, if not a meta one. But in my opinion, the largest one.
Here’s two concrete things that I thought are good examples of some Hard problems that Facebook has solved:
1. A global media storage platform that each day is capable of ingesting billions of new photos and videos and delivering trillions as well. This includes Haystack, F4, and Cold Storage, which store hot/warm/cold media objects for Facebook. Each storage layer has specialized, custom software running on distinct storage hardware designed to take advantage of the requirements for each layer. Facebook is the largest photo sharing site in the world by orders of magnitude, and they had to build a very custom photos backend to handle the immense load. I’m not even mentioning their terabit class Edge platform which has a global constellation of POPs that accelerates application traffic and caches popular content close to users. Facebook’s global media storage and delivery platform is truly a unique asset.
2. A datacenter architecture focused on high power efficiency and flexible workloads, comprised of custom built: datacenters, racks, servers, storage nodes, network fabric, rack switches, and aggregation switches. Plus some other things that haven’t been made public yet. This architecture let application developers stop working around various physical performance bottlenecks with compute, storage, cache, and network, and instead just focus on optimizing what was best for the application. At the same time, this new architecture also greatly reduced infrastructure costs.
There are other solved Hard problems I can talk about that are public, and others that I wish I could talk about that aren’t yet. However, I’m not trying to exhaustively defend what Facebook has built, I more just wanted to respond and show a few examples of the very special scale Facebook has and some of the extremely hard problems they’ve had to solve.
re live data. Processing that much live data without changes breaking the whole thing is a Hard problem. Very impressive work on their part. I'll give them that.
re storage platform. I'd have to look into it to Haystack or F4 to see if they were truly unique or incremental improvements on existing stuff. Cold Storage, though, I *did read up on. The work and decisions that went into that were brilliant. Every little detail and optimization they could. Plus, although minor precedents existed, I don't think I saw anyone else thinking or working in the direction they worked. Truly innovative.
re datacenter. I count most of that as more incremental. I've looked at their publications on their architecture. They might strip unnecessary stuff out of a blade, put in a RAM sled for a cache, use a simpler protocol than TCP/IP for internal comms, and so on. The kind of stuff everyone in whole datacenter (and supercomputer) industry does. Most of it is obvious & has plenty precedent. Now, if you know a few specific tech they used that were extremely clever (little to no precedent) please share. Example would be whoever started installing servers into shipping containers for mass production then easy installation... that was brilliant stuff. Was either Facebook or Microsoft.
So, you've given two good examples of Hard problems Facebook solved. Brilliant thinking went into solving them. No doubt a company working some miracles on a regular basis.
I built Haystack with two other engineers and then founded the Everstore team which works on Haystack/F4/Cold Storage. I moved off of managing Everstore to focus on Traffic in fall of 2011, so F4 and Cold Storage happened after my time, but I worked on Haystack and the storage management layer above it for five years.
There were a few things that were interesting or novel about Haystack. CDNs were only giving us a 80% hit rate on photos, and even back in 2007 when we first starting working on Haystack, that was an immense number of misses to serve from the origin. That heavily informed the design goals of Haystack, which were to have a very I/O efficient userland filesystem that used direct I/O to squeeze as much iops from the underlying drives as possible. We also wanted zero indirect I/O, so we used an efficient in-memory index so that all seeks on disk were to service production reads versus indirect index blocks or other metadata. Lastly, Haystack is very very simple. We made it as simple as we could and eschewed anything that looked too clever. I actually think its simplicity is one of the things that made it Hard. It would have been much easier to design something more complex.
If you have any questions about the Haystack paper after reading it, I'd be happy to answer what I can. Just send me an email.
Regarding the datacenters, you're kind of simplifying it, but that's okay. There's a lot more orchestration, automation, and integration, and I wish I could give you a tour of one of our OCP datacenters so you can see it firsthand. There's plenty of things I wish I could talk about on the datacenter front that I feel are truly amazing and without precedent, but I can't until they are public. There's an excellent chance that they will become public and likely even part of OCP in the future though. I'm really proud of how much of Facebook's datacenter architecture has been shared with the world for free. It gives me warm fuzzies all the time.
Anyways, thanks for the fun thread, and let me know if you ever want to talk storage sometime.
The paper was straightforward on what mattered. I liked that your team did their homework on prior efforts, used object-based storage (I pushed it in late 90's), used most proven filesystem (one I'm using it now), kept data structures fairly simple for modern stuff, and wisely used compaction. All good stuff.
I actually do have ideas on where to go from there and that's general. I'd probably charge Facebook for the info, though, given the increase in performance and savings they might generate. I might contact you in the future on that.
If true, that would be an epic burn of my comment. Yet, the only person's name that kept popping up has a LinkedIn that doesn't mention Google. He worked at Microsoft, Amazon, etc. Might not be the main inventor, though. Who are you referring to?
Facebook actually built many major pieces of its infrastructure itself after finding no suitable solution in open source or commercial vendors. Efforts like this were happening even back in the early days of FB infra. I could list out many publicly known Facebook infrastructure projects that solved Hard Problems. And not just in recent years either. :-)