Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Update about the October 4th outage (fb.com)
274 points by ve55 on Oct 5, 2021 | hide | past | favorite | 224 comments


It just occurred to me to wonder if Facebook has a Twitter account and if they used it to update people about the outage. It turns out they do, and they did, which makes sense. Boy, it must have been galling to have to use a competing communication network to tell people that your network is down.

It looks like Zuckerberg doesn't have a personal Twitter though, nor does Jack Dorsey have a public Facebook page (or they're set not to show up in search).


> It looks like Zuckerberg doesn't have a personal Twitter though

He does: https://twitter.com/finkd


His LinkedIn photo used to be this really awkward laptop camera photo of roughly this face: (-_-)

It was amazing. I’m sad he remove it.


Ah, I saw that one, but it wasn't verified so I figured it was an imposter. It has only a handful of tweets from 2009 and 1 from 2012, but it could really be him, I suppose.


Yeah, that's kinda sus.


They have an official account. https://twitter.com/Facebook/status/1445061804636479493

hint: "some people".


I’m not sure they are competing though. They serve different purposes and co-exist pretty well together.


Gotta love how painfully vague this is. Sounds like a PR piece for investors, not an engineering blog piece.


I think you need to re-adjust your expectations, it's not reasonable to have a fully fleshed out RCA blog post available within hours of incident resolution. Most other cloud providers take a few days for theirs.


I mean, not an RCA per se, but info more akin to cloudflare's blog post would be v welcome IMHO: https://blog.cloudflare.com/october-2021-facebook-outage/


Both posts have essentially the same info - the fb one just didn't include an explainer on how the internet works.


The Cloudflare post includes graphs depicting how the issue looked downstream and a rough initial timeline for the incident. The Facebook post says basically nothing more than that there was a networking mistake.

I hope we get something more substantial and informative than that over the next couple of weeks, but it doesn't seem (at least from my searching) that Facebook is in the business of publicly posting in-depth post-mortems for their outages, which I personally find unfortunate.


Cloudflare can scratch the surface of the issue it wouldn't matter, it is a content marketing piece after all. Facebook, otoh, needs to be thorough.


How tech savvy are people who pay Cloudflare money?

vs

How tech savvy are people that Facebook profits from?

Gotta target your audience, every communication is PR...


It’s not reasonable to demand any details at all, it’s nice of them to notify people of what went wrong but it really is none of our business.


On the off-chance this isn't sarcasm, Facebook's routing shenanigans slowed down the entire internet. Not to mention that they're a publicly traded company, and one which has gone out of its way to assume an infrastructure role. They don't have a right to privacy here, and we are all owed an explanation.


I don't want an explanation nor do I care, Facebook could disappear tomorrow like all the other networks before it and it wouldn't make a dent in my day.


Doe you honestly believe that Facebook and its subsidiaries don't have a major impact on the world?


Whether we like it or not, all three platforms are relied upon by hundreds of millions of people and businesses everyday for communication.

I'm sure the world would quickly adapt by re-adopting these things called "websites" and "email" but in the meantime, it's highly self-centered to think this "didn't matter".


Okay? Why are you commenting here then?


Or else? You will angrily stamp your foot? Start an e-petition?


This comment thread is about whether we ought to receive an explanation, not the practical likelihood of getting one.


> Facebook's routing shenanigans slowed down the entire internet

This is Hacker News, so the distinction between network performance, server performance and application performance should matter.

"The Internet" did not slow down. "The Internet" infact probably had more available capacity as a result of Facebook's outage, as all those bits of outrage and cats ceased to be transferred for the duration.

Some applications may have seen performance hits, as a result of poorly thought out dependencies on an external service without graceful failure.

Some applications may have seen increased load and suffered due to server resourcing constraints, caused by applications like the above failing to fail gracefully, and instead polling more aggressively.

> They don't have a right to privacy here, and we are all owed an explanation

Morally / ethically, you're right. The fact that Facebook exists in it's current form tells me that morals and ethics aren't particularly important to the real world.


Perhaps you're unaware that billions of devices attempting to resolve Facebook's unresolvable domains effectively DDOS-ed the DNS system? It most certainly did slow down big chunks of internet which otherwise had nothing to do with Facebook.

https://www.theverge.com/2021/10/4/22709123/facebook-outage-...


> Some applications may have seen increased load and suffered due to server resourcing constraints, caused by applications like the above failing to fail gracefully, and instead polling more aggressively.

I had Cloudflare's woes in mind when I wrote that.


So your point is that it didn't slow down the entire Internet, only the parts of the Internet that use DNS (damn near all of them)?


Based on cloudflare's own blog[0], 1% of queries went unanswered for about 10 mins, and were significantly delayed for a few hours. 5% of queries were delayed for a couple of hours. 95% of queries continued to be resolved as normal.

This is referring explicitly to users of 1.1.1.1, which is likely not the same infrastructure as domains hosted by cloudflare dns.

[0] https://blog.cloudflare.com/october-2021-facebook-outage/


Disagree -- it's here to establish something that a lot of people have been speculating about, which is whether it's hacking-related. It doesn't say much because its purpose is to deliver a single bit of information: { hacking: boolean }


It's less vague than you realize. It points out that the problem was within Facebook's network between its datacenters. This not only suggests it's related to express backbone, but also suggests that the DNS BGP withdrawal which Cloudflare observed was not the primary issue.

It's not a full root cause analysis, to be sure, and leaves many open questions, but I definitely wouldn't describe it as painfully vague.


A point of distinction - there is no "DNS BGP withdrawl".

DNS is related to BGP only that without the right BGP routes in the routers, no packets can get to the facebook networks and thus the facbook DNS servers.

That their DNS servers were taken out was a side affect of the root issue - they withdrew all the routes to their networks from the rest of the Internet.

Not picking on you - but there has been a lot of confusion around DNS that is mostly a red herring and people should just drop it from the conversation. Everything on facebook networks disappeared, not just DNS. The main issue is they effectively took a pair of scissors to every one of their internet connections - d'oh!


This is not what we observed. Less specific routes were still present in routing tables, but the more specifics that routed DNS traffic disappeared.


FB did not totally disconnect from the internet as far as what I observed. They broke their backbone (from the blog post, as well as seeing that their edge was returning 503s during the incident), and FB withdrew specific BGP announcements which covered their DNS servers are what I saw. I think this is consistent with what Cloudflare actually said, although not what has been often repeated since then.


RCAs take time. It's best to issue vague statements during and right after an incident rather than make guesses.


> It's best to issue vague statements during and right after an incident rather than make guesses.

Why? Why couldn't you just post that the RCA is still ongoing and that proper updates will follow? Otherwise all you get is meaningless fluff.


Well, because you can fail on initial assumptions. And may spread officially-false information.

This is not meaningless fluff. It may not provide info to technical persons, but valid info to other persons, as others noted (was hacked: yes/no).

Getting down to root cause takes time. It is usually multiple-things-at-once that caused X to happen. And then they must also make a decision on how to prevent X to happen again. All that must be written into RCA. It takes days not hours.

An example from my life: Service has intermittent disruptions. Antivirus activity correlated 100% with disruptions. Upon further investigation turns out that AV was just doing its job when there was less load. (And before anyone points out why on earth there is AV on such service, well, because it deals with user uploaded files)

So what should I have called out - AV is the guilty one? And then say: oh, no, false info.


Not worth clicking really, everything is in the url.


Its important for many stakeholders to understand it wasn’t a hack/exploit or malicious third party or malicious insider

Its much better for some random committee in Congress to debate antitrust forever, instead of bigger committees and agencies debating national security threats


It clearly is a PR piece for investors and customers. And that's ok, not everything is an eng blog.


Pointing out that this is published under _engineering_.fb.com.


Even though the angle grinder story wasn’t accurate, it’d still be interesting to know what percentage of the time to fix the outage was spent on regaining physical access:

https://mobile.twitter.com/mikeisaac/status/1445196576956162...


I worked with a network engineer who misconfigured a router that was connecting a bank to it's DR site. The engineer had to drive across town to manually patch into the router to fix it.

DR downtime was about an hour, but the bank fired him anyway.

Given that Zuck lost a substantial amount of money, I wonder if the engineer faced any ramifications.

Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.


"I worked with a network engineer who misconfigured a router that was connecting a bank to it's DR site. The engineer had to drive across town to manually patch into the router to fix it. DR downtime was about an hour, but the bank fired him anyway." so prod wasn't down and he fixed it in a hour and they fired the guy who knew how to fix such things so quickly. Idiot manager at the bank.


Had a DBA once who was playing around with database projects in visual studio and he managed to hose the production database in the course of it. This caused our entire system to go down.

Prostrate, he came before the COO expecting to be canned with much malice. The COO just asked if he learned his lesson and said all is forgiven.


I agree it was very heavy handed, but I suspect there was more at play (not the first mistake, and some regulatory reporting that may have looked bad for higher ups)


Facebook has a very healthy approach to incident response (one of the reasons it's so rare for the site to go down at all despite the enormous traffic and daily code pushes).

Unless there was some kind of nefarious intent, it's very unlikely anyone will be 'punished'. The likely ramifications will be around changes to processes, tests, automations, and fallbacks to 1) prevent the root sequence of events from happening again and 2) make it easier to recover from similar classes of problems in the future.


I've never understood companies that fire individuals when policies were followed and an incident happened. Or, when no policies existed. Or, when policies are routinely bypassed.

Organizational failures require organizational solutions. That seems pretty obvious.


100%! Unfortunately even with 60+ years of us formally knowing this-- at least since WWII-- it's very difficult to fight the urge to blame and punish people for incidents.

In many ways we're wired to do it and it FEELS GOOD to do it. Even the industries that are championed for focusing on organizational process over human blame (ex: airlines) are often lulled into initially falling back on the emotional knee-jerk of "pilot error" (see: the early days of the 737 max debacle).

Companies have to be very intentional, usually top-down, about focusing on the context that allowed humans to fail instead of the human themself. That's often easier said than done.


Harsh. Unless there is more to the story being fired for a mistake like that is ridiculous. Everyone fucks up occasionally, and on the scale of fuck ups I've certainly done worse than a 1hr DR site outage, as I'm sure pretty much anyone who's ever run infrastructure has. A consistent pattern of fucking up is grounds for termination, but not any one off instance unless there was an extreme level of negligence on display.

> Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.

I bring this sort of thing up all the time in disaster planning. There are scales of disaster so big that business continuity is simply not going to be a priority.


>DR downtime was about an hour, but the bank fired him anyway

The US, not even once.

The guy should have had "reload in 10", an outage window and config review. There must be more to this story than it being a firable offence for causing a P2 outage for an hour.


This is a funny post to have suggested at the bottom of the article: https://engineering.fb.com/2021/08/09/connectivity/backbone-...


Looks like the 'Failure Generator' was brought online.


So this is pure conspiracy theory, but to me this could be a security issue. What if something deep in the core of your infrastructure is compromised? Everything at risk? Id ask my best engineer, hed suggest to shut it down, and the best way to do that is to literally pull the plug on what makes you public. Tell everyone we accidentally messed up a BGP and thats it.

But yeah, likely not.


Speaking of conspiracies, one that is floating around is that this was done to cover up spread of information around the Pandora Leak.


BGP is public routing information and multiple external sources are able to confirm that aspect of the story. It makes for a good conspiracy theory but the BGP withdrawal is as real as the Moon landing.


> It makes for a good conspiracy theory but the BGP withdrawal is as real as the Moon landing.

I wasn't aware that Stanley Kubrick was now in NetOps. /s


If Facebook had been under actual attack, and defended by taking itself off the internet… that would be the most hands-on approach to security.


Even though it's probably not that, I must admit the fact that I absolutely love reading theories like this.


Many have pointed out that a couple of weeks ago Facebook had a paper out on how they had implemented a fancy new automated system to manage their BGP routes.

Whoops! Never attribute to malice that which can more easily be explained by stupidity and all that.


It was interesting to visit the subreddits of random countries (eg /r/Mongolia) and see the top posts all asking if fb/Insta/WhatsApp being down was local or global. I got the impression this morning that it was only affecting NA and Europe, but it looks like it was totally global. The numbers must be staggering of the number of people trying to login.


This more or less confirms what we’ve heard, and I appreciate the speed, but it’s incredibly lame from a details point of view.

Will a real postmortem follow? Or is this the best we are gonna get?


Having been on the team that issued postmortems before, I can tell you that we said as little as possible in as vague a way as possible while meeting our minimum legal requirements. Actual Facebook customers (i.e. those who pay money to Facebook) will get a slightly more detailed release. But the whole goal is to give as little information as possible while appearing to be open. As an engineer that makes me growl, but that's how it is in this litigous world -- don't want to give someone a reason to sue.


Sue for what, that they couldn't do with zero information? I don't buy that excuse. (Not that I blame you for the excuse.)


How would you explain that AWS, GCE, Cloudflare, GitLab publish very detailed post-mortems?


These companies average users are highly technical developers, while facebooks users are from a much wider demographic.

It’s not really surprising to me that Facebook is writing comms that most users will understand right now, rather than publishing detailed post-mortems straight away. You have to speak the same language as your users initially in these comms.

Although I wouldn’t be surprised if we see a post-mortem in the days ahead, but Facebook probably will want to say why it happened (not just what happened, but why did it not get detected during testing, was the configuration change correct but there is an underlying bug on the routers etc) and what new mitigation’s will be put in place to stop it happening again, and these might not be known yet.


It's a marketing strategy. Their target customer segment is technical. FaceBooks and Twitter's for the most part, aren't.


Yeah, also a chance some eng from AWS/GCP/Azure leaks actual details if they lie or if public statements are inadequate.


Facebook just had an actual whistleblower go on 60-minutes last night.


Are you under the impression that Facebook is a SaaS provider?

Facebook sells ad space, retail. The impact on their customers of the outage is ‘sorry, you couldn’t buy ads for a few hours.’

Demanding a public RCA for this is like demanding an RCA from Costco because they’re out of stock of tinned beans.


WhatsApp for Business and Instagram Shops for example are SaaS offerings.


I don't know but perhaps they excluded damages in their ToS?


Sounds like they could do with some updates to their risk-driven backbone management strategy!

https://engineering.fb.com/2021/08/09/connectivity/backbone-...


The badge story only shows how people are looking for "efficiency" where it doesn't matter, with predictable results.

The badge system should be local to the building. There are few actual reasons (sure, besides "efficiency") of why badge control should be centralized. Even less reasons for it to be a subdomain of fb. Another option would be to keep the system but make it failsafe (but it seems the newer generation doesn't know what that means). If the network goes down keep it at the last config. Badge validation should be offline first and added/removed ones should be broadcast periodically.

This is the same issue with smartlocks times the number of employees. Do you really want to add another point of failure between yourself and your home?


Having the badge system work from a single point has a lot of advantages for a company like FB: HR can update info from everywhere (they might not be in the same office), you can immediately deny or block a card everywhere, you have an audit log etc.. They're not having this for fun.

Akso, it's likely not on an fb subdomain, but something like office.security.fb-infra.com (example). It just happens to be that fb-infra.com is using the Facebook DNS server.


It's still possible to design a badge system on an independent network (think just switched within a building) which syncs a local copy of the authoritative ldap from the corp domain, so your badge readers stay working if the link to the corp domain goes away.

It's just more expensive and another thing to maintain, and still doesn't account for _all_ failure modes (what if you sync really frequently and a bad change was made deleting all accounts?)


Sure, this fits under the "few actual reasons" but think for a moment: does it make sense that access to a building is controlled only (keyword here) through a centralized location somewhere? Some DB who knows where? With no fallback?

You might need a break-glass account/badge somewhere. Sure, the angle-grinder works, but probably cost you 2h maybe?

> it's likely not on an fb subdomain, but something like office.security.fb-infra.com

Thanks, yeah, makes sense


The issue I would imagine is that during this outage some people needed badge access that had previously been revoked due to covid. All of the caching doesn't help if your source of truth is offline.


"To all the people and businesses around the world who depend on us, " ... yesterday was another example of why you shouldn't depend on us to such an extent.


On a side note: when I browse to that page in Firefox (92.0.1) from HN I can't go back to HN - the back arrow is disabled. What gives?


Do you have the facebook container extension? That closes the current tab, opens a new tab with a container, then goes to the facebook link. Reopening the last closed tab works for me, although I haven't noticed this before since I always open links in a new tab.


Tried Edge, it works as expected. Tried turning off Facebook Container, it also works as expected. So you are right good Sir!

Still, a bit unexpected behaviour though.


Well that doesn't say a whole lot... I know it is early but they could use a little more detail. Even if it is just a timeline.


It was quite ironic that while every Facebook property was offline there was an immense amount of misinformation about the incident perpetuated across the internet (including right here on HN) which everyone just believed as fact.


I didn’t see any disinformation, just initial reports that it was DNS which were later explained to be caused by BGP.


- It was government intervention

- Facebook was hacked

- They did it on purpose to bury the whistleblower story

- No one could access Facebook offices

- They had to cut open servers with angle grinders

- Disgruntled employees changed DNS records

- Lots of made up numbers for how much money Facebook/the rest of the economy was losing (or gaining)

They probably rushed out this blog post just to dispel some of these rumors.


I entirely believed that the badge readers on doors were broken, I have set systems like that up and they’re network connected. It is not the least bit surprising that BGP chaos would break that. That doesn’t mean people literally couldn’t get in the building, more like security was really annoyed and confused all day.


A New York Times reporter said employees had trouble getting in, and I don't see a retraction. So I believe that is true:

https://twitter.com/sheeraf/status/1445099150316503057


Though, I assume by "had trouble getting in" means something boring like "was delayed by a few min until someone already inside pressed the exit button"

And they probably "fixed it" by putting someone near the door to let people in.


Do we hold journalists posting on Twitter to the same standards as writing in newspapers and their websites? Genuine question


No, but I hold journalists posting on Twitter to a higher standard than someone unknown posting on Twitter. I don't see why she would lie.


Ha - journalists don't seem to have any standards for their websites so why would twitter be any different?


I briefly worked on access control a long time ago - aren’t they supposed to cache credentials locally so that individual doors still function autonomously during a fire / loss of power and comms?


Eh I would assume that would only be necessary if the lock mechanism kept people in, the only place I’ve been that did that was a defense contractor, and even then there were alarmed emergency doors. Everywhere else had either no exit control or trivially bypassed exit control.


Wasn’t defence, but somewhere with highly competitive commercially sensitive engineered components. You needed a card to open a door in either direction, but exit routes all had “Break to Exit” bypass switches - which lazy engineers often did, particularly if the door wasn’t covered by CCTV.

My curiously downvoted point was, I’m surprised an internet problem would stop an high security access control system from functioning - they’re supposed to be designed to cope with that and continue to work autonomously in emergencies.


Most of these started out as speculation and jokes. Maybe we have a stupidity (i.e. critical thinking) problem rather than a misinformation problem. Only two of these (angle grinders and access to the offices) can really be considered lies or mistakes by an otherwise reputable source (the NYT) and one of those doesn't seem to have been refuted or retracted yet.


It looks like they retracted/corrected the angle grinder thing but they didn't use a retraction yet for the "access to office" info, maybe that means that was correct? (for a section of the FB workforce, of course).


> - It was government intervention

There was an Indian opposition Member of Parliament blaming the current government that it blocked FB due to some protests being held in the Capital.


Not to defend politicians looking to gain political mileage from everything, this is not a far-fetched claim. Current Indian government has blocked websites and shutdown internet access wholesale to towns and even whole states citing many reasons, and multiple times in recent past. Latest was internet shutdown in multiple Rajasthan towns apparently to prevent exam takers from cheating on a test.


You say "believed as fact" where I say "speculated because we had nothing to go on".

Sure, my friends and I wondered if it was a malicious insider. It takes surprisingly few people in an organization to cause chaos.

Knowing IoT, it isn't unbelievable that badge readers could be offline.

Knowing division of duties, it isn't hard to believe that the network engineers, domain admins and datacenter ops people may have hustled to a DC to get things back online.

Never did I see a large number of people take anything as fact that didn't seem to be substantiated.


The angle grinders stuff was confirmed by NYTimes. Everything else was rank speculation by people who admitted as such. No one was passing off a hostile attack as fact.


I think you mean repeated uncritically by NYT. It doesn't even make sense, as reported. Yes, maybe to get into the cage in a shared DC, but not individual servers, and it turns out neither was necessary. That's a rehash of an old Google story.

Another bit of misreporting (e.g. The Verge) is that this was "one of its main US data centers in California" when there's no such thing. The seriously big heaps of hardware are elsewhere and they're not shared, so there are no cages to force entry into. I've been in one; no cages in sight. I know which facility they're talking about, and its only distinguishing characteristic is that it's close to where relevant people live.


And then retracted by NYT:

> Correction: Oct. 4, 2021. An earlier version of this article misstated a Facebook team’s means of getting access to server computers at a data center in Santa Clara, Calif. The team did not have to cut through a cage using an industrial angle grinder.


Even if it wasn't true, drilling out locks or cutting them with tools is a pretty standard way of dealing with a lock you can't open. In an emergency where lots of things were broken it would be a perfectly reasonable thing to do. Also a news source with a nontechnical reporter might say the wrong thing that would indicate somebody was opening a server chassis with a powertool.


It could still be any/some of these. Unlikely, but possible. Taking down BGP would just be a cover-up of something bigger. If this was a more honest company I'd call myself a tinfoiler, but since we're dealing with Facebook... ¯\_(ツ)_/¯


Just for the sake of completeness I saw a Twitter from some conspiracy website stating that it was some kind of menace from outer space.


"They had to cut open servers with angle grinders"

I want to believe.


You do know the meaning or jokes and/or ironic comments, right?


Seriously. I also saw lots of posts about how quiet it would be with Facebook down, but I don't think I've ever been exposed to so many stories and so much chatter about Facebook in a single day.


Do you think people on Facebook just talk about facebook??


I work in a different social media company that has had some visible outages. Its always hilarious to see how wrong people are with their confident speculation. It's a good reminder that people online are often full of shit.


I work in video games.

It amazing how wrong people can be and how confident they are about being right.

Even sometimes fighting _me_ about things _I_ designed and built.

It’s quite sobering; taught me not to believe all the speculation I read.


Just out of curiosity, how often do people claim that your random number generator is broken? And then when you ask why they think that, it's an anecdote about some time when they had incredibly bad luck?


it's more common for people to tell me the tickrate/framerate of my server or talk about how "[I] moved away from amazon to cheaper bare metal servers for cost savings" and such.


Lol I had the exact same lesson too. Saw people spewing falsehoods as facts and being misled by others. I’ve started to take everything I see online with a pinch of salt.


Dang I missed the misinformation.


The problem is not misinformation per se but the largest social media data processor running algorithms boosting such information for profit.


Like what?


Lots of people blaming dns


I think mainstream news sources either used "DNS" as a term for routing internet traffic because most people haven't heard of BGP or come across autonomous systems (AS), or because they genuinely thought reports of routing issues meant it was DNS.


BGP took out their DNS. It isn't entirely wrong.


We almost went down the ‘this is a subterfuge to delete whistleblower evidence’ rabbit hole.


I saw a couple of people clearly guessing something along these lines but none of them seemed to be claiming that it was actually happening, more like “isn’t it convenient that…”


The timing was uncanny. I still don't see a reason why it couldn't have been an intrusion/rogue employee? Like someone had access to a system to push router firmware updates or something?


A rogue employee would have been very easy to detect and that employee would have known this. The core network infrastructure involved is extremely sensitive and is quite unlikely to be accessible in a break in.

Also IIRC an employee was posting on Reddit saying the incident started shortly after a network update was posted this morning.

When you know more about the tech and systems involved a mistake seems infinitely more likely than sabotage.


Many have pointed out a couple of weeks ago Facebook released a paper talking about a new system to automate the management of the their BGP routing.

Seems like the new system having an unanticipated flaw is a far more likely scenario than a malicious actor.

More boring - but usually the boring stuff is the far more reasonable.


I could have believed it given Facebook’s record for unpleasantness, but I think the fact it must have cost them tens of millions in advertising revenue is evidence against this. Even the closest thing we have to a cyberpunk dystopia isn’t going to throw that much money down the crapper to bury bad press.


Did we, really?


There are a lot of comments promoting the idea here: https://news.ycombinator.com/item?id=28751224

Some of them seem to actually say it's likely, not just possible.


>We also have no evidence that user data was compromised as a result of this downtime.

No, that just happens during uptime.


> configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues

This could be anything, potentially.

I'm not very knowledgeable in computer networking, but this could be as trivial as an incorrect update to a DNS record, right?


Backbone routers don't usually deal with hostnames or DNS. This is pretty much saying they done broke BGP. And it sounds like they're saying that they broke it in a way that prevented accessing their data centers from the PoPs, and we know from the long downtime that it prevented accessing the BGP configuration system from darn near anywhere.

It happened to also kill the announcements for anycast DNS.


There seems to be nothing uncertain about the immediate cause of the issue - Facebook revoked all of their BGP routes, and all of their IP addresses couldn't receive packets until they were restored.


Understood.

I had question: This is what we can only perceive through internet/routing table entries right?

Internal to FB, we don't know what had caused issues that led to the BGP UPDATE.

That's kind of what has been confusing me - there's a lot of speculation around FB's data center design and what actually happened, but we actually don't know for sure until they post an RCA - please correct me if I'm wrong here.


You are on the right path. It's entirely possible the BGP routing updates that we witnessed 'out here' on the Internet were simply a side effect of the actual network problem that impacted the company. As toast0 mentioned above it is less likely to be DNS-related just because most of the related services don't tend to build dependencies on DNS (in part for this reason).


The didn't revoke all their routes, FWIW, just a lot of them (including the anycast DNS routes)


What I don't understand is why, when a route is revoked, if there is no other route announced the routing table gets updated? It seems like either it's a black hole or it still works and there was a BGP error (or the route works but the resources aren't present, so traffic would be dropped). What's the reason for designing the system to revoke routes when no new route is announced?

It strikes me it's like DNS when you get a SERVFAIL, why not try the prior IP address. The similarity in the design here suggests there may be common reasoning??


BGP operates on the principle of using the 'most specific prefix', basically if all else is equal, a route covering fewer IPs is more specific and should be used.

When the announcement is revoked, you fall back to a less specific prefix if present, or your default route.

If you've got a full BGP table, then you tend not to have a useful default route (you should have specific routes for everything) and it might be useful to fallback to the last known value. But many participants have an intentional default route and then get announcements for special traffic --- dropping the announcement would mean to send it on the default route instead. It's hard to know what the right thing to do is, so better to do what you were told by the authority.

DNS is a bit different, but again, the authority told you to use some data and how long to keep it (ttl), if they're not there to tell you a value again later, what else can you do but report an error? Some DNS servers have configurable behavior to continue using old data while fetching new data or when new data is unavailable.

But the expectation is if you can't keep your BGP up and your DNS up, your server probably isn't up either. Note that in this case, bypassing DNS and going to the FB Edge PoPs that were still network available (because of different BGP announcements, that weren't withdrawn) resulted in errors, because they weren't able to connect to the upstream data centers. (Or so it seems)


Withdrawing the last route covering an IP range is a legitimate change, indicating that those IP addresses are no longer in use by the ASN. This needs to be supported so that one ASN can withdraw an IP block and transfer it to another ASN.


Basically the same problem as tag soup. As soon as mistakes don't stop things working you get many more mistakes.


It would be interesting to estimate what dollar value can be ascribed to a x-hour FB outage, both in terms of lost ad revenue for FB itself as well missed conversions/revenue for businesses running ads on FB/IG.


Does anyone know if FB's advertisement contracts even have an SLA ?

I can completely picture a world in which many people bought some ads yesterday morning (say, to promote an event that occured yesterday evening), the ads were never displayed to anyone, and FB will keep the money, thank you.


Don’t forget WhatsApp users switching to Signal and possibly never returning


Knowing almost nothing about networking, isn't the way Facebook handles networking somewhat of a monolithic anti-pattern? Why is a single update responsible for taking out multiple services and why wouldn't each product or even each region within each product have their own routes, for resiliency which can then be used to rollout changes slower?

By having a large centralized and monolithic system, aren't they guaranteeing that mistakes cause huge splash damage and don't separate concerns?


BGP has to converge to a single routing table.

You are effectively asking is why is there a single routing table for the internet.

To put in simple terms having a single routing table is what it makes it the internet we can share, otherwise it would just be a bunch of independent networks.


> BGP has to converge to a single routing table.

It certainly does not. If I peer with you, neither of us (generally) announce that route to our other peers, but often announce to our customers. There are many routes that are not visible to everyone, and there is no single routing table for the internet. Each BGP speaker ends up with their own routing table, although there are a lot of similarities.


I was trying to give simple explanation for someone who said they don't know networking.

BGP in this context implicitly meant external BGP . Yes, no single router necessarily sees all the routes, but all routers combined generally see the internet as a single network of networks was my point.

Convergence in this context is how most ASN will resolve on where/how to route a specific ASN traffic.

It is hard to peer with someone and not trust the routing table they publish, that is why Pakistan could by mistake block YouTube for everyone few years back.

In this case if you peered with Facebook, and they published incorrect routing for their AS you would accept it.

This doesn't mean FB couldn't have used multiple ASNs did some rolling updates etc, however without knowing what exactly fb screwed up for five hours it is hard to say what they could done differently.


> You are effectively asking is why is there a single routing table for the internet.

No, they're asking why a single set of routers is in charge of announcing BGP routes for all of facebook. If you have multiple ASes, with independent configuration sources and independent routers broadcasting them, it's a lot harder to break everything at once.


Facebook definitely has multiple ASes and likely rolling BGP update strategies already. This isn't the first high profile BGP screwup after all.

However if even one set of routers were misconfigured and it was announcing incorrect routes for all their ASes as result of the issue then their peers will not typically drop that set alone automatically.

BGP doesn't have a Paxos / Raft style smart consensus algorithms, it runs on trust. Either their peers had to trust what FB published or they won't be peering with FB ASes in the first place.

That's what I was meant when I said it comes down to one network web of trust, and if there is a breakdown of that redundancy cannot typically help


The idea is that there would be multiple layers preventing one set of routers from announcing routes for "all their ASes", correct or not.

The config database would reject routes for the wrong ASes. The router would reject it. The peers would be told "add filters so you only accept these ASes from these routers".

Maybe they have all that and it somehow broke anyway? But what it looks like from the outside is that all the ASes are controlled by the same system.

> their peers will not typically drop that set alone automatically.

I'm not sure what this sentence means.


> config database would reject

> router would reject it.

Any of these hardware could have bugs, if one of them announces wrongly it will be propagated wrongly by all other ASes peering with them and to the next level so on and on. That is the point, at this level it is possible to fuck up globally.

> peers would be told "add filters so you only accept these ASes from these routers".

There are hundreds of ISPs , all of them peer cannot directly with each other. Routes are propagated downstream and upstream it is a web of networks running on trust.

While filtering is built into most implementations ( sadly not the protocol itself), practically ISPs have no easy way to determine which AS can actually originate which other AS'es traffic, so they don't actually implement a lot of filtering. Remember traffic can have more than 2 hops. Effectively that means you would be routing traffic for AS 3/4 hops away. Neither you nor your peer would know anything about it or whether you can trust it etc.

Even if some ISPs do drop/block the announcements, unless every single AS also implements the block there won't be an impact. Traffic would route through ASes which don't have filtering and announce the routes incorrectly . For example say AT&T blocks an incorrectly announced FB route, but British Telecom does not, BGP is designed to assume that FB has lost peering with AT&T and route all traffic for FB via British telecom.

If filtering was robustly possible we wouldn't have periodic BGP hijacking incidents as we do whether accidental or maliciously. The famous Pakistan Telecom Youtube hijacking [2] or as recently as April-2021 [3] or incidents over the last few years usually authoritative governments (such as China/Russia etc) but with impact well beyond their networks.

[1] http://www.bgpexpert.com/article.php?article=145

[2] https://www.ripe.net/publications/news/industry-developments...

[3] https://blog.apnic.net/2021/04/26/a-major-bgp-route-leak-by-...


Without in depth technical details of what exactly happened and how, it’s hard to say. With what we have heard and seen the answer is probably that there are some opportunities to build in more resiliency, but ultimately the kind of failure that causes domino effects is impossible to eliminate completely.

This is partially due to design limitations of BGP, and partially due to it being nearly impossible to eliminate all sources of large scale failures in any highly complex system, and increasing the uptime of a system that already has a few nines costs an additional order of magnitude for each new nine. At some point you set your risk tolerance and have catastrophic failures now and then.


You are correct, but this problem isn’t Facebook’s doing. This is just how BGP works. Even big players like Verizon can screw it up and break the Internet.

For all it’s flaws, BGP is the piece of the Internet that truly makes it decentralized. Without it, there would be a centralized routing table of some sort.


Quite the opposite. Back in the day you would've had to login dozens if not hundreds of routers individually to push the change, and it likely would've been caught after screwing up the first one. This is the result of SDN (software defined networking) and being able to push a change globally from one command.

I recall major ISP's screwing up their routing tables in the past but never globally on this level.



https://www.itproportal.com/news/misconfigured-centurylink-d...

https://www.bleepingcomputer.com/news/technology/ibm-cloud-g... (this one isn't clear, maybe BGP hijacking, and if so, not sure who the responsible party was)

https://www.catchpoint.com/blog/vodafone-idea-bgp-leak (not sure how major this one was)

You can practically search ISP bgp outage and get news about the last couple times they screwed up BGP and caused a big problem. Or service BGP and get a 50/50 chance of the service screwing up BGP or an ISP/country hijacking their routes and causing a big problem.

BGP is one of the best ways to break things at scale.


Ultimately I think you are right, but I don't think this is the right way to think about the question. It's not as though they thought they were creating a monolith on purpose by following a pattern for creating monoliths. They thought they were following best practices and building a distributed system, without single-points of failure. I think a more productive way to think about it might be what's the psychology and practices that led them to be wrong? My best guess is that at some points in the stack DNS is opaque, and at other points, particularly for trained network people, it becomes transparent (i.e. invisible) and disappears (like a mirage), so then they make both the network and physical locks dependent on it (and BGP) and... lock themselves out when it fails.


If you want 1 facebook.com entity (that in turn controls instagram, whatsapp,etc) , then I get why a single AS change would take out globally.

The anti-anti-pattern would be to have country specific AS's, kind of like a franchised restaurant pattern, where each country's version of fb owns their own AS that loosely forms back into the facebook mothership org, but from infrastructure point of view points to their own set of AS netspace.


Don't AS "things" only affect the IP addresses? Surely you could have DNS records for facebook.com pointing to IP addresses in multiple ASes?


One might argue that the fact that you care about things like this is why you don’t run Facebook.


Not an antipattern. Route announcements/withdrawals are well, serious things and should likely flow through a single source. The idea of a 'microservice' based route update makes any network person uneasy.


I mean if you own 3 independent businesses, each of which would be worth over $100 billion, and you break all of them simultaneously for an entire day, including your internal email and your badge entry systems, yes, that is definitionally an anti-pattern.


Surely “anti-pattern” doesn’t just mean “anything with negative outcomes.” Couldn’t this just be a really big mistake that isn’t indicative of an anti-pattern?


Single point of failure is the oldest anti-pattern in the book. Goes back 5000 years.


Is there evidence to suggest that redundancy could have solved this problem? If this was simply a bad configuration that was propagated (as intended) to a large number of systems, I would hardly call that a single point of failure.


Yeah, I don't think redundancy would have prevented this. What could prevent this is stuff like canaries. Or automatic rollbacks for this kind of change, with the change having to be submitted in two steps, one where it is applied, and one where it is manually confirmed that the change worked (or automatically by a system outside the network). If no confirmation is given after a certain time, the system should revert to the old configuration.

Of course, we can't really know for sure until/if they release exactly what caused this issue.


If it was a single change, deployed once that caused this, then it’s a single point of failure.


One of the usual justifications for acquisitions is to save money using common infrastructure. Instagram and WhatsApp haven't been independent for a while.


Just out of curiosity, does Facebook have a status page? Like http://status.twitter.com?


https://status.fb.com/

Although it only covers their API and business apps, not the site itself.


It was also down during the outage.


That’s a bit sad


For a status page to be actually independent, it needs to have all it's requirements hosted on other infrastructure. fb.com authoritative DNS is the same as facebook.com, so it's going down when (FB) DNS goes down (and DNS is going down when BGP is broken, apparently).

It looks like the status page is hosted on CloudFront though, so it got part of the way. (Of course, the other question is if it was updatable / updated during the outage)


Pardon me if it's a stupid question, but out of curiosity:

Is there any way to keep DNS up in case BGP goes down for any reason? Like a fallback nameserver hosted elsewhere/not affected by Facebook's ASs?

Is it technically impossible or did Facebook just assume something like yesterday would never happen and kept things simple instead of complicating things?


BGP didn't "go down" - they erroneously removed all routes between the Internet and several facebook internal networks via BGP. BGP was the instrument of their destruction, but not the source. Someone or something told BGP to do that; whatever that was is the cause of the issue.

At least one of those networks they accidentally removed also happened to contain the DNS servers; DNS being unavailable was a symptom - but not part of the root problem. Any focus on DNS at this point is a red herring.

Think of routes as street directions - they tell routers where to ship packets. If you erase all your addresses and directions to them from the outside world at at large, then there literally is no way for network packets to get from the global Internet to Facebooks networks (where I imagine the DNS servers were up and probably twiddling their thumbs wondering where everyone went).

An easier way to think of it - they essentially took a pair of scissors and cut the cable connections to the Internet - which is why it was so catastrophic.

They only way to mitigate that is to have an identical infrastructure managed by different tooling so a bad configuration setting from one environment wouldn't pollute the second in the same way. Not exactly an easy thing to do and might cause more other problems than it's worth. And you would have to do that for all services, not just DNS. Let's say Facebook used Cloudflare for their DNS. Great - DNS can resolve your request for fb.com to the IP address of the facebook datacenter - there still is no path for your packets to get to that facebook datacenter because they accidentally purged the routes to their networks.

It's easier to just not cut your connection to the Internet :) I'm sure there are all kinds of internal discussions picking this incident apart and formulating ways to either prevent it, or more realistically - have improved procedures to speed recovery when it inevitably happens again. BGP is not known for its inherent robustness or security. But since it's at the core of the Internet, any changes to it would have to be done on a massive internet-wide scale in perfect unison or the "cure" would be a lot worse than the current problems with it.

Murphy was indeed an optimist! (search "Murphy's Law" for those unfamiliar with the idiom)


Yes, but...

If it's a FB managed server, run on someone else's network, you still have a lot of the FB software risk (FB's software stack and development mantra make it easy to push changes, some of which break everything, including the ability to push further changes); even if not FB, there's a similar risk.

If it's not a FB managed server, like a 3rd party DNS provider, it's difficult to get that synchronized considering all the fun geographic loadbalancing FB is doing at the DNS level. That's generally hard once you start doing this; and it's why you don't see many dual-provider DNS setups.

Really, the status page should be not on a core domain, so that the DNS can just be external.

FB DNS breaking yesterday almost doesn't matter in the scheme of things, because the BGP breakage broke everything anyway. Would it have been a bit nicer to get http error messages instead of DNS not found messages, sure; but mostly nothing was working anyway.


It’s definitely technically possible to have secondary’s on a separate network that do zone axfr from the primary. That’s not to imply it’s trivial / easy at FB’s scale (query volume) or topology complexity (as in GSLB).


Looks like it! I remember trying to access this site during the down time and it turns out it’s not accessible at the moment


The first thing people here thought of was that it was the gouvernement denying access to these websites as it usually does for a number of reasons.


It was pretty quickly deemed a global phenomenon, so no comments on posts about it said that. Also, enough people here on HN know how to investigate dns and bgp to have found the problem within the first 30 minutes, first with DNS then the revelation that every BGP route associated with them was withdrawn.


I have been unclear. By "here", I meant the country I am in.


Sorry, thought you meant HN.


No, I am sorry as my reply was ambiguous and unclear.


Around the turn of the century, in a network the size of Europe, we had OOB comms to the core routers via ISDN/POTS. We experimented with mobile phones in the racks as well, much to the chagrin of the old telco guys running the PoPs.


The mobile whatsapp app should notify that the whatsapp servers are down and not allow you to just send messages that won't arrive for six hours


The app is designed under the assumption that Facebook servers are never down. If you can't reach the servers, the problem is assumed to be client-side, in which case they have decided the best UI is to keep retrying (not unreasonably in a mobile context). The only way to disambiguate "no internet service" (extremely common) with "Facebook dropped off the internet" (black-swan rare) is to ping some other, third party service. Unless that third party service has infrastructure as good as Facebook's, it will drown in pings the moment Facebook genuinely goes offline. I can see why Facebook wouldn't want to open that can of worms, if they even envisaged this failure mode (unlikely considering the chaos it caused).


>The app is designed under the assumption that Facebook servers are never down.

Which was and is a lame assumption. Stuff happens. SMTP wouldn't even be phased by this; it would just pick up where it left off.

I've seen far too many applications fail in bizzare ways because people make unrealistic assumptions like "X will ALWAYS be there". Sure it's highly unlikely, but when you have multiple things making the same dumb assumptions, on the inevitable day when multiple things that need X and X is suddenly no longer there then you start to get cascading effects of Y that relied on something that relied on X not being there when it is assumed that it would always be there so now Y fails, and then something dependent in the same way on Y unexpectedly fails and so on.

One should never assume that anything will "always" be available. That's an incredibly unrealistic assumption; and the more interconnected things become, the chances of these really nasty dependency chains/cascade failures skyrocket - leading to far worse outages and longer recovery times.


How would you address the issue?


Good points. Then it should just tell the user that they appear to have no internet service.


yup, they should. I couldn't send the messages as well and thought my mobile had some issues and tried rebooting it


Any FB throwaway know if someone got fired for this?


I think it would be extremely unusual and counterproductive for someone in the trenches to get fired about this, as it is clearly a failure of procedure that this was even possible and so hard to recover from. Large companies I am aware of have a no-blame postmortem culture around this stuff. There may be people suffering consequences at a higher level in the SRE division, though I doubt this will happen in a timeframe of hours after the outage.


The Bootcamp training at FB explicitly mentions that such things are not a fire-able offense - the attitude is around learning - if you managed to bring everything down, let’s learn together how you managed to do this… :)


I don't think this is the case. Wasn't TechLead fired for SEV events?


Where did you hear that from? He doesn’t even say that in his video, he says he was fired for having side income on YouTube.


Sorry, I was thinking about the engineer who got PIPd and committed suicide.


If it's the intern, talk about a learning moment


An intern bringing Facebook.com properties down would be a learning moment for the company and for the world


I heard rumors it was triggered by a PR auto-merge bot.


That bot is definitely getting fired.


Why is this non-post on the frontage? It's PR only


Move fast and

NO CARRIER


So their actual deployment process is quite rigorous and should have a tight blast radius. After lots of emulated and canary testing, their deployments are phased out over weeks. I don't see how a bad push could have done what happened yesterday.

I found a paper that describes the process in detail. See page 10-11:

https://web.archive.org/web/20211005034928/https://research....

Phase Specification

P1 Small number of RSWs in a random DC

P2 Small number of RSWs (> P1) in another random DC

P3 Small fraction of switches in all tiers in DC serving web traffic

P4 10% of switches across DCs (to account for site differences)

P5 20% of switches across DCs

P6 Global push to all switches

We classify upgrades in two classes: disruptive and non-disruptive, depending on if the upgrade affects existing forwarding state on the switch. Most upgrades in the data center are non-disruptive (performance optimizations, integration with other systems, etc.). To minimize routing instabilities during non-disruptive upgrades, we use BGP graceful restart (GR) [8]. When a switch is being upgraded, GR ensures that its peers do not delete existing routes for a period of time during which the switch’s BGP agent/config is upgraded. The switch then comes up, re-establishes the sessions with its peers and re-advertises routes. Since the upgrade is non-disruptive, the peers’ forwarding state are unchanged.

Without GR, the peers would think the switch is down, and withdraw routes through that switch, only to re-advertise them when the switch comes back up after the upgrade. Disruptive upgrades (e.g., changes in policy affecting existing switch forwarding state) would trigger new advertisements/withdrawals to switches, and BGP re-convergence would occur subsequently. During this period, production traffic could be dropped or take longer paths causing increased latencies. Thus, if the binary or configuration change is disruptive, we drain (§3) and upgrade the device without impacting production traffic. Draining a device entails moving production traffic away from the device and reducing effective capacity in the network. Thus, we pool disruptive changes and upgrade the drained device at once instead of draining the device for each individual upgrade. Push Phases. Our push plan comprises six phases P1-P6 performed sequentially to apply the upgrades to agent/config in production gradually.

We describe the specification of the 6 phases in Table 4. In each phase, the push engine randomly selects a certain number of switches based on the phase’s specification. After selection, the push engine upgrades these switches and restarts BGP on these switches. Our 6 push phases are to progressively increase scope of deployment with the last phase being the global push to all switches. P1-P5 can be construed as extensive testing phases: P1 and P2 modify a small number of rack switches to start the push. P3 is our first major deployment phase to all tiers in the topology.

We choose a single data center which serves web traffic because our web applications have provisions such as load balancing to mitigate failures. Thus, failures in P3 have less impact to our services. To assess if our upgrade is safe in more diverse settings, P4 and P5 upgrade a significant fraction of our switches across different data center regions which serve different kinds of traffic workloads. Even if catastrophic outages occur during P4 or P5, we would still be able to achieve high performance connectivity due to the in-built redundancy in the network topology and our backup path policies—switches running the stable BGP agent/config would re-converge quickly to reduce impact of the outage. Finally, in P6, we upgrade the rest of the switches in all data centers.

Figure 7 shows the timeline of push releases over a 12 month period. We achieved 9 successful pushes of our BGP agent to production. On average, each push takes 2-3 weeks


If they have such a rigorous release process, what could have caused all of the dns records to get wiped?


"Are you sure you want to remove ALL routes to AS32934? Type YES to confirm."

Hey what is our internal BGP called again? AS32934?

"Yeah"

"OOK."


"Post hoc ergo propter hoc"


Remember, remember, the 4th of October.


Yes. It is true. If you enter Facebook into Facebook. It will break the internet.


Could you please be more clear about ''no evidence that user data was compromised''


Do you think DLT/ blockchain can minimize this from happening again in the future?


This is uh, no offense, but.. you are a robot, aren't you?


nope


Reading this statement all I can think of is this scene https://www.youtube.com/watch?v=15HTd4Um1m4


One of the things they restored was annoying sounds in the app every time I tap anything. Who knew that was DNS related!


I’ll take my downvotes but I’d be happy for anyone to explain why.


TL;DR

We YOLO'd our BGP experiment to prod. It failed.

https://web.archive.org/web/20210626191032/https://engineeri...


I thought DARPA designed the internet to survive nuclear war - no single point of failure - clearly Facebook's network breaks that rule. They need a DNS of last resort that doesn't update fast.


Far easier to make a system resilient to bombing than a bad configuration update


On your single point of failure. That might be true, but certainly isn't these days.

Networks have grown so large and complex that the only reasonable way of managing them is through SDN, and a small mistake in configuration might results in a cascading effect on the whole infrastructure.

That's also true for the entire (western) internet. We've ended up with a centralized market where a few key players, e.g. cloud providers/CDNs/DNS (Amazon/Google/Microsoft/Akamai/Fastly/Cloudflare) can easily break large parts of the internet. See Akamai outage in July.


Such a BS. FB imagining that they are their own Internet but failing in a most miserable way because they need actual Internet to communicate.


The best course of action is to split FB into separate companies. It is already neatly divided between instagram, WU and legacy facebook. That would be the best for the government to avoid disruptions.


Of all the reasons to break up big companies, protecting consumers from Instagram downtime is not one of them.


> We also have no evidence that user data was compromised as a result of this downtime.

I am not sure why they had to mention this specifically. This makes it sound like an external attack.


There were rumors early in the downtime that it was the responsibility of various outside groups. Saying, "no, your data was not impacted" is pretty standard in light of those rumors, even if they weren't the main ones spreading around after the initial reports.


It doesn't make it sound like an attack, it's standard boilerplate to dispel any worries. It's natural for anyone to wonder downtime -> data loss? , so it's natural to reassure people that it wasn't the case.


It has been painfully admitted by the Facebook mafia that they know that they are the internet and farming the data of an entire civilisation; further evidence that this deep integration of their services needs to be broken up.

After all the scandals, leaks, whistleblowers etc it would take more than a DNS record wipe to take down the Facebook mafia.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: