WAN router IP address change blamed for global Microsoft 365 outage

kloch · on Jan 30, 2023

> As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. "The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed."

From this it sounds like they might have changed the primary loopback IP, which by default is the "router-id" for various routing protocols, causing the entire network to have to reconverge. You can override the default router-id with an explicit address that does not depend on lo0 but lots of networks don't do that.

It's extremely uncommon to change the primary loopback address. It's less uncommon to add an additional one but as the article says that syntax varies by vendor: Juniper will add as additional by default, Cisco and Arista will replace the existing primary one (IPv4) unless you include the "secondary" keyword...

kemals · on Jan 30, 2023

This was a rather interesting event. In general, changing the IP address (even the loopback address) shouldn't have caused it from the BGP perspective. For example, if you were to change the IP address of BGP enabled router that has multiple BGP sessions, all other routers tore down the sessions to it, and withdrew the prefixes. BGP reconverge events take time. However, less than this took (90+ minutes and then a few more hours until __full__ recovery).

This seems like one of the events in which they changed IP on Route Reflector routers that were pretty busy, which would cause reconvergence and CPU spikes for all routers that it had sessions with. Also, there was a lot of volatility, as part of which re-advertisements were happening continuously. They also attempted rollback, which caused reverse operation, which triggered reconvergence. The other scenario is doing this change on the SDN controller, which affected all other routers.

More details: https://www.thousandeyes.com/blog/microsoft-outage-analysis-... https://www.thousandeyes.com/resources/na-microsoft-outage-a...

AdamJacobMuller · on Jan 30, 2023

I feel like they intended to /ADD/ a new loopback IP and in the process accidentally removed the existing one and replaced because I think anyone intentionally changing the loopback IP knows it's going to reset all bgp sessions. I think more modern cisco/arista platforms now "secondary" by default and perhaps that is what bit them?

jimmyl02 · on Jan 30, 2023

It seems that in modern large scale systems networking continues to be one of the few things were a a seemingly small and inconsequential change can cause entire cloud providers and highly redundant systems to go down. It makes sense as networking is the fabric connecting all systems together but each time an incident like this occurs I'm reminded of just how important networking is.

Network engineers and the people handling network ops always amaze me.

iso1631 · on Jan 30, 2023

IME Network engineers put too much faith in vendors. They think "the vendor says this is a resilient virtual chassis so it can't break", rather than thinking "ok, if this breaks what happens"

A crash affecting both sides of a "resilient" virtual chassis I had to work with took off a major broadcast last year (it was a last minute favour I was doing, and I rerouted to a tertiary route in a couple of minutes).

Meanwhile I ran a rather large event going out to some hundred million listeners via two crappy £300 switches which were completely independent of each other, into two independent routers, running via two separate systems (one on a UPS, one on mains). If one of them broke the other one was completely independent and the broadcast would have continued just fine.

As far as I am concerned, that is far better than a virtual chassis.

ccakes · on Jan 30, 2023

This may be true of enterprise network engineers but I’ve worked across a lot of very large networks (telco, not cloud) and we never ever trust the vendor.

The kind of bugs that I’ve read about in errata notes over the years is wild and truly unpredictable.

Spooky23 · on Jan 30, 2023

Enterprise is definitely different - network guys need multiple customers to develop the vendor skepticism. I used to get into brutal internal fights with network directors over whatever bullshit the Cisco salesman said offhand that was treated as though it was delivered by Moses off the mountain. One guy tried to get me fired because I offended an SE. lol.

I worked on systems and platforms at the time, and we were more cynical even about vendors we liked.

jacquesm · on Jan 30, 2023

It wouldn't be the first time that your redundant vendors end up sharing a conduit for a bunch of fiber somewhere. Guess where that backhoe will start digging?

oarsinsync · on Jan 30, 2023

Redundant vendors in the GP’s context referred to using multiple router vendors, eg Cisco and Juniper.

Using multiple connectivity vendors doesn’t guarantee path diversity. Demanding fibre maps and ensuring that your connectivity has separate points of entry into the building, doesn’t cross outside the building, and validating with your DC provider that your cross connects aren’t crossing either, guaranteed path diversity / redundancy.

iso1631 · on Jan 31, 2023

Its a bit of both. Internationally I find I can't trust the network maps of the connectivity vendors and I'm better going for two separate companies (ones which are part of different subsea cables -- e.g. Wiocc on Eassy and Safaricom on TEAMS).

Of course I had one failure in Delhi which the provider blamed on 5 separate fibre cuts. Long distance circuits can run via areas where they can sustain multiple cuts across large amounts of area (regional flooding is a good one), and fixing isn't instant. This can be mittigated a little, but you still end up with circuit issues -- I had two fibre runs into Shetland the other month. Frist one was cut, c'est la vie. Second one was cut, had to use a very limited RF link. There's only so much you can do.

On the other hand I've just been given a BT Openreach plan which lists any pinch points of a new RO2 EAD install, I can see the closest the two get during transport is about 400m (aside from the end point of course, and experience has taught me I can trust it.

jacquesm · on Jan 30, 2023

The GP was clearly talking about whole networks, not just the hardware vendors, if I read that different than the GP intended I'll wait for their correction.

One of the problems that I've seen in practice that with the degree of virtualization at play that it has at the same time become much more easy to in principle be guaranteed 100% independence and in practice it has become much harder to verify that this is the case because of all of the abstraction layers underneath the topology. One of my customers specializes in software that allows one to make such guarantees and this is a non-trivial problem, to put it mildly, especially when the situation becomes more dynamic due to outages from various causes.

iso1631 · on Jan 31, 2023

In London I can literally follow the map from manhole to manhole, exchange to exchange. It's dark fibre so I can flash a light down it and a colleague can see it emerge at the other end. Now it's possible they don't follow the map and still make it to the other end, but it's pretty unlikely.

Sometimes of course you have to make judgement calls. From one location near Slough I have a BT EAD2 back to my building a few miles away. I know the route into my building, I can see the cables with my own eyes going in different directions. BT tell me which exchanges those cables goto, and provide me with a map into the field at a 1000:1 scale showing the cables coming in down a shared path. Sure it's possible BT are lying, but it's unlikely. Only use that location sporadically, and when I do it's a managed location, so I can accept the risk of a digger on the ground.

Another location in Norfolk, two BTNet lines, going to two different exchanges. They meet at the edge of the farm and go up the same trunk. That's fine, I can physically control the single point of failure there too, although if peering between BT and my network fails then I'm screwed, but I have a separate pinnacom circuit in a crunch.

Now obviously some failure become far harder to mitigate. A failure of the Thames Barrier would cause a hell of a lot of problems in Docklands, I'm not sure if any circuits in/out of places like telehouse, sovhouse, etc will remain. Cross that bridge etc. Whether my electricity provider will remain with a loss of the internet is another matter, so then it comes down to how much oil there in in the generators, and the generators of any repeaters on the routes of my network.

However the much easier to avoid is the problem of some shitty stacked switch the salesman says will always work.

oarsinsync · on Feb 1, 2023

> One of the problems that I've seen in practice that with the degree of virtualization at play

If you’re buying SDN WAN solutions, you get what you get.

If you’re buying specific paths, you get what you pay for.

MichaelZuo · on Jan 31, 2023

Sounds like a great place for a specialized insurance company to be the middle man

iso1631 · on Jan 31, 2023

I have to trust the dark fibre map provided, but I know exactly which way it ran, manhole to manhole. I had three cores, they shared the first 20 metres to the manhole, it's unlikely there would be a backhoe digging underneath the police van and pile of scaffolding that was parked in the shared conduit.

After that it went on different paths to three different buildings, which from each of those was then routed independently.

We take physical resilience seriously, as it isn't network engineers that do that part of the infrastructure. Enterprise network engineers then throw it all away by stacking their switches into a single point of logical failure.

(Still had a non-IP backup, but sometimes that breaks too -- just in different ways than the IP)

jacquesm · on Jan 30, 2023

The network is a single point of failure, even if the network itself is redundant!

wmf · on Jan 30, 2023

One possible way to fix that is to replace the network with multiple independent networks. It's really expensive though.

jacquesm · on Jan 30, 2023

Yes, exactly. Most really mission critical places do exactly that.

The first time I saw something like that put into practice was when an experiment in the oil and gas industry that was scheduled to run for years delivered their network design. On the runtime cost of the experiment the extra network wasn't a big deal, but a service interruption would have been and would have caused them to have to restart the whole thing from scratch. It's more than a decade ago and I forgot what the exact context was but the whole thing was fascinating from a redundancy perspective as well as the degree of thinking that had gone into the risk assessment. Those guys really knew their business. Also the amount of data that experiment was expected to generated was off the scale. Multiple petabytes, which at the time (a decade ago or so) was a non trivial amount of data.

noorkersz · on Jan 31, 2023

yes, instead of one network, many independent networks which then can get connected together, forming a network of networks, some kind of inter-network!

..oh wait. see what I did? ahhAHAHA

bogomipz · on Jan 30, 2023

This doesn't really make sense. The modern WAN operates on multiple independent networks - SD-WANs, multiple transit providers, fiber-ring MPLS, EVPN etc. If you propagate a bad network change throughout your autonomous system or backbone you can still have an outage on your hands.

wmf · on Jan 31, 2023

My point is that you could apply the same principle internally; have two backbones managed by separate teams instead of one.

bogomipz · on Jan 31, 2023

That still doesn't make sense though. In the context of a WAN, a backbone is an external network. It routes between your POPs. At any rate, the margin of error and complexity in having two separate backbones networks managed by two separate teams would likely result in more network issues not less. The whole point in having an AS is having a coherent routing policy.

MichaelZuo · on Jan 31, 2023

The parent never said multiple networks was easier to implement.

In fact it could easily 2x the cost for the same level of quality, which is why it's almost unheard of for cloud.

bogomipz · on Jan 31, 2023

The parent was stating that two networks would be better but its none done because of costs. And that's complete nonsense.

The fact that it's more difficult and complex to have two separate teams manage two separate networks means it's more prone to error and misconfiguration. The reason it's not done has nothing do with financial costs but rather because it makes no sense, for the very fact I just mentioned.

admax88qqq · on Jan 31, 2023

Two end to end networks would be more reliable.

Like two independent internets spanning from your server to my laptop.

Two completely isolated end to end transports.

That's what OP meant when they said you could make it more reliable by having a redundant network. It's just prohibitively expensive.

Then if one internet goes down in any way I talk to you over the other. That's a fairly straightforward fallback algorithm to implement.

allarm · on Jan 31, 2023

Actually I have seen a setup that was quite close to this. Two separate networks, one of them was completely isolated from another, didn’t have Internet access and used a separate set of network equipment. On top of that, the building itself had two entrances - one for the boss and another one for the personnel. You physically couldn’t get from one part of the building to another. It didn’t help the boss though - he was blown up in his car one day. Fun times.

wmf · on Jan 31, 2023

You're talking like multihoming doesn't work. Sure there are cases where bugs or bad configs can propagate across ASes but most of the time you can survive if one provider goes down.

bogomipz · on Jan 31, 2023

And that's exactly where the whole "have two backbones managed by separate teams instead of one" stops. If someone pushes out an incorrect network config to the end box then all that "let's have two of everything" becomes completely worthless. And as far as multihoming everything and having every single box on the network act as router, unless you are running a CDN of some sort, really makes zero sense. You seem to be arguing that adding more complexity will automatically result in better reliability.

jgrahamc · on Jan 30, 2023

Having, uh, had bad things happen with router configuration I feel for them.

https://blog.cloudflare.com/cloudflare-outage-on-july-17-202...

alex-mohr · on Jan 30, 2023

It does seem like network configuration remains rather manual compared to other large scale systems that include more automation.

In Microsoft's case, the remediation is not to put in place higher level systems to safely accomplish the goal of the command. Instead:

- "We have blocked highly impactful commands from getting executed on the devices (Completed)"

- "We will require all command execution on the devices to follow safe change guidelines (Estimated completion: February 2023)"

Requiring commands to follow guidelines sounds suspiciously like they're requiring network ops not to break things.

iso1631 · on Jan 30, 2023

That's the norm in network ops. Automated testing is pretty much impossible, easy rollback may be possible depending on exactly what was screwed, but not always.

Take this for example, looks like the problem was an unplanned recalculation of routing tables. That's not going to be the case on a small scale test network, and rolling back won't help, indeed in this case it likely would cause more problems.

jsz0 · on Jan 30, 2023

One of the reasons I got out of network engineering was how frequently the work I was required to do would cause unintended consequences. You can do all your due diligence, get your work blessed by vendor support, and still get blown up by a bug or undocumented behaviors on a regular basis. The conspiratorial part of my brain says these network device makers intentionally provide unreliable software and terrible documentation to bolster their support contract profits. I was just the guy typing in the commands and getting all the blame.

spookthesunset · on Jan 30, 2023

I remember the first time I got access to an employers production Cisco router. It’s pretty scary how easy it is to majorly fuck something up.

There isn’t a concept of a transaction or a rollback. You just enter a command, press enter and it’s live.

To counter this we’d write all the commands we planned on executing and peer review it. Nothing was to be done “on the fly” (at least in theory)

In short, coming from a developer perspective with ample version controls and gated releases… networking is a very wild ride.

simoncion · on Jan 30, 2023

> There isn’t a concept of a transaction or a rollback.

Yeah, Cisco gear is bonkers.

Mikrotik has "Safe Mode", which undoes all commands since you entered "Safe Mode" if the connection that created the shell gets interrupted. It has saved my bacon on several occasions, but there are several obvious situations in which you can get yourself locked out.

Juniper gear has "commit confirmed $NUMBER_OF_MINUTES", which will roll back everything since your last commit if you don't do a "commit" within $NUMBER_OF_MINUTES. It will also, apply all of the changes you've staged all at once (and do configuration sanity checking before it performs the commit).

I do have no idea how Juniper's rollback works when multiple users are doing simultaneous config editing... maybe don't do that?

iso1631 · on Jan 31, 2023

> I do have no idea how Juniper's rollback works when multiple users are doing simultaneous config editing... maybe don't do that?

You get a warning

    Users currently editing the configuration:
      bob termainal p0...."

But the failure here is actually sshing to a network switch in the first place.

Some cisco kit has restconf which is better for automation, but it's buggy.

ccakes · on Jan 30, 2023

Modern router operating systems have this.

It’s been a long time since I’ve touched IOS-XE (Cisco enterprise gear) but Cisco IOS-XR, Junos, Arista EOS and the Nokia SRs all support some combination of configuration transactions with rollback and commit confirm on a timer

This definitely doesn’t stop you shooting yourself in the foot, similar to how you can still push broken config to a k8s controller, but it’s some level of protection for certain types of changes.

bogomipz · on Jan 30, 2023

>"There isn’t a concept of a transaction or a rollback. You just enter a command, press enter and it’s live."

This hasn't been true for a very long time. Juniper router's have rollbacks, commits and revisions:

https://www.juniper.net/documentation/us/en/software/junos/c...

and

https://www.juniper.net/documentation/us/en/software/junos/c...

Cisco has similar:

https://www.cisco.com/c/en/us/td/docs/ios/ios_xe/fundamental...

allarm · on Jan 31, 2023

Except Cisco doesn’t have a commit feature in any of their OS and the rollback feature is not implemented everywhere as well - NXOs doesn’t have it for example. Still, it’s better than ‘reload in 5’ that we had to use back then.

atxbcp · on Jan 30, 2023

That's not entirely true, you can rollback a change on modern switches/routers, either via a rollback command, or with a revert timer (configure terminal revert timer X) (because the new configuration might have made the router unreachable, so you're never sure you'll be able to rollback manually if you're working remotely).

meltyness · on Jan 30, 2023

Interesting. There's also some stuff in Cisco that can't be done both atomically and remotely, so you may have to push a change as a file to the router and then source the file into the running config with some permutation of `copy`.

meltyness · on Jan 30, 2023

Hadn't thought about it from the perspective of support contract profits, but they also have their friendship stick firmly planted in technicians via the semi-required training since as you indicate the manuals are deficient.

At some point network vendors switched manuals from engineers documenting features whitebox to educated techs documenting features blackbox.

There's a clear transition for docs produced after 2008, prior to which more care went into tech notes and interpreting technologies -- after you're lucky to even get a complete set of steps and caveats without having to cross-reference bugs, release notes, old-manuals, new-manuals, draft manuals, reference manuals, licensing manuals, the inevitable errors that appear in the logs, and of course the configuration guide where this should all be in the first place.

In short, yes, this.

Cyph0n · on Jan 30, 2023

> The conspiratorial part of my brain says these network device makers intentionally provide unreliable software and terrible documentation to bolster their support contract profits.

As a dev who has worked at one of the major networking vendors, I can assure you that is the not the case. You’d be surprised by how major bugs are handled internally, especially if the bug affects “important” customers.

candiddevmike · on Jan 30, 2023

Networking and storage changes are always butt clenching affairs. Way more stressful than anything else in IT due to their blast radius if something shits the bed.

throw0101a · on Jan 30, 2023

> That's the norm in network ops. Automated testing is pretty much impossible, easy rollback may be possible depending on exactly what was screwed, but not always.

Ansible/Napalm is a thing in NetOps in some places. Some folks use Eve-ng / GNS3 to spin up virtual networks to test config changes, and it may be possible to do CI/CD changes if you track things in a repo.

Juniper JunOS has auto-rollback if you don't confirm the change after "x" minutes:

* https://www.juniper.net/documentation/us/en/software/junos/c...

So if you did something that causes breakage and disconnection from the router, you (ideally) don't have to do anything but wait it out.

iso1631 · on Jan 31, 2023

Emulating even a mid-sized network in GNS3 requires massive resources, and my cisco account manager doesn't seem to even get why I'd want to deploy a test system of 50 different multi-vendor switches (and key supporting services like syslog and tacacs) with terraform, run some tests, apply a configuration change, and run more tests.

And virtual switches aren't the same as physical switches in any case, they have different bugs, different features, different responsiveness.

AdamJacobMuller · on Jan 30, 2023

commit confirmed is such a life-saver. I ran a production network which spanned multiple continents and even though I probably only ever actually needed commit confirmed a single digit number of times, the fact that it was there made every change I did 99% less stressful. I knew that even if I made a mistake, all I had to do was wait 5-10 minutes and it would all revert.

Compare this to my cisco/foundry/other experience where I would delay changes until I was in the office (physically colocated with main routers) or calling people to be onsite for what was 99% of the time an innocuous change. The stress of it led to me deferring changes or just skipping them entirely which led to more issues/stress/etc.

I'm really not sure there is a single software feature which improved my life as much as "commit confirmed"

iso1631 · on Jan 31, 2023

So instead of one ripple across your BGP network, you have two as it rollsback the change?

The problem is that the state in routing tables isn't stored in a single location, it's dynamically built over time. Breaking a single router in the wrong way can break the state, and there's no rollback of that state

AdamJacobMuller · on Feb 5, 2023

> So instead of one ripple across your BGP network, you have two as it rollsback the change?

It's possible, it depends on what the nature of the change is. If you use super short commit confirmed intervals (commit confirmed 1) then yes you can cause a situation where you revert a "good" commit and cause a second disturbance. You need to intelligently reason about commit confirmed times to consider this when you're making such changes.

latchkey · on Jan 30, 2023

How about describing how you implement systems that prevent this? You kind of talk about what was 'fixed', but not how. CI/CD is pretty hard to do for global networking changes. I'm sure whatever CF has done in this area is a lot of magic sauce and it would be super interesting to learn more about it, even at a high level.

bogomipz · on Jan 30, 2023

Your CEO sure doesn't seem to have much empathy when it's someone else though:

https://twitter.com/eastdakota/status/1143182575680143361

newah1 · on Jan 30, 2023

I remember this happening. The 20 some sites we ran went down as they were supported by cloudflare. I spent a panicked 30 minutes trying to figure out what I had done wrong, to eventually find out it was on CF's end.

I remember voicing at our team meeting "boy, they must be panicking at CloudFlare."

Cloudflare works so spectacularly we just wrote it off as a one time thing.

jgrahamc · on Jan 30, 2023

There was no panic but there was a lot of VUF (Very Urgent Focus)!

libraryatnight · on Jan 30, 2023

Holy shit I have been there and it sucks. I wasn't the guy who made the change, but I was on the long call that followed.

int0x2e · on Jan 30, 2023

Time to share one of my favorite talks (and speakers) ever -

"Debugging Under Fire: Keep your Head when Systems have Lost their Mind" (Bryan Cantrill, GOTO 2017)

https://www.youtube.com/watch?v=30jNsCVLpAE

libraryatnight · on Jan 30, 2023

This was an awesome lunch listen, thank you for sharing!

rexarex · on Jan 30, 2023

The curse of network engineering. You’re invisible and insignificant when everything is running well, and public enemy number one if you make a mistake!

xxpor · on Jan 31, 2023

this is the general case with all critical systems. Everything from networking to sewers (... not actually that different now that I mention it) to pandemic planning. No one gets credit for the pandemic prevented because the BSL regulations did their job.

chriscjcj · on Jan 31, 2023

Live television production as well.

asmor · on Jan 30, 2023

Security too.

raffraffraff · on Jan 30, 2023

Token-ring network. Someone configured their printer to use the gateway address in the ip address field. Idiot. "Turn off all devices on the internet, then turn them all on again one by one until we find the bastard who did this"

nzgrover · on Jan 30, 2023

It’s not DNS

There’s no way it’s DNS

It was DNS

Credit: https://www.cyberciti.biz/humour/a-haiku-about-dns/

feyman_r · on Jan 31, 2023

This reminded me of a talk at SREcon this year https://www.usenix.org/conference/srecon23americas/presentat...

eurticket · on Jan 31, 2023

Dang, I think I've encountered the same issue with my systems recently.

Or at least I think so, trying to figure out a packet loss issue on a virtual machine, for windows xp image.

noorkersz · on Jan 31, 2023

my own WAN IP got changed a few months after my ISP was eating(err, bought) by another larger ISP... now it's a private IPv4 address. I'm pretty sure my 'symmetrical bandwidth' is now only really true when testing it, a technique first invented by Herr Volkswer aus Deutsch-Wagen.

septune · on Jan 30, 2023

I miss the old days of IOS :

switchport trunk allowed vlan (add) xxx

Can’t imagine how many outages where caused by the missing « add » command.

candiddevmike · on Jan 30, 2023

Too many Cisco commands would truncate the syntax if you didnt know better:

no access-list 101 permit something

so long access-list 101!

acd · on Jan 30, 2023

Was it an BGP border gateway protocol WAN update?