Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Rogers, of course, blamed their vendor (Ericsson I believe it was). Rogers can do no wrong!

Of course, was fun to see yet another huge org have no back-out/failure plan for their potential enterprise-breaking changes. No/limited IT 101 stuff here.

The only positive thing we learned was that the big 3 (really 2) telcos thought it would be a good idea to give eachother emergency backup sims for the other network to key employees in case their network went down. They did that in 2015, but better late than never.

Fun that Rogers used the same core for wireless and wired connections, so many of us were in total blackout, even if we used a 3rd party internet provider that ran over Rogers. Like, everything including their website was down, corp circuits, everything with non-existent comms from Rogers.

Thankfully my org was multi-homed and switched over its circuits at 6am so on-site mostly continued without issue.

Also fun where the towers remained just powered on enough for phones to stick to them but not be able to do anything, so 9-1-1 calls would just fail, instead of failing-over to other networks. Seems like a deficiency in the GSM spec (or Rogers SIM programming?) that I don’t think was actioned on.

https://en.m.wikipedia.org/wiki/2022_Rogers_Communications_o...



> Also fun where the towers remained just powered on enough for phones to stick to them but not be able to do anything, so 9-1-1 calls would just fail, instead of failing-over to other networks. Seems like a deficiency in the GSM spec (or Rogers SIM programming?) that I don’t think was actioned on.

Actually, I think this is going to change after the Rogers outage, it's just slowly happening behind the scenes so it's not getting much attention these days. The government has mandated a lot of industry response to failover between providers... we'll see where they land after all the lobbying happens. I do think implementations are changing a bit around this, mostly in the phones so that they give up and go into a network scan if the emergency call is failing.

I worked mostly on core network stuff, so I was a layer removed from the towers, but if they hadn't lost management access they would've been able to tell the tower to stop advertising the network and 911 service. I do understand the question of from a vendor implementation perspective of how automatic this should be though... because automation in this regard does have some of it's own risks and could complicate some types of outages or inadvertently trigger and confuse recovery of problems.

I'm with you though there should be an automatic mechanism to fail over to other network operators, I just haven't thought through all the risks with it and I hope the industry is taking their time to think through the implications.


> I do think implementations are changing a bit around this, mostly in the phones so that they give up and go into a network scan if the emergency call is failing

It seems like this is a global problem, since all Rogers-subscribed devices in a Rogers reception area couldn’t make 9-1-1 calls. But could be a SIM coding issue and not afflict other providers elsewhere.

I just always imagined the GSM spec was so resilient that you could always make a 9-1-1 call if a working network was available but this outage proved that wrong. Surprising to learn in 2022.

Of course it’s Canada, so I agree with them that the thought of letting users failover to a partner for everything would thrash the partner’s networks. Even though Canadian subscriber plans are laughably low in monthly data and population density is low (per the telecom’s usual excuse for our high prices) it turns out the telecoms still underbuilt their networks to have less capacity than what other networks internationally built out to support plans available on the international market (e.g. close to truly unlimited data/free long distance calls)


> I just always imagined the GSM spec was so resilient that you could always make a 9-1-1 call if a working network was available but this outage proved that wrong. Surprising to learn in 2022.

The X is broken but claims it isn't stops failover pattern is strong all over networking. It's not unusual to see it in telco root cause analysis.


> I just always imagined the GSM spec was so resilient that you could always make a 9-1-1 call if a working network was available but this outage proved that wrong.

As I recall it is slightly more nuanced than this and was particular to the failure mode, and has a couple of different things aligning to create the failure mode.

If you're phone is just blank, no sim card. To make an emergency call, it has to just start scanning all the supported frequencies. This is very slow, tune radio, wait for the scheduled information block that described the network on the radio protocol. See if it has the emergency services bit enabled. If not, tune to next frequency and try again. I used to remember all the timers, but almost a decade later I can't remember all the network timers for the information blocks.

The sim card interaction, is say you're at home and you boot up your phone with 100% clean state. You don't want to wait for this scan to complete, so the SIM card gives the phone hints about which frequencies the carrier uses, so start on frequency x to find the network. But if you roam internationally, it can take alot longer to find a partner network, and there are some other techs around steering to preferred partners, but I don't know that those come into play here. I don't know but would be surprised if there is a SIM option to try and pin the emergency calls to a network, I think it's more likely the interaction is this hint on where to start the scan.

The way the rogers network failed, it appears to me it caused the towers to stay in a state where they advertised in their radio block the network was there, and the 911 bit was enabled so the network could be used for emergency calls. This is where I don't really have the details since they haven't been public about it, how much of their network was still available internally. Maybe the cell towers could all see each other, that network layer was OK, and the signalling equipment was all talking to each other as well. That's the part I don't really know and have to speculate, as well as the tower side since I was a core person. So because the towers had enough service to never wilt themselves, they kept advertising the network, along with the 911 support. But then when you try to activate an emergency call, somewhere in the signalling path, as you get from tower to signalling system, to the voip equipment, to the circuits to the emergency center the outage knocked something out. Oh and for all these pieces of 911 equipment, there are two of everything for redundancy... two network paths, two pieces of equipment, etc.

And because they lost admin access to their management network, no one could go in manually and tell the towers to wilt themselves either.

If the towers had just stopped advertising 911 services, the phone would fall back into the network search mode as I described when you have no sim card. It just starts scanning the frequencies until it see's an information block for a network it can talk with the emergency support advertised to and does an emergency attach to the network that the carriers will all accept (An unauthenticated attach for the sole purpose of contacting an emergency center).

So my suspicion is because carriers are so used to we have two of everything, and all emergency calls are marked for priority handling at all layers of the equipment (they get high priority bits on all the network packets and priority CPU scheduling in all the equipment), this particular failure mode where there was a fault somewhere down the line, and they lost control of the towers to tell them to stop advertising 911 services all sort of played together to create the failure mode.


Multi-faceted failure mode.

0) At the network terminal level (mobile phone): at least for emergency calls if a given network fails to connect, fail over and try other networks. Even if the preferred networks claim to provide service.

1) At the network level: failure thresholds should be present. If those thresholds are crossed enter a fail-safe state. This should include entering a soft offline / overloaded response state.

2) Where possible critical data paths should cross-route. Infra Command and Control and Emergency calls in this case. Though if Roger's issue was expired certs or something the plans for handling that get complicated.


it’s that “0” level that surprised me the most here.

Days later, Rogers said you might be able to pull out/disable your SIM card to call 9-1-1, but then it depends: if Rogers is the strongest network, you might end up in the same predicament anyway.


Agreed, 'zombie state' is a valid failure mode. Partners / Infra can think they're alive and respond but be non-functional. As can agents spoofing valid infra but always failing operations.


Say there is an outage at the 911 call center. Now you try to call, don't get through, and your phone writes off that tower. Who were you planning to call after 911? Too bad, should have placed that call first.


Your phone would try other towers from other providers. If 911 is experiencing an outage that’s a separate issue that needs to be mitigated at a different layer. Even still, 100% uptime is difficult and expensive.


> Fun that Rogers used the same core for wireless and wired connections, so many of us were in total blackout, even if we used a 3rd party internet provider that ran over Rogers.

If it ran over Rogers circuits then why wouldn't it go down too? Isn't that the case everywhere?


I just know that a part of Rogers’ response was to separate their cores between wireless and wireline so that the risk of both going down simultaneously would be reduced.

The 3rd party providers aren’t white-label resellers, but there’s obviously some overlapping susceptibilities to going down when Rogers breaks something. Depends what they break, and in this case, it took them down too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: