Delta had a similar outage due to a datacenter fire, grounding all domestic flights. Southwest was uniquely slow in taking days to start up again. And if the way my American Airlines ticket switched my birthdate to January 1st, 2000 is any indication, many airlines still need to modernize.
Most of the travel industry runs on old software that would horrify a lot of people here, especially those who've never worked for a large, 30+ year old company. When I used to interview a lot of people I made it a point to mention some of the more "interesting" aspects so they'd know what they were getting into.
One example: ever tried to book a flight a year in advance? On a lot (almost all?) of systems you can't, because the underlying date format is "DEC27".
Edit to address a couple comments: logistics are hard and there are plenty of reasons why airlines wouldn't want to support booking that far out. However, the reason you can book a flight 330 days from now but not 360 days from now is almost certainly due to the date format. (I believe the windows used are less than 365 days because it's helpful to be able to have dates in the recent past. I remember seeing documentation for 360, but AA and United seem to be in the 330-340 range on their websites).
As a fun side thing, I am also a travel agent with access to some of these internal systems on the booking side. The technology is incredibly antiquated. Most of the US runs on a system called SABRE, which is basically a MS-DOS system with a text command line interface and its own language. It's all ASCII text based (and all in uppercase). It's straight out of the 80s. Travel agents need to buy special "errors" insurance to cover any losses caused by fucking it up (a typo could accidentally cancel a ticket and cause the client thousands in losses rebooking it).
They actually have a GUI interface over it now with the ability for power/legacy users to drop into the raw shell, if they wish. From feedback, many of the older agents actually prefer the command line, because it’s muscle memory and an experienced agent can perform routine tasks that would take multiple screens in the UI with one hand in the way we’re comfortable with our text editors.
Granted, the rollout across airlines is probably glacial
I don’t blame them. Modern UX has a huge problem with something as simple as date pickers. Preferring you scroll through 90+ items when a simple textbox would suffice.
That is a tremendously fun fact! The little background things that keep society running. May I never be cursed enough I would ever have to directly work with such a system.
It'll probably never go away, but just be layered over like civilizations. Eventually our software is probably going to get so complicated that we just build new software to interact with old software to avoid ever fully shutting it down. Like building a fresh highway on the oregon trail
I remember using some version of SABRE through CompuServe back in the day. All command line stuff over a dial-up modem, but it was novel and cool to be able to book your own flights with it. It would be very annoying to still be stuck on that interface, though.
1) Replacing any software for an airline carries huge risk. They are barely operating ok with the software they have and holding it together with duct tape. Even something you might regard as ancillary, like a baggage handling software system, or flight catering software system, if it goes down, has the potential to disrupt thousands of planes and hundreds of thousands of passengers for days.
It is so significant an issue (to try to change some software, and just one out of many systems that have to talk to each other) that if an airline ever considers doing this, they may actually stop operations for some number of days while they do it rather than risk having operations go wrong. There are some rare examples of airlines doing this to try to change their systems.
2) Related to the above, airline management hates to be embarrassed by something that might work but has the potential to go badly wrong. So they are very conservative when it comes to replacing systems that are working, even if it's painful / much less functional than what they might achieve by a change.
Combine these factors (and many others) and it means that sometimes starting a new airline is simpler than trying to fix an old one...
Sabre has a 10-year deal with GCP to do just that. It's going to take a lot more than 10 years to get the thing off System/360 running in a bunker under Omaha airport though.
The typical answer for old behemoths: it was built because it was necessary to build it, and it won't change until a change is necessary too. Wanting that change is not enough, it has to become an almost mechanical constraint, and usually the constraint gets noticed when it far outweighs the costs (and not just a little). Or is a noticeable threat to the system's existence.
GDS — there’s really only 3 centralized stores of real-time flight/hotel/booking information in the entire world (Sabre/SABRE, Amadeus, and Travelport). Almost every American airline uses Sabre (American Airlines is an interesting case in that it does not technically in a legal sense, but actually it spun off and sold Sabre in 2000, so a lot of their core systems are forks of each other)
Complexity — Fundamentally you’re looking at a logistics software, except unlike packages you’re dealing with people who aside from expected destinations have travel lengths and time-in-air calculated down to the minute. Also unlike a package, a surprise multi-day trip, unexpected multi-leg journey, one day delay is not something passengers (and crew members) will accept or be at all ok with. And if any one thing goes wrong there’s going to be cascading failures down the line— so much that it may break your company’s entire operating workflow (e.g. Southwest) entirely, and no software can overcome that kind of organizational gap.
Airlines - There’s not many commercial passenger airlines left in the US, especially that fly nationwide. Good luck trying to convince one of these giant behemoths to move to a non-battle-tested system for core operations, especially when decades-old industry software and practices around that software exist.
Entrenched - Sabre is entrenched in airlines around the world. They don’t just provide the booking services, they do the flight tracking, the ticket handling, the upgrading, the in-flight upgrades, missed connection handling, the flight scheduling algorithms, the pricing algorithms, the pilot and flight attendant time tracking, ground crew management, even the terminal software at each gate. To replace SABRE, you would physically need to rip out and then replace software around the world. And because agents don’t work from an office usually, but at the airport, you’re going to need to conduct trainings and provide support around the entire service area, which for the largest airlines is the entire world
Scale — A lot of Sabre’s revenue comes from passengers boarded. It depends on the airline, but I believe the average is that each airline pays 10cents/customer boarded with their software (though with increases in passenger volume each year, it may be less now). Because Sabre is so prevalent, and so many flights use them, they can afford such a price. A company servicing just one regional passenger airline would absolutely not be able to compete on price, at least starting out
Also— Sabre’s software itself is actually reliable! As a corporation it is slow clunky and bureaucratic, but the actual functionality it provides is stable, battle-tested, can handle any travel edge case you can think of, and fast and efficient for those who know how to use it, while also good enough at day to day operations that it doesn’t take too much time to train new agents on how to use it for routine tasks.
SABRE is ancient technology, but very reliable and at the same time extremely inflexible. Last time I saw it upclose in the early 2000's much of the core was still coded in IBM assembler, although over the decades more pieces were slowly being modernized so I have no idea where it is now. Sabre is a horrifically un- imaginative company where projects are measured in years and not much every changes.
I think though Southwest's issues are more on their side.
Yeah building a new GDS today is an exercise in insanity, it's a huge complexity nightmare and switching probably impossible. I always wondered if AI could eventually improve things, but the existing GDSs are unlikely to care much to try. It's basically a (tri)monopoly you can never break.
Great write-up. This applies to other industries such as dealerships with parts and service and various purchase plans. The software sucks for everyone except finance/accounting and that office is beside the Pres's office and therefore it won't change.
"What are the reasons preventing flight booking software modernization?"
Let's ask the opposite question:
Why do you feel that software necessarily needs to be modernized ?
I'm not sure how well SABRE works but I do know how fast and efficient keystroking through a non-GUI interface can be and I don't know why expert mode interfaces should ever be replaced by unsophisticated mousey-mouse-mouse ones.
In context, it seems like modernization simply means that the software knows whether a flight was canceled and therefore can automate the status updates for the airplane crew. Nothing to do directly with UX.
The existing flight booking software that they use (Sabre) can and does do that, but Southwest’s issue is likely their insistence on using a homegrown management system on top of that. Southwest only switched to Sabre in 2021, so it’s likely still being implemented, and their homegrown approach has likely evolved with them from their founding, so it’s not something the company itself is likely organizationally prepared for.
At least, I feel confident in that analysis, since this exact same issue happened to Southwest in 2016, before they were using Sabre. Which would point to a chronic organizational failure
airline margins are paper thin these days. they weren't back when booking engines and reservation systems were being built. (SABRE was built by and spun off from American Airlines. Galileo was built by United before they merged with Continental.) this plus the absolutely insane business logic that goes into booking engines has made the effort extremely risky.
To be fair, I think allowing flights a year in advance is probably far more complicated than just updating the underlying date format. Even if they were able to solve that problem, airlines probably can't easily operationally plan that far ahead due to so many moving parts, i.e. committing to routes and schedules, planning for staffing that far ahead of time, ever changing government restrictions, fuel price fluctuations, inflation, geopolitical realities, staffing, etc. I mean, imagine if they did, and something like COVID comes along again, it would cause far, far more disruption if they had booked out the next few years in advance (we had no idea how long COVID restrictions would last while we were in the heat of it, it's only clearer now in retrospect).
Also speaking as a software engineer myself, it's almost never just a software fix that will magically solve everybody else's problems, that always ends up being just wishful thinking
While this is humorous in that there are limiting assumptions like this baked into the system, I also have to wonder, who needs or even wants to book a flight a year in advance? I dread planning out a flight 4 months in advance and dealing with the almost inevitable cascade of conflicts this introduces of juggling and rescheduling things to make things align correctly. One year makes me cringe.
There are events that I can see purchasing flights well in advance for. I used to a go to a conference that was held every other year at the same time, it would have been easy to buy tickets more than a year in advance for that without much concern. Eclipses, certain sporting events, or reservations for activities with a wait list of more than a year could qualify as well.
Despite that I am like you and rarely have tickets far in advance of a trip.
- Annual conferences or conferences that occur every-other year
- Planning family reunions because you need that kind of cat-herding lead time when you have 9 uncles/aunts on just ONE side of the family
- Periods where I have some spare cash I'd like to lock in a getaway with before I spend it or something unexpected like the invasion of the Ukraine drives up fuel costs and overall prices... or a global pandemic hits - would be sweet if I could have rebooked some of my trips for for 1-2 years out when the pandemic hit
- Travel for future medical stuff; at one point for 2-3 years I was taking my mom to the Cleveland Clinic every 4 months for periodic checks and it would have been super nice to be able to just book that stuff way in advance and have it all taken care of
Etc
etc
etc
I'd bet quite a few people would appreciate that ability
My family plans the yearly family get-together at the yearly family get-together. A year in advance. Except sometimes due to scheduling deconfliction, it's actually 10 or 11 months in advance.. or 13 or 14 months in advance. The exact date floats and sometimes we are planning trips more than a year in advance.
I could see it for major holidays. I spent too much money to fly home this year because I am bad at scheduling. I would consider booking next year's flight during this year's trip just so I know it's knocked out and I don't have to worry about it.
,, Most of the travel industry runs on old software that would horrify a lot of people here''
If you can see how it works, it horrifies me even more as a traveller, as from outside it just doesn't work a lot of the time.
Also if you just look at the video, we all know how bad these systems are, but are not able to do anything (starting anything new in the airline industry has too much cost).
Likely cannot book that far not because of the underlying date format, but because of jet-a fuel prices which fluctuate. Airlines typically hedge their near term purchases with longer-term futures
> "What's unique is the partial failure, it's never happened," he said. "This isn't a drill you can run."
As someone who writes some very thorough unit tests... and also have had to have mandatory training... I find "this isn't a drill you can run" to be _very_ wrong.
Southwest is a "discount" airline. They do many things to economize, i.e. no assigned seats, they only fly 737s so they don't need to certify pilots or mechanics on any other types, you can only book with them and not with Expedia etc.
It would not surprise me that their back-office operations are likewise economized and some things are just not done because "they can never happen."
They're also the "friendly airline", they easily have the most personable and friendly staff. I don't know what they do different, but Southwest employees treat me human and all the rest generally treat me like human trash. It's got to be a company culture thing, maybe connected to Southwest not having a first-class section.
Usually I fly with Southwest whenever possible without thinking twice about it, but this outage and the outage last year are forcing me to reconsider. Better to deal with rude people than to have my flight delayed..
Yep, the other airlines are in the business of selling "class" and "status," and it's part of their product differentiation strategy to treat you according to how much you pay.
>I don't know what they do different, but Southwest employees treat me human and all the rest generally treat me like human trash. It's got to be a company culture thing
Definitely. Normal American customer service is to treat you like human trash, so obviously Southwest has decided to do this differently and so probably does things like only hiring friendly people, training them to be friendly and positive even in difficult situations, and checking on them somehow to make sure they're doing this and not just faking it for the interview and training. Other airlines obviously don't, but it's not just airlines, it's everywhere in American customer-facing business these days. Rudeness is just a normal part of America culture now.
> Rudeness is just a normal part of America culture now.
I have a backpack that has a phone holder on the shoulder strap. When you click your phone in it, the camera is visible. I've had many a situation where someone was about to go full New York on me, but noticed the camera staring at them and toned it down. I've never once been recording.
It's an Orben with a cheap iphone belt holder clipped into the cargo loop on the strap. The belt holder came with the clear protective case I bought at Wal-Mart. I hung the belt hodlder on the strap and it's been great for recording short hikes and downtown excursions. But also, seems to have the effect of making people more polite at customer service counters.
training them to be friendly and positive even in difficult situations
Without having any insight into how Southwest runs things I'd venture to guess that their gate agents and cabin crew have a lot more discretion than their peers at, for example, United. That kind of leeway makes the job much, much easier. United essentially handcuffed their staff leaving few options for dealing with an overbooked flight but to escalate to police involvement.
That's what happened with Dr. Dao in any case. Not long after that shit show, I booked a transcon on American. Turns out it was overbooked, and they were desperate to get people off the plane. The gate agents basically asked how much money it would take to get people to volunteer to take another flight. They got their volunteers, everyone went home happy because American empowered their staff to resolve problems.
A lot of people lay responsibility for Southwest's reputation for customer service at Herb's feet. Personally I found their front line staff to be casual to the point that it bordered on unprofessional (which is something coming from someone allergic to formality), but I'd absolutely believe that much of their positive reputation came from the top. Southwest pilots though, yikes.
With the exception of United I'd say most airlines I've dealt with have been pleasant whether or not I'm flying up front or have status. Yes, even RyanAir. We're in something of a golden age since (with the notable exception of Southwest) American airlines have mostly put their mergers behind them and have mostly shed themselves of CEOs who view employees as adversaries. Take a look at the bad old days of the late 80s through mid 2000s. Smisek. Lorenzo. Parker. Bastards and crooks, all of them. It's difficult to stress just how toxic airline leadership was and how that trickled down for a long time.
Southwest also has pretty good size seats, afaik, relative to other airlines' economy seat size. Plus the whole "you're all the same so we treat you all the same" thing is nice. I do wish there was an international equivalent.
Delta has been more than fine with me. Granted, I'm a Diamond class member now, but even on the way up they were reasonable and provided mostly good service.
> It would not surprise me that their back-office operations are likewise economized and some things are just not done because "they can never happen."
Meh, it doesn't even have to be "never". It just has to be cost multiplied by frequency is less than the cost to prepare.
If they lose $100m every five years due to a system failure, and it would cost $30m/year to plan for those failures, they it's just cheaper to let it happen.
And I don't mean this in a judgmental, Fight Club-car recall speech kind of way. It's just business reality. At some point every business has to decide that the cost of planning for something is higher than the cost of letting it happen.
What's the value of the reputation risk of a major, very high profile failure?
Sometimes businesses end up on the wrong side of that bet. They see only the costs but not the benefits of preparedness (by the time it fails, there will probably be a different CEO in charge) and make a bad call.
Of course, no argument there. Ideally when you make that kind of decision you take reputational risk into account, as well as, like, is this an existential risk?
The airline industry feels like one where each year it's a different carrier who has some catastrophic scheduling failure. Today, everyone says they're never flying Southwest again. But if you fly semi-regularly then it won't take very long before you don't have any airlines left to fly on.
For people who weren't affected, I doubt very many are even going to remember this. Personally, I remember that this kind of thing has happened recently with other carriers but I couldn't even tell you who.
And people who were affected can mostly be bought off if you need to. Some vouchers & hotel reimbursement and it's just the cost of doing business.
Plus, the airline industry has proven over and over that people are willing to put up with a lot when you have the cheapest prices.
It's different from an industry that's built on reputation and trust. Like, a password manager, the only real thing you're selling is your reputation. Losing trust is a real existential threat. Security costs need to be in the bucket of either "yes, we will do it" or "it's so expensive that if we do it then we don't have a business anyway, so we'll skip it and pray."
Scheduling won't ruin an airlines reputation. Crashing the planes is what ruins an airline. Southwest has only ever had two passenger deaths and one of those was an attempted hijacker beaten to death by other passengers.
FAA has an important role, but market forces won't tolerate an airline that experiences crashes, even rare ones. Airlines are highly incented to be safe, both by regulation and by the market itself.
How badly did SWA's past high profile failures affect them? I'd say not so very much. Yes, some short term damage, but we're talking about an industry where no one stands out from a quality perspective. Everyone is seen as some degree of bad, much like big ISP's. People are used to rotating between airlines and, while those affected this round may shift away, others will get frustrated with their preferred carrier and rotate to Southwest.
I think reputation impacts in this industry from anything other than crashes don't hold much staying power.
All airlines economize. An airline that doesn't is a bankrupt airline because typical industry margins on flight are razor thin.
Southwest isn't a particularly budget airline compared to modern budget carriers like spirit and ryanair that haven't copied the open boarding policy. I suspect the opportunity to upsell seats / luggage and have distinct classes outweighs the turnaround time costs of assigned seating.
Just as a note: they are about to issue a $458M dividend.
They plan to spend $4-4.5B in 2023 on planes.
How much are they spending on system modernization?
It’s funny that you use unit tests as an example of it being possible to run drills for this kind of thing. Unit tests are by their definition not the kind of thing that simulates this kind of failure. Perhaps you have a false sense of security about what you’ve really been testing?
> Unit tests are by their definition not the kind of thing that simulates this kind of failure.
Thorough unit and integration tests include failure modes. Mine include things like "what happens if the OS reports that the storage was unmounted during read/write" (because that was a failure often seen in production with some flaky SAN devices) and "what happens if the server stops responding" (because networks are generally unreliable) and "what happens if invalid (random) data is given" because data corruption is a frequent occurrence for similar reasons.
> Perhaps you have a false sense of security about what you’ve really been testing?
Perhaps. On the other hand I've seen a _lot_ of other developers test only the happy path and call it a day then spend days/weeks/months debugging failures.
Then you are senior enough to know what “ then-CEO now-chairman Gary Kelly” really meant was “I haven’t funded our technology team well enough to have resources to test a scenario like this”.
Or "we decided that the cost to plan for this is so high that it's not even worth testing. If it happens then we're fucked anyway and we'll just eat it."
You can drill the initial failure, but not really the cascading events. In something as large as a global airline you are dependant on 1000s of third parties actions and the weather. No simulated drill is going to be sufficient or realistic. The only way to really mitigate or plan for something like this is multiple layers of segregation so that events in one area have less or no impact on others. Then you could drill total failure in various segments.
Testing reveals the presence of bugs, never their absence. With hindsight you can always feel smugly superior in saying “you should have tested for this”, but there’s an infinitude of things you might need to test, and if you haven’t encountered a failure you didn’t test for, you’re probably just lucky.
Google runs Disaster Recovery Training annually (DiRT) where security teams are tasked with simulating these “black swan” events. Seems like this practice needs to expand to more industries.
At Facebook we would simulate an entire datacenter disappearing.
When we first started doing it the datacenter would be chosen months in advance so that teams would have plenty of time to ensure their services can run without that specific datacenter.
When I left this year, the datacenter would be randomly chosen on the same day it would be cut off.
That's pretty cool and ideal practice for a software firm but in one of the reddit threads they're talking about mass quits/refusal to work of ground crew at Denver because of the weather. I wonder how you could ever prepare for that? Keep a backup, airport scoped, ground crew in the waiting room??
You can't really do hot spares for people without time to gear/train up and the weather event is so widespread I doubt there's enough spare SWA human capacity across the whole nation even if you had C130s on standby everywhere ready to take workers where they're needed most. From a national security perspective, situations like this is why the Marines exist right? Ensure a rapid response while the rest of the machine gets moving. I feel bad for everyone involved, those affected and those trying to figure out a solution.
Do you really think the MBA wizards can't figure out some basic pay issues?
It seems you're discounting just how complex HR can be, especially in the face of exigent circumstances. No amount of bonuses will immediately staff up an entire terminal in the face of a massive snowstorm.
> No amount of bonuses will immediately staff up an entire terminal
That is right and thus I would engage line management to figure out business continuity.
> in the face of a massive snowstorm.
It is winter. The storm was tough but not exceptional for the season of winter. Denver did not report tremendous amounts of snow.
Few if any MBAs can get on the ramp and look a line employee in the eye and lend a hand. The wfh keyboard warriors don't know blue collar and therefore are unable to figure this out. MBAs can figure out ways to game their pay. HR can recommend team building 'fun' and non-revenue standby seats, which have minimal value to those employees flying with school age children. Shareholders should demand senior management unemployment applications.
> The storm was tough but not exceptional for the season of winter.
Sure seems like this was an exceptional storm. Widespread, deep cold reaching into Mexico, snow falling across much of the U.S., Buffalo hit with the most snow in 20 years, records set multiple locations.
A couple decades of weather variation is not exceptional, though some will disagree with me when their bonus is yearly. I recall it was cold in Texas last winter. Weather is seasonal and varies.
I wonder if board of directors will compare performance with other airlines.
While they may be able to figure it out, optimizing pay to quality of life at work ratios to ensure long term employee retention and loyalty has certainly not been a priority.
Pay might not really be enough. Maybe management could try to find folks to babysit kids/take care of parents trapped at home, freeing up workers to come fly. They'll certainly fail, but at least they'll understand the plight of their workers.
You can certainly handle it better than SWA is handling it. Maybe if they’d actually run a simulation where flights out of one or more cities had to be wholesale canceled, they would be. You can’t fix the situation for everyone, but you can avoid fucking up this badly.
It doesn't seem that farfetched for an airline to run a drill where a given airport is assumed inoperable to see how the system reacts. The expectation shouldn't be the same as the data center failure but you can learn what you aren't doing well enough.
Google is also a trillion dollar company, as other have pointed out Soutwest is a low-cost carrier which most probably doesn't have the luxury of hiring FAANG-level engineers on 500k yearly comp in order to best simulate "black swan" events.
This is probably the best argument for AWS/GCP/Azure even though it is becoming more and more obvious you don't really save that much money.
If you have a black swan event like this and you listened to your solutions architect you will have a disaster recovery plan or even better a multi region setup. Worst case you have highly paid support engineers at the cloud providers who will do everything they can to get you back online.
This does not seem like a hardware failure scenario where the cloud has anything to offer. More like their intricate software/database systems became out of sync with reality and disentangling the mess is a highly manual process.
That's not the article I hoped to find however. I seem to remember there was another article where they hired a investigator/consultant to figure out the price to migrate to the cloud and ensure "this never happens again."
My recollection of that was: their scheduling/ops team is also in the same city (Atlanta GA) as this datacenter, and that teams work was brought to a halt by the datacenter outage. The investigator concluded that Delta would need redundant copies of the ops team or the whole effort of moving the software to the cloud would just be at risk to something happening to the human team all in the same city. That would obviously cost to much money, so Delta decided to skip it.
Regarding the employees, keep in mind that neither SWA (nor any other airline for that matter) have big software engineering departments. It's all outsourced to either generalist bodyshops for custom/peripheral systems (IBM, Accenture) or specialist shops for core (Amadeus, SABRE)
20 years ago I used to work for an airline. Back then Sabre was nothing more than the mainframe in Tulsa. All the APIs did nothing more than perform automated green screen commands. Has anything changed with Amadeus or Sabre? Or is it still mainframes behind the curtain?
There are two parts to the IT: the airline backend and the distribution backend. For the airline backends, Sabre used to build custom single-tenant ones per airline, on mainframes. The SWA one was very old, hence their inability to charge for bags. The Sabre distribution system is common of course, again on mainframes
Amadeus are eating them up, because their airline backend system is a shared multi-tenant setup, built on commodity hardware. Their distribution system used to be mainframes, but they managed to migrate away in the 2010s.
Sabre is still alive, but only in North America, and Amadeus is slowly chipping away (WN, AC..)
Training for these scenarios may help with responding to true black swan events, like Rick Rescorla's WTC evacuation drills ahead of 9/11. But, nitpicking, if you've predicted something will happen, by way of simulating it, it's not a black swan when it does.
There are cost considerations. Business continuity costs money. Finance firms have significant capital and income to have empty but built out building around around airports for business continuity. Which doesn’t even make sense since they can work from home as proven with covid. Airlines can’t work from home.
Personally one of the basic tenets of my adulthood is realizing how many companies are a hair away from a similar scenario (differing in magnitude from an airline ofc).
Google is made of money, and the reason they are is not because of DiRT. Other industries can't afford the same things that Google can while continuing to be a business.
After canceling 5,400 flights, I can't see how Southwest can afford not to test. Even if they only made $1000 off each flight, that's still $5 million they just lost.
That... and Chaos Engineering in general, and also just generally a forward looking team that identifies threats, vulnerabilities, and risks in the future and works backwards to identify potential mitigating controls.
What's most annoying is that there's plenty of employees on the front line who not only care about testing for this sort of thing, but it actually interests them, they're motivated by it, and they understand the dire reality of what happens – to them, primarily – if the company isn't prepared to handle it.
And you can guess what their managers' response typically is: "We need to focus on OKRs and QBRs and KPIs right now... maybe next quarter"
I'm fully convinced that achieving 'manager status' is directly correlated to cowardice. Companies need top-down decision-making, but those decision-makers need to spend more time on the front line.
> Companies need top-down decision-making, but those decision-makers need to spend more time on the front line.
This is not rewarded so it doesn't happen. Managers are rewarded for line goes up so they only focus on line goes up. If line ever doesn't go up it costs them money (advancement, compensation) even if there's little they could have done to make line go up.
Airlines don't have quarterly gains. They regularly go bankrupt and get picked up again, because the country needs airlines and because they have large union contracts.
My wife and I spent a few hours this morning dealing with the cancellation of our return flight. Southwest has long been preferable for me in many cases, including my most flown route. Between the headache of this outage and the apparently dismal state of their operations there's a strong chance I never fly with them again.
Our family was in San Jose last week when our Southwest flight was cancelled.
It used to be that my first priority would be to go into the terminal and try to talk to somebody. I figured they were the experts. From what I've read, the staff use an antiquated system that takes you from one airport to another, then they can try to get you from that city to where you want to go. That's why there's so much tapping of keys and why it takes so long.
It's better to present them with a route that you've found on Google Flights or similar. The Southwest first flight out was supposed to be yesterday evening, the day after Christmas. In our case, the only thing we could find before Christmas was getting us from SJC to Seattle via Phoenix on Alaska. We ended up renting a car and driving home to Portland. Things got bad around Eugene - I stopped counting after 40 wrecked cars and semis - and got worse as you got closer to Portland.
Asking the government to step in for additional regulation is rarely helpful. For this type of failure, the free market will determine whether processes and tools improve, or whether the status quo is good enough.
Cloud doesn't solve badly designed processes or poorly written software, which seem to be at play with Southwest. Yes, it can help provide more stable infrastructure and there are some (but by no means all) black swan events that can be mitigated simply by throwing more kit at the problem during a surge but it's no silver bullet.
>"What's unique is the partial failure, it's never happened," he said. "This isn't a drill you can run."
The unspoken part you have to hear there is "... within the economic model of the airline business".
Business continuity gets exponentially more expensive as you chase the blackest of swans: the sheer volume of plan development and maintenance, developing exercises, table-top vs. walkthrough vs. simulation, assumptions about how many different uncorrelated failures you're prepared for deal with at once etc.
I've no doubt you could run an airline to be as resilient as (say) USAF Air Mobility Command, but no-one could afford the tickets.
What's ironic here is that groups like USAF are constantly pressured to adopt private industry models to be more "economically efficient" and completely ignoring that resiliency is a requirement baked into the high cost. I understand why both take the approaches they do but it seems everyone holds private industry barely running with no resiliency optimizations above all else, which don't make sense in all contexts. Corner cutting is fine in many contexts, especially when you know the side effects of their failures which may be quite insignificant.
On our database systems, we have some date fields for which the default value should never, ever be used and if it is, there is a big problem. All of those dates are set to the dates of well-known natural disasters that happened in the 1800s or earlier.
The thought was that it needs to be something that isn’t believable to a non-technical user seeing it on their computer screen. It turns out that this is not necessarily useful. I listened to a guy talking about some issues with a record; he says “1871? What’s up with that?” And then just moved on as if “well it came out of the computer, must be right” or something.
I think that databases need to have the concept of NaN for dates and time stamps, except that this should be configurable to something like a poop emoji or something like ⁉🆘. It has to be something where your grandma would look at it and confidently say “your computer is broken”
> I think that databases need to have the concept of NaN for dates and time stamps, except that this should be configurable to something like a poop emoji or something like ⁉
So have another column adjacent to the date that stores an enum with an error code with a reason for why the date is missing. My point is, this is business logic, doesn't have a universally applicable definition, and should be handled on a case-by-case basis rather than trying to hard-code it into the date type in the DB.
Default date types, with some columns being nullable, work just fine for my project. I wouldn't want them to be more complicated and force me to consider additional cases, especially if those cases aren't language compatible.
Particularly on mainframe systems like airline reservation systems tend to run on where the Y2k fix in a lot of cases for Cobol was to simply contextually know that certain fields couldn't have been created before 2000, so '00' BCD is simply year 2000.
> "this isn't a drill you can run". And yet, Netflix has chaos Kong do it with regularity.
Netflix and airlines are so different as to make this comparison laughable. The cost of setup and consequence of problems actually being found (ie Federal Regulations) that are not addressable (it's not like SWA didn't know about some of the eventualities), easily outclasses the need for testing every combination of situations. Kong doesn't run anything that has to do with weather turning jet fuel into sludge or 12x pre-staffing in case of massive computer failures along with assessing the possible legal consequences from each locale. The hubris of pretending that physical services on a national scale, is as deterministic as a complex automated system, is unsurprising from a certain crowd, I guess.
Modernization of equipment, hiring more pilots and other employees, investing in updating the code base - how can that be done? It's far more important to keep the stock price high by whatever means necessary, such as using government bailouts to buy back shares.
Investment capitalism is really a garbage system when it comes to building and maintaining basic infrasctructure like transportation, electricity grids, roads and so on. China has demonstrated that convincingly over the past two decades, hasn't it?
When characterized as something that can't be done instead of something they don't know how to do, you know exactly where they are on the Dunning–Kruger curve.
https://www.dallasnews.com/business/local-companies/2016/08/...
at the time, then-CEO now-chairman Gary Kelly said:
> "What's unique is the partial failure, it's never happened," he said. "This isn't a drill you can run."
https://web.archive.org/web/20161112192103/http://www.dallas...
Delta had a similar outage due to a datacenter fire, grounding all domestic flights. Southwest was uniquely slow in taking days to start up again. And if the way my American Airlines ticket switched my birthdate to January 1st, 2000 is any indication, many airlines still need to modernize.