Hacker News new | past | comments | ask | show | jobs | submit login

Overnight, planes tend to be plugged in to ground power, to ventilate, keep the batteries charged, for the cleaning crews, etc. Most get rebooted once in a while, but it's always possible one won't be, hence the directive to be certain.

This particular problem has been known for years (the article is from 2020).




Unfortunately, an aircraft has no “reboot”. It is just a violent power cut. A lot of headache is introduced in non-critical aircraft software because there is no “graceful shutdown” or long power duration. Infact, certain hardware has an upper limit(much lower than a week) before which it needs one power cut(sometimes called power cycle) or it suffers from various buffer overflow, counter overflow and starts acting mysterious.


It's amazing that's legal. Like, why do we accept software that does this? It can be done in such a way that these things don't happen.Put another way, why aren't the companies involved being fined and sued out of business? Why aren't their managers facing criminal negligence charges? It's outrageous.


Because there has never been a single commercial jetliner fatality caused by software in its intended operational domain failing to operate according to specification. That makes the commercial jetliner software development and deployment process by far the safest and highest reliability ever conceived by multiple orders of magnitude. We are talking in the 10-12 9s range.

And just to get ahead of: “Well what about the 737 MAX”, that was a system specification error, not due to “buggy” software failing to conform to its specification. The software did what it was supposed to do, but it should not have been designed to do that given the characteristics of the plane and the safety process around its usage.


>“Well what about the 737 MAX”, that was a system specification error, not due to “buggy” software failing to conform to its specification. The software did what it was supposed to do

Exactly: the system was designed to fly the plane into the ground if a single sensor was iced up, and that's exactly what the software did. Boeing really thought this system specification was a good idea.


That is a massive over-simplification and that invites patently false characterizations like it was a "stupid mistake" that would have been fixed if they were not stupid (i.e. adopted average development process). That is absolutely not the case. They were really capable, but aerospace problems are really, really hard, and their safety capability regressed from being really, really capable.

They modified the flight characteristics of the system. They tuned the control scheme to provide the "same" outputs as the old system. However, the tuning relied on a sensor that was not previously safety-critical. As the sensor was not previously safety-critical, it was not subject to safety-critical requirements like having at least two redundant copies as would normally be required. They failed to identify that the sensor became safety critical and should thus be subject to such requirements. They sold configurations with redundant copies, which were purchased by most high-end airlines, but they failed to make it mandatory due to their oversight and purchasers decided to cheap out on sensors since they were characterized as non-safety-critical even if they were useful and valuable. The manual, which pilots actually read, has instructions on how to disable the automatic tuning and enable redundant control systems and such procedures were correctly deployed at least once if not multiple times to avert crashes in premier airlines. Only a combination of all of those failures simultaneously caused fatalities to occur at a rate nearly comparable to driving the same distance, how horrifying!

A error in UX tuning dependent on a sensor that was not made properly redundant was the "cause". That is not a "stupid mistake". That is a really hard mistake and downplaying it like it was a stupid mistake underestimates the challenges involved designing these systems. That does not excuse their mistake as they used to do better, much better, like 1,000x better, and we know how to do better and the better way is empirically economical. But, it does the entire debacle a disservice to claim it was just "being stupid". It was not, it was only qualifying for the Olympics when they needed to get the gold medal.


I really don't think it takes a mastermind of software design to go "okay I've built a system that takes control of the plane's maneuverability, let's make sure we have redundant sensors on this". Furthermore, descriptions of MCAS and its role were dangerously under played so that they didn't have to tell their customers to retrain their pilots. An egregious breach of public trust in a company we put a whole lot of faith into.


>They failed to identify that the sensor became safety critical and should thus be subject to such requirements.

Whistleblower testimony indicated it wasn't a failure to identify it as safety critical, but a conscious decision not to mention it as such to the regulator, and not implement it as a dual sensor system as doing so would have caused the design to require Class D simulator training; which Boeing was relying on the abscence of as a selling point to prevent existing airlines from defecting to Airbus.

>They sold configurations with redundant copies, which were purchased by most high-end airlines, but they failed to make it mandatory due to their oversight and purchasers decided to cheap out on sensors since they were characterized as non-safety-critical even if they were useful and valuable.

Incorrect. All MAX's have two AoA vanes, each paired to a single Flight Computer. The plane has two Flight Computers, one on each side of the cockpit, and the computer in command is typically alternated between each flight. One computer per flight will be considered in-command (henceforth referred to as Main), the other will be henceforth referred to as operating as "auxillary". The configuration you're thinking of is an AoA disagree light, implemented by enabling a codepath in software running on the Main FC whereby a cross-check of the value from the AoA vane networked to the auxillary FC would light up a warning light to inform pilots that system automation would be impacted, because the AoA values between the MFC and AFC differed. A pilot would be expected to recognize this as and adapt behavior accordingly/take measures to troubleshoot their instruments. Importantly, however, this feature had zero influence on MCAS. MCAS only took into account inputs from the vane directly wired to the Main FC. While a cross-check happened elsewhere for the sole purpose of illuminating a diagnostic lamp, there was no cross-check functionality implemented within the scope of the MCAS subsystem. The MCAS system was not thoroughly documented in any delivered to the pilot documentation. The program test pilot got specific dispensation to leave that out of the flight manual. See the Congressional investigation, final NTSB, and FAA report.

>The manual, which pilots actually read, has instructions on how to disable the automatic tuning and enable redundant control systems and such procedures were correctly deployed at least once if not multiple times to avert crashes in premier airlines.

The documentation, which included an Airworthiness Directive and NOTAM, informed pilots any malfunction should be treated in the same manner as a stabilizer trim runaway. Said problem is characterized in aviation parlance as a continual uncommanded actuation of trim motors. MCAS, notably is not that. It is periodic, and in point of fact, it ramps up in intensity over time until over 2° of travel are commanded by the computer per actuation event, with the timer between actuations being reset to 5 seconds by use of the on yoke Stab trim switches. This was ncommunicated to pilots. Furthermore, there were design changes to the Stab-Trim Cutout switches between 737NG (MAX's predecessor), and MAX. In the NG, the Stab Trim cutout could isolate the FC alone, or both FC and yoke switches from the Stab Trim motor. In MAX, however, the switches were changed to never isolate the FC from the Stab trim motors, because MCAS being operational was required for being able to checkmark FAR compliance for occupant carrying aircraft. So when that cutout was used, all electrically assisted actuation of the horizontal stabilizer became unavailable. The manual trim wheel would be the only trim input, and in out-of-trim attitudes, would result in such excessive loading on the control surface that physical actuation without electronic assistance was not feasible on the timescales required to recover the plane. There was a maneuver known to assist with these conditions (when they occurred at high altitude) called "roller coastering" in which you dive further into the undesired direction to unload the control surface to render it actuable. This technique has not been in official documentation since Dino 737 (Pre-NG). The events you're referring to when uncommanded actuations were recovered on other flights, happened at high altitudes, and were recovered with countered electrical stab switch actuation followed by Stab trim cutout within the reset 5 second watchdog timer prior to MCAS activation subsequent to a Stab-trim yoke control switch actuation. This procedure, and the implementation details needed to fully understand its significance, were undocumented prior to the two crashes. Furthermore, this procedure to cut out MCAS/the MFC from the stab trim motor and finishing the flight in a completely manually trim controlled configuration meant that technically you were flying an aircraft in a configuration that could not be certified to carry passengers when taking the FAR's prescriptively, and uncompromisingly rules-as-written with zero slack offered for convenience, because MCAS was necessary for grandfathering the MAX under the old type cert, and without MCAS functional, it's technically a new beast, which is non-compliant with control stick force feedback curves when approaching stalls, which by the way, just to make it clear, a compliant curve has been a characteristic of every civil transport in all jurisdictions worldwide for well over 50 years. This was not documented and only became apparent after investigation. Again, see the House findings, FAA report, and NTSB.

>Only a combination of all of those failures simultaneously caused fatalities to occur at a rate nearly comparable to driving the same distance, how horrifying!

Oh, the multi-billion dollar aircraft maker built a machine that crashes itself, gaslit it's regulators, pilots, airlines, and the flying public to juice the stock price so executives could meet their quarterly incentives, and diverted tunds away from it's QA and R&D functions to do stock buybacks, move HQ away from the factory floor, and try to union bust. With over 300 direct measurable deaths within a couple of months and multiple years worth of grounding and mandated redesigns to fix all the other cut corners we've been unearthing, and veritable billions of dollars of loss incurred in delays. Heavens, it could happen to anybody. How could you possibly see this as something to get upset about? /s


Thank you for providing a more thorough and complete technical explanation.

As you can see from my final statement, I made no argument that it was not a travesty. It was ABSOLUTELY UNACCEPTABLE. This is not a defense of their inadequacy.

I was pointing out how it is absolutely incorrect to claim that it was a "stupid mistake". That argument is used by people implicitly arguing that "If only Boeing used modern software development practices like Microsoft/Google/Crowdstrike/[insert big software company here] then they would have never introduced such problems". That is asinine. As can be seen from your explanation, the problem is multi-faceted requiring numerous design failures in both implementation, integration, and incentives. In fact, the problems are even more subtle and pernicious than in my original explanation that was derived from high level summaries rather than the investigation reports themselves.

I do not know if this has changed in the last few years, but at Microsoft you were required to have 1 whole randomly-selected person, with no required domain expertise, say they gave your code, in isolation, a spot check before it could be added. This is the same process applied regardless of code criticality, as they do not even has a process to classify code by criticality. This is viewed as a extraordinary level of process and quality control that most could only dream of achieving. Truly if only Boeing threw out whatever they were doing and adopted such heavyweight process by "best-in-class" software development houses they would have discovered and fixed the 737 MAX problems.

Boeing does not need to adopt modern software development "best practices" and whatever crap they use at Microsoft/[insert big software company here] that introduces bugs faster than ant queens. The processes in play that created the 737 MAX already make Microsoft and its peers look like children eating glue, but they are inadequate for the job of making safe aerospace software and systems. What Boeing needs to do is re-adopt their old practices that make the 737 MAX development processes look like a child eating glue. The 737 MAX was not stupid, it was inadequate. BOTH ARE UNACCEPTABLE, but the fix is different.


This is a totally bizarre strawman argument. Safety-critical software has almost nothing in common with Microsoft crapware, or indeed, most typical desktop software. Even within the desktopo software industry, MS has never been held up as "best-in-class", but rather the butt of jokes.

As the other poster said, it doesn't take a genius to figure out that a new safety-critical system needs its sensors to be redundant. It wasn't stupid, though, it was malicious: Boeing wanted to hide the existence of MCAS so that pilot retraining wouldn't be required.


So what should we make of these issues described in the article? When, not if, this kind of thing kills people will it be a specification error? Will we blame it on maintenance? Surely it can't be the software's fault!


First of all, who got blamed for the 737 MAX? Boeing did. This is one of the few industries where the responsibility does not get easily sloughed off.

Second, 787s have been flying for ~13 years and ~4.5 million flights [1]. Assuming they were unaware of the problem for the majority of that time, their unknowing maintenance and usage processes avoided critical failures due to the stated problems for a tremendous number of flights. Given they now know about it and are issuing a directive to enhance their processes to explicitly handle the problem, we can assume it is even less likely to occur than previously which was already experimentally determined to be ludicrously unlikely. Suing someone into oblivion for a error that has never manifested as a serious failure and that is exceedingly unlikely to manifest is a little excessive.

Third, they should be remediating problems as they arise balanced against the risks introduced by specification changes and against the alternative of other process modifications. Given Boeing’s other recent failings, they should be given strict scrutiny that they are faithfully following the traditional, highly effective remediation processes. It should only be worrisome if they are seeing disproportionately more problems than would be expected in a aircraft design of its age and are not remediating problems robustly and promptly.

[1] https://www.boeing.com/commercial/787#overview


> Suing someone into oblivion for an error that has never manifested as a serious failure and that is exceedingly unlikely to manifest is a little excessive.

I appreciate your point of view. The air travel industry is undeniably safe, moreso than any transportation system ever. By a large margin. On the other hand, it is possible to make software systems that do not have the defects described in the article. So how do we get to the place where we choose to build systems that behave correctly? I don't think we get there without severe penalties for failure.


>The air travel industry is undeniably safe, moreso than any transportation system ever.

I disagree: the Japanese shinkansen bullet train system has never had a fatal accident, except for a single incident 30 years ago when someone was caught in a door and dragged 100 meters. No fatalities from collisions, derailings, etc., ever, since the 1960s. That's far safer than air travel could ever claim to be.

Even other train systems have better records than commercial aviation, in general. Plane crashes are rare these days, but they still happen once in a while, and the results are usually catastrophic.

Are planes safer than cars? Well of course, but that's a really, really low bar: cars are driven by all kinds of morons who frequently (esp. in the US) have little to no training or testing, are frequently distracted, don't have a copilot who can take over at any time, and are frequently operating in a very, very chaotic environment (like city streets). It's truly a wonder there aren't more fatal crashes. But safer than trains in general? I seriously doubt it.


Actually, the Shinkansen seems to average ~100 billion passenger-km per year [1] or ~60 billion passenger-miles per year. Using that as a overestimate for the last 60 years, that is a grand total of 3.6 trillion passenger-miles.

US commercial aviation averages ~1 trillion passenger-miles per year [2]. So if we compare the last 4 years of US aviation that is a comparable number of passenger-miles.

Over the last 4 years recorded on this dataset (2019-2022)[3] it looks like there were 5 fatalities total. Over the last 4 years recorded on this dataset (2018-2021)[4] it looks like there were 2 fatalities total.

So, while it does not appear to be safer, it is within a few factors on a passenger-mile basis. Furthermore, there are multiple periods of 4 trillion consecutive passenger-miles where there were 0 recorded accidents. It nowhere near obvious that it is “far safer than air travel could ever claim to be” and certainly a much closer race than you believed given your other assertions.

[1] https://www.statista.com/statistics/1262752/japan-jr-high-sp...

[2] https://www.transtats.bts.gov/traffic/

[3] https://www.bts.gov/content/us-air-carrier-safety-data

[4] https://www.airlines.org/dataset/safety-record-of-u-s-air-ca...


That's not exactly a fair comparison, because you're comparing distances traveled, rather than trips taken. Of course planes are going to look good, since they travel much longer distances than cars or trains, and because planes are more likely to have trouble when taking off or landing than any time in-between. It's not like you can just take a commercial airliner flight to go to your local grocery store, even though statistically you're more likely to get killed on that trip than on a cross-continent flight.


First of all, passenger-distance per event (or its inverse) is the standard metric used when comparing transportation safety. You would be hard-pressed to find any broad, rigorous comparison that does not compare on that metric. It encodes the risk of a trip to a location of a certain distance. It is absolutely a fair comparison.

Second of all, even if we do use your metric which only cares about passenger-trips per event it still does not matter. The Shinkansen has transported ~6.4 billion people since inception. As seen in the second link I provided above, US commercial aviation serves ~900 million passengers per year. So, that is 7 years of US commercial aviation to transport the same number of people the Shinkansen has ever transported. As seen on the third link the last 7 years (2016-2022) had ~6 fatalities and as seen on the fourth link the last 7 years (2015-2021) had 2 fatalities compared to the 1 fatality on the Shinkansen.

Third of all, given that the Shinkansen has transported ~6.4 billion people, but averages 150 million people per year and ~60 billion passenger-miles per year, we can reasonably conclude that I overestimated at ~3.6 trillion passenger-miles and it would likely actually be ~2.4 trillion passenger miles or just 2.5 years of US aviation. From the third link that would be a mere 1 fatality and from the fourth link 0-1 fatalities.

If we extend our analysis to the last decade the third link indicates 15 fatalities over ~10 trillion passenger miles, ~2x the Shinkansen rate, and the fourth link indicates 2 fatalities over ~10 trillion passenger miles, ~50% the Shinkansen rate. Again, broadly comparable, but it is hard to truly tell which one is "safer" than the other. And again, they are clearly in the same ballpark and not dramatically different as you implied.



Their deaths-per-passenger-mile stats are worse, though.

US airlines haven't had a single fatal crash in 15 years.

https://nypost.com/2019/08/22/video-shows-moment-man-crushed...


> So how do we get to the place where we choose to build systems that behave correctly? I don't think we get there without severe penalties for failure.

What failure? The planes work. This is puritanism.


> First of all, who got blamed for the 737 MAX? Boeing did. This is one of the few industries where the responsibility does not get easily sloughed off.

The whistleblowers dying is coincidental and convenient.

https://www.theguardian.com/business/article/2024/may/02/sec...


1. For at least one of the whistleblowers, it was certain not "convenient" because he already managed to go public with the accusation, the lawsuit was filed, and his deposition was already made.

2. I'm not sure how a few whistleblowers dying disproves "responsibility does not get easily sloughed off". If anything, they're getting extra responsibility than is warranted. Every time there's something wrong with a Boeing product, people almost reflexively start posting about how it must be caused by corner cutting by Boeing, or how it's yet more evidence that Boeing it circling the drain. This happens even for planes that's are decades old, have a solid service history, and by all accounts are probably caused by pilot error or improper maintenance.


Because it works fine. A maintenance tech gets one extra line item on the weekly or monthly inspection checklist.


It works fine until it doesn't and people die. At which point the blame falls on the maintenance crew? That's wrong. And where there's smoke there's fire. If the software has this horrible bug, likely the broken culture that created it has written worse, more subtle bugs.


Commercial air travel in the US is incredibly safe. The last fatal crash was in 2009.


I agree completely with the first part. But SWA-1380 was a commercial operating fatality in 2018. Not a crash into terrain, but the engine definitely crashed into the fuselage.


[flagged]


Probably not much comfort for the passenger who was ejected from the plane and died...


Because changes to that software go through a enormous amount of testing, validating and documentation for a new baseline to become a flashable item. Meanwhile a always working workaround is needed now.


Have you even found the documentation around things like ACPI? It's kinda coupled with UEFI these days I think, and hell, I'm not even sure of the hardware boards/revisions aircraft makers are using these days... Are they still on BIOS? Or old-as-sin linux/RTOS kernels/microcontrollers?

Point being, when you start talking about high QA systems, where the Quality is non-negotiable (you will have everything documented and tested); barring exec/managerial malfeasance in preventing that work from being done, you reach for the same simple things over and over again since it takes a hell of a lot of work to actually characterize and certify a thing to the requisite level of reliability/operating conditions.

Testing ain't free, ya know.


> Unfortunately, an aircraft has no “reboot”. It is just a violent power cut.

That’s a reboot.


There’s nothing about a reboot that precludes a graceful shutdown.


There's also no reason why a "reboot" can't be a "violent power cut", especially if the equipment in question doesn't hold any state. For instance, there's no reason why you'd need to go through a shutdown sequence for a printer.


Please tell my printer that. It becomes _very_ grumpy if it loses power instead of being shut down via its off button.

And then it turns itself off if it's not used for a while. I hate printers.


A reboot is just a boot after a shutdown. It doesn’t matter what kind of shutdown that is.


This has to be a joke right ?

You're telling me Aerospace's "real engineering-level" is worse than something a sophomore can cook up ?


The testing for aerospace is extremely rigorous ... For DO-178C level A (Catastrophic failure that can cause a crash or many fatal injuries) we're estimating 2 years to do MC/DC test coverage metric of a fairly basic software system that has two mechanical backups. And that's above and beyond the extensive unit tests.

The main thing that gets checked is the worst-case timing analysis for every branch condition. And there are stack monitors to monitor if the stack is growing in size.

Look at Rapita System's website for more info ... we don't use them, but they explain it well.


Wait till you hear about boeing in space.


>an aircraft has no “reboot”. It is just a violent power cut

Guess how I typically reboot things :)


By traveling to Mexico and laying out bait along the migratory path of the butterflies?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: