This is not a big story, but I find it interesting. Last week American Airlines had its reservations computer system — called SABRE — go offline for most of a day leading to the cancellation of more than 700 flights. Details are still sketchy (here’s American’s video apology) but this is beginning to look like a classic example of a system that became too integrated and a company that was too dependent on a single technology.
To be clear, according to American the SABRE system did not itself fail, what failed was the airline’s access to its own system — a networking problem. And for further clarification, American no longer owns SABRE, which was spun off several years ago as Sabre Holdings, but the airline is still the system’s largest customer.
It’s interesting that Sabre Holdings has yet to say anything about this incident.
American built the first computerized airline reservation system back in the 1950s. It was so far ahead of its time that the airline not only had to write the software, they built the hardware it ran on, too. Over the years competing systems were developed at other airlines but some of those — TWA and United included — were splintered versions of SABRE. American has modernized and extended the same code base for over 50 years, which is long even by mainframe standards.
Today SABRE is probably the most intricate and complex system of its type on earth and Sabre Holdings sells SABRE technology to other industries like railroads and trucking companies. In many ways it is hard to dissociate the airline and the computer system, and that seems to be the problem they had last week.
The American SABRE system includes both a passenger reservation system and a flight operations system. Last week the passenger reservation system became inaccessible because of a networking issue. In addition to reservations, passenger check-in, and baggage tracking, the system also passes weight and location information over to the flight operations system which calculates flap settings and V speeds (target takeoff speeds based on aircraft weight and local weather) for each departure runway and flight combination. The lack of either system will cause flight delays or cancellations, not just because the calculations have to be done by hand, but because the company had become totally dependent on SABRE for running its business.
Without SABRE American literally didn’t know where its airplanes were.
Here’s an example. SABRE has backup computer systems, but all systems are dependent on a microswitch on the nose gear of every American airliner to tell them when the plane has left the ground. That microswitch is the dreaded single point of failure. And while it may not be that switch that failed in this instance, it is still a second order failure because if you can’t communicate with the microswitch it may as well be busted.
That’s what happens with such inbred systems that no one person fully understands. But it’s easy to get complacent and American was used to having its systems up and running 24/7. The last significant computer outage at American, in fact, happened back in the 1980s.
That one was caused by a squirrel.
SABRE did make a statement, affirming that their system was not at fault: https://www.marketwatch.com/story/sabre-statement-regarding-american-airlines-outage-2013-04-16
AA initially said that SABRE’s system was to blame, then retracted that and replaced it with its current story about its own software.
SABRE did make a statement, affirming that their system was not at fault: https://www.marketwatch.com/story/sabre-statement-regarding-american-airlines-outage-2013-04-16
AA initially said that SABRE’s system was to blame, then retracted that and replaced it with its current story about its own software.
Not only isn’t this a big story, it’s a time waster to your readers. The entire story could be paraphrased in a few sentences, one of which would be “We really don’t know what happened.” Why not wait until you have some actual, um, news before writing this. Clearly not up to your usual high journalistic standards, Bob.
for systems programmer types, this is a very cautionary tale. every company that has been in business more than 10 years has a boatload of proprietary code done in languages that the schools no longer turn out fresh, cheap meat that can maintain that code. enough of it is strangulated spaghetti code that violates type restrictions and ignores basic libraries, ready-to-run and robust and proven, as any reader of thedailywtf.com could tell you. assuming they never admitted to writing with their left foot themselves.
the pivot of the story is “single point failure.” this is a recurrent design flaw that exists in every complex system, not just nerds grinding out case and if/then statements. we had one in a bridge up here in Minnesota that was ignored by the highway department that killed 17 folks in rush hour just a few short years ago. every car recall you find in the newspaper is a single-point failure that affects safety and is in millions of cars because, hey, Joe, they looked OK to me….
so this is relevant to almost everybody in the readership. the precise cause of the semaphore fail or network fail or out of bounds condition is something that everybody doing just-in-time stuff needs to know about, so they can think about whether they have one of those in their house as well.
you can rest assured that The New American Airlines is not going to post the code and the conditions that caused the fail.
but if the root cause is isolated, rest assured that if anybody else hears about it, NostrilDrippus Predicts! ™ there is an 80% chance that the info will be considered Cringeworthy, and you will then have the full story in front of you, here.
The day that this happened the word on the street was that while AA initially thought that the problem was with SABRE, the actual cause was a single point of failure between the AA domain and the SABRE domain. Specifically, it was a switch / communications link that was outsourced and maintained by HP that went down, disrupting the ability of the AA users to reach the SABRE data.
If this was true, then this is a huge single point of failure, and it is astonishing that a risk of this size with a probability of occurrence (outsourced IT stuff goes belly-up all the time) so high existed without any fall-back at a company the size and sophistication level of AA.
How many times must it be said – you cannot just outsource and “throw it over the wall”. You have to manage the process, the risks, and the outcomes even though you outsource an activity, or the activity will fail.
Kind of a shame your shift key is broke. You know you can get a new keyboard pretty cheap.
…so your reaction to feeling your time was wasted was to spend more time complaining?
It always amuses me when people complain about the subject of a particular blog post. It’s Bob’s blog, so he can write about whatever you wants to! Hey, if that post isn’t of interest to you, just don’t read it!
Did the IQ suddenly drop in this forum? LOL – I think it’s an interesting story and an example of the sort of planning issues that most people generally learn the hard way …
One interesting question that comes to mind is, “Should all software have a mandatory ‘retirement date’?” – we write the stuff, then hack it for years to “fix” bugs (or move them) and every time a new programmer comes along they warp the code base slightly … eventually it gets to the point where even updating a comment can introduce a fatal problem in the code. So what was SABRE written in originally, FORTRAN? Is the code base still FORTRAN … probably ….
Wikipedia: “Programs were originally written in assembly language, later in SabreTalk, a proprietary dialect of PL/I, and now in C.”
SABRE runs on IBM’s TPF(transaction processing facility) environment.
AA’s flight operations system (FOS) is seperated from the reservation system. FOS grounded the fleet back in 2004 when someone entered a date in the incorrect format in the AA Cargo module and caused the program to eat up all the system resources necessitating a reset.
AA has over 600 planes, so I am not sure 600 switches are a single point of failure by any means. If there is any single point of failure it is the VAX preprocessors that handle communications to the mainframes.
Planes can obviously land if the Flight Operations System is down or communication is lost with the plane. FOS can also be manually updated with arrival/departure time.
I wager my last dime the day will come when some critical system will fail due to a random computer error and civilization will cease to exist. The problem is that complex computer systems aren’t designed and/or maintained to account for the effects of change and complexity.
Airlines use very robust systems (multi redundant), this is their life blood. I think someone ordered the plug pulled. Like the stopping of all trains from Boston to New York. Just to be safe. It was an American flight that struck the World Trade Center.
I’m liking the software retirement-date idea. It forces the company to (ideally) always have someone on hand who understands each piece of the puzzle, and to modernize consistently. Since the retirement aspect would affect *all* lines of code, it would be easy to plan for as it would be consistent, as opposed to Y2K which was a sudden-panic-mad-dash. It would also give an incentive to code efficiently.
I’ve long had a similar idea for laws. Every single law ought to have an expiration date. Even the basics, like theft and murder … everything. This would keep politicians busy, so they’d have less time to spend on making new laws to remove our freedom / cash, and it would force every law to be reconsidered regularly to see whether it was really a good idea after all or whether it still is.
Mandatory retirement of code would give you a “drop dead date” but management would (as it always does) push the project back with other priorities until 2 weeks before the due date.
In 1967 the whole SABRE system went down because an IBM Customer Engineer (hardware maintenance person) changed a burnt out console light bulb. There was a short in the bulb socket, and inserting the new bulb tripped the emergency power off relay for the device which was a switching unit for placing various mainframes in the computer room on and off line. When the switch EPO’d, all the attached computers also EPO’d. A very noisy computer room became dead silent for several seconds prior to the onset of panic:)
I thought EPO was a performance enhancing drug.
I suspect Cringely posted this because part 2 is that the problem was caused by IBM failure caused by having H1Bs instead of quality tech.
Sounds like there may be another squirrel story in the works!! 😀
Look here Bob, we all KNOW why this happened – the Indians got their hands on it. They did the same thing in 2005 with ComAir when the ALL INDIAN IT DEPARTMENT screwed up and used a short int instead of a long int. When the crew scheduling system exceeded that limit on Christmas Day 2005, the ENTIRE AIRPORT SYSTEM IN THE US CAME CRASHING DOWN. Not just one airline, but THE WHOLE thing. I checked ComAir’s website on that day and sure enough EVERY last name in their IT section listed was Indian.
The IQ81 geniuses from India who cant build enough toilets for their people did this.
What about the American genius who hired them?
[…] Sabre Did NOT Crash Last Year American Airlines has admitted it was their overlay to Sabre that crashed. Sabre never crashed. https://www.cringely.com/2013/04/29/h…-its-computer/ […]