09 August 2016

Single Router Failure Blamed on Southwest Outage

Screenshot of Computerworld story on Southwest Airlines Outage in July 2016
Computerworld Story on July 2016 Outage at Southwest Airlkines

Single point of failure.  You know the concept.  Those single points are easier to see in the rear view mirror. Can a single router cause an entire data center to go down?

In a Dallas Morning News story, CEO Gary Kelly said it could -- and did. There was a "backup system" in place to address a single router failure. 

But, Kelly insisted, because of the router's unusual "partial failure," the backup procedures weren't triggered, and the problem became a massive one.

No, this isn't a very good technical explanation .

Kelly took pains to insist that the legacy technology in their data center wasn't involved in the failure. Nor was the cause a hack of some sort (probably true, though at the time of this report it might be premature to completely rule that out).

Whether they're justified in saying so or not, Southwest Airlines unions want a change of leadership, arguing that the CEO and others are delaying technology improvements in order to maintain profits for investors. (True, Southwest employees hold plenty of Southwest stock, too.) What do the IT folks at Southwest say? I didn't see any comment on that.

CNBC listed the publicly reported airline technology failures since 2015.

08 August 2016

Delta Airlines' Turn for Resilience and DR Systems Fail

ComputerWorld Story on Delta Airlines August System-wide Outage
Power failure is one of the most planned-for disaster recovery modes, yet DR plans to address it appear to fail on a regular basis. As reported in this Computerworld story, a power outage struck the company's headquarters around 2:30A. 

The result was not only a system-wide outage, but also side-effects seemingly designed to maximize already frayed traveler nerves. According to the report, "airport screens and other flight status systems were incorrectly showing flights as being on time."

Testing for power failure is one of most obvious elements of the DR canon, and has been for decades. Such testing is anything but trivial, but it appears from this and other system-wide outages that resilience planning is not given a high enough priority.

Announcements about the outage and related customer service support are key, as shown in the recent Southwest Airlines outage. That airport monitors are showing incorrect status suggests that secondary systems designed to degrade graceful are not working properly, either. 

Engineers have suggested a number of remedies. SDN has been suggested as one way to making communication systems more resilient. In their study "The Data Center as a Computer," Barroso, Clidaras and Hölzle suggest an approach that involves"tolerating faults, not hiding them."

UPDATE (Via Reuters) A former industry worker suggested this possible explanation: "The carrier was probably running a routine test of its backup power supplies when the switch gear failed and locked Delta out of its reserve generators as well as from Georgia Power, industry analyst and former airline executive Robert Mann said. That would result in a shutdown of Delta's data center, which controls bookings, flight operations and other critical systems, he said."