29 May 2017

British Airways Glitch Echoes 2016 Southwest Outage

GlitchReporter.com: Guardian story on British Airlines Outage 2017
While the postmortem is still ongoing, a May 2017 British Airways computer system glitch appears to have had much in common with a 2016 Southwest Airlines outage. Both involved seemingly well-understood hardware failures that produced a cascade of problems that delayed an orderly recovery. 

Planning for such outages is nontrivial. As the scale of data and network connectivity increases, models for post-failure recovery processing are difficult to model. That said, public perception that the airlines should be doing better to mitigate these outages is understandable. Second-guessing has already begun. Was it a lapse in cybersecurity?  Massive outsourcing?  Loss of talent?  Cost-cutting?

It may have been a failure to properly model British Airways systems and the processes required to recover from a hardware outage.

H. Herodotou, B. Ding, S. Balakrishnan, G. Outhred, and P. Fitter, "Scalable near real-time failure localization of data center networks," in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD '14.    New York, NY, USA: ACM, 2014, pp. 1689-1698. [Online]. Available: http://doi.acm.org/10.1145/2623330.2623365 

09 August 2016

Single Router Failure Blamed on Southwest Outage

Screenshot of Computerworld story on Southwest Airlines Outage in July 2016
Computerworld Story on July 2016 Outage at Southwest Airlkines

Single point of failure.  You know the concept.  Those single points are easier to see in the rear view mirror. Can a single router cause an entire data center to go down?

In a Dallas Morning News story, CEO Gary Kelly said it could -- and did. There was a "backup system" in place to address a single router failure. 

But, Kelly insisted, because of the router's unusual "partial failure," the backup procedures weren't triggered, and the problem became a massive one.

No, this isn't a very good technical explanation .

Kelly took pains to insist that the legacy technology in their data center wasn't involved in the failure. Nor was the cause a hack of some sort (probably true, though at the time of this report it might be premature to completely rule that out).

Whether they're justified in saying so or not, Southwest Airlines unions want a change of leadership, arguing that the CEO and others are delaying technology improvements in order to maintain profits for investors. (True, Southwest employees hold plenty of Southwest stock, too.) What do the IT folks at Southwest say? I didn't see any comment on that.

CNBC listed the publicly reported airline technology failures since 2015.