29 May 2017

British Airways Glitch Echoes 2016 Southwest Outage

GlitchReporter.com: Guardian story on British Airlines Outage 2017
While the postmortem is still ongoing, a May 2017 British Airways computer system glitch appears to have had much in common with a 2016 Southwest Airlines outage. Both involved seemingly well-understood hardware failures that produced a cascade of problems that delayed an orderly recovery. 

Planning for such outages is nontrivial. As the scale of data and network connectivity increases, models for post-failure recovery processing are difficult to model. That said, public perception that the airlines should be doing better to mitigate these outages is understandable. Second-guessing has already begun. Was it a lapse in cybersecurity?  Massive outsourcing?  Loss of talent?  Cost-cutting?

It may have been a failure to properly model British Airways systems and the processes required to recover from a hardware outage.

H. Herodotou, B. Ding, S. Balakrishnan, G. Outhred, and P. Fitter, "Scalable near real-time failure localization of data center networks," in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD '14.    New York, NY, USA: ACM, 2014, pp. 1689-1698. [Online]. Available: http://doi.acm.org/10.1145/2623330.2623365 

No comments: