08 August 2016

Delta Airlines' Turn for Resilience and DR Systems Fail

ComputerWorld Story on Delta Airlines August System-wide Outage
Power failure is one of the most planned-for disaster recovery modes, yet DR plans to address it appear to fail on a regular basis. As reported in this Computerworld story, a power outage struck the company's headquarters around 2:30A. 

The result was not only a system-wide outage, but also side-effects seemingly designed to maximize already frayed traveler nerves. According to the report, "airport screens and other flight status systems were incorrectly showing flights as being on time."

Testing for power failure is one of most obvious elements of the DR canon, and has been for decades. Such testing is anything but trivial, but it appears from this and other system-wide outages that resilience planning is not given a high enough priority.

Announcements about the outage and related customer service support are key, as shown in the recent Southwest Airlines outage. That airport monitors are showing incorrect status suggests that secondary systems designed to degrade graceful are not working properly, either. 

Engineers have suggested a number of remedies. SDN has been suggested as one way to making communication systems more resilient. In their study "The Data Center as a Computer," Barroso, Clidaras and Hölzle suggest an approach that involves"tolerating faults, not hiding them."

UPDATE (Via Reuters) A former industry worker suggested this possible explanation: "The carrier was probably running a routine test of its backup power supplies when the switch gear failed and locked Delta out of its reserve generators as well as from Georgia Power, industry analyst and former airline executive Robert Mann said. That would result in a shutdown of Delta's data center, which controls bookings, flight operations and other critical systems, he said."

No comments: