12 October 2009

Dependency Maze May Underly Sidekick Data Loss

It's being called a black eye on the entire cloud computing initiative. Danger, now a Microsoft subsidiary, original developer of the innovative Sidekick mobile device, is reporting a major server failure (via ZDNet's Adrian Kingsley-Hughes and numerous other sources over the weekend of 10 October) with  what is feared to be a heavy loss of data, apparently affecting thousands of users. Whether it deserves the corruptus in extremis plaque is early to tell at this point, but the prognosis is not good, with T-Mobile advising users not to remove device batteries or allow Sidekicks to become fully discharged.  T-Mobile may have halted online sales of the device.

Speaking as an early (but not, thankfully, current user) adopter of the cleverly designed Sidekick and the very usable back-end web resources provided by Danger, it's understandable why customers would panic. The Danger back end meant that even with the  early Sidekicks it was possible to kick free of the  device's keyboard -- and the Sidekick has always had an excellent keyboard -- to use a full size computer with a full size browser to manage the address book, notes and calendar. Because these useful features appeared years ago, Sidekick users may be more likely that other mobile device users to rely upon "cloud" infrastructure.

I've since reluctantly migrated to Windows Mobile, with considerable loss of functionality on the back end. Recently -- in fact, just last week -- some of that functionality is being slowly added via the Microsoft MyPhone initiative. One wonders whether the same operations practices are in effect for the infrastructure behind MyPhone, or if Danger was left to fend for itself -- organizationally speaking. 

What happened?  Reuters offered one cryptic clue from Microsoft:

Microsoft said in an emailed statement that the recovery process has been "'incredibly complex' because it suffered a confluence of errors from a server failure that hurt its main and backup databases supporting Sidekick users.

"Confluence of errors" may be taken to mean a chain of interdependencies.  The cause may be organizational -- at least one blogger has suggested that Microsoft under-resourced Danger after the acquisition, and that founders are less motivated to ensure high service quality once they disappear into the org chart of a larger firm.  The data center's SAN contractor, Hitachi Data Systems is rumored be involved.

But it seems clear that technology approaches should also have been explored. To avoid or at least anticipate major catastrophies in the face of complexity involved in architecting major data centers and complex real time messaging / synchronization applications, the responsible team had several possibilities that go beyond "just keep a backup." Simulations and DR emergency walkthroughs are part of the answer.

Another approach is being considered by the U.S. military.  The military has an urgent need to understand the connection between a particular mission and the IT assets it depends upon.  IT assets could go offline due to communications issues, break, or become compromised.  It's increasingly less obvious which assets are critical for a particular mission. This is the subject of research being conducted by Applied Visions for the Air Force.*  A better understanding of dependencies was also needed for SCADA systems; that was the goal of the Idaho National Lab CIMS project

Update (13 OCT 09) : T-Mobile is reporting some progress in restoring customer data.
Update (16 OCT 09): Microsoft is reporting more progress, but no explanation to speak of.

* I am an employee of Applied Visions, but not currently working on the project. Opinions expressed are mine alone and not the official views of the Company.

No comments: