26 March 2011

Mere Outage or Outrage? Ask Alaska/Horizon Air Ticketing Agents

Via AP - ABC News / Twitter @BreakingNews Alaska / Horizon Airlines reported that a computer outage initially stated to be of unknown origin with its "central computers" has delayed a number of flights.  Later in the day, the same ABC News / AP report offered these additional details:
Alaska Airlines and its Horizon Air affiliate canceled 95 flights Saturday because a computer system used for flight planning failed. The outage lasted intermittently for about seven hours and resulted in the two airlines scrapping about 12 percent of their combined schedule before technicians fixed the system, which returned at 10 a.m. Pacific time.

20 March 2011

Glitch or Sabotage? China had Google Guessing

Another reason to favor glitch reduction emerges from a recent investigation Google made into Gmail service disruptions in China. According to a NY Times story, Google concluded that:
“There is no issue on our side; we have checked extensively. This is a government blockage, carefully designed to look like the problem is with Gmail.”
China did not confirm or deny adding to its immodest wholesale censorship of Facebook, Twitter and other sites. Politics aside, the point is that lapses in software quality can become a convenient cover for mischief -- covert or otherwise. 

02 March 2011

Glitch vs. Glitch: Microsoft Live vs. Google Gmail

Since no one has figured out how to write error-free software, cloud glitches, sooner or later, are likely to affect more and more people. As Woody Leonhard notes, it's instructive to see how cloud providers react. Leonhard suggests a comparison between the recent Gmail incident and Microsoft's 30 December 2010 outage at Hotmail. In the latter case, Leonhard's research suggests that Microsoft didn't provide an official response until four days after the seemingly not dissimilar scripting problem that caused the Hotmail outage. Not that Google has given much of an explanation so far, either.

As anyone whose flight has been cancelled will attest, providing timely information eases the pain a little -- even if it's only to say "We don't have a new departure time yet, but here's what we know so far . . ." IT must make a similar commitment, or be content with user-unfriendly, unsympathetic software and services that take a cue from indifference writ large. The challenges are both technical and organizational. Support staff need to be empowered to escalate and route problems to the appropriate teams, since existing problem resolution workflow is likely to be unsuitable for major outages. 

Support services must scale along with cloud migration, and the indications so far are unimpressive.

01 March 2011

Cloud Glitch: Gmail Outage Highlights Big Benefits / Big Risk Proposition

Automation. It simultaneously extends benefits and risks. On the risk side of the equation, small problems can be dramatically magnified. A steady migration of some applications to the cloud has exposed many to this long-understood computing gamble. The latest instance is yesterday's Google Gmail outage, caused by a yet-to-be-specified software failure in Google's storage infrastructure.

Commentary about this outage has focused on the reliance of tape technology to restore Google's backups. I'm guessing that's because consumer backups may have drifted toward disk backups instead of tape (out of economic desperation, rather than a well-reasoned decision). The bigger story isn't about tape. It's not about the flaw itself, though that would be an interesting sidebar. It's about how one of the world's largest technology firms with an essentially unlimited budget for its software development life cycle (SDLC) allowed flawed software to enter production systems.