05 May 2011

VMWare Cloud's Stormy Start and Superspecialization in SM

It's still in Beta, so cut them some slack, but two failures at VMWare's new Cloud Foundry infrastructure may give some prospective customers pause. Cloud Foundry has only been available since 12 April 2011, and the two failures on 25 and 26 April may be related, but the failures are noteworthy nonetheless. One reason is that prospective customers are keen to learn how VMWare coordinates its service interruptions with partners and customers.

As reported by Network World, the first outage was caused by a PSU failure in a storage cabinet. This is the sort of failure to be expected at any facility of this kind. The outage lasted 10 hours. The next day, as staff worked out remediation plans to address future failures, as VMWare put it, "Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry" (via Network World).

It's possible that in an era of superspecialization yet another specialization has been added to the mix: service availability management for mega-datacenters. Not likely to be in the job description for my next job or yours, but it will be part of someone's highly compensated sweat.

Let's hope there's still room in mega-datacenter budgets for some decent CRM. Based on recent history, and steady, though reluctant migration to cloud services, they will need it.

Share/Save/Bookmark

23 April 2011

Cloudy Computing: Amazon EC2 2011 Outage Added to 2008, 2009 Events

Cakewalk.com ECommerce Site Explains EC2 Outage to Customers
The net's collective memory can offset even the short memories of consumers and investors.Search for Amazon web service outages, and one is reminded that it was around the time that rumblings of the 2008 Big Recession began to make themselves felt, in July to be exact, that Amazon's S3 storage service suffered an outage. The July outage affected Twitter, which used S3 to store images. Keep reading through the search results, and one is further reminded that it was the second outage that year; earlier in February, Amazon explained the outage as an overloading of authentication requests. That problem turned out to be a problem with server-to-server system health reporting in Amazon's farm. In June of 2009, a lightning strike on one of Amazon's data centers took down some resources for more than four hours. Amazon's health dashboard described the latest problem as "connectivity and latency issues with RDS database instances in the US-East-1 region." In other words, for some customers, the Elastic Compute Cloud (EC2) stretched and snapped again in April 2011.

The screenshot above is taken from Cakewalk, a music technology firm based in the Boston Greater Metro area. Cakewalk is an Amazon customer and its ecommerce site was directly affected by the outage. As can be seen from their message, the outage resulted in an interruption to an existing web campaign, and obviously no ecommerce revenues were possible during the approximately two day outage. In addition to the lost revenue, as web staff scrambles to compensate for the outage, errors can be made in reconfiguring sites with workarounds. There's no way for the GlitchReporter to know the cause for sure, but Cakewalk's ecommerce store is now getting a 404 on the store's home page (see below).

These failures are infrequent, and failures will happen in the cloud just as they happen to internal data centers, but there are important differences, too. As the GlitchReporter noted not long ago with a Gmail cloud outage, the scale and complexity of cloud resources may reduce the frequency, but perhaps not the cascading effects of failures across enterprises. Perhaps more importantly, because technical communications from big company cloud firms (Microsoft, Google, Amazon) tend to be sketchy during such outages, customers could not be blamed for thinking (even if erroneously) that they might be better able to work around and plan for failures that occur inside their own data centers.



Share/Save/Bookmark

26 March 2011

Mere Outage or Outrage? Ask Alaska/Horizon Air Ticketing Agents

Via AP - ABC News / Twitter @BreakingNews Alaska / Horizon Airlines reported that a computer outage initially stated to be of unknown origin with its "central computers" has delayed a number of flights.  Later in the day, the same ABC News / AP report offered these additional details:
Alaska Airlines and its Horizon Air affiliate canceled 95 flights Saturday because a computer system used for flight planning failed. The outage lasted intermittently for about seven hours and resulted in the two airlines scrapping about 12 percent of their combined schedule before technicians fixed the system, which returned at 10 a.m. Pacific time.

Share/Save/Bookmark

20 March 2011

Glitch or Sabotage? China had Google Guessing

Another reason to favor glitch reduction emerges from a recent investigation Google made into Gmail service disruptions in China. According to a NY Times story, Google concluded that:
“There is no issue on our side; we have checked extensively. This is a government blockage, carefully designed to look like the problem is with Gmail.”
China did not confirm or deny adding to its immodest wholesale censorship of Facebook, Twitter and other sites. Politics aside, the point is that lapses in software quality can become a convenient cover for mischief -- covert or otherwise. 

Share/Save/Bookmark

02 March 2011

Glitch vs. Glitch: Microsoft Live vs. Google Gmail

Since no one has figured out how to write error-free software, cloud glitches, sooner or later, are likely to affect more and more people. As Woody Leonhard notes, it's instructive to see how cloud providers react. Leonhard suggests a comparison between the recent Gmail incident and Microsoft's 30 December 2010 outage at Hotmail. In the latter case, Leonhard's research suggests that Microsoft didn't provide an official response until four days after the seemingly not dissimilar scripting problem that caused the Hotmail outage. Not that Google has given much of an explanation so far, either.

As anyone whose flight has been cancelled will attest, providing timely information eases the pain a little -- even if it's only to say "We don't have a new departure time yet, but here's what we know so far . . ." IT must make a similar commitment, or be content with user-unfriendly, unsympathetic software and services that take a cue from indifference writ large. The challenges are both technical and organizational. Support staff need to be empowered to escalate and route problems to the appropriate teams, since existing problem resolution workflow is likely to be unsuitable for major outages. 

Support services must scale along with cloud migration, and the indications so far are unimpressive.

Share/Save/Bookmark

01 March 2011

Cloud Glitch: Gmail Outage Highlights Big Benefits / Big Risk Proposition

Automation. It simultaneously extends benefits and risks. On the risk side of the equation, small problems can be dramatically magnified. A steady migration of some applications to the cloud has exposed many to this long-understood computing gamble. The latest instance is yesterday's Google Gmail outage, caused by a yet-to-be-specified software failure in Google's storage infrastructure.

Commentary about this outage has focused on the reliance of tape technology to restore Google's backups. I'm guessing that's because consumer backups may have drifted toward disk backups instead of tape (out of economic desperation, rather than a well-reasoned decision). The bigger story isn't about tape. It's not about the flaw itself, though that would be an interesting sidebar. It's about how one of the world's largest technology firms with an essentially unlimited budget for its software development life cycle (SDLC) allowed flawed software to enter production systems.

Share/Save/Bookmark

03 January 2011

Sprint Takes Down Long Island Railroad Payment Systems


At 11:05 on 2 January 2011, an email was received by subscribers to the New York Metropolitan Transit Authority's Long Island Railroad alerts system stating that no credit or debit card purchases could be used. The full text of the message follows:
Ticket Machines at all LIRR stations are not accepting credit or debit for ticket purchases due to a problem with Sprint communication lines.  Please purchase your tickets using cash, while the LIRR works to find a solution.
The problem was reported as corrected at 14:39, according to a later MTA-LIRR message.

Needless to say, for many travelers highly dependent upon rail services in the area, this was not a minor inconvenience. (For curious readers unfamiliar with this area, a round trip off-peak ticket from SUNY Stony Brook to Manhattan costs $23.50. That's 106 miles by car, or $108.12 at January 2011 GSA mileage reimbursement rates.)

Share/Save/Bookmark