23 April 2011

Cloudy Computing: Amazon EC2 2011 Outage Added to 2008, 2009 Events

Cakewalk.com ECommerce Site Explains EC2 Outage to Customers
The net's collective memory can offset even the short memories of consumers and investors.Search for Amazon web service outages, and one is reminded that it was around the time that rumblings of the 2008 Big Recession began to make themselves felt, in July to be exact, that Amazon's S3 storage service suffered an outage. The July outage affected Twitter, which used S3 to store images. Keep reading through the search results, and one is further reminded that it was the second outage that year; earlier in February, Amazon explained the outage as an overloading of authentication requests. That problem turned out to be a problem with server-to-server system health reporting in Amazon's farm. In June of 2009, a lightning strike on one of Amazon's data centers took down some resources for more than four hours. Amazon's health dashboard described the latest problem as "connectivity and latency issues with RDS database instances in the US-East-1 region." In other words, for some customers, the Elastic Compute Cloud (EC2) stretched and snapped again in April 2011.

The screenshot above is taken from Cakewalk, a music technology firm based in the Boston Greater Metro area. Cakewalk is an Amazon customer and its ecommerce site was directly affected by the outage. As can be seen from their message, the outage resulted in an interruption to an existing web campaign, and obviously no ecommerce revenues were possible during the approximately two day outage. In addition to the lost revenue, as web staff scrambles to compensate for the outage, errors can be made in reconfiguring sites with workarounds. There's no way for the GlitchReporter to know the cause for sure, but Cakewalk's ecommerce store is now getting a 404 on the store's home page (see below).

These failures are infrequent, and failures will happen in the cloud just as they happen to internal data centers, but there are important differences, too. As the GlitchReporter noted not long ago with a Gmail cloud outage, the scale and complexity of cloud resources may reduce the frequency, but perhaps not the cascading effects of failures across enterprises. Perhaps more importantly, because technical communications from big company cloud firms (Microsoft, Google, Amazon) tend to be sketchy during such outages, customers could not be blamed for thinking (even if erroneously) that they might be better able to work around and plan for failures that occur inside their own data centers.


116 comments: