01 March 2010

It's Not A Game: PS3 Firmware Bug Affects Product Worldwide

"No offline game play" appears to be the result of a firmware bug, according to CNET's coverage of a worldwide glitch affecting almost all Playstation 3 (PS3) game consoles.  The bug, reported by the console software as "error 8001050F," (this site offers a humorous "fix" video) is variously reported as caused by a calendar problem or by issues with "trophy support."  A calendar issue, either direct or indirect, could be suspected due to longstanding issues with weak software testing of leap year conditions.  A recent example appeared in a preview version of Microsoft SQL Server 2008.


Photo courtesy of Wikipedia Commons.

Share/Save/Bookmark

26 December 2009

Software Test Failure Apparent Cause of Latest Blackberry Outage


In its second outage this month, messaging services including email were affected for all North American customers of Research In Motion (RIM), maker of the popular Blackberry smartphones.  Phone service was unaffected.  While no announcements are archived on RIM's Press Releases page, it's tempting to recall the April 2007 outage, which was blamed on an update intended to improve cache performance. This time it appears the cause was an error in a new release of Messenger, the client application for RIM's devices.

Share/Save/Bookmark

18 December 2009

Blackberry Email Outage at RIM Affected All Carriers

See http://bit.ly/7dxqfP.  Yesterday's outage affected all carriers.  Details when and if provided by RIM. ◦
Share/Save/Bookmark

30 November 2009

Black Friday for eCommerce Hyperlinks at hp.com?

It was to be a busy weekend for HP.com, but perhaps it was too busy for some of its web content managers.  Over the weekend I received several HP small business promotions by email.  When I clicked into the links, selected "laptops" and then "ultra-portables," I got the dreaded "404."  So remarkable was this that I assumed this had to be a problem with me, not the website.  I tried it again several times between Friday and Sunday evening -- to no avail.  None of the category groupings offered up by the landing page from the push email from HP resulted in anything but 404.


If I'd known this was going to last an entire weekend and become "a story," I would have grabbed a screenshot and featured it on errorprocessing.com. But I had no such prescience, and as a result I have no hard evidence to back up this finding.

Instead the only proven result is a statistically insignificant decline in an HP small business push email conversion sales total -- one less ultra-portable purchased online.

Share/Save/Bookmark

03 November 2009

T-Mobile Outage Widespread


Reports from U.S. customers of T-Mobile reported problems with service nationwide, though not with every customer, and, according to the Company, only affecting 5% of their (active?) users.  Problems were reported from San Francisco to Long Island and Tennessee, according to CNET's Ina Fried. T-Mobile's troubles last month are on every analyst's mind, though it's unlikely there is any connection, unless it's another lapse in supplier service level.  TMobile's forum landing page posted the announcement below.  The screenshot was captured at 21:23 Eastern.



Share/Save/Bookmark

12 October 2009

Dependency Maze May Underly Sidekick Data Loss


It's being called a black eye on the entire cloud computing initiative. Danger, now a Microsoft subsidiary, original developer of the innovative Sidekick mobile device, is reporting a major server failure (via ZDNet's Adrian Kingsley-Hughes and numerous other sources over the weekend of 10 October) with  what is feared to be a heavy loss of data, apparently affecting thousands of users. Whether it deserves the corruptus in extremis plaque is early to tell at this point, but the prognosis is not good, with T-Mobile advising users not to remove device batteries or allow Sidekicks to become fully discharged.  T-Mobile may have halted online sales of the device.


Speaking as an early (but not, thankfully, current user) adopter of the cleverly designed Sidekick and the very usable back-end web resources provided by Danger, it's understandable why customers would panic. The Danger back end meant that even with the  early Sidekicks it was possible to kick free of the  device's keyboard -- and the Sidekick has always had an excellent keyboard -- to use a full size computer with a full size browser to manage the address book, notes and calendar. Because these useful features appeared years ago, Sidekick users may be more likely that other mobile device users to rely upon "cloud" infrastructure.

I've since reluctantly migrated to Windows Mobile, with considerable loss of functionality on the back end. Recently -- in fact, just last week -- some of that functionality is being slowly added via the Microsoft MyPhone initiative. One wonders whether the same operations practices are in effect for the infrastructure behind MyPhone, or if Danger was left to fend for itself -- organizationally speaking. 

What happened?  Reuters offered one cryptic clue from Microsoft:


Microsoft said in an emailed statement that the recovery process has been "'incredibly complex' because it suffered a confluence of errors from a server failure that hurt its main and backup databases supporting Sidekick users.


"Confluence of errors" may be taken to mean a chain of interdependencies.  The cause may be organizational -- at least one blogger has suggested that Microsoft under-resourced Danger after the acquisition, and that founders are less motivated to ensure high service quality once they disappear into the org chart of a larger firm.  The data center's SAN contractor, Hitachi Data Systems is rumored be involved.


But it seems clear that technology approaches should also have been explored. To avoid or at least anticipate major catastrophies in the face of complexity involved in architecting major data centers and complex real time messaging / synchronization applications, the responsible team had several possibilities that go beyond "just keep a backup." Simulations and DR emergency walkthroughs are part of the answer.



Another approach is being considered by the U.S. military.  The military has an urgent need to understand the connection between a particular mission and the IT assets it depends upon.  IT assets could go offline due to communications issues, break, or become compromised.  It's increasingly less obvious which assets are critical for a particular mission. This is the subject of research being conducted by Applied Visions for the Air Force.*  A better understanding of dependencies was also needed for SCADA systems; that was the goal of the Idaho National Lab CIMS project


Update (13 OCT 09) : T-Mobile is reporting some progress in restoring customer data.
Update (16 OCT 09): Microsoft is reporting more progress, but no explanation to speak of.



* I am an employee of Applied Visions, but not currently working on the project. Opinions expressed are mine alone and not the official views of the Company.



Share/Save/Bookmark

24 September 2009

Google Outage: The "Fail Whale" Fails to Amuse

Many IT managers are anxiously monitoring the landscape to read whether the time is right to move applications to The Cloud.

This May 2009 report of a short Google outage by ZDNet's Larry Dignan is not the sort of message that Google, Microsoft's Azure team, or Salesforce.com would like to distribute. As others have pointed out, it's not that internal data centers are immune to outages. On premises outages, despite claims of 99.999 uptime, can be just as difficult to correct and just as pervasive in their effects on the enterprise.

This is not an issue of cloud vs. on-premises IT infrastructure. Rather the issue is one of perception and how outsourced cloud outages are handled.

Google apparently put up its "fail whale" (see the ZDNet post for a screenshot), but didn't post anything to Twitter until after the problem, from its point of view, was "solved." Cloud outages can take down every application, which arguably is less likely to happen with some types of in-house outages. The loss of control and lack of information exacerbates the shroud of mystery that accompanies these normally high reliability systems. One sees this at the airport on a regular basis. When passengers are kept informed on a regular basis, they may not be happy, but they are less unhappy than when they are kept in the dark and encouraged to foster rumors.

Share/Save/Bookmark