16 June 2010

Twitter Outages: Network Simulation, Anyone?

Whether caused by increased use over the World Cup, or other causes, Twitter is having service level management and possibly other issues. In their June 11 post acknowledging the problems and blaming it on a "perfect storm" of issues, their engineers attributed the outages to capacity planning and errors in configuration management.  While some, such as Forrester's Gualtieri, quoted by Computerworld, accept this level of transparency, it's minimal at best. My recent TechRepublic post on capacity planning for backup has some suggestions that Twitter might want to consider; the methods currently in use by the firm, one guesses, haven't involved careful simulation or other prudent measures.

07 June 2010

iPhone Contract Perk May Have Sunk AT&T Servers

Some days or months before the iPhone 4 announcement that transpired today, it's clear that Apple and AT&T very likely had a number of planning sessions. The result was an agreement by AT&T to offer a number of iPhone customers currently under contract an option to upgrade to iPhone 4 this month. The unsurprising result was undoubtedly a peak in traffic to the AT&T iPhone account management server page.  The surprising result was that, despite the advance planning, the AT&T site may have been unprepared for the traffic.

According to a Computerworld post by Gregg Keizer, the result was that the site could was either taken offline intentionally or, probably more likely, could not handle the volume.

01 June 2010

Configuration Mgmt Failure Caused Military GPS Outage

It's been said more times than this glitch reporter could count, but a net-centric military must make certain assumptions about what services are of status "always-on."  GPS is one of those.  But apparently, according to an AP report,  "as many as 10,000 U.S. military GPS receivers were rendered useless for days. . ."

The problem was blamed on "incompatible software." According to the report, an Air Force defense contractor installed software in certain Trimble Navigation receivers that was incompatible with other elements of the system -- a ground control system that received an update in January 2010.  The update was part of a new generation of GPS satellites ("Block IIF). 
A more IT-savvy writer might have referred to this as a configuration management failure, but at least the AP kept after the Air Force to provide a narrative for the problem. 

The AP story concludes with a discussion of cybersecurity risks to the GPS software.  While the discussion covers jamming and straightforward outages, the risks of insider threat are not fully explored.  (Note: The Trimble "recon" handheld shown is illustrative of the company's products -- not necessarily the one involved in this glitch report).