Monday, April 6, 2009

A History of Failure

A History of Failure

Ancient Greece

  • More than 2000 years ago
  • Device - position of the stars, sun, planets, and moon
  • First computer, but also first software collaboration
  • Modification of device after created
    • Bugfixes
    • Feature Creep
  • Plundered by Romans
  • Sank, recovered in 1901.
  • X-ray tomography, 2000 greek characters on the outside
  • (Funny EULA)

Modern Times

  • First bug: 1947. A Real Insect
  • 1983: Therac-25 Radiation Treatment Machine
    • PDP-11
    • Errors are caused by alpha particles and EM noise
    • Picks the wrong mode 1 in 250M times, massive radiation overdose
    • No hardware interlocks, software controlled
    • Picked wrong mode 6 times in 3 years.
  • Overcorrection killed a rocket because of absolute velocity vs. smoothed velocity
  • Self-destruct buttons

"It’s possible to make mistakes so large they invalidate your entire worth as a human being"

  • Australian = $40,000/year, over lifespan of 80 years, $3.17M
  • Metric = lifetime effort lost

Bug 1: AT&T 1990

  • Switches fail, tell its neighbors, they remove it from the routing table, bad switch spends 6 seconds trying to fix itself.
  • Coming back up, it would 3way handshake with peers to add them back.
  • Changed, still send fault, still self-fix, then just makes an outgoing call to the other switches.
  • Bug: 1st switch made the call, 2nd switch updating routing table, crashes everyone!
  • 75M calls were lost
  • Lost revenue = $60M, 2300 years of productivity lost.

1996: Tiwai Point

  • Aluminum smelter, computer controlled
  • Comalco Australia programmed them
  • 2 hours behind AUS
  • Leap year, computers couldn’t take day 366.
  • All computers crash @ midnight.
  • 2 hours pass, same problem happens in AUS
  • Cells melted, had to be replaced.
  • Unknown cost.


http://www.roundtripsolutions.com/blog/wp-content/uploads/2007/07/OutlookErrorMessageManagedMAPIServiceCat_72E7/outlookmapierror.jpg

Space vehicles

1996: Ariane 5

  • Developed bug 37 seconds after launch
  • Veered off course dramatically
  • 64-bit FP to measure launch position
  • Casting to 16-bit int
  • No Exception Handling!
  • Overflow, negative! Rocket turned around!
  • Reused code from Ariane 4, could only move 1/2 the horizontal speed
  • Testing? The bug showed up perfectly!!!
  • The bug showed up afterwards in simulation
  • $370M lost!
  • 150 lifetimes, 12,000 years

1998: Mars Climate Orbiter

  • Plummeted through the atmosphere
  • Part of the code in imperial, some in metric
  • Pound force, newtons :P
  • Testing budget was cut before launch
  • Mars Lander failed as well
    • Thrusters stopped working
    • Landing gear started vibrating, thought it was on the ground
  • 8300 years of time lost


Deeps Space 2: Hit Mars

  • 644+ KM/h
  • Sat in storage
  • Launched it, and it hit mars
  • Battery was dead!
  • $30M, 10 lifetimes

http://www.leonardo-energy.org/drupal-5.6/modules/tinymce/tinymce/jscripts/tiny_mce/plugins/imagemanager/images/blackout_USA.jpghttp://www.nasa.gov/images/content/318247main_march_89_powergrid_500.jpg

2003: North American Blackouts

  • 50M people
  • 2.38 x AUS, 1/6 of USA
  • Who’s to blame?
    • El Nino
    • Canada blames New York, but was a sunny day
    • Canada blames a nuclear power plant in Pennsylvania
    • New York blames Canada
  • Europe was saying USA had 3rd world electric grid
  • 6 weeks later, there was a big blackout
  • First Energy in Ohio
    • 14:14 Alarm system fails SILENTLY
    • Display said everything was fine
    • Remained in that state for 27 minutes, crashed
    • Hot spare failed silent after 13 minutes
    • 345kV line goes down, alarm system isn’t working
    • Automatic re-route, other lines pick up the load
    • 2 more lines went down, no one knows
  • 11 more lines go down says MISO
  • MISO calls First Energy to notify, then their own power went out

http://www.concurringopinions.com/archives/images/facebook-cartoon.jpg

Take away

  • Race conditions
  • Test
  • Deploy in New Zealand First

The image “http://tbn0.google.com/images?q=tbn:bNUwqyIyebsG9M:http://news.bbc.co.uk/olmedia/65000/images/_67614_Auckland_power_300.jpg” cannot be displayed, because it contains errors.http://media.nzherald.co.nz/webcontent/image/jpg/cut_purcell_body.jpghttp://images.tvnz.co.nz/tvnz_images/news2009/energy/power_cut_closed_2.jpg

1998: Auckland blackouts

  • LOTR: Where the orcs come from
  • 5 weeks without power
  • 150MWatts of load, 110MW of rated cable :P
  • 4 cables, 1 failed.
  • Bad press recently, so no announcement
  • 150MW of power on 85MW of cable
  • Cable 2 fails -> 150MW of power on 50MW of cable
  • Management willpower vs. physics. Classic.
  • Blamed it on El Nino
  • Actually a lack of sysadmins, engineers knew cables were overloaded
  • 1980: "We should replace the cables, guys"
  • Cost: $150M to Mercury power, unknown to business
  • Economic gain to Wellington: Priceless

http://twm.co.nz/elecmap1.gif



http://www.bcsolution.com/recovery/images/driveMechanical.jpg


Sysadmins

  • Hard to get people to listen to you, doomsayers
  • Disk failure, we need raid
  • Power? UPS
  • Listen to the sysadmins

FROM github.com/hank/life/blob/master/oscon/2008/sessions/History.of.Failure.rdoc

http://palestinethinktank.com/wp-content/uploads/2009/01/coll-damage-cartoon.jpg
http://www.seattlepi.com/dayart/20090130/cartoon20090130.jpg
StumbleUpon PLEASE give it a thumbs up Stumble It!

0 Comments:

Post a Comment

<< Home