A History of Failure
A History of Failure
Ancient Greece
- More than 2000 years ago
- Device - position of the stars, sun, planets, and moon
- First computer, but also first software collaboration
- Modification of device after created  - Bugfixes
- Feature Creep
 
- Plundered by Romans
- Sank, recovered in 1901.
- X-ray tomography, 2000 greek characters on the outside
- (Funny EULA)
Modern Times
- First bug: 1947. A Real Insect
- 1983: Therac-25 Radiation Treatment Machine  - PDP-11
- Errors are caused by alpha particles and EM noise
- Picks the wrong mode 1 in 250M times, massive radiation overdose
- No hardware interlocks, software controlled
- Picked wrong mode 6 times in 3 years.
 
- Overcorrection killed a rocket because of absolute velocity vs. smoothed velocity
- Self-destruct buttons
"It’s possible to make mistakes so large they invalidate your entire worth as a human being"
- Australian = $40,000/year, over lifespan of 80 years, $3.17M
- Metric = lifetime effort lost
Bug 1: AT&T 1990
- Switches fail, tell its neighbors, they remove it from the routing table, bad switch spends 6 seconds trying to fix itself.
- Coming back up, it would 3way handshake with peers to add them back.
- Changed, still send fault, still self-fix, then just makes an outgoing call to the other switches.
- Bug: 1st switch made the call, 2nd switch updating routing table, crashes everyone!
- 75M calls were lost
- Lost revenue = $60M, 2300 years of productivity lost.
1996: Tiwai Point
- Aluminum smelter, computer controlled
- Comalco Australia programmed them
- 2 hours behind AUS
- Leap year, computers couldn’t take day 366.
- All computers crash @ midnight.
- 2 hours pass, same problem happens in AUS
- Cells melted, had to be replaced.
- Unknown cost.
Space vehicles
1996: Ariane 5
- Developed bug 37 seconds after launch
- Veered off course dramatically
- 64-bit FP to measure launch position
- Casting to 16-bit int
- No Exception Handling!
- Overflow, negative! Rocket turned around!
- Reused code from Ariane 4, could only move 1/2 the horizontal speed
- Testing? The bug showed up perfectly!!!
- The bug showed up afterwards in simulation
- $370M lost!
- 150 lifetimes, 12,000 years
1998: Mars Climate Orbiter
- Plummeted through the atmosphere
- Part of the code in imperial, some in metric
- Pound force, newtons :P
- Testing budget was cut before launch
- Mars Lander failed as well  - Thrusters stopped working
- Landing gear started vibrating, thought it was on the ground
 
- 8300 years of time lost
Deeps Space 2: Hit Mars
- 644+ KM/h
- Sat in storage
- Launched it, and it hit mars
- Battery was dead!
- $30M, 10 lifetimes
2003: North American Blackouts
- 50M people
- 2.38 x AUS, 1/6 of USA
- Who’s to blame?  - El Nino
- Canada blames New York, but was a sunny day
- Canada blames a nuclear power plant in Pennsylvania
- New York blames Canada
 
- Europe was saying USA had 3rd world electric grid
- 6 weeks later, there was a big blackout
- First Energy in Ohio  - 14:14 Alarm system fails SILENTLY
- Display said everything was fine
- Remained in that state for 27 minutes, crashed
- Hot spare failed silent after 13 minutes
- 345kV line goes down, alarm system isn’t working
- Automatic re-route, other lines pick up the load
- 2 more lines went down, no one knows
 
- 11 more lines go down says MISO
- MISO calls First Energy to notify, then their own power went out
Take away
- Race conditions
- Test
- Deploy in New Zealand First
1998: Auckland blackouts
- LOTR: Where the orcs come from
- 5 weeks without power
- 150MWatts of load, 110MW of rated cable :P
- 4 cables, 1 failed.
- Bad press recently, so no announcement
- 150MW of power on 85MW of cable
- Cable 2 fails -> 150MW of power on 50MW of cable
- Management willpower vs. physics. Classic.
- Blamed it on El Nino
- Actually a lack of sysadmins, engineers knew cables were overloaded
- 1980: "We should replace the cables, guys"
- Cost: $150M to Mercury power, unknown to business
- Economic gain to Wellington: Priceless
Sysadmins
- Hard to get people to listen to you, doomsayers
- Disk failure, we need raid
- Power? UPS
- Listen to the sysadmins
FROM github.com/hank/life/blob/master/oscon/2008/sessions/History.of.Failure.rdoc

 
 Stumble It!
 Stumble It!
    









0 Comments:
Post a Comment
<< Home