RISKS@KL.SRI.COM (RISKS FORUM, Peter G. Neumann -- Coordinator) (07/06/88)
RISKS-LIST: RISKS-FORUM Digest Tuesday 5 July 1988 Volume 7 : Issue 15 FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator Contents: "The target is destroyed." (Iranian Airbus) (Hugh Miller) Clarifications on the A320 Design (Nancy Leveson) Virus aimed at EDS gets NASA instead (Dave Curry) The RISKS Forum is moderated. Contributions should be relevant, sound, in good taste, objective, coherent, concise, and nonrepetitious. Diversity is welcome. CONTRIBUTIONS to RISKS@CSL.SRI.COM, with relevant, substantive "Subject:" line (otherwise they may be ignored). REQUESTS to RISKS-Request@CSL.SRI.COM. For Vol i issue j / ftp kl.sri.com / get stripe:<risks>risks-i.j ... . Volume summaries in (i, max j) = (1,46),(2,57),(3,92),(4,97),(5,85),(6,95). ---------------------------------------------------------------------- Date: Mon, 04 Jul 88 11:15:52 EDT From: Hugh Miller <HUGH%UTORONTO.BITNET@CORNELLC.CCS.CORNELL.EDU> Subject: "The target is destroyed." There will be a board of inquiry, and congressional committees, and endless hearings, and in the end "operator mistakes" or "human error" will be found to have been the cause of the downing of the Iranian passenger airliner. A system as big and expensive as the Aegis cannot be allowed to fail, as the scandal about its testing revealed. The "failures" will be explained away. The irony of this situation is, of course, that "human error" gets blamed for the disasters, to take the heat off the (more-infallible-than-the- pope) technology/-ists; but, in a skewed and entirely unfunny way the blame is well-placed. I imagine that every commander sitting in one of those high-tech bathtubs in the Gulf took one look at poor Captain Brindel of the Stark -- who took the fall and the forced retirement and the reduced pension when all along it was his buggy EW gear at fault, not he or his crew -- and said to himself, "OK, the writing's on the wall. If anything hits me and holes the ship & kills American sailors the pols will throw me to the sharks rather than admit that some whiz-bang from Rockwell or Unisys or whoever didn't do its zillion dollar job. So from now on it's hair-trigger 24 hours a day, and since I can't be sure my BOZO QZ999 Battlesys can knock down a missile once it's fired my only recourse is to knock the launchers down before they fire. They're bigger & slower & better targets anyway. Shoot first and ask questions later. The hell if I'm gonna be the next one to lose his Florida retirement condo to keep Marconi's rep clean." I can't find it in my heart to blame the man, either. Who wants to be the fall guy for a gigabuck defense contractor and a desperate, freebooting White House in an election year? So along comes a jumbo jet, 25,000 feet, 430 mph, radar cross-section size of a football field. Software library in the EW battle computers says it's an F-14, kind that dinged the Stark. Hell, we ought to know our own aircrafts' profiles, right? You may fire when ready, Gridley. So, in a way, it _was_ a "human" "error". A human, all-too-human error. (Lest I be accused of imputing less-than-noble motives to valiant, dedicated career military men from my safe armchair half a world away, I advance for your consideration a hypothesis similar to the one Henry Spencer proposed, here or over in ARMS-D, I forget, last year. Never attribute to malice what can be accounted for by plain stupidity, he urged; I concur. Similarly, I would propose that one should not impute noble motives where base ones will do the trick, human nature being what it is. Remember, we live in a society where the most revered moral maxim is "Look out for Number One," followed closely by "Cover your ass" and "Nice guys finish last." Naivete, especially willful naivete, is not a necessary condition of lucidity in thinking about techno-political matters. And, today, all technological matters are political and vice versa.) Which brings me, Mr. Speaker, to my questions. They are not original, but since they have never been answered and still seem crucial I will put them again. 1: Surely one of the greatest risks at which we are held by technology is the result of its completely self-contained, hermetic world-view? In that view, all nature, human and otherwise, is a machine, science the search for the control levers, and technology the pulling thereof. We are guided in how we want to pull them by our "values," whatever they are. (What are they, by the way, in the Persian Gulf? I suppose the usual "national interest," whatever that is.) Just as we don't stop using our PC because the disk drive breaks -- Quick! Get the drive repaired! --technological thinking will not countenance any stint of its advance. Greenhouse effect? Engineer hardier crops. (I heard _that_ one at the Toronto atmospheric conference the other day.) 290 dead civilian airline passengers, including 66 children? Patch the software library, piece of cake. Six months from now, when we shoot down a couple of our own fighters returning from a sortie because of an unforeseen bug in the Flight 655 patch, we patch the patch, no problem. Nothing fazes the true believer, except the suggestion that we deep-six his favorite toy. 2: Technology creates risks, vigorously. Since politics and technology are today joined in unholy matrimony, technology creates political risks. Techno-freaks don't like to think about this, preferring to pretend to be "apolitical" so no one will disturb their tidy world, but it is no less true or all that. Since World War II especially the military has been the hot spot for the interpenetration of politics and technology. Only the can-do technoptimism of the postwar armed forces can explain such patently politically imprudent and hazardous ventures as the Persian Gulf filibuster or the "600 Ship Navy"'s aggressive forward basing strategy, to say nothing of CBW or, haha, the Strategic Defense Initiative. The bigger and more optimistic the technology, the bigger the risks. Soviet SSBN's off Virginia and Pershing II's in Europe? An 8-minute LUA decision window. This is not, _pace_ Charles Perrow, a matter of "normal accidents" arising out of the increasing complexity of systems. It is a _positive drive_ coeval with technology itself, to which the complexification problem is only an adjunct. As Oppenheimer once told us, "If the experiment is sweet, you have to run it." If there's no going back, are we prepared for the increasingly wild lurches politics will take in the nearer and nearer future? Hugh Miller, University of Toronto,(416)536-4441 ------------------------------ Date: Sun, 03 Jul 88 19:00:12 -0700 Subject: Clarifications on the A320 Design From: Nancy Leveson <nancy@commerce.ICS.UCI.EDU> There has been a great deal of speculation about the design of the A320 computing system in Risks. I would like to clear up a few things. First, the A320 is not designed like the F-16 or Boeing 757/767. The information below is taken from a paper by J.C. Rouquet and P.J. Traverse entitled "Safe and Reliable Computing on Board the Airbus and ATR Aircraft," which was published in the Proceedings of Safecomp '86 (Safety of Computer Control Systems 1986), edited by W.J. Quirk, published by Pergamon Press, and copyrighted in 1986 by IFAC (International Federation of Automatic Control). The authors work for Aerospatiale, the firm responsible for designing most of the computing systems on board Airbus aircraft. I quote from some of the relevant parts; sorry I could not include the figures. Instead of trying to correct the English in the paper (which was obviously not edited before printing), I have left it as is for accuracy; I have tried to proof read this typewritten message carefully to make sure that I have not introduced errors. I apologize if I have. ... "The A320 is the first civil aircraft designed as fly-by-wire. It is also the only aircraft with such an ultra high reliability requirement (failure rate of 10^-9/H) for a computing system." ... "Safety-only requirements can be specified as not computerized back-up exits. Let us take the roll control of the AIRBUS 300-600 and 310 as an example. The aircraft is controlled on the roll axis using a pair of ailerons and 5 pairs of spoilers. The aircraft is controlled either in manual mode, or in an automatic one. The automatic flight control system is composed of two "Flight Control Computer". Only one of them is active at a time, the other one being a spare. If both computers are lost, the aircraft is manually controlled. Therefore, the loss of the automatic flight control system is not dangerous for the aircraft (except during a short period of an automatic landing in bad weather conditions)." "Computers are involved in the manual control mode. Two "Electrical Flight Control Unit" are used to control the spoilers. If both computers are lost, the pilot can still control the aircraft using the ailerons, with a reduced authority, as the spoilers are no more available." "The basic building block is a duplex computer. Each high safety computer is composed of two computation channels. Each channel is monitoring the other. If one channel fails, the other one shuts the whole computer down in a safe way. This scheme can be impaired by a latent error of the monitoring. Therefore, a self-test program is run each time the computer is powered up." "Other precautions are that each channel contains watchdog, that exception testing and acceptance testing on the channels output are done, and that if a computing channel contains two processors, they partially cross-check themselves." "Input to the computer are also tested prior to use. It has to be noted that safety is not affected, even if a Byzantine general strikes. Indeed, if a sensor sends different information to the two computation channels the consequence is only the shut-down of the computer." "As a basic precaution, computers are shielded and loosely synchronised. Most of the transient are coming through power supply. Therefore, the power supply is filtered, but also monitored. A power loss is thus detected, a few data stored in a protected memory. If the power loss is sufficiently short, a "hot start" is possible. Else, the computer can anyway reset itself, and restart." "Equipment on board are divided into two subsets. These two subsets are often referred the 2 sides of the aircraft. Typically, one side is controlled by the pilot, the other by the copilot." "The main characteristics are as follow: -- only one side is needed to flight [sic] without almost any limitation -- an error cannot propagate from one side to the other -- a fault cannot be common to both sides The main task is to verify that the two sides are sufficiently segregated to limit error propagation and common point failure." "High quality software is obtained using a quality plan, as defined in (DO178A). [I discuss this standard, and my qualms about it and the n-version programming used to justify the high reliability numbers, in a previous Risks. It was reprinted in SEN. NGL] This document is agreed as a basis for certification. Its recommendations are used during the software design phase, and each time software must be recertified because of modifications." "This (and any) rigorous design and testing methodology is not acknowledged as a fault free software warrant." "Therefore, each computers uses two different programs, one in each channel. The dissymetry (diversity) between the two programs is obtained using: -- different software design teams, -- different languages, and, depending on the computer, different algorithms, functions, etc.."... "A major concern is about maintenance faults. Their effects are limited, thanks to the power-up self tests, and the computer aided maintenance system. The pilots are generating input to the computers, and it is recognized that in most of the accidents, at one time, a pilot takes a wrong option. This error may not be an important one, and it has to be noted that in these cases, the pilot is generally in very bad working conditions. Anyway, pilot errors are a major concern. Two ways to cope with these are: -- a computer aided decision (in case of emergency) system (Airbus 300-600, 310, 320) -- a limitation of the authority of the pilot (A320)." "The first system called Electronic Centralized Aircraft Monitoring displays adequate procedures on a screen.... More, on the A320, the pilot cannot drive the aircraft outside the flight envelope. [Wasn't this what the manufacturers claimed happened in the recent accident? NGL] Ziegler and Durandeau (1984) have discussed further this point. " [Citation to a paper entitled "Flight control system on model civil Aircraft, Proc. of the International council of the Aeronautical Sciences (ICAS '84), Toulouse, France, Sept. 1984.] "The flight control of the A320 is ensured by a computing system, with a limited mechanical backup. The design objective is for the computing system to be sufficiently reliable in order not to use the mechanical back-up. This back-up is on board to ease the certification of the aircraft and to help people to be confident in the aircraft." "The safety requirement still exists for the computers. Therefore, they are built following the rules defined above (two channels computer, different programs, high segregation ...)." "Reliability is gained using five computers. All of them participate in the control of the aircraft on the roll axis, four of them on the pitch axis. At each time, each surface is controlled by one computer, the other being hot back-up. Two types of computers are used. One is called 'ELAC' (Elevator and Aileron Computer) is manufactured by THOMSON-CSF, around microprocessors of the 68000 type. The other is built by SFENA and AEROSPATIALE with 80186 type processors, and is called 'SEC' (Spoiler and Elevator Computer)." "Each computer has two different programs. Therefore, two types of computer have been designed and four programs. The repartition of the computers is shown on Table 1." PITCH ROLL 'SIDE' ELAC 1 x x 1 ELAC 2 x x 2 SEC 1 x x 1 SEC 2 x x 2 SEC 3 - x 2 TABLE 1 - Repartition of the computers "With this architecture, the aircraft can tolerate: -- multiple hardware failures -- a complete loss of one 'side' of the aircraft -- at least a software error, or a hardware design error, event if it shuts down one type of computer -- combinations of the above." "More details about the Electronic Flight Control System of the A320 can be found in a paper by Ziegler and Durandeau (1984)." [Cited above] "When an equipment fails, two problems appear. First, to find it, and second to replace it. On the A320, the localization is done at two levels: each equipment records failure indication, and at the aircraft level, a computer collects all the information and correlates them. It is thus possible to have at a terminal the list of all the failed computers, through the 'Centralized Fault Display System.'" "Spare parts are not available in all the airports. In order not to ground an aircraft, it is needed to have spares on board, or for the aircraft to be able to take off with failed equipment. The second way is taken and computing systems are generally designed to reach their the [sic] reliability requirements, even if one of the computer is down. For example electrical flight control system of the A320 is composed of five computers, but, provided some limitations of the flight envelope, it will be allowed, and safe, to take-off with only four computers up." ... "The demonstration [of the assigned reliability] is based on ground or flight test (to measure the effect of a failure), on probability numbers, and on a software design quality plan in accordance with (DO 178A). No number is assigned to a software, even if measures have been done by Troy and Baluteau (1985)." [Reference to a paper in the proceedings of FTCS-15, June 1985, pp. 438-443]. "The 'Zonal Safety Assessment Document' analyzses the effect of such things as an engine burst, waste liquids, ..." "We rely also on ground and flight tests. For example, the equipment of the A320 are tested on ground in an 'iron bird' for one year prior to the first flight. During the year between the first flight and the certification, both ground and flight tests will be performed. The ground tests include power supply transients, electrical environment hazard." [There is a section on accrued experience] ... "Our record is satisfactory. No aircraft crashed, and even came close to this situation. Design errors have been found in operation, both in computer specification, and in programs. We plan to examine all of them, but at first glance, none of them is dangerous. The use of design diversity is successful, as no error has been found in both versions of a software." NGL: The argument in Britain about the A320, led by Mike Hennell and Bev Littlewood, has focused on the lack of proof of the claims by the manufacturers of the Airbus A320 that they have the safest plane flying because of the ultra-high reliability of the computer systems. Despite the claim of 10^-9/H, there was an incident where the A320 computers all failed in test. The manufacturers explain this as a "teething problem" that will disappear after test. They also stress that the test pilot was able to safely land the plane on the back-up system. I have been concerned about claims that the use of n-version programming (aka "design diversity") will provide ultra-high reliability. John Knight and I have written several papers describing experiments with this technique. Tim Shimeall and I have also just completed another experiment that compares n-version programming and more traditional reliability techniques. I will send anyone copies of these papers upon request. Let me add to the voices suggesting that we wait for the final data before judging the latest accident. In almost all accidents, the manufacturers of the equipment involved immediately claim that it was a result of operator error (for very good reasons which are usually of a liability and monetary nature). It did not seem like the pieces had all stopped smoking before the cause of the A320 accident was announced. Unfortunately, with the immense amount of money involved, it is not clear that the truth will ever be known. If it truly is a technical problem, it may require multiple accidents and deaths before this is admitted. This, by the way, is what happened with the Therac 25. It now appears that some early accidents involving the Therac 25 were the result of software error even though hardware was blamed in the official accident reports. After several such accidents, it was no longer possible to continue to blame the operators and the mechanical systems, and the software errors responsible were finally found. The Three Mile Island accident is often attributed to operator error although four separate hardware failures occurred before the operators even got into the act. Most accidents are not attributable to a single cause -- they are a result of the interaction of several factors. It is always possible to blame the operator for not taking the correct steps to "save the day" after the accident scenario has already been started; accidents in complex systems are usually not a matter of operator error alone. At the least, why was the system involved designed to be unsafe on a single point failure like an operator error? If it is, then the limiting reliability is that of operator error, which is usually counted as 10^-5 in risk assessments. Note that according to the paper cited above, the A320 contains a computer-aided decision system and a limitation of the authority of the pilot. The pilot also cannot drive the aircraft outside the flight envelope. If this is true, it seems odd to blame the recent accident solely on the pilot driving the aircraft outside the flight envelope, as reported by the press. ------------------------------ Date: Mon, 04 Jul 88 11:18:37 EST From: davy@intrepid.ecn.purdue.edu (Dave Curry) Subject: Virus aimed at EDS gets NASA instead Taken from The Lafayette Journal & Courier, 7/4/88, Page A2. Destructive computer program sabotages government data NEW YORK (AP) - A computer program designed to sabotage a Texas computer company destroyed information stored on personal computers at NASA and other government agencies, according to _The_New_York_Times_. It was not known whether the program had been deliberately introduced at the agencies or brought in accidentally, but NASA officials have asked the FBI to enter the case. Damage to government data was limited, but files were destroyed, projects delayed and hundreds of hours spent tracking the electronic culprit. The rogue program destroyed files over a five-month period beginning in January at the National Aeronautics and Space Administration, the Environmental Protection Agency, the National Oceanic and Atmospheric Administration and the U.S. Sentencing Commission, the _Times_ reported. The program, or virus, infected close to 100 computers at NASA facilities in Washington, Maryland and Florida. The virus was designed to sabotage computer programs at Electronic Data Systems, a private company in Dallas, Bill Wright, a company spokesman, said. The program did little damage, he said. --Dave Curry Purdue University ------------------------------ End of RISKS-FORUM Digest 7.15 ************************ -------