[comp.risks] RISKS DIGEST 7.15

RISKS@KL.SRI.COM (RISKS FORUM, Peter G. Neumann -- Coordinator) (07/06/88)

RISKS-LIST: RISKS-FORUM Digest   Tuesday 5 July 1988   Volume 7 : Issue 15

        FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS 
   ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Contents:
  "The target is destroyed." (Iranian Airbus) (Hugh Miller)
  Clarifications on the A320 Design (Nancy Leveson)
  Virus aimed at EDS gets NASA instead (Dave Curry)

The RISKS Forum is moderated.  Contributions should be relevant, sound, in good
taste, objective, coherent, concise, and nonrepetitious.  Diversity is welcome.
CONTRIBUTIONS to RISKS@CSL.SRI.COM, with relevant, substantive "Subject:" line
(otherwise they may be ignored).  REQUESTS to RISKS-Request@CSL.SRI.COM.
  For Vol i issue j  /  ftp kl.sri.com  /  get stripe:<risks>risks-i.j ... .
  Volume summaries in (i, max j) = (1,46),(2,57),(3,92),(4,97),(5,85),(6,95).

----------------------------------------------------------------------

Date:         Mon, 04 Jul 88 11:15:52 EDT
From: Hugh Miller <HUGH%UTORONTO.BITNET@CORNELLC.CCS.CORNELL.EDU>
Subject:      "The target is destroyed."

        There will be a board of inquiry, and congressional committees, and
endless hearings, and in the end "operator mistakes" or "human error" will be
found to have been the cause of the downing of the Iranian passenger airliner.
A system as big and expensive as the Aegis cannot be allowed to fail, as the
scandal about its testing revealed.  The "failures" will be explained away.
        The irony of this situation is, of course, that "human error" gets
blamed for the disasters, to take the heat off the (more-infallible-than-the-
pope) technology/-ists; but, in a skewed and entirely unfunny way the blame is
well-placed.  I imagine that every commander sitting in one of those high-tech
bathtubs in the Gulf took one look at poor Captain Brindel of the Stark -- who
took the fall and the forced retirement and the reduced pension when all along
it was his buggy EW gear at fault, not he or his crew -- and said to himself,
"OK, the writing's on the wall.  If anything hits me and holes the ship &
kills American sailors the pols will throw me to the sharks rather than admit
that some whiz-bang from Rockwell or Unisys or whoever didn't do its zillion
dollar job.  So from now on it's hair-trigger 24 hours a day, and since I
can't be sure my BOZO QZ999 Battlesys can knock down a missile once it's fired
my only recourse is to knock the launchers down before they fire.  They're
bigger & slower & better targets anyway.  Shoot first and ask questions later.
The hell if I'm gonna be the next one to lose his Florida retirement condo to
keep Marconi's rep clean."  I can't find it in my heart to blame the man,
either.  Who wants to be the fall guy for a gigabuck defense contractor and a
desperate, freebooting White House in an election year?  So along comes a
jumbo jet, 25,000 feet, 430 mph, radar cross-section size of a football field.
Software library in the EW battle computers says it's an F-14, kind that
dinged the Stark.  Hell, we ought to know our own aircrafts' profiles, right?
You may fire when ready, Gridley.
        So, in a way, it _was_ a "human" "error".  A human, all-too-human
error.

        (Lest I be accused of imputing less-than-noble motives to valiant,
dedicated career military men from my safe armchair half a world away, I
advance for your consideration a hypothesis similar to the one Henry Spencer
proposed, here or over in ARMS-D, I forget, last year.  Never attribute to
malice what can be accounted for by plain stupidity, he urged; I concur.
Similarly, I would propose that one should not impute noble motives where base
ones will do the trick, human nature being what it is.  Remember, we live in a
society where the most revered moral maxim is "Look out for Number One,"
followed closely by "Cover your ass" and "Nice guys finish last."  Naivete,
especially willful naivete, is not a necessary condition of lucidity in
thinking about techno-political matters.  And, today, all technological
matters are political and vice versa.)

        Which brings me, Mr. Speaker, to my questions.  They are not original,
but since they have never been answered and still seem crucial I will put them
again.

1:      Surely one of the greatest risks at which we are held by technology is
    the result of its completely self-contained, hermetic world-view?  In that
    view, all nature, human and otherwise, is a machine, science the search
    for the control levers, and technology the pulling thereof.  We are guided
    in how we want to pull them by our "values," whatever they are.  (What are
    they, by the way, in the Persian Gulf?  I suppose the usual "national
    interest," whatever that is.)  Just as we don't stop using our PC because
    the disk drive breaks -- Quick! Get the drive repaired! --technological
    thinking will not countenance any stint of its advance.  Greenhouse
    effect?  Engineer hardier crops.  (I heard _that_ one at the Toronto
    atmospheric conference the other day.)  290 dead civilian airline
    passengers, including 66 children?  Patch the software library, piece of
    cake.  Six months from now, when we shoot down a couple of our own
    fighters returning from a sortie because of an unforeseen bug in the
    Flight 655 patch, we patch the patch, no problem.  Nothing fazes the true
    believer, except the suggestion that we deep-six his favorite toy.

2:      Technology creates risks, vigorously.  Since politics and technology
    are today joined in unholy matrimony, technology creates political risks.
    Techno-freaks don't like to think about this, preferring to pretend to be
    "apolitical" so no one will disturb their tidy world, but it is no less
    true or all that.  Since World War II especially the military has been the
    hot spot for the interpenetration of politics and technology.  Only the
    can-do technoptimism of the postwar armed forces can explain such patently
    politically imprudent and hazardous ventures as the Persian Gulf
    filibuster or the "600 Ship Navy"'s aggressive forward basing strategy, to
    say nothing of CBW or, haha, the Strategic Defense Initiative.  The bigger
    and more optimistic the technology, the bigger the risks.  Soviet SSBN's
    off Virginia and Pershing II's in Europe?  An 8-minute LUA decision
    window.  This is not, _pace_ Charles Perrow, a matter of "normal
    accidents" arising out of the increasing complexity of systems.  It is a
    _positive drive_ coeval with technology itself, to which the
    complexification problem is only an adjunct.  As Oppenheimer once told us,
    "If the experiment is sweet, you have to run it." If there's no going
    back, are we prepared for the increasingly wild lurches politics will take
    in the nearer and nearer future?


Hugh Miller, University of Toronto,(416)536-4441 

------------------------------

Date: Sun, 03 Jul 88 19:00:12 -0700
Subject: Clarifications on the A320 Design
From: Nancy Leveson <nancy@commerce.ICS.UCI.EDU>

There has been a great deal of speculation about the design of the A320 
computing system in Risks.  I would like to clear up a few things.  First, the 
A320 is not designed like the F-16 or Boeing 757/767.  The information below
is taken from a paper by J.C. Rouquet and P.J. Traverse entitled "Safe and
Reliable Computing on Board the Airbus and ATR Aircraft," which was
published in the Proceedings of Safecomp '86 (Safety of Computer Control
Systems 1986), edited by W.J. Quirk, published by Pergamon Press, and
copyrighted in 1986 by IFAC (International Federation of Automatic Control).
The authors work for Aerospatiale, the firm responsible for designing most
of the computing systems on board Airbus aircraft.  I quote from some of
the relevant parts; sorry I could not include the figures. Instead of trying 
to correct the English in the paper (which was obviously not edited before
printing), I have left it as is for accuracy; I have tried to proof read this 
typewritten message carefully to make sure that I have not introduced errors.  
I apologize if I have.

... "The A320 is the first civil aircraft designed as fly-by-wire.  It is also
the only aircraft with such an ultra high reliability requirement (failure
rate of 10^-9/H) for a computing system." ... 

"Safety-only requirements can be specified as not computerized back-up exits.
Let us take the roll control of the AIRBUS 300-600 and 310 as an example.
The aircraft is controlled on the roll axis using a pair of ailerons and
5 pairs of spoilers.  The aircraft is controlled either in manual mode, or
in an automatic one.  The automatic flight control system is composed of
two "Flight Control Computer".  Only one of them is active at a time, the
other one being a spare.  If both computers are lost, the aircraft is 
manually controlled.  Therefore, the loss of the automatic flight control
system is not dangerous for the aircraft (except during a short period of
an automatic landing in bad weather conditions)."

"Computers are involved in the manual control mode.  Two "Electrical Flight
Control Unit" are used to control the spoilers.  If both computers are lost,
the pilot can still control the aircraft using the ailerons, with a reduced
authority, as the spoilers are no more available."

"The basic building block is a duplex computer.  Each high safety computer
is composed of two computation channels.  Each channel is monitoring the
other.  If one channel fails, the other one shuts the whole computer down
in a safe way.  This scheme can be impaired by a latent error of the 
monitoring.  Therefore, a self-test program is run each time the computer
is powered up."

"Other precautions are that each channel contains watchdog, that exception 
testing and acceptance testing on the channels output are done, and that
if a computing channel contains two processors, they partially cross-check
themselves."

"Input to the computer are also tested prior to use.  It has to be noted
that safety is not affected, even if a Byzantine general strikes.  Indeed,
if a sensor sends different information to the two computation channels
the consequence is only the shut-down of the computer."

"As a basic precaution, computers are shielded and loosely synchronised.
Most of the transient are coming through power supply.  Therefore, the power
supply is filtered, but also monitored.  A power loss is thus detected, a
few data stored in a protected memory.  If the power loss is sufficiently
short, a "hot start" is possible.  Else, the computer can anyway reset
itself, and restart."

"Equipment on board are divided into two subsets.  These two subsets are often
referred the 2 sides of the aircraft.  Typically, one side is controlled by
the pilot, the other by the copilot."

"The main characteristics are as follow:
  -- only one side is needed to flight [sic] without almost any limitation
  -- an error cannot propagate from one side to the other
  -- a fault cannot be common to both sides
The main task is to verify that the two sides are sufficiently segregated to 
limit error propagation and common point failure."

"High quality software is obtained using a quality plan, as defined in (DO178A).
[I discuss this standard, and my qualms about it and the n-version programming
used to justify the high reliability numbers, in a previous Risks. It was 
reprinted in SEN. NGL]  This document is agreed as a basis for certification.  
Its recommendations are used during the software design phase, and each time 
software must be recertified because of modifications."

"This (and any) rigorous design and testing methodology is not acknowledged as
a fault free software warrant."

"Therefore, each computers uses two different programs, one in each channel.
The dissymetry (diversity) between the two programs is obtained using:
  -- different software design teams,
  -- different languages,
and, depending on the computer, different algorithms, functions, etc.."...

"A major concern is about maintenance faults.  Their effects are limited,
thanks to the power-up self tests, and the computer aided maintenance system.
The pilots are generating input to the computers, and it is recognized that
in most of the accidents, at one time, a pilot takes a wrong option.  This
error may not be an important one, and it has to be noted that in these cases,
the pilot is generally in very bad working conditions.  Anyway, pilot errors
are a major concern.  Two ways to cope with these are:
  -- a computer aided decision (in case of emergency) system (Airbus 300-600,
     310, 320)
  -- a limitation of the authority of the pilot (A320)."

"The first system called Electronic Centralized Aircraft Monitoring displays 
adequate procedures on a screen....  More, on the A320, the pilot cannot drive 
the aircraft outside the flight envelope.  [Wasn't this what the manufacturers 
claimed happened in the recent accident? NGL]  Ziegler and Durandeau (1984) have
discussed further this point. "  [Citation to a paper entitled  "Flight control
system on model civil Aircraft, Proc. of the International council of the 
Aeronautical Sciences (ICAS '84), Toulouse, France, Sept. 1984.]

"The flight control of the A320 is ensured by a computing system, with a
limited mechanical backup.  The design objective is for the computing system
to be sufficiently reliable in order not to use the mechanical back-up.  This
back-up is on board to ease the certification of the aircraft and to help
people to be confident in the aircraft."  

"The safety requirement still exists for the computers.  Therefore, they are
built following the rules defined above (two channels computer, different
programs, high segregation ...)."

"Reliability is gained using five computers.  All of them participate in the
control of the aircraft on the roll axis, four of them on the pitch axis.  At
each time, each surface is controlled by one computer, the other being hot
back-up.  Two types of computers are used.  One is called 'ELAC' (Elevator
and Aileron Computer) is manufactured by THOMSON-CSF, around microprocessors
of the 68000 type.  The other is built by SFENA and AEROSPATIALE with 80186
type processors, and is called 'SEC' (Spoiler and Elevator Computer)."

"Each computer has two different programs.  Therefore, two types of computer
have been designed and four programs.  The repartition of the computers is
shown on Table  1."

                PITCH     ROLL   'SIDE'
    ELAC 1       x         x       1
    ELAC 2       x         x       2
    SEC 1        x         x       1
    SEC 2        x         x       2
    SEC 3        -         x       2
      TABLE 1 - Repartition of the computers

"With this architecture, the aircraft can tolerate:
   -- multiple hardware failures
   -- a complete loss of one 'side' of the aircraft
   -- at least a software error, or a hardware design error, event if it
      shuts down one type of computer
   -- combinations of the above."

"More details about the Electronic Flight Control System of the A320 can be
found in a paper by Ziegler and Durandeau (1984)."  [Cited above]

"When an equipment fails, two problems appear.  First, to find it, and second
to replace it.  On the A320, the localization is done at two levels: each
equipment records failure indication, and at the aircraft level, a computer
collects all the information and correlates them.  It is thus possible to
have at a terminal the list of all the failed computers, through the
'Centralized Fault Display System.'"

"Spare parts are not available in all the airports.  In order not to ground 
an aircraft, it is needed to have spares on board, or for the aircraft to
be able to take off with failed equipment.  The second way is taken and
computing systems are generally designed to reach their the [sic] reliability
requirements, even if one of the computer is down.  For example electrical
flight control system of the A320 is composed of five computers, but, provided
some limitations of the flight envelope, it will be allowed, and safe, to
take-off with only four computers up."

... "The demonstration [of the assigned reliability] is based on ground or 
flight test (to measure the effect of a failure), on probability numbers, and 
on a software design quality plan in accordance with (DO 178A).  No number is 
assigned to a software, even if measures have been done by Troy and Baluteau 
(1985)."  [Reference to a paper in the proceedings of FTCS-15, June 1985,
pp. 438-443].

"The 'Zonal Safety Assessment Document' analyzses the effect of such things
as an engine burst, waste liquids, ..."

"We rely also on ground and flight tests.  For example, the equipment of the
A320 are tested on ground in an 'iron bird' for one year prior to the first
flight.  During the year between the first flight and the certification, both
ground and flight tests will be performed.  The ground tests include power
supply transients, electrical environment hazard."

[There is a section on accrued experience] ... "Our record is satisfactory.
No aircraft crashed, and even came close to this situation.  Design errors
have been found in operation, both in computer specification, and in programs.
We plan to examine all of them, but at first glance, none of them is dangerous.
The use of design diversity is successful, as no error has been found in both
versions of a software." 

NGL:  The argument in Britain about the A320, led by Mike Hennell and 
Bev Littlewood, has focused on the lack of proof of the claims by the 
manufacturers of the Airbus A320 that they have the safest plane flying 
because of the ultra-high reliability of the computer systems.  Despite the 
claim of 10^-9/H, there was an incident where the A320 computers all failed 
in test.  The manufacturers explain this as a "teething problem" that will 
disappear after test.  They also stress that the test pilot was able to safely 
land the plane on the back-up system.  

I have been concerned about claims that the use of n-version programming 
(aka "design diversity") will provide ultra-high reliability.  John Knight
and I have written several papers describing experiments with this technique.  
Tim Shimeall and I have also just completed another experiment that compares 
n-version programming and more traditional reliability techniques.  I will 
send anyone copies of these papers upon request.   

Let me add to the voices suggesting that we wait for the final data before 
judging the latest accident.  In almost all accidents, the manufacturers of
the equipment involved immediately claim that it was a result of operator error
(for very good reasons which are usually of a liability and monetary nature).
It did not seem like the pieces had all stopped smoking before the cause of
the A320 accident was announced.  Unfortunately, with the immense amount of
money involved, it is not clear that the truth will ever be known.  If it truly
is a technical problem, it may require multiple accidents and deaths before 
this is admitted.  This, by the way, is what happened with the Therac 25.
It now appears that some early accidents involving the Therac 25 were the
result of software error even though hardware was blamed in the official 
accident reports.  After several such accidents, it was no longer possible to 
continue to blame the operators and the mechanical systems, and the software 
errors responsible were finally found.  The Three Mile Island accident is often 
attributed to operator error although four separate hardware failures occurred 
before the operators even got into the act.  Most accidents are not attributable
to a single cause -- they are a result of the interaction of several factors.  
It is always possible to blame the operator for not taking the correct steps to
"save the day" after the accident scenario has already been started; accidents 
in complex systems are usually not a matter of operator error alone.  At the 
least, why was the system involved designed to be unsafe on a single point 
failure like an operator error? If it is, then the limiting reliability is 
that of operator error, which is usually counted as 10^-5 in risk assessments.  
Note that according to the paper cited above, the A320 contains a computer-aided
decision system and a limitation of the authority of the pilot.  The pilot also 
cannot drive the aircraft outside the flight envelope.  If this is true, it 
seems odd to blame the recent accident solely on the pilot driving the aircraft
outside the flight envelope, as reported by the press.

------------------------------

Date: Mon, 04 Jul 88 11:18:37 EST
From: davy@intrepid.ecn.purdue.edu (Dave Curry)
Subject: Virus aimed at EDS gets NASA instead

Taken from The Lafayette Journal & Courier, 7/4/88, Page A2.

Destructive computer program sabotages government data

  NEW YORK (AP) - A computer program designed to sabotage a Texas computer
company destroyed information stored on personal computers at NASA and other
government agencies, according to _The_New_York_Times_.
  It was not known whether the program had been deliberately introduced at
the agencies or brought in accidentally, but NASA officials have asked the
FBI to enter the case.
  Damage to government data was limited, but files were destroyed, projects
delayed and hundreds of hours spent tracking the electronic culprit.
  The rogue program destroyed files over a five-month period beginning in
January at the National Aeronautics and Space Administration, the
Environmental Protection Agency, the National Oceanic and Atmospheric
Administration and the U.S. Sentencing Commission, the _Times_ reported.
  The program, or virus, infected close to 100 computers at NASA facilities
in Washington, Maryland and Florida.
  The virus was designed to sabotage computer programs at Electronic Data
Systems, a private company in Dallas, Bill Wright, a company spokesman,
said.  The program did little damage, he said.

--Dave Curry
Purdue University

------------------------------

End of RISKS-FORUM Digest 7.15
************************
-------