[comp.risks] RISKS DIGEST 10.31

risks@CSL.SRI.COM (RISKS Forum) (09/06/90)

RISKS-LIST: RISKS-FORUM Digest  Wednesday 5 September 1990  Volume 10 : Issue 31

        FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS 
   ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Contents:
  March 1989 British Rail Train Crash (Brian Randell)
  Complexity, safety and computers (Martyn Thomas)
  Software bugs "stay fixed"? (Martyn Thomas)
  Re: Stonefish mine (Mark Lomas, Bill Davidsen, Bill Ricker)
  Reply to "Computer Unreliability" Stars vs Selves (Dave Davis)
  "Wild Failure Modes" in Analog Systems
    (Jim Hoover, Richard D. Dean, Will Martin, Pete Mellor)

The RISKS Forum is moderated.  Contributions should be relevant, sound, in good
taste, objective, coherent, concise, and nonrepetitious.  Diversity is welcome.
CONTRIBUTIONS to RISKS@CSL.SRI.COM, with relevant, substantive "Subject:" line
(otherwise they may be ignored).  REQUESTS to RISKS-Request@CSL.SRI.COM.
TO FTP VOL i ISSUE j:  ftp CRVAX.sri.com<CR>login anonymous<CR>AnyNonNullPW<CR>
cd sys$user2:[risks]<CR>GET RISKS-i.j <CR>; j is TWO digits.  Vol summaries in 
risks-i.00 (j=0); "dir risks-*.*<CR>" gives directory listing of back issues.
ALL CONTRIBUTIONS ARE CONSIDERED AS PERSONAL COMMENTS; USUAL DISCLAIMERS APPLY.
The most relevant contributions may appear in the RISKS section of regular
issues of ACM SIGSOFT's SOFTWARE ENGINEERING NOTES, unless you state otherwise.

----------------------------------------------------------------------

Date: Wed, 5 Sep 90 13:50:34 BST
From: Brian Randell <Brian.Randell@newcastle.ac.uk>
Subject: March 1989 British Rail Train Crash

  [Yesterday's Independent carried a number of articles related to the ending
  of the trial of a British Rail train train driver in connection with a major
  train accident that occurred on 4 March 1989 in south London. There was a
  fairly lengthy front page article, and two further articles taking up a
  complete half-page.  I have selected from these some paragraphs which should
  be on interest to RISKs readers.  Brian Randell, Computing Laboratory,
  University of Newcastle upon Tyne, UK PHONE =	+44 91 222 7923 
                                                     [Further excerpted by PGN]

A train driver who admitted passing through a red light and causing the Purley
rail crash, in which five people died and 87 were injured, was jailed
yesterday.  Robert Morgan 47, was sentenced to 18 months' imprisonment, 12
months of which was suspended, after pleading guilty to two charges of
manslaughter.  The sentence drew strong criticisms from the rail unions.  The
Old Bailey was told Morgan received two commendations in a previously exemplary
23 years as a train driver.  [...]

Julian Bevan, for the prosecution, told the Old Bailey that Morgan was in
hospital with face and neck injuries a few hours after the crash when he said
he had jumped the red light at about 70mph.  The track limit was 90mph.
Describing the safety system thrown into question by the crash, Mr Bevan said
drivers were given two amber warning signals before coming to a red light.
Each is accompanied by a klaxon sounding in the cab.  If the driver fails to
switch it off, the brakes are applied automatically within three seconds, he
told the court.  Morgan, a single man, of Ferring, West Sussex, admitted that
he must have switched off the klaxon each time, but the memory loss he suffered
prevented him from being more precise.  [...]  Robert Morgan, a driver since
1966, had been well warned by the signalling system that there was going to be
a train ahead of him.  What went wrong?

For every signal a driver passes, the system provides him with an instantly
recognisable set of acknowledgements.  Every time a signal is passed at green,
a bell rings in the cab indicating the line ahead is clear.  If the signal is
at double amber, or amber, or red, a klaxon sounds and has to be acknowledged
by the driver.  He has three seconds to do this and if he does not press the
button, the brakes start to apply and are fully operational in five seconds.
However, he can override this system.

It is a weakness in the system that BR has now recognised.  The Purley crash
came just three months after the disaster at Clapham where 35 people were
killed when a signal failure resulted in a commuter express ramming another
stationary rush hour train.  The real cause, the public inquiry said, was too
much repetitive, painstaking work, not enough time off - lack of supervision
and improper testing procedures among technicians completing resignalling work
in the area.

Since Purley, and acting on the recommendations of the Hidden Report into the
Clapham disaster, BR has been overhauling its approach to safety.

BR admits that prior to the Clapham crash its approach to safety was
equipment-based.  It reasoned that if the equipment and the rules designed to
protect it worked, then the safety of staff and passengers was assured.  What
happened at Clapham and to Robert Morgan at Purley showed that approach to be
inadequate.  In coming to terms with human error, BR has introduced its new
Safety Management Programme.

Potential train drivers already undergo extensive psychological as well as
practical testing, to ensure they are suited to working in a highly disciplined
atmosphere.  However, after work with Professor James Reason of the University
of Manchester, a specialist in risk analysis, BR recognises that regardless of
personality, all human behaviour is inherently quirky in increasingly
repetitive circumstances.  It understands that drivers can get into a "mind
set" where they believe they have completed a task, or recognised a signal,
when they have not.  In that mental state, a driver could cancel a warning
horn, not realise he had done so and plough on to disaster.

BR has admitted that the chances of an equipment failure being the sole cause
of an accident have been all but engineered out of the safety equation, and
that one of the biggest risks to passengers is drivers passing signals at
danger.  It happens on average between 20 and 30 times a year, and each
incident is investigated.  [...]

------------------------------

Date: Wed, 5 Sep 90 10:57:39 BST
From: Martyn Thomas <mct@praxis.co.uk>
Subject: Complexity, safety and computers

In RISKS (10.29), David Gillespie writes:
: I think one point that a lot of people have been glossing over is that in a
: very real sense, computers themselves are *not* the danger in large,
: safety-critical systems.  The danger is in the complexity of the system itself

I agree. Often, people talk about "the software reliability problem" when
actually the problem is the difficulty of getting complex designs right, and
the impossibility of guaranteeing that any residual errors will cause the
design to fail less frequently than (some very low probability of failure).

There is, of course, the related problem of what we mean by "getting the
design right" and "failure". In general, these can only be defined with
hindsight - we recognise that the system has entered a state which we wish
it hadn't, and we define that as failure. We cannot (usually) guarantee that
we have defined all safe states, or all hazardous states, in advance.

This is seen as a "software problem" because we *choose* to put most of the
system complexity into the software, as a sensible design decision.

Recently, I have started to wonder if some of our difficulties are
exacerbated by this decision. Software is digital (at the moment, at least).
Yet many safety-critical systems involve monitoring analog signals and
driving actuators which cause analog activity in the controlled system 
(for example, monitoring airspeed and driving the elevators of an aircraft).
At some point in the system, the analog signal is digitised - generally
before any computation is performed on it. Then the digital outputs are
reconverted to analog.

The question I would ask is: are we making our systems significantly more
complex by converting to digital too soon (or at all)? Would the system
complexity be reduced if, instead of converting to digital so that we can
use a commercial microprocessor, we processed the signals as analog signals,
using an application-specific integrated circuit (ASIC) and only converted
to digital where there is a clear reduction in complexity from doing so?

This is a serious question: latest technology allows mixed analog-digital
ASICs, and the cost and time to produce an ASIC is competitive
with the cost and time to produce the software and circuit board for a
microprocessor system - and the technology is moving so that economics
increasingly favour the use of ASICs. You can have (some of) your favourite
microprocessors on-chip, too.

To summarise: the issue is system complexity - safety is related (probably
exponentially) to the inverse of complexity (if only we could measure it) -
so reducing complexity is the key to increasing safety; can we make progress
by exploiting analog techniques?

--
Martyn Thomas, Praxis plc, 20 Manvers Street, Bath BA1 1PX UK.
Tel:	+44-225-444700.   Email:   mct@praxis.co.uk

------------------------------

Date: Wed, 5 Sep 90 13:44:05 +0100
From: Martyn Thomas <mct@praxis.co.uk>
Subject: software bugs "stay fixed"?

In RISKS 10.30, Robert L. Smith writes:
...
    "... the reliability advantage software has over hardware and people system
components, which is that once a software bug is truly fixed, it stays fixed!
In contrast consider the many times you repair hardware only to see it fail
again from the same cause ..."

I don't know how you define "truly fixed" (unless it means that the bug doesn't
recur - in which case the claim that it stays fixed is tautological!).

In my experience, software bugs are often reintroduced (which is why regression
testing is important). This source of problems is probably only surpassed by
the number of *new* errors introduced while fixing old ones.

The problem of re-assuring software after "maintenance" is as hard as the
problem of assuring it in the first place - while the industry practices are
probably worse, and the regulatory control is certainly worse. Experience with
"software rot" in past systems suggests that we may well see accidents caused
by "faulty maintenance" in growing numbers over the next few years. I predict
that the individual staff will be blamed, rather than the whole regulatory
structure (whereas a major accident caused by an ab initio design error would
raise the question of how the error managed to get through the certification
process). Somehow, "maintenance errors" sound less threatening, possibly
because they sound as though they only apply to a single system.

------------------------------

Date: Wed, 5 Sep 90 12:19:30 +0100
From: tmal@computer-lab.cambridge.ac.uk
Subject: Re: Stonefish mine

In RISKS DIGEST 10.30 Chaz Heritage wrote in reply to a message from Pete
Mellor:

> > 4. Does Stonefish rely on some sort of sonar transponder
> >    to distinguish friend from foe?
>
> I imagine not, since if the mine were to transmit sonar in order to trigger
> transponders located on friendly ships then it would render the mine very
> susceptible to detection and countermeasures.

I don't know whether Stonefish is able to trigger transponders on
detected ships but let us assume that it can.  We already know that
Stonefish performs pattern recognition on passing ships to distinguish
friend from foe.  There are also some types of hostile ships that
should not be attacked, for instance we would like the mine to remain
undetected as a minesweeper passes.

If the mine has already decided that a ship should not be attacked, because it
has been deemed friendly or a hostile minesweeper, then it need not trigger the
transponder.  Only if it has already decided to attack a ship would it need to
confirm its decision and so try to trigger the transponder.  If there is no
response then the mine intends to explode and so will almost certainly be
detected very shortly afterwards.

The decrease in risk to friendly shipping may make such behaviour worthwhile;
the additional warning that a foe would receive would be of the order of the
round-trip time for the message pair.

	Mark Lomas (tmal@cl.cam.ac.uk)

------------------------------

Date: Wed, 5 Sep 90 10:51:03 EDT
From: davidsen@crdos1.crd.ge.com (bill davidsen)
Subject: Re: Stonefish mine

| From: chaz heritage:wgc1:RX
| Date: 3-September-90 (Monday) 4:34:36 PDT
| 
| > 6. The sophistication of Stonefish's recognition system argues for some
| kind
|     of artificial intelligence. If it's that smart, would it know who was
|     winning and change sides accordingly?<
| 
| Personally I wouldn't consider Stonefish to be an AI. I don't think the
| problem posed is much of a risk to Stonefish operators....

| If, on the other hand, Carlos Cardoen is telling fibs (which would not
| perhaps be entirely out of character) then it's possible that he's sold the
| Iraqis a few Stonefish already. If so, it seems unlikely to me that they'll
| work properly without reprogramming for the target signatures of US and UK
| shipping.

  Here's a real risk of software... after the mines are reprogrammed how
would you like to be the first one to run a ship over one to verify that
they are ignoring "friendlies?" Since Iraq doesn't have enough ships to
worry about this, they don't have the problem, but if they blew the
bottom out of a tanker they might really shut off the flow of oil.

  I believe the mines huddle on the bottom and wait until they detect a
target close enough to be damaged then pop to the surface. Somewhat like
a "Bouncing Betty" mine, for those of us old enough to remember.

bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

------------------------------

Date: Wed, 5 Sep 90 18:06:16 GMT
From: wdr@wang.com (William Ricker)
Subject: S-W controlled mine Risks to Aircraft carriers (Re: Stonefish)

I enjoyed the speculation "From Channel 4 news last night (Tue. 28th Aug)"
about a software-controlled mine. However, after recent discussions in the
sci.military / military-request@att.att.com list/group (initiated after I
repeated something I heard a US Admiral say on CBC repeated over US NPR), I
must quickly comment on:

In comp.risks, p.mellor@uk.ac.city (JANET) writes
>It is reported that Iraq may be deploying some of the Royal Navy's latest
>high-tech weaponry. Apparently this is causing US commanders to be reluctant
>to send aircraft carriers into the northern area of the Gulf.

Carriers are not in the gulf because it it too small to maintain normal flight
operations in -- the standard exclusion zones around a carrier task group would
include oil platforms, Saudi, and Iran; and if operating in north gulf, Iraq.
They also can't steam east or west for wind-accross-deck very long either, but
that is less of a concern.

Stonefish may deter smaller ships and the Battleships (the forgotten class of
capital ship) from approaching for bombardment, but they're irrelevant in that
overgrown estuary for carriers.

/bill ricker/                      wdr@wang.com a/k/a wricker@northeastern.edu

------------------------------

Date: Wed, 05 Sep 90 11:04:23 EDT
From: Dave Davis <davis@mwunix.mitre.org>
Subject: Reply to "Computer Unreliability" Stars vs Selves

In response to Peter Mellor's challenge in RISKS 10.28, 31 Aug 90, let me
offer the following.  On the surface, it would seem that the authors 
of the _Futures_ article have a fresh point of view about the risks 
of using computers in areas where the cost of failure is high, avionics,
automated medical devices, nuclear reactors, etc.  Systems based on large
quantities of software do have large numbers of states, and therefore,
large numbers of failure points.  I addition, such a system may have
previously unknown (to the developers) states caused by errors or outside
factors, such as the EMI-caused failures of the Blackhawk helicopters.

    However, the arguments the authors present (as summarized by Mr.
Mellor) are somewhat similar to previous objections to utilizing relatively
immature technologies.  That is, "we don't understatnd it well enough, so
let's not trust it" is the underlying point.  Almost any significant new
technology fits that argument.  Historically, it has been through applying
a technology that motivation toward better theoretical understanding is
created.  For example, we didn't understand thermodynamics and statistical
mechanics while we applied steam power for several generations.  In
addition, it is significant that the authors object to the use of
statistics as a measurment technique.  One wonders if this is an attempt to
play on the commonly held bias against the use of statistics.  Statistics
are routinely used by all large manufacturing companies to identify
production problems.

    In broader sense, the authors misunderstand how broad the
implementation of an information-intensive system can be.  It is not
necessarily just silicon and software.  One is reminded of the complexity
of the mechanical rail switching systems described so well in previous
RISKS.  The argument that discrete-state machines have inherently wilder
failure modes that an analog systems isn't so.  Any system that has
feedback, intentional or unintentional, may behave wildly if a component
fails, or it is operated outside design limits.  (In the early 70s some
airliners were thought to have crashed due to the pilots over-extending
their controls.)  Moreover, returning to the era of electromechanical
devices that wear out and have their own idiosyncracies is not a path
toward increased reliability.

Dave Davis,  MITRE Corporation, McLean, VA

------------------------------

Date: Tue, 4 Sep 90 21:45:06 MDT
From: hoover@cs.ualberta.ca (Jim Hoover)
Subject: "Wild Failure Modes" in Analog Systems

Hmm, last time I taught a hardware course I emphasized that the digital
computer was just a fiction invented by us theory types.  All the 
implementations I know of use analog devices.  Thus we already comply
with the suggested legislation.

Jim Hoover, Dept. of Computing Science, University of Alberta, Edmonton, 
Canada T6G 2H1 | 403 492 5401 | FAX 403 492 1071 | hoover@cs.ualberta.ca

------------------------------

Date: Wed,  5 Sep 90 11:23:01 -0400 (EDT)
From: "Richard D. Dean" <rd0k+@andrew.cmu.edu>
Subject: Re: Wild failure modes in analog systems

>From: Pete Mellor <pm@cs.city.ac.uk>

>Synopsis of: Forester, T., & Morrison, P. Computer Unreliability and
>Social Vulnerability, Futures, June 1990, pages 462-474.

>In contrast [to digital computers], although analogue devices
>have infinitely many states, most of their behaviour is
>*continuous*, so that there are few situations in which they
>will jump from working perfectly to failing totally.

Although analog behavior is continuous, what about resonance ?  While the
output may still be a continuous function of some inputs, it's certainly very
non-linear in some places....Watch the voltage (or current) on an RLC circuit
go very high given the right (or wrong) frequency.

Drew Dean                          rd0k+@andrew.cmu.edu

------------------------------

Date:     Wed, 5 Sep 90 14:49:04 CDT
From: Will Martin <wmartin@STL-06SIMA.ARMY.MIL>
Subject:  Re:  "wild failure modes" in analog systems

>it is now well known that analogue devices can
>also (through design infelicities or just the perverseness of the universe) do
>inherently "wild" state switches.  The classic example is the simple dribble of
>water from a faucet, which, in the absence of analogue catastrophes, would be a
>steady stream, or an equally spaced series of droplets, but is instead a series
>of droplets whose size and spacing is unpredictable except statistically.

While this is indeed true, I think that you have to look at the "level"
of the possible state change to see the analog/digital difference. In
the example cited, while each individual droplet is of unpredictable
size and falls at (generally) unpredictable intervals, stepping back
from the action and looking at the entire system (water pouring from the
faucet) gives a predictable result -- over a period of time, a certain
amount of water will flow out of that faucet. 

There is not likely to be any sudden change in the rate of flow, nor is
the flow likely to suddenly stop (assuming nobody is messing with the
controls and there are no foreign objects in the water supply to clog
the outlet). So while the individual elements (droplets) of the flow follow
chaotic paths, the flow, as a whole, follows a predictable route.

In a digital system without adequate limiting controls, each succeeding
digital number could vary wildly from the preceeding one. A high-order
bit could be turned on, for example, causing an effect that just could
not happen in an analog system, simply because it takes time for a change
to occur; analog variables can "ramp" up or down but each instance will
depend, to some extent, on those preceeding. Each digital sample,
though, can stand alone and enormous swings can occur in the interval of
milliseconds or nanoseconds between samples. Thus the possible range
of catastrophic effects are inherently greater in digital as opposed to
analog systems. (Of course, well-designed digital systems with limit
checking and sample verification can avoid such ill effects.)

This doesn't mean that analog systems can't suffer similar catastrophes.
In the example given, a lump of something in the water supply could clog
the valve or nozzle in an instant. So the flow could drop to zero in a
shorter-than-normal time. But that is about all that could go wrong. The
flow couldn't change from 1 liter/minute to 1 billion liters/minute in
an instant, or switch to a reverse-direction flow. A digital equivalent
would be subject to such possibilities.

Will       wmartin@st-louis-emh2.army.mil OR wmartin@stl-06sima.army.mil

------------------------------

Date: Wed, 5 Sep 90 22:02:11 PDT
From: Pete Mellor <pm@cs.city.ac.uk>
Subject: Re: "wild failure modes" in analog systems

Kent Paul Dolan in RISKS-10.30 writes about the "wild" (or "catastrophic" in
Forester and Morrison's original terms) failure modes of analogue systems.  He
states:

> Unless my understanding from readings in Chaos Theory is entirely flawed, the
>second sentence is simply false; it is now well known that analogue devices can
>also (through design infelicities or just the perverseness of the universe) do
>inherently "wild" state switches.

The "second sentence" here is: 

>>In contrast [to digital computers], although analogue devices
>>have infinitely many states, most of their behaviour is
>>*continuous*, so that there are few situations in which they
>>will jump from working perfectly to failing totally.

First, let me say that I *almost* entirely agree with Kent. After all, chaotic 
phenomena were originally demonstrated on analogue systems. In that synopsis, I
was trying to present the authors' view without prejudice. I did not pick
that particular bone with them in my subsequent criticism of their paper since
I had plenty of other points to raise.

Kent goes on to say:

> So, if the original authors' intent in demeaning our increasing
> reliance on (possibly "un-failure-proofable") digital systems is
> to promote a return to the halcyon days of analogue controls,
> this is probably misdirected by the time the controls approach
> the order of complexity of operation of the current digital ones.

I agree again, *but*, we would never attempt to build systems of the complexity
of our current digital systems if we had only analogue engineering to rely on.
It would not be possible. Reliability requires simplicity. Analogue systems
would be expected to be more reliable than digital because they are forced to
be simpler. It is the complexity of the software in a digital system which
leads to its unreliability.

> We may just have to continue to live with the fact, true throughout
> recorded history, that our artifacts are sometimes flawed and cause
> us to die in novel and unexpected ways, and that we can only do our
> human best to minimize the problems.

Of course! No human endeavour is free of risk. However we do have a choice:
a) to restrict the complexity of life-critical systems, in the hope of
retaining some kind of intellectual mastery of their modes of failure, and
b) to stop kidding ourselves that software failure makes an insignificant
contribution to the unreliability of digital systems (and we *do* - see 
below [1]).

Returning to chaos (as properly defined: non-linear behaviour of a system,
whose basic laws are well-understood, such that second-order effects predominate
and the future states of the system become unpredictable at the detailed level
since arbitrarily close points in the state space can diverge along widely
differing circuits): how does this differ from digital system behaviour?

When Christopher Zeeman gave a lecture on Chaos at City, I asked him what he 
thought was the relevance of chaos theory to digital systems. To my surprise,
he (and, I would guess, 99.99 per cent of other chaotists) had never given the
problem a single thought! Hardly surprising, if you think about it. There *is*
no physical theory of digital behaviour, and no distinction between 1st and 2nd
order effects (or, if you like, everything is at least 2nd order: the slightest
perturbation from a point in the state space can lead to *anywhere* arbitrarily
quickly).

Has anyone out there thought about what the state space diagram of a modest 
digital device would look like? The closest I could get was a 
billion-dimensional discontinuous space of 0's and 1's (i.e. the Cartesian
product, a billion times over, of {0, 1} with itself). Yuk!

A serious attempt *has* been made (by John Knight - reference not to hand) to
examine the shapes of bugs in programs, i.e. the topological properties of those
subsets of the input space which activate program faults. Chaotists will be
pleased to learn that they were fractal.

So here I side with Forester and Morrison. Although I agree with Kent that 
analogue systems can behave chaotically, digital systems are far, far more
chaotic than chaos!

Just an observation in passing.

[1] By the way, the reference above to belief in the perfection of software
is based on what a representative of Airbus Industrie said when interviewed
on the last Equinox programme on fly-by-wire (see RISKS passim).

UK viewers (and some elsewhere in Europe) should tune into Channel 4 on Sunday
30th September, when an updated version of this programme will be transmitted.
Approximately 50 per cent of the material is new, including some *very*
interesting stuff on the Mulhouse-Habsheim disaster.

Peter Mellor, Centre for Software Reliability, City University, Northampton 
Sq.,London EC1V 0HB +44(0)71-253-4399 Ext. 4162/3/1 p.mellor@uk.ac.city (JANET)

------------------------------

End of RISKS-FORUM Digest 10.31
************************