[comp.software-eng] Information on current state of software safety desired

syspgm%ritcv@cs.rit.edu (09/30/89)

               I am currently starting graduate research into the  area  of
          software  safety.  This research is intended to be initially very
          broad until I can narrow my focus.  I am specifically  interested
          in, but not limited to, the following areas:

          (1)  What is the current state of the art in software engineering
               as it relates to software safety ?

          (2)  Who are the movers and shakers in this field ? Is there  one
               person  who  (or a group of people) that are single handedly
               championing the cause of software safety ?

          (3)  What, if anything, is motivating  the  current  interest  in
               safety  ?  Has  there  been any single event that might have
               sparked interest in the field recently ?

          (4)  Have there been any break throughs in the field  recently  ?
               Or  perhaps failing a break through, what looks most promis-
               ing in the field ?

          (5)  How will the proposed IEEE Software Safety  Standard  change
               the field ?

          (6)  Have there been any  major,  or  very  informative  articles
               written that might be of use to me ?

               Thank you in advance for any assistance that  you  might  be
          able to provide.
###############################################################################
# Dean Scott Blodgett			| Systems Programmer
# Rochester Insititute of Technology	| Make it in Massachussetts 
# (716) 475 - 2079			| Spend it in New Hampshire

shimeall@cs.nps.navy.mil (Tim Shimeall x2509) (10/03/89)

Suggestions (these answer most of your questions):
  Read Nancy Leveson's "Software Safety: What, Why and How", ACM
   Computing Surveys, June 1986.
  Get onto the RISKS list (comp.risks) 
   -- Peter Neumann (risks-request@csl.sri.com) is the moderator.
  Read the IEEE Transactions of Software Engineering, Sept. 1986,
   a special issue on safety.
  Get copies of the proceedings of COMPASS, an IEEE sponsored
   annual conference on safety (among other things).
  Get copies of the proceedings of SAFECOMP, an AFIPS sponsored annual
   conference on safety (among other things).  This might be a bit
   tricky, since this is a European conference, but Pergammon Press is
   the publisher.
				Tim

murphyn@cell.mot.COM (Neal P. Murphy) (10/03/89)

>                I am currently starting graduate research into the  area  of
>           software  safety.  This research is intended to be initially very
>           broad until I can narrow my focus.  I am specifically  interested
>           in, but not limited to, the following areas:
> ...
>           (3)  What, if anything, is motivating  the  current  interest  in
>                safety  ?  Has  there  been any single event that might have
>                sparked interest in the field recently ?

One thing that motivated my interest in software safety was the failure of a
radiation therapy (cancer treatment) LINAC built by some North American
company. While I think that the failure resulted from a system design flaw,
the problem is directly related to software safety, since the software was
performing most of the control of the system and should have had access to
sensors that would have enabled the system to detect the massive overdose
of radiation and shut it off in time. The software developers should have
been aware of the lethal radiation levels that could be generated and should
have insisted on a fail-safe shutoff, either as part of the system or parallel
to it.

Ah, well, as long as everyone involved learned from their mistakes. We're only
human. We can only try to do our best. Mostly we succeed, sometimes we don't.
"The operation was a success, but we lost the patient."

NPN

nancy@ics.uci.edu (Nancy Leveson) (10/04/89)

>One thing that motivated my interest in software safety was the failure of a
>radiation therapy (cancer treatment) LINAC built by some North American
>company. While I think that the failure resulted from a system design flaw,

I was an expert witness on one of the law suits involved with this machine.
Unfortunately, a lot of misinformation has been floating around, but I am
unable (at this time) to provide details.  However, the failure resulted from
software bugs, not from system design flaws.

>The software developers should have
>been aware of the lethal radiation levels that could be generated and should
>have insisted on a fail-safe shutoff, either as part of the system or parallel
>to it.

This is not the responsibility of the software developers, but of the system,
nuclear, and safety engineers.  

>
>Ah, well, as long as everyone involved learned from their mistakes. We're only
>human. 

Four people are dead and one is maimed.  Two of these died of cancer, the 
others died or were maimed as a result of the incorrect treatment.  I often
hear software engineers say "there is nothing we can do about software errors,
they will always occur."  This is just not true.  There were many things that
could have been done in this case and in general.

nancy
--
Nancy Leveson

diamond@csl.sony.co.jp (Norman Diamond) (10/05/89)

In article <1989Oct4.055359.15145@paris.ics.uci.edu> Nancy Leveson <nancy@commerce.ics.uci.edu> writes:

>I was an expert witness on one of the law suits involved with this machine.
[LINAC]
>the failure resulted from software bugs, not from system design flaws.

>>The software developers should have
>>been aware of the lethal radiation levels that could be generated and should
>>have insisted on a fail-safe shutoff, either as part of the system or parallel
>>to it.
>
>This is not the responsibility of the software developers, but of the system,
>nuclear, and safety engineers.  

The point of view that system engineers or safety engineers have
ultimate responsibility is understandable.

The suggestion that nuclear engineers have this responsibility is hard
to understand.  Should their machine not do what it was ordered to do?
But morally they should behave with a certain amount of responsibility,
and ask "what if...".

The suggestion that software engineers do not have this responsibility
is also hard to understand.  Morally we should behave with a certain
amount of responsibility too, and ask "what if...".  And we are
certainly obligated to test our code.

Of course, moral responsibility is difficult.  If you are a space
shuttle engineer and testify in court that you *did* ask "what if",
you might be fired for it.  And if you are a software engineer and
ask too many times "what if" or try too many times to test your code
(when management does not understand or reply), you might be fired
even without court testimony.  The difficulties of moral standards.

>I often
>hear software engineers say "there is nothing we can do about software errors,
>they will always occur."

You hear that from fake software engineers (the kind who don't get
fired).

>This is just not true.  There were many things that
>could have been done in this case and in general.

This is absolutely true.  So why did you say that software engineers
did not have responsibility in this particular case?

--
-- 
Norman Diamond, Sony Corp. (diamond%ws.sony.junet@uunet.uu.net seems to work)
  The above opinions are inherited by your machine's init process (pid 1),
  after being disowned and orphaned.  However, if you see this at Waterloo or
  Anterior, then their administrators must have approved of these opinions.

murphyn@cell.mot.COM (Neal P. Murphy) (10/09/89)

From Article 96 of comp.software-eng:

>unable (at this time) to provide details.  However, the failure resulted from
>software bugs, not from system design flaws.
>     .
>     .
>     .
>This is not the responsibility of the software developers, but of the system,
>nuclear, and safety engineers.  

So, in one breath, the deaths and injuries resulted from software
failures.  In the next breath, you state that desirable safety interlocks
are the responsibility of the system designers.  So whose failure was it?
Did the software engineers fail?  Or did the system designers fail?  Or was
it (as I think) a breakdown of the development process?  I accept that
there were errors in the software.  But I must insist that that system
should have been PHYSICALLY INCAPABLE of producing those lethal dosages,
REGARDLESS of what the software controller was telling it to do!

While I have not been an expert witness at any trial, I do know something
about this topic, as I spent a number of years in the radiation testing
field.  All that would have been needed to prevent these tragedies is a
circuit to compute the area beneath the pulse - a diode, analog integrater,
comparator and a relay switch (for an analog solution) or a diode,
digitizer, dedicated processor (for a digital solution) whose sole function
is computing the area beneath the pulse and to turn on a relay to disengage
the LINAC should the total dose exceed safety margins.  This is a version
of the `crowbar circuit', which function it is to shunt, to ground, power
surges in a power supply's mains, much like putting a crowbar across the
hot and common terminals.  How much did the system cost?  One million
dollars?  Two million dollars?  Even as little as $500,000?  Would a
$30,000 fail-safe system have made the system uneconomical?  I think not.

A terrible tragedy it is.  But I will not persecute the designers of the
system.  I imagine they feel pretty rotten anyway.  Did they REALLY do all
they could to ensure the safe use of the machine?  (Look at how many people
are killed on the highways and by-ways of the world.  I don't hear much
call for educating the drivers to make them better, or confiscating the
vehicles of drivers who repeatedly kill and maim people while behind the
wheel.  But I digress.)  I stand by what I said.  What is done, is done.
What can we learn from our mistakes?  We are only human.  We WILL make
mistakes.  I will never claim that any software I create will be so perfect
that it won't need to be independently checked.  Try as I might to avoid
them, I DO make mistakes.  I HAVE hurt people in my lifetime.  And I have
felt that hurt.

All we can do is our best.  And when that best is not good enough, we hurt,
because we realize that we could have done better.  Those of us who care
stand up and accept responsibility for our actions.  To mis-quote the Good
Book, "Let him, who has made no mistakes, cast the first stone."  If there
were more of us working together instead of attacking each other, mayhap we
would make some progress.  No?

NPN

nancy@ics.uci.edu (Nancy Leveson) (10/13/89)

In reading the responses of both Norman Diamond and Neal Murphy to my posting
about the Therac, it appears that I have been unclear.  I meant to say that
the design of the linear accelerator itself (as opposed to the design of 
the software controlling parts of it) is not the responsibility of the 
software engineer but of the nuclear, mechanical, electronic, system, or 
other engineer that designs the linear accelerator hardware.  Of course the 
software engineer or anyone else that sees a flaw in the hardware should 
speak up.  But the software engineer is usually not involved in the actual 
design of the hardware being controlled by the computer so should not take 
the blame for any deficiencies in it.   

Accidents are usually the result of multiple failures and design deficiencies.
It does not help to point the finger at someone else and say "the hardware
should have been built to compensate for all possible software errors."
The Therac should have included hardware interlocks; these have now been
added.  But most systems are such that it is not possible to build hardware
to protect against every incorrect software behavior.  Furthermore, hardware 
interlocks can also fail.  For systems that can potentially kill people, it
is necessary that the software be as safe as possible (which means building
special safeguards into it and using special hazard analysis techniques on
it) as well as building in special hardware protection devices when possible.
Unfortunately, what I see continually is the hardware people assuming that
the software will be correct and thus not building in the interlocks and
the software people assuming that the hardware will provide protection against
software errors and not performing the extra software engineering procedures
necessary for safety-critical software (as partially outlined in my Computing 
Surveys paper).  BOTH need to be done in order to reduce the risk of killing
people as much as possible.

Neal Murphy writes:
>A terrible tragedy it is.  But I will not persecute the designers of the
>system.  I imagine they feel pretty rotten anyway...  What is done, is done.
>What can we learn from our mistakes?  We are only human.  We WILL make
>mistakes.  I will never claim that any software I create will be so perfect
>that it won't need to be independently checked.  Try as I might to avoid
>them, I DO make mistakes.  I HAVE hurt people in my lifetime.  And I have
>felt that hurt.

>All we can do is our best.  And when that best is not good enough, we hurt,
>because we realize that we could have done better.  Those of us who care
>stand up and accept responsibility for our actions.  To mis-quote the Good
>Book, "Let him, who has made no mistakes, cast the first stone."  If there
>were more of us working together instead of attacking each other, mayhap we
>would make some progress.  No?

I do not want to persecute anyone or to point blame.  But we do need to learn
from accidents that occur so that they do not happen again for the same reasons
in the future.  In too many instances, people are building software that
is safety-critical who should not be.  I am sure that they are doing their
best, but they are unqualified to do the job.  Would we say that a person
involved in an auto accident who does not know how to drive and does not
possess a driver's license is "only human" and did his best?  It is not a
matter of casting stones; it is a matter of ensuring public safety by
requiring only the best practices on projects that can endanger innocent
people.

Norman Diamond says:
>The suggestion that software engineers do not have this responsibility
>is also hard to understand.  Morally we should behave with a certain
>amount of responsibility too, and ask "what if...".  And we are
>certainly obligated to test our code....

>So why did you say that software engineers
>did not have responsibility in this particular case?

Let me attempt to be very clear.  I did not say that software engineers
do not have responsibility or that they did not have responsibility in this
particular case.  I said that they did not have responsibility for any
deficiencies in the design of the linear accelerator itself.  That is the
responsibility of the people who designed it.  The software engineers have
responsibility for the safety of the software that provides control 
instructions to that linear accelerator.  Everyone has the responsibility
to communicate with each other and with the system safety engineers to
make sure that all parts of the system working together will be as safe 
as possible.

Relying on testing of software or independent review by someone else to
ensure the safety of software is inadequate.  Building safety-critical
software in the same way you would build other software is inadequate.
Special effort and techniques must be employed by both the software engineers 
and system engineers to minimize risk.  Software engineers who work on such 
projects should be highly qualified in basic software engineering and in 
these special software safety techniques.  

Accidents will happen; there are no such things as risk-free systems.  But 
we need to be able to say that the systems we build are as safe as it is 
possible to make them given the current state-of-the-art knowledge.  And we
need to stand up and argue against using computers to control systems when
even this state-of-the-art knowledge is not adequate to provide acceptable
risk, especially when, as is often the case, the primary reason for introducing
computers into these systems is to save money.

nancy
--
Nancy Leveson

ruffwork@ube.CS.ORST.EDU (Ritchey Ruff) (10/13/89)

Nancy Leveson writes:
}[...] But we do need to learn
}from accidents that occur so that they do not happen again for the same reasons
}in the future.  
}[...]
}Accidents will happen; there are no such things as risk-free systems.  But 
}we need to be able to say that the systems we build are as safe as it is 
}possible to make them given the current state-of-the-art knowledge.  And we
}need to stand up and argue against using computers to control systems when
}even this state-of-the-art knowledge is not adequate to provide acceptable
}risk, especially when, as is often the case, the primary reason for introducing
}computers into these systems is to save money.

Two must-read books on engineering and safety in general:
	(1) To Engineer is Human: the role of failure in successful design,
	    Henry Petroski, NY:St. Martin's Press, 1985. ($10.95 USA).

	(2) Normal Accidents, Charles Perrow, Basic Books, 1984. ($11.95 USA).

(1) points out something we all know but don't tout too much: you
only learn limits from failures, and no matter how careful you are
when you do something never done before it's impossible to know it
will work as you think until you try it out.  Each new piece of
engineering is an experiment! (boy, would Joe Public have fun with this
one ;-)  All you can do is be as careful as humanly possible and
be prepared to learn if it fails.

(2) points out that one major source of failure is non-linearity
in the coupling of different parts of a system (or between systems),
especially as the systems become too complex for a single person to
understand fully.  (by non-linear coupling I mean the failure of part a
and part b together is MUCH worse that the failure of part a or part
b added together).

So, what's the point?  I think that one has to try to stick to
"appropriate technology" (32 bit, 1Meg computers to control a simple
car motor is what I'd call "inappropriate").  Mechanical systems are much
better understood and often are much easier to analyze w.r.t. error
and failure modes---if a simple mechanical system does the job, why
use a complex electronic system? (money...)  Also we have to
try to decouple interactions as much as possible (making analysis
of error and failure modes easier) and try to make sure interactions
are as linear as possible.  Finally we must accept that things will
fail and be willing to accept the results of the failure, and LEARN
from that failure.

--ritchey ruff					ruffwork@cs.orst.edu

snidely@inteloa.intel.com (David P. Schneider) (10/18/89)

In article <13097@orstcs.CS.ORST.EDU>,
Ritchey Ruff (ruffwork@CS.ORST.EDU) writes:
>So, what's the point?  I think that one has to try to stick to
>"appropriate technology" (32 bit, 1Meg computers to control a simple
>car motor is what I'd call "inappropriate").  Mechanical systems are much
>better understood and often are much easier to analyze w.r.t. error
>and failure modes---if a simple mechanical system does the job, why
>use a complex electronic system? (money...)  

I'm a bit suspicious of this comment.  First, application of microcontroll-
ers  and  computers  is  often because the *simple* mechanical system is no
longer adequate, and the required *complex* mechanical upgrade is harder to
do than the electronic upgrade.

Car motors are a case in point.   19th century technology was sufficient to
provide mechanical governors and other control techniques that are adequate
for vehicles travelling at low speeds.  25-40 mph is still within the range
where  human oversight of low-level details is acceptable.  Engine require-
ments have become more demanding since then.  Most recently, the quest  for
pollution  reduction  while retaining performance.  Mechanical systems have
been discarded because of the difficulty in designing  and  build  them  to
meet these requirements.

As an anecdote against mechanical systems, consider this.  My father worked
for a VW/Porsche dealership in the late 60's.  The joke around the shop was
that the  mechanics had to be paid so much because it cost them so much mo-
ney  to keep their Porsches in tune.  The mechanical fuel injection used on
911s, just at the time that VW Type IIIs (square  backs)  were  introducing
analog  electronic  fuel  injection, supposedly went out of tune every time
you ran the car past 60 mph.

Also, there has been discussion in the RISKS forum that mechanical  systems
may  not  be  easier to analyze; they just have better known rules of thumb
("make your best calculation, and multiply by 2").  These  rules  of  thumb
allow  the engineers to work around problems in the analysis.  Software en-
gineers are just developing their rules of thumb, so of course they  aren't
widely followed or tested.

                                                David P. Schneider
                                                     BiiN (tm)
                                                 Wednesday, 10.18