[comp.software-eng] Software Failure Analysis

shimeall@cs.nps.navy.mil (Tim Shimeall x2509) (09/26/89)

In article <592@halley.UUCP> joannz@halley.UUCP (Joann Zimmerman) writes:
>One other very noticeable difference between other engineering fields and
>computing is in the amount of failure analysis to be found in the field. Did
>anybody reading this EVER take a course in failure analysis of software? In
>fact, where's the literature on this? 

There have been a number of empirical studies on software failures and
on the faults that cause software to fail.  Principally these have
been published as software testing or software fault tolerance
studies.  See (for just a few examples):

P.E. Ammann, and J.C. Knight, ``Data Diversity: An approach to
Software Fault Tolerance'', _IEEE_Transactions_on_Computers_,
April 1988, pp. 418--425.

V.R. Basili and R.W. Selby, ``Comparing the Effectiveness of
Software Testing Strategies'', _IEEE_Transactions_on_Software_Engineering_,
Vol. SE-13, No. 12, December 1987, pp. 1278--1296.

S.S. Brilliant, _Testing_Software_Using_Multiple_Versions_,
Ph.D. Dissertation, University of Virginia, Charlottesville, VA,
September 1987.

W.C. Hetzel, _An_Experimental_Analysis_of_Program_Verification_Methods_,
Ph.D. Dissertation, University of North Carolina at Chapel Hill, 1976.

J.C. Knight and N.G. Leveson, ``Experimental Evaluation of the
Assumption of Independence in Multi-Version Programming,'' 
_IEEE_Transactions_on_Software_Engineering_, January 1986, pp. 96--109.

J.C. Knight and N.G. Leveson, ``An Empirical Study of Failure
Probabilities in Multi-Version Software,'' 
_Sixteenth_International_Symposium_on_Fault-Tolerant_Computing_, 
Vienna, Austria, July 1986, pp. 165--170.

and don't forget :-)
T.J. Shimeall, _An_Experiment_in_Software_Fault_Tolerance_and_
Fault_Elimination_, Ph.D. Dissertation, University of California,
Irvine, 1989.

There have also been a number of testing works on the theory of "fault
based testing" that deal with the issue of how software fails.  See,
for example:

Richardson, Debra J., and Thompson, Margaret C., ``The RELAY Model of
Error Detection and its Application'', _Proceedings_of_the_Second_
Workshop_on_Software_Testing,_Verification_and_Analysis_, Banff, Alberta,
July 1988, pp. 223--230.

This paper also has references to some of the other fault-based
testing work.

All-in-all there has been a fair amount written, and in major research
journals and conference, about how software failure analysis.  
It just hasn't been called by that term.

johnny@lanai.cs.ucla.edu (Jia Hong Chen) (09/29/89)

For ANOTHER view of Knight and Leveson's experiment, one recent paper
which might be interesting to you is "Failure Masking: a Source of
Failure Dependency in Multi-Version Programs", P. G. Bishop and F. D.
Pullen (Central Electricity Research Laboratories, Leatherthead, UK).
It appeared in the proceedings of the first International Working
Conference on Dependable Computing for Critical Applications at Santa
Barbara, August 23-25, 1989.

Quotated from the abstract: " ....  Error masking behavior can be
predicted from the specification (prior to implementation), and simple
modifications to the program design can minimize the error masking
effect and hence the observed dependency."

Prof. Algirdas Avizienis in the question and answer session after the
presentation of the paper by Bishop made some comments about the
paper.  I don't remember the exact words.  Basically he mentioned that
the "error masking" is another way of saying that if you reduced the
resolution of something and make comparisons on the the variable(s)
with the resolution reduced, you end up losing something.  During the
conference, I had chances to visit my friend's (a EE graduate at UCSB)
speech lab on campus.  He demonstrated to me some of their speech
coding (based on Linear Predictive Coding) research, with different
bit rates.  Intuitively with more bit rates you can provide higher
quality sound.  But with some algorithms, you might be able to improve
the quality with the same bit rate.

People who are familiar with the concept of "quantization errors"
should have no difficulty understanding the mumblings in the above
paragraph.  Also, I sort of remember some similar ideas when I studied
the D-algorithm back when I was an undergraduate at National Taiwan
University.

Check out the paper and make up you mind.  Don't trust everything
published.
Jia-Hong Chen

johnny@cs.ucla.edu                    ...!ucbvax!cs.ucla.edu!johnny

shimeall@cs.nps.navy.mil (Tim Shimeall x2509) (09/30/89)

I have no desire to start a fresh round of the N-Version programming
flame wars here.  (Suffice it to say that there are at least two views
on the quality of the Knight-Leveson work and on the quality of the
works by Bishop and Avizienis -- readers are encouraged to exercise
their own judgement.)

I would like to clarify the point of raising
the Knight-Leveson experiment.  The value of that work in terms of
failure analysis is two-fold:
 a) the characterization of the faults that were detected in the
    programs involved -- in particular their demonstration that 
    groups working independently may make similar faults.
    (Their result is arguably stronger than this, but I agree with
     the point that readers should consult the paper and decide for
     themselves.)
 b) the examination of the run-time effect of the faults in increasing
    program failure probability. 

The fact that programs occasionally mask faults (i.e. a fault does not
ALWAYS produce a failure when it is executed) is no surprise to those
in the software testing community, who have been dealing with
"coincidentally correct" results for some time now.
					Tim

nancy@ics.uci.edu (Nancy Leveson) (09/30/89)

Besides the papers that Tim Shimeall mentions on failure analysis, there is
also a set of papers on "software safety" that describe how to apply to
software some of the same types of failure analysis done in engineering.
If you are interested in this, one place to start is:
   Leveson, N.B. "Software Safety: Why, What and How,"  ACM Computing Surveys,
   Vol 18, No. 2, June 1986.
This contains a lot of references.

With respect to the letter by Jia Hong Chen about John Knight and my 
experiment, I have seen only an earlier paper by Peter Bishop in which he
attempted to explain our results as having been a result of "failure masking."
Unfortunately, this does not explain the results, but Peter was hampered by
not having the detailed data from our experiment needed to know this.

For those interested, there will be a paper appearing this spring in IEEE
Transactions on Software Engineering that provides a detailed explanation 
of the faults that led to statistically dependent failures in our original
experiment along with a model that attempts to explain this phenomena.  

There has also been a replication of our experiment -- they got the same 
results as we did.  Analysis of the program failure behavior of the programs 
developed for an n-version programming experiment in which UCLA was one of 
the 4 participating universities showed virtually identical results with 
the Knight and Leveson experiment.

nancy leveson

--
Nancy Leveson

dph@crystal.lanl (David Huelsbeck) (10/02/89)

I seem to have missed the original bibliography.
Would some kind soul please mail me a copy.

Thanks,
dph@lanl.gov