[comp.sys.mac.programmer] Software should be robust

ech@cbnewsk.att.com (ned.horvath) (01/13/91)

From article <1991Jan11.173254.26319@dartvax.dartmouth.edu>, by ari@eleazar.dartmouth.edu (Ari Halberstadt):
> In article <1991Jan11.150446.16850@ux1.cso.uiuc.edu> dorner@pequod.cso.uiuc.edu (Steve Dorner) writes:

>>And what happens when the Segment Loader also wants a resource from the
>>now evaporated resource file?  How will you keep your program from crashing
>>*THEN*?

> Presumabely, Apple had the forethought to place a loop into the load
> segment routine, something like:

> 		  try to load segment
> 		  while (segment not loaded due to disk error && attempts < max)
> 		  			 show alert
> 		  			 attempts++;
> 		  }
> 		  if (not loaded) {
> 		  			 show alert "The program has given up because it can't
> 		  			 get information it needs"
> 		  			 ExitToShell()
> 		  }

> I leave it as an "exercise to the reader" to discover if Apple's programmers
> thought of doing this.

Does the phrase "System Error 15" ring any bells?  The Segment Loader uses
the resource manager just like everybody else, and fatals when it can't get
the resource.  Now, I COULD pre-flight every segment load
(GetResource ('CODE', id) || postAlert() would do), and I suppose the
Segment loader could do the same.  I could, in fact, preflight every call
to every toolbox function that calls GetResource.  Shall I?  Answer below.

> There is little reason why the program may not simply display an alert
> "Could not proceed since the disk is unreadable (error #-1234)"
> (presumabely the CPU is still operational)...

Except that the alert resources are typically in the same inaccessible 
volume as the rest.  OK, I could preload my disaster-recovery alert
resources.  Shall I?  Answer below.

> ...Normal processing may
> resume as soon as the user has acknowledged the alert (ie, the program
> may reenter its main event loop after taking any other actions needed
> to recover from the error, such as closing of temporary files)...

Presuming that such recovery can be effected with only resources on hand.
GetResource must be presumed to return null forever more.

> The problem of a user pressing command-period when the system requests
> a disk is similar to a user pressing command-period when printing or
> doing something in a dialog box. It isn't reasonable to crash at those
> times just because a user changed his or her mind. It isn't reasonable
> to crash when the user presses command-period when asked to insert a
> disk.

Unless you simply can't function without that disk.  You've seen
several examples of how that might happen.

> You are correct in that software can not protect against all hardware
> failures. However, software should strive for fault tolerance and
> error recovery. If a specific hardware problem may be recovered from,
> then a program should do so...

Oh, my.  Remind me never to accept a contract job from you.  Here are
some failures that one COULD recover from:

1. Disk failure.  We've seen that many of these can be prohibitively
   expensive to anticipate/recover from.  There are ways, of course:
   Duplicate all data in RAM and on multiple volumes, preflight every
   disk reference, read every block twice and compare the results,
   I'm sure there are others.

2. Monitor failure.  If my screen fails, I should be able to move all my
   windows to any surviving screens.  I'll leave this one as an exercise,
   you may find IM-V instructive.

3. Mouse failure.  All functions should be accessible using only a
   keyboard.  This one has to be designed into your program, of course.

4. Keyboard failure.  All typing should be possible with only the mouse.

5. Defective Desk Accessory (or other coresident program).  The program
   should frequently checksum all critical resources (as TMON does,
   for example), restoring from disk where possible, and gracefully
   exiting if unable to recover.  For extra credit, use multiple
   independent checksum/recovery algorithms.

6. Defective user.  If the user forgets a password or other critical
   data, you should be able to recover the data from redundant data.
   This probably requires that you maintain a dossier on your user
   so that you can tolerate such human failings.  Obviously, you need to
   maintain all documentation on line as well, since users won't read
   and frequently lose manuals.

> Writing software, like designing bridges, requires trade-offs of
> robustness versus feasiblity...

Amen.  The short list I've given represents some extreme cases; an
unrecoverable disk is such a case.  The costs of bulletproofing
are measured in programmer time and time-to-market, but they are also
measured in code size, memory requirements, and response time.

There are times when such extremes are justified: the famous AT&T
requirement of less that 2 hours of total system failure per 40 years
is an example, since all emergency services depend on communications.
The price is hardware and software redundancy everywhere, which you
pay for every time you make a call (and pay for every month, whether
you make any calls or not) and still degradation of service and local
outages are possible.  Your absolute requirement -- that every possible
failure mode be anticipated and dealt with properly -- is impossible
in principle.  How can I ever know I'm done?

This does suggest a bit of fun, however.  Instead of writing obfuscated
C, perhaps we should have a Dracula competition: write a program that
won't die short of a wooden stake through the processor chip, and even
then corrupts no other co-resident programs.  Shades of REDCODE...

Oh, and if there are any resources left, the program ought to do
something useful.

There's your challenge, Mr. Halberstadt: I'd love to see your efforts...

=Ned Horvath=

lsr@Apple.com (Larry Rosenstein) (01/16/91)

In article <1991Jan11.173254.26319@dartvax.dartmouth.edu>, ari@eleazar.dartmouth.edu (Ari Halberstadt) writes:
> 
> Presumabely, Apple had the forethought to place a loop into the load
> segment routine, something like:

You get a System Error in this case.

> There is little reason why the program may not simply display an alert
> "Could not proceed since the disk is unreadable (error #-1234)"

True.  MacApp uses its own routine to pre-load a code segment and uses its
standard error handling code if the segment can't be loaded.  In theory,
this should abort back to the main event loop and display an alert.  (If
the failure is due to lack of memory, then it should work; if the heap is 
corrupted, then all bets are off.)

Since the ROM has no standard failure handling mechanism, it couldn't signal
a failure.  I think the Segment Loader has no other choice but to do
a System Error.

Larry