[mod.computers.vax] error recovery on MUA:

egisin@june.cs.washington.edu@watmath.UUCP (Eric Gisin) (12/16/86)

Newsgroups: mod.computers.vax

I spent a day copying some save sets on TK50 tape to save sets on disk.
I normally use `mount mua: volume /block=8192' then just copy them. 
There was a parity error in the third save set,
and this caused the ACP to get confused when it came to that error
(even if I was trying to access the fourth save set).
I managed to copy the fourth save set by mounting /foreign,
and I copied the third with something like:
$ backup MUA: [...]
$ backup [...] saveset/saveset
where backup reported a recoverable error when reading the tape.

My questions are:
- why can't the ACP recover from a data error,
and
- what does backup do to recover from errors that
copy, the ACP or RMS, or the device driver don't do?

LEICHTER-JERRY@YALE.ARPA (12/17/86)

    I spent a day copying some save sets on TK50 tape to save sets on disk.  I
    normally use `mount mua: volume /block=8192' then just copy them.  There
    was a parity error in the third save set, and this caused the ACP to get
    confused when it came to that error (even if I was trying to access the
    fourth save set).  I managed to copy the fourth save set by [using
    BACKUP].

    My questions are:
    - why can't the ACP recover from a data error,
    and
    - what does backup do to recover from errors that
    copy, the ACP or RMS, or the device driver don't do?
A parity error means the data on the tape is unreadable.  The various levels
of the system - device driver through RMS and the COPY program - can retry
the read, but if the data is gone, it's gone.

The data is also gone when BACKUP tries to read it, but fortunately BACKUP is
designed to deal with unreliable media.  There are two levels of error correc-
tion that are available to BACKUP - and only BACKUP, since they are based on
the specific information that BACKUP puts on the tape.

First, during writing, the tape controller does a "read after write" and com-
pares what is on the tape with what was written.  If they differ, an error
occurs.  This is usually the result of a physical bad spot on the tape.  The
bad spot may just be due to some dirt that can fall off.  Most programs will,
at best, re-try the write, hoping that the error will go away.  BACKUP just
ignores the error where it occurs, but writes another copy of the block that
got the error later on.  The blocks are numbered in sequence, and when reading
a tape, BACKUP watches block numbers.  When it gets a bad block, it can wait
for the second copy to come along.  (It isn't necessarily the next item on the
tape because of buffering while writing.)

Because it takes a rather complex program to deal with bad blocks and out-of-
order buffers, this feature is one of the things /INTERCHANGE disables.

Note that using COPY to duplicate a tape with a saveset on it will fail if
either input or output tape has bad spots, but a BACKUP restore of the files
followed by creation of a new tape will usually work.

Beyond this mechanism, BACKUP also writes additional information.  There is
a parameter called the group size, which is normally 10.  After a group of
blocks has been written, the XOR of those blocks is written as the next
block.  If any one block within the group has gotten clobbered, it can be
recovered from the rest of the group and the XOR block (by just XOR'ing all
of them together, since A XOR B XOR A = B).  Where the previous mechanism
allows for the WRITER to compensate for bad tape areas found at the time
of writing, the XOR block allows the READER to compensate for bad blocks that
developed while the tape was in storage.

Obviously, the XOR blocks will increase the size of the save set (by about 10%
with the default group size).  Computing the XOR's also slows down writing,
though not by much.  In situations where you are not concerned about the
robustness of the saveset - when you are using it as an on-disk archiving
mechanism, for example - you can increase the group size or turn off XOR
blocks completely with the /GROUP_SIZE qualifiers; /GROUP:0 means "no XOR
blocks".

There is yet one more thing BACKUP does to provide robustness:  Above I said
that the XOR blocks could be used to reconstruct a known-bad block.  But how
does the reading BACKUP know the block is bad?  Sometimes the tape drive
will report a problem reading the block, since tape hardware does write check-
sums.  These checksums are of varying quality, depending on the tape speed.
(The slower tapes speeds, developed at a time when hardware cost more, use
less powerful checksums.  A 6250 bpi tape drive does pretty good error check-
ing, more than compensating for the inherently larger number of errors on a
more densely-written tape.)  BACKUP adds its own CRC checksum to each block,
so it can provide yet another level of checking.  The chances of a bad block
being passed by both the tape hardware and the BACKUP CRC is extremely small.

As with all things, there's a tradeoff - CRC's take time to compute.  On
smaller VAXes, it can take just about the whole CPU to compute CRC's fast
enough to keep up with the tape drive.  The /[NO]CRC qualifier lets you
turn CRC computation off if you feel you don't need the extra robustness.

BTW, there are some 3rd-party backup tools that like to advertise how much
faster than BACKUP they are.  In general, they are about as fast as a

		$ BACKUP/GROUP:0/NOCRC

- and about as reliable.  Can you guess why?  If you are considering using one
of these to do your backups, pin the vendor down on the product's ability to
deal with errors.
							-- Jerry
-------