egisin@june.cs.washington.edu@watmath.UUCP (Eric Gisin) (12/16/86)
Newsgroups: mod.computers.vax I spent a day copying some save sets on TK50 tape to save sets on disk. I normally use `mount mua: volume /block=8192' then just copy them. There was a parity error in the third save set, and this caused the ACP to get confused when it came to that error (even if I was trying to access the fourth save set). I managed to copy the fourth save set by mounting /foreign, and I copied the third with something like: $ backup MUA: [...] $ backup [...] saveset/saveset where backup reported a recoverable error when reading the tape. My questions are: - why can't the ACP recover from a data error, and - what does backup do to recover from errors that copy, the ACP or RMS, or the device driver don't do?
LEICHTER-JERRY@YALE.ARPA (12/17/86)
I spent a day copying some save sets on TK50 tape to save sets on disk. I normally use `mount mua: volume /block=8192' then just copy them. There was a parity error in the third save set, and this caused the ACP to get confused when it came to that error (even if I was trying to access the fourth save set). I managed to copy the fourth save set by [using BACKUP]. My questions are: - why can't the ACP recover from a data error, and - what does backup do to recover from errors that copy, the ACP or RMS, or the device driver don't do? A parity error means the data on the tape is unreadable. The various levels of the system - device driver through RMS and the COPY program - can retry the read, but if the data is gone, it's gone. The data is also gone when BACKUP tries to read it, but fortunately BACKUP is designed to deal with unreliable media. There are two levels of error correc- tion that are available to BACKUP - and only BACKUP, since they are based on the specific information that BACKUP puts on the tape. First, during writing, the tape controller does a "read after write" and com- pares what is on the tape with what was written. If they differ, an error occurs. This is usually the result of a physical bad spot on the tape. The bad spot may just be due to some dirt that can fall off. Most programs will, at best, re-try the write, hoping that the error will go away. BACKUP just ignores the error where it occurs, but writes another copy of the block that got the error later on. The blocks are numbered in sequence, and when reading a tape, BACKUP watches block numbers. When it gets a bad block, it can wait for the second copy to come along. (It isn't necessarily the next item on the tape because of buffering while writing.) Because it takes a rather complex program to deal with bad blocks and out-of- order buffers, this feature is one of the things /INTERCHANGE disables. Note that using COPY to duplicate a tape with a saveset on it will fail if either input or output tape has bad spots, but a BACKUP restore of the files followed by creation of a new tape will usually work. Beyond this mechanism, BACKUP also writes additional information. There is a parameter called the group size, which is normally 10. After a group of blocks has been written, the XOR of those blocks is written as the next block. If any one block within the group has gotten clobbered, it can be recovered from the rest of the group and the XOR block (by just XOR'ing all of them together, since A XOR B XOR A = B). Where the previous mechanism allows for the WRITER to compensate for bad tape areas found at the time of writing, the XOR block allows the READER to compensate for bad blocks that developed while the tape was in storage. Obviously, the XOR blocks will increase the size of the save set (by about 10% with the default group size). Computing the XOR's also slows down writing, though not by much. In situations where you are not concerned about the robustness of the saveset - when you are using it as an on-disk archiving mechanism, for example - you can increase the group size or turn off XOR blocks completely with the /GROUP_SIZE qualifiers; /GROUP:0 means "no XOR blocks". There is yet one more thing BACKUP does to provide robustness: Above I said that the XOR blocks could be used to reconstruct a known-bad block. But how does the reading BACKUP know the block is bad? Sometimes the tape drive will report a problem reading the block, since tape hardware does write check- sums. These checksums are of varying quality, depending on the tape speed. (The slower tapes speeds, developed at a time when hardware cost more, use less powerful checksums. A 6250 bpi tape drive does pretty good error check- ing, more than compensating for the inherently larger number of errors on a more densely-written tape.) BACKUP adds its own CRC checksum to each block, so it can provide yet another level of checking. The chances of a bad block being passed by both the tape hardware and the BACKUP CRC is extremely small. As with all things, there's a tradeoff - CRC's take time to compute. On smaller VAXes, it can take just about the whole CPU to compute CRC's fast enough to keep up with the tape drive. The /[NO]CRC qualifier lets you turn CRC computation off if you feel you don't need the extra robustness. BTW, there are some 3rd-party backup tools that like to advertise how much faster than BACKUP they are. In general, they are about as fast as a $ BACKUP/GROUP:0/NOCRC - and about as reliable. Can you guess why? If you are considering using one of these to do your backups, pin the vendor down on the product's ability to deal with errors. -- Jerry -------