[net.unix-wizards] Some thoughts on enhancing cpio

sam@delftcc.UUCP (Sam Kendall) (04/02/86)

I've had some thoughts recently about features that cpio(1) needs.  Some
of these apply to tar(1) also.

(1) Optional error recovery.  If the header of just one file in a cpio
    archive is munged, cpio will issue the pitiful message "Out of
    phase--get help" and terminate.  This message is confusing to
    ordinary users, and it then takes a guru to recover the files in the
    archive past the garbled point.  This is a bit ridiculous.  There
    should be some optional error recovery, like the ability to retrieve
    the file following the garbled header (even if its name is unknown),
    and then to recognize the next file header in the garbled archive
    and proceed from there.  This might break down if another cpio
    archive were one of the files in the garbled archive, but no big
    deal.
    
(2) Automatic recognition of -c vs.  non-"-c" formats.  The -c option
    could be ignored with -i (copy in); cpio should recognize which
    format the archive is in.  This is easy to implement.  It
    complicates error recovery, though, in the case that the beginning
    of the file is munged.
    
(3) Fix the bug that -m (restore file modification times) is ineffective
    on directories that are being copied.  This is vital for the next
    feature:
    
(4) Optional save and restore of directory contents, with file
    deletion.  The purpose of this feature is to correctly handle full
    and incremental backups with cpio; specifically, to correctly
    restore a directory in which files have been removed after the full
    backup was made, but before the incremental backup was made.
    
    Currently, when -o (copy out) gets the name of a directory, it
    outputs a header for that directory, but no contents.  My proposal
    is for an option "-D" which would work with both -o and -i.  With
    -o, a list of files in a directory is saved along with the
    directory.  With -i, when a directory is being restored and is
    "replacing" an already existing directory on disk, all files that
    are in the existing directory but NOT in the archived directory are
    REMOVED.
    
    Another way to look at it: with a cpio -i, the action of a file
    replacing an already existing file means, of course, that the
    archived contents replace the contents on disk.  But there is no
    corresponding action for directories.  -D adds such an action.
    N.B.: as with files, the archived directory will replace the
    existing directory only if it is newer or the -u option is given;
    this is why (3) above is necessary.
    
    -D would also work with -p (pass), of course.

    Example: a directory "d" contains files "a" and "b".  A full backup
    (using cpio) is made including "d" and its contents.  The file "b"
    is deleted.  Now an incremental backup of files that have changed
    since the full backup is made using cpio -D.  "d" is on the
    incremental backup, because it has changed since the full backup was
    made.  (It changed when "b" was deleted.)  Now suppose "d" is lost on
    disk, and we try to restore it to disk from backup.  We first
    restore the full backup; "d" contains "a" and "b" again.  We next
    restore the incremental backup.  On the incremental backup, "d"
    contains "a" but not "b".  So "b" is deleted from disk.  The restore
    has worked correctly.  With the current cpio, "b" would still exist,
    incorrectly, after the incremental backup was restored.
    
    This is extremely useful for backup purposes.  It sounds
    complicated, but it fits in beautifully.
    
(5) Preservation of printable ASCII + short lines.  It is too late for
    this, since the format is already frozen, but it would have been
    good.  The idea here is that an archive of mailable files should be
    itself mailable, except perhaps for its size.  A file that is
    mailable has only printable ASCII characters, and has no lines
    longer than some length, maybe 80 characters (I'm not sure).
    
    A cpio -c archive has headers which are about 80 characters plus the
    length of the pathname; this can get too long.  Also, the header
    includes a NUL character or two.  I wish someone had thought about
    this a little bit more before designing the format.  It is so close
    to preserving mailability!
    
    Of course, "shar", and also Martin Minow's (decvax!minow; I think
    it's his) "arch" programs do preserve mailability in almost all
    cases.
    
(6) Should be public domain.  This would avoid the annoying scenario
    where people get cpio archives but cannot unpack them.
    
I haven't recommended that checksums be introduced into cpio, because I
think this can be handled by some other filter.  (There are some tools
to package software for transmission, available through the AT&T
Toolchest, that probably do what I want here.)  One could argue that
mailability can also be handled by other filters; but I would rather
keep things simple for unpacking mailed archives.

Comments?

----
Sam Kendall			{ ihnp4 | seismo!cmcl2 }!delftcc!sam
Delft Consulting Corp.		ARPA: delftcc!sam@NYU.ARPA

dricej@drilex.UUCP (Craig Jackson) (04/05/86)

Sam @ Deflt Consulting Corporation recently proposed several enhancements
to cpio(1).  I think that is a very interesting area of discussion.
I'm not sure where it leads, but it can at least be useful for persons
modifying a system or doing a port.

Sam left out the one change that we have found that we needed the most: byte
swapping the headers.  We have gotten cpio tapes from VAXes that we could
not read on our big-endian 68000 and Z8000 machines.  We ended up adding
a -h option to cpio, but ideally it would be done automatically, upon
detecting a swapped magic number.

The various byteswapping options which are present today are of limited 
utility if you can't read the header.  The -c option solves this problem, but
only if the person who made the tape thought to use it.

-- 
Craig
UUCP: {harvard,linus}!axiom!drilex!dricej
BIX:  cjackson

allyn@sdcsvax.UUCP (Allyn Fratkin) (04/06/86)

I don't see why the "-c" option isn't the default in the first place.
What advantages are there in having a binary header over an ASCII header?

ASCII headers are portable (that's the point), no byteswapping, no int/long 
size problems, and are easier to recover when cpio barfs on a bad block.

I definitely think cpio needs to recover from errors.

-- 
 From the virtual mind of Allyn Fratkin            allyn@sdcsvax.ucsd.edu    or
                          UCSD EMU/Pascal Project  {ucbvax, decvax, ihnp4}
                          U.C. San Diego                         !sdcsvax!allyn

 "Generally you don't see that kind of behavior in a major appliance."

ed@mtxinu.UUCP (Ed Gould) (04/07/86)

In article <109@drilex.UUCP> dricej@drilex.UUCP (Craig Jackson) writes:
>
>Sam @ Deflt Consulting Corporation recently proposed several enhancements
>to cpio(1).  I think that is a very interesting area of discussion.
>I'm not sure where it leads, but it can at least be useful for persons
>modifying a system or doing a port.

One of the places it leads is backwards in time, or perhaps just
sideways to another stream.  The dump and restor programs have
always had the facility to remove files that disappeared between
the full dump and a later incremental.

Does anyone *know* why USG decided to drop dump/restor and, for
file-transfer functions, tar in favor of cpio?  This decision
was made fairly early.  PWB 1.0 had dump/restor, I don't know
about 2.0.  They were gone in 3.0, which was released external
to the (then) Bell System as "System III".

-- 
Ed Gould                    mt Xinu, 2910 Seventh St., Berkeley, CA  94710  USA
{ucbvax,decvax}!mtxinu!ed   +1 415 644 0146

"A man of quality is not threatened by a woman of equality."

ka@hropus.UUCP (Kenneth Almquist) (04/12/86)

> Does anyone *know* why USG decided to drop dump/restor and, for
> file-transfer functions, tar in favor of cpio?  This decision
> was made fairly early.  PWB 1.0 had dump/restor, I don't know
> about 2.0.  They were gone in 3.0, which was released external
> to the (then) Bell System as "System III".

I don't know, but I can guess.  They probably dropped dump/restore
because no one wanted them.  Volcopy was faster.  I expect that
dump and restore would have been retained if anybody had had a use
for them.

They didn't drop tar in favor of cpio; they dropped tp.  There were
a number of deficiencies with tp, including:
1)  It could read archives from a disk file rather than tape, but
    it could not write them.
2)  It could handle only a limited number of files specified by name
    because the names had to be passed as arguments (exec used to
    limit argument lists to 512 bytes).
3)  It didn't understand about multiple links to a file.
So USG released cpio and announced that they would drop tp eventually.
I doubt that tar existed at this point.  (If it did, USG might have
reasonably rejected it on the grounds that it solved the first problem
with tp, but not the latter two.)  Some time later USG picked up tar,
and a while after that they dropped tp, as promised.  Tar was not
dropped; it is still in System V today.  It was not widely used because
there was no good reason for users not to continue to use cpio.
				Kenneth Almquist
				ihnp4!houxm!hropus!ka	(official name)
				ihnp4!opus!ka		(shorter path)