[net.bugs.usg] Some thoughts on enhancing cpio

sam@delftcc.UUCP (Sam Kendall) (04/02/86)

I've had some thoughts recently about features that cpio(1) needs.  Some
of these apply to tar(1) also.

(1) Optional error recovery.  If the header of just one file in a cpio
    archive is munged, cpio will issue the pitiful message "Out of
    phase--get help" and terminate.  This message is confusing to
    ordinary users, and it then takes a guru to recover the files in the
    archive past the garbled point.  This is a bit ridiculous.  There
    should be some optional error recovery, like the ability to retrieve
    the file following the garbled header (even if its name is unknown),
    and then to recognize the next file header in the garbled archive
    and proceed from there.  This might break down if another cpio
    archive were one of the files in the garbled archive, but no big
    deal.
    
(2) Automatic recognition of -c vs.  non-"-c" formats.  The -c option
    could be ignored with -i (copy in); cpio should recognize which
    format the archive is in.  This is easy to implement.  It
    complicates error recovery, though, in the case that the beginning
    of the file is munged.
    
(3) Fix the bug that -m (restore file modification times) is ineffective
    on directories that are being copied.  This is vital for the next
    feature:
    
(4) Optional save and restore of directory contents, with file
    deletion.  The purpose of this feature is to correctly handle full
    and incremental backups with cpio; specifically, to correctly
    restore a directory in which files have been removed after the full
    backup was made, but before the incremental backup was made.
    
    Currently, when -o (copy out) gets the name of a directory, it
    outputs a header for that directory, but no contents.  My proposal
    is for an option "-D" which would work with both -o and -i.  With
    -o, a list of files in a directory is saved along with the
    directory.  With -i, when a directory is being restored and is
    "replacing" an already existing directory on disk, all files that
    are in the existing directory but NOT in the archived directory are
    REMOVED.
    
    Another way to look at it: with a cpio -i, the action of a file
    replacing an already existing file means, of course, that the
    archived contents replace the contents on disk.  But there is no
    corresponding action for directories.  -D adds such an action.
    N.B.: as with files, the archived directory will replace the
    existing directory only if it is newer or the -u option is given;
    this is why (3) above is necessary.
    
    -D would also work with -p (pass), of course.

    Example: a directory "d" contains files "a" and "b".  A full backup
    (using cpio) is made including "d" and its contents.  The file "b"
    is deleted.  Now an incremental backup of files that have changed
    since the full backup is made using cpio -D.  "d" is on the
    incremental backup, because it has changed since the full backup was
    made.  (It changed when "b" was deleted.)  Now suppose "d" is lost on
    disk, and we try to restore it to disk from backup.  We first
    restore the full backup; "d" contains "a" and "b" again.  We next
    restore the incremental backup.  On the incremental backup, "d"
    contains "a" but not "b".  So "b" is deleted from disk.  The restore
    has worked correctly.  With the current cpio, "b" would still exist,
    incorrectly, after the incremental backup was restored.
    
    This is extremely useful for backup purposes.  It sounds
    complicated, but it fits in beautifully.
    
(5) Preservation of printable ASCII + short lines.  It is too late for
    this, since the format is already frozen, but it would have been
    good.  The idea here is that an archive of mailable files should be
    itself mailable, except perhaps for its size.  A file that is
    mailable has only printable ASCII characters, and has no lines
    longer than some length, maybe 80 characters (I'm not sure).
    
    A cpio -c archive has headers which are about 80 characters plus the
    length of the pathname; this can get too long.  Also, the header
    includes a NUL character or two.  I wish someone had thought about
    this a little bit more before designing the format.  It is so close
    to preserving mailability!
    
    Of course, "shar", and also Martin Minow's (decvax!minow; I think
    it's his) "arch" programs do preserve mailability in almost all
    cases.
    
(6) Should be public domain.  This would avoid the annoying scenario
    where people get cpio archives but cannot unpack them.
    
I haven't recommended that checksums be introduced into cpio, because I
think this can be handled by some other filter.  (There are some tools
to package software for transmission, available through the AT&T
Toolchest, that probably do what I want here.)  One could argue that
mailability can also be handled by other filters; but I would rather
keep things simple for unpacking mailed archives.

Comments?

----
Sam Kendall			{ ihnp4 | seismo!cmcl2 }!delftcc!sam
Delft Consulting Corp.		ARPA: delftcc!sam@NYU.ARPA

dricej@drilex.UUCP (Craig Jackson) (04/05/86)

Sam @ Deflt Consulting Corporation recently proposed several enhancements
to cpio(1).  I think that is a very interesting area of discussion.
I'm not sure where it leads, but it can at least be useful for persons
modifying a system or doing a port.

Sam left out the one change that we have found that we needed the most: byte
swapping the headers.  We have gotten cpio tapes from VAXes that we could
not read on our big-endian 68000 and Z8000 machines.  We ended up adding
a -h option to cpio, but ideally it would be done automatically, upon
detecting a swapped magic number.

The various byteswapping options which are present today are of limited 
utility if you can't read the header.  The -c option solves this problem, but
only if the person who made the tape thought to use it.

-- 
Craig
UUCP: {harvard,linus}!axiom!drilex!dricej
BIX:  cjackson

allyn@sdcsvax.UUCP (Allyn Fratkin) (04/06/86)

I don't see why the "-c" option isn't the default in the first place.
What advantages are there in having a binary header over an ASCII header?

ASCII headers are portable (that's the point), no byteswapping, no int/long 
size problems, and are easier to recover when cpio barfs on a bad block.

I definitely think cpio needs to recover from errors.

-- 
 From the virtual mind of Allyn Fratkin            allyn@sdcsvax.ucsd.edu    or
                          UCSD EMU/Pascal Project  {ucbvax, decvax, ihnp4}
                          U.C. San Diego                         !sdcsvax!allyn

 "Generally you don't see that kind of behavior in a major appliance."

dwd@ttrda.UUCP (Dave Dykstra ) (04/08/86)

[--]

Your idea for incremental backups, i.e. saving the directory contents and
removing the files which have disappeared, is good except for the case
where files are moved without changing their modification time.  Then
the files will not show up on the incremental backup but the file in the
former directory will be removed.

For a while we were losing our filesystems on a regular basis and we
found that incremental backups are basically useless; we do complete
backups every night now.

	-- Dave Dykstra
	   ihnp4!ttrdc!ttrda!dwd

henry@utzoo.UUCP (Henry Spencer) (04/11/86)

> ... good except for the case
> where files are moved without changing their modification time.  Then
> the files will not show up on the incremental backup but the file in the
> former directory will be removed.

As has been known at least since the V7 dump/restor system was written
(1978?), you should dump based on the most recent of the mod time and
the change time.  The change time changes when the inode changes, and
since a rename involves (temporarily) changing the link count, the change
time will reflect it.

> For a while we were losing our filesystems on a regular basis and we
> found that incremental backups are basically useless...

Actually they work fine if done right, i.e. using a proper dump/restor
system rather than cpio, and done standalone so the filesystem is not
changing underfoot.  Yes, we have lost filesystems and restored them
perfectly from full+incremental backups.

It is quite possibly true that incrementals are basically useless on a
System V using cpio for backups.
-- 
Support the International League For The Derision Of User-Friendliness!

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

geoff@desint.UUCP (Geoff Kuenning) (04/17/86)

In article <150@ttrda.UUCP> dwd@ttrda.UUCP (Dave Dykstra ) writes:

> ...except for the case
> where files are moved without changing their modification time.  Then
> the files will not show up on the incremental backup but the file in the
> former directory will be removed.

Ah!  But the *creation time* is changed when you move the file.
(Creation time is a misnomer;  the real meaning of this field is
"last change to inode").

Thus, the following find command will pick up all files for a daily
incremental backup:

	find / -mtime -1 -print -o -ctime -1 -print ...

or, if you have the mods to find that allow switches like -newercm, use

	find / -newercm /.lastinc -print -newermm /.lastinc -print

(I think I have the character order right;  it might be -newermc).
-- 

	Geoff Kuenning
	{hplabs,ihnp4}!trwrb!desint!geoff

mdb@laidbak.UUCP (Mark Brukhartz) (05/11/86)

> As has been known at least since the V7 dump/restor system was written
> (1978?), you should dump based on the most recent of the mod time and
> the change time.  The change time changes when the inode changes, and
> since a rename involves (temporarily) changing the link count, the change
> time will reflect it.

The change time (ctime) is always set to the present when the modification
time (mtime) is altered. Hence, it is sufficient to look at only the change
time. Also, the rename problem is deeper than inferred; what about all of
the file pathnames which change when a directory is renamed?

> 
> > For a while we were losing our filesystems on a regular basis and we
> > found that incremental backups are basically useless...
> 
> Actually they work fine if done right, i.e. using a proper dump/restor
> system rather than cpio, and done standalone so the filesystem is not
> changing underfoot.
> 
> It is quite possibly true that incrementals are basically useless on a
> System V using cpio for backups.

Incrementals appear safe as long as you compare reality to a list of full
pathnames and inode change times from a full backup. Kept ordered by path,
it's easy to generate a list of deleted, added and changed files. Even the
directory-rename case (mentioned above) is handled reasonably well; every
file in the tree is noted "deleted" with the old name and "added" with the
new name.

It is reasonable to back up active filesystems with this scheme. Races with
users generally result in doubly backed up files or innocuous "cannot open"
messages. Only an unlikely combination of renamed directories and files with
identical inode change times appear likely to break this algorithm.

We have a Sequent with three full Eagles and a couple of Cipher GCR (6250)
CacheStreamers. Full backups via dump(8) take the better part of a day. We've
been using a homegrown cpio(1) replacement (...to be posted to the net Real
Soon Now) with the aforementioned scheme for six months or so. Unfortunately
(?), we've never restored an entire real-life disk, but I see no reason why
we couldn't do so.

						Mark Brukhartz
						Lachman Associates, Inc.
						..!ihnp4!laidbak!mdb

dwd@ttrda.UUCP (Dave Dykstra ) (05/12/86)

In response to the two articles by Mark Brukhartz:

> ... the rename problem is deeper than inferred; what about all of
> the file pathnames which change when a directory is renamed?

> Incrementals appear safe as long as you compare reality to a list of full
> pathnames and inode change times from a full backup. Kept ordered by path,
> it's easy to generate a list of deleted, added and changed files. Even the
> directory-rename case (mentioned above) is handled reasonably well; every
> file in the tree is noted "deleted" with the old name and "added" with the
> new name.

Yes, I overlooked the possibility of renamed directories.  Your scheme
of keeping a sorted list of paths sounds like the best solution.

> Directories are recursive. A change to the root directory would turn the
> next incremental backup into a full one. When "filehog" removes the vnews
> core file from his home directory (:-), all of his files will be backed
> up again. Ugh!

I don't understand that.  Changing a directory doesn't require a back up
of all subdirectories, only if the name is changed.

Now, about removing missing files.  Mark writes about this possibility:

>	o Assume that nonexistent files are intended to be removed. This
>	  sounds dangerous... It's too easy for a mortal user to "backup"
>	  and "restore" a mode 777 directory containing inaccessible mode
>	  400 files, inadvertently removing the files. Yes, it's possible
>	  to distinguish "no permission" from "no such file" errors, but
>	  the possibility for bugs and other problems is disturbing.
>
>	o Add a "key" as the first character of each input pathname, with
>	  ">" meaning "copy" and "<" meaning "remove". This would have to
>	  be Yet-Another-Option* to preserve what little cpio invocation
>	  compatibility remains. Consider, too, of all of the other keys
>	  which could be added by future featurefesters (...say that five
>	  times quickly!).

I'm assuming that the backup and restore will be done by root so there
will be no problem with inaccessible files.  It sounds like an easier
solution than the "key" idea; a separate program would need to be written
to put in the keys.

					- Dave Dykstra
					  ihnp4!ttrdc!ttrda!dwd

henry@utzoo.UUCP (Henry Spencer) (05/13/86)

> > ... dump based on the most recent of the mod time and the change time...
> 
> The change time (ctime) is always set to the present when the modification
> time (mtime) is altered. Hence, it is sufficient to look at only the change
> time.

This is true if you are positive that all Unixes and Unixoids and Unixish
systems will get this fine point right.  It costs little to be paranoid
and check both, which is presumably why the V7 dump does it that way.

> Also, the rename problem is deeper than inferred; what about all of
> the file pathnames which change when a directory is renamed?

With the old-style dump/restor approach, this is a null issue.  When you
restore the parent directory, which changed when its child was renamed,
it has a link to the child under the new name.  Hence the child is known
by the new name and its files and subdirectories are available via the
new path.  It is sufficient to note that their names have changed; it is
not necessary to dump them again.

> It is reasonable to back up active filesystems with this scheme. Races with
> users generally result in doubly backed up files or innocuous "cannot open"
> messages. Only an unlikely combination of renamed directories and files with
> identical inode change times appear likely to break this algorithm.

For most ordinary Unix purposes, this algorithm should be reliable.
There is, however, a more subtle problem which can arise.  Things like
database systems may well be in an inconsistent state when backed up.
There can be race conditions even if all relevant files appear to change
simultaneously,	since the backup algorithm does not process them
simultaneously.  The only way to produce a backup tape which is an exact
snapshot of the system is to require everybody to hold still while you take
the snapshot.  (It is admitted that backups which are not exact snapshots
may still be a useful form of backup.)
-- 
Join STRAW: the Society To	Henry Spencer @ U of Toronto Zoology
Revile Ada Wholeheartedly	{allegra,ihnp4,decvax,pyramid}!utzoo!henry