sam@delftcc.UUCP (Sam Kendall) (04/02/86)
I've had some thoughts recently about features that cpio(1) needs. Some of these apply to tar(1) also. (1) Optional error recovery. If the header of just one file in a cpio archive is munged, cpio will issue the pitiful message "Out of phase--get help" and terminate. This message is confusing to ordinary users, and it then takes a guru to recover the files in the archive past the garbled point. This is a bit ridiculous. There should be some optional error recovery, like the ability to retrieve the file following the garbled header (even if its name is unknown), and then to recognize the next file header in the garbled archive and proceed from there. This might break down if another cpio archive were one of the files in the garbled archive, but no big deal. (2) Automatic recognition of -c vs. non-"-c" formats. The -c option could be ignored with -i (copy in); cpio should recognize which format the archive is in. This is easy to implement. It complicates error recovery, though, in the case that the beginning of the file is munged. (3) Fix the bug that -m (restore file modification times) is ineffective on directories that are being copied. This is vital for the next feature: (4) Optional save and restore of directory contents, with file deletion. The purpose of this feature is to correctly handle full and incremental backups with cpio; specifically, to correctly restore a directory in which files have been removed after the full backup was made, but before the incremental backup was made. Currently, when -o (copy out) gets the name of a directory, it outputs a header for that directory, but no contents. My proposal is for an option "-D" which would work with both -o and -i. With -o, a list of files in a directory is saved along with the directory. With -i, when a directory is being restored and is "replacing" an already existing directory on disk, all files that are in the existing directory but NOT in the archived directory are REMOVED. Another way to look at it: with a cpio -i, the action of a file replacing an already existing file means, of course, that the archived contents replace the contents on disk. But there is no corresponding action for directories. -D adds such an action. N.B.: as with files, the archived directory will replace the existing directory only if it is newer or the -u option is given; this is why (3) above is necessary. -D would also work with -p (pass), of course. Example: a directory "d" contains files "a" and "b". A full backup (using cpio) is made including "d" and its contents. The file "b" is deleted. Now an incremental backup of files that have changed since the full backup is made using cpio -D. "d" is on the incremental backup, because it has changed since the full backup was made. (It changed when "b" was deleted.) Now suppose "d" is lost on disk, and we try to restore it to disk from backup. We first restore the full backup; "d" contains "a" and "b" again. We next restore the incremental backup. On the incremental backup, "d" contains "a" but not "b". So "b" is deleted from disk. The restore has worked correctly. With the current cpio, "b" would still exist, incorrectly, after the incremental backup was restored. This is extremely useful for backup purposes. It sounds complicated, but it fits in beautifully. (5) Preservation of printable ASCII + short lines. It is too late for this, since the format is already frozen, but it would have been good. The idea here is that an archive of mailable files should be itself mailable, except perhaps for its size. A file that is mailable has only printable ASCII characters, and has no lines longer than some length, maybe 80 characters (I'm not sure). A cpio -c archive has headers which are about 80 characters plus the length of the pathname; this can get too long. Also, the header includes a NUL character or two. I wish someone had thought about this a little bit more before designing the format. It is so close to preserving mailability! Of course, "shar", and also Martin Minow's (decvax!minow; I think it's his) "arch" programs do preserve mailability in almost all cases. (6) Should be public domain. This would avoid the annoying scenario where people get cpio archives but cannot unpack them. I haven't recommended that checksums be introduced into cpio, because I think this can be handled by some other filter. (There are some tools to package software for transmission, available through the AT&T Toolchest, that probably do what I want here.) One could argue that mailability can also be handled by other filters; but I would rather keep things simple for unpacking mailed archives. Comments? ---- Sam Kendall { ihnp4 | seismo!cmcl2 }!delftcc!sam Delft Consulting Corp. ARPA: delftcc!sam@NYU.ARPA
dricej@drilex.UUCP (Craig Jackson) (04/05/86)
Sam @ Deflt Consulting Corporation recently proposed several enhancements to cpio(1). I think that is a very interesting area of discussion. I'm not sure where it leads, but it can at least be useful for persons modifying a system or doing a port. Sam left out the one change that we have found that we needed the most: byte swapping the headers. We have gotten cpio tapes from VAXes that we could not read on our big-endian 68000 and Z8000 machines. We ended up adding a -h option to cpio, but ideally it would be done automatically, upon detecting a swapped magic number. The various byteswapping options which are present today are of limited utility if you can't read the header. The -c option solves this problem, but only if the person who made the tape thought to use it. -- Craig UUCP: {harvard,linus}!axiom!drilex!dricej BIX: cjackson
allyn@sdcsvax.UUCP (Allyn Fratkin) (04/06/86)
I don't see why the "-c" option isn't the default in the first place. What advantages are there in having a binary header over an ASCII header? ASCII headers are portable (that's the point), no byteswapping, no int/long size problems, and are easier to recover when cpio barfs on a bad block. I definitely think cpio needs to recover from errors. -- From the virtual mind of Allyn Fratkin allyn@sdcsvax.ucsd.edu or UCSD EMU/Pascal Project {ucbvax, decvax, ihnp4} U.C. San Diego !sdcsvax!allyn "Generally you don't see that kind of behavior in a major appliance."
dwd@ttrda.UUCP (Dave Dykstra ) (04/08/86)
[--] Your idea for incremental backups, i.e. saving the directory contents and removing the files which have disappeared, is good except for the case where files are moved without changing their modification time. Then the files will not show up on the incremental backup but the file in the former directory will be removed. For a while we were losing our filesystems on a regular basis and we found that incremental backups are basically useless; we do complete backups every night now. -- Dave Dykstra ihnp4!ttrdc!ttrda!dwd
henry@utzoo.UUCP (Henry Spencer) (04/11/86)
> ... good except for the case > where files are moved without changing their modification time. Then > the files will not show up on the incremental backup but the file in the > former directory will be removed. As has been known at least since the V7 dump/restor system was written (1978?), you should dump based on the most recent of the mod time and the change time. The change time changes when the inode changes, and since a rename involves (temporarily) changing the link count, the change time will reflect it. > For a while we were losing our filesystems on a regular basis and we > found that incremental backups are basically useless... Actually they work fine if done right, i.e. using a proper dump/restor system rather than cpio, and done standalone so the filesystem is not changing underfoot. Yes, we have lost filesystems and restored them perfectly from full+incremental backups. It is quite possibly true that incrementals are basically useless on a System V using cpio for backups. -- Support the International League For The Derision Of User-Friendliness! Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry
geoff@desint.UUCP (Geoff Kuenning) (04/17/86)
In article <150@ttrda.UUCP> dwd@ttrda.UUCP (Dave Dykstra ) writes: > ...except for the case > where files are moved without changing their modification time. Then > the files will not show up on the incremental backup but the file in the > former directory will be removed. Ah! But the *creation time* is changed when you move the file. (Creation time is a misnomer; the real meaning of this field is "last change to inode"). Thus, the following find command will pick up all files for a daily incremental backup: find / -mtime -1 -print -o -ctime -1 -print ... or, if you have the mods to find that allow switches like -newercm, use find / -newercm /.lastinc -print -newermm /.lastinc -print (I think I have the character order right; it might be -newermc). -- Geoff Kuenning {hplabs,ihnp4}!trwrb!desint!geoff
mdb@laidbak.UUCP (Mark Brukhartz) (05/11/86)
> As has been known at least since the V7 dump/restor system was written > (1978?), you should dump based on the most recent of the mod time and > the change time. The change time changes when the inode changes, and > since a rename involves (temporarily) changing the link count, the change > time will reflect it. The change time (ctime) is always set to the present when the modification time (mtime) is altered. Hence, it is sufficient to look at only the change time. Also, the rename problem is deeper than inferred; what about all of the file pathnames which change when a directory is renamed? > > > For a while we were losing our filesystems on a regular basis and we > > found that incremental backups are basically useless... > > Actually they work fine if done right, i.e. using a proper dump/restor > system rather than cpio, and done standalone so the filesystem is not > changing underfoot. > > It is quite possibly true that incrementals are basically useless on a > System V using cpio for backups. Incrementals appear safe as long as you compare reality to a list of full pathnames and inode change times from a full backup. Kept ordered by path, it's easy to generate a list of deleted, added and changed files. Even the directory-rename case (mentioned above) is handled reasonably well; every file in the tree is noted "deleted" with the old name and "added" with the new name. It is reasonable to back up active filesystems with this scheme. Races with users generally result in doubly backed up files or innocuous "cannot open" messages. Only an unlikely combination of renamed directories and files with identical inode change times appear likely to break this algorithm. We have a Sequent with three full Eagles and a couple of Cipher GCR (6250) CacheStreamers. Full backups via dump(8) take the better part of a day. We've been using a homegrown cpio(1) replacement (...to be posted to the net Real Soon Now) with the aforementioned scheme for six months or so. Unfortunately (?), we've never restored an entire real-life disk, but I see no reason why we couldn't do so. Mark Brukhartz Lachman Associates, Inc. ..!ihnp4!laidbak!mdb
dwd@ttrda.UUCP (Dave Dykstra ) (05/12/86)
In response to the two articles by Mark Brukhartz: > ... the rename problem is deeper than inferred; what about all of > the file pathnames which change when a directory is renamed? > Incrementals appear safe as long as you compare reality to a list of full > pathnames and inode change times from a full backup. Kept ordered by path, > it's easy to generate a list of deleted, added and changed files. Even the > directory-rename case (mentioned above) is handled reasonably well; every > file in the tree is noted "deleted" with the old name and "added" with the > new name. Yes, I overlooked the possibility of renamed directories. Your scheme of keeping a sorted list of paths sounds like the best solution. > Directories are recursive. A change to the root directory would turn the > next incremental backup into a full one. When "filehog" removes the vnews > core file from his home directory (:-), all of his files will be backed > up again. Ugh! I don't understand that. Changing a directory doesn't require a back up of all subdirectories, only if the name is changed. Now, about removing missing files. Mark writes about this possibility: > o Assume that nonexistent files are intended to be removed. This > sounds dangerous... It's too easy for a mortal user to "backup" > and "restore" a mode 777 directory containing inaccessible mode > 400 files, inadvertently removing the files. Yes, it's possible > to distinguish "no permission" from "no such file" errors, but > the possibility for bugs and other problems is disturbing. > > o Add a "key" as the first character of each input pathname, with > ">" meaning "copy" and "<" meaning "remove". This would have to > be Yet-Another-Option* to preserve what little cpio invocation > compatibility remains. Consider, too, of all of the other keys > which could be added by future featurefesters (...say that five > times quickly!). I'm assuming that the backup and restore will be done by root so there will be no problem with inaccessible files. It sounds like an easier solution than the "key" idea; a separate program would need to be written to put in the keys. - Dave Dykstra ihnp4!ttrdc!ttrda!dwd
henry@utzoo.UUCP (Henry Spencer) (05/13/86)
> > ... dump based on the most recent of the mod time and the change time... > > The change time (ctime) is always set to the present when the modification > time (mtime) is altered. Hence, it is sufficient to look at only the change > time. This is true if you are positive that all Unixes and Unixoids and Unixish systems will get this fine point right. It costs little to be paranoid and check both, which is presumably why the V7 dump does it that way. > Also, the rename problem is deeper than inferred; what about all of > the file pathnames which change when a directory is renamed? With the old-style dump/restor approach, this is a null issue. When you restore the parent directory, which changed when its child was renamed, it has a link to the child under the new name. Hence the child is known by the new name and its files and subdirectories are available via the new path. It is sufficient to note that their names have changed; it is not necessary to dump them again. > It is reasonable to back up active filesystems with this scheme. Races with > users generally result in doubly backed up files or innocuous "cannot open" > messages. Only an unlikely combination of renamed directories and files with > identical inode change times appear likely to break this algorithm. For most ordinary Unix purposes, this algorithm should be reliable. There is, however, a more subtle problem which can arise. Things like database systems may well be in an inconsistent state when backed up. There can be race conditions even if all relevant files appear to change simultaneously, since the backup algorithm does not process them simultaneously. The only way to produce a backup tape which is an exact snapshot of the system is to require everybody to hold still while you take the snapshot. (It is admitted that backups which are not exact snapshots may still be a useful form of backup.) -- Join STRAW: the Society To Henry Spencer @ U of Toronto Zoology Revile Ada Wholeheartedly {allegra,ihnp4,decvax,pyramid}!utzoo!henry