[comp.std.unix] tar or cpio?

std-unix@ut-sally.UUCP (Moderator, John Quarterman) (05/09/87)

Section 10 of the POSIX Trial Use Standard (and of the current draft)
describes a data interchange format based on the tar program.  The P1003.1
Working Group has recently received two related proposals regarding that
section:  one to add cpio format (including old-style, non c option format);
the other to replace the tar format with cpio format.  It was also proposed
in the latest Working Group meeting to drop section 10 altogether and let
P1003.2 handle the issue.

As the moderator of this newsgroup, I solicit comments about what should
be done with section 10.

As a Working Group member, I will take such comments into account when
I submit a proposal in a few weeks.  If there is sufficient interest,
I will post an outline of that proposal in the newsgroup (as myself,
rather than as the moderator).  I can also post the already-submitted
proposals.

Volume-Number: Volume 11oree:elp
th

guy@sun.com (Guy Harris) (05/10/87)

From guy@sun.com Sun May 10 02:30:14 1987
From: guy@sun.com (Guy Harris)

> As the moderator of this newsgroup, I solicit comments about what should
> be done with section 10.

One thing that should not be done, under any circumstances, is to
replace "tar" with "cpio" - *especially* if it includes the old
non-"-c" form.  The non-portable form is completely useless for
moving data between systems with different byte orders unless you
have a clever "cpio" that figures out that the byte order is
backwards and undoes the damage.

I discovered this when trying to read a "cpio" tape made on a VAX in
the old format; no combination of "cpio" byte-swapping options and
"dd conv=swab" would help.  I finally ended up fixing our "cpio" to
do the aforementioned look-at-the-header-and-undo-the-damage stuff.

The X/OPEN standard uses "cpio".  The rationale given exhibits a
distressing degree of incompetence:

	If an exchange mdeium is to be read on a target machine that
	is architecturally different from the source machine,
	problems may arise concerning the ordering of bytes within a
	word and words within a long word (see the portability guides
	in Part III).  These can easily be handled when using "cpio"
	as an exchange utility, while with "tar" it may be a little
	more difficult.

Now, I will first note here that the *only* time I had a problem
moving "tar" tapes between machines was when I had to move things to
a Plexus.  The problem was *not* that the machines had different byte
orders; the problem was that the Plexus had a typical brain-damaged
Multibus tape controller that swapped bytes when it transferred data
to and from memory.

"cpio" would not have made this any easier; the System III
byte-swapping option did not swap the bytes on *all* blocks read, but
just swapped the bytes on data blocks and in file names.  The intent
here was clearly that you would read a tape written on a machine with
a different byte order by doing something like

	dd if=/dev/rmt0 conv=swab | cpio -ids

"dd" would swap everything; "cpio -s" would un-swap everything but
the binary data in the header.  (We pause to note that merely
swapping the binary data in the header would be much more efficient,
especially given that "dd" is somewhat of a pig.)  This works, but is
less than wonderful.  (And it doesn't solve the problem with the
Plexus; to solve that you just stick the "dd" in front of "cpio" and
don't bother with "-s" at all.)

The System V "cpio" byte-swapping and word-swapping options work
*only* on data blocks; they have no effect whatsoever on binary data
in the header or on file names.  This means that the trick that
worked with the System III "cpio" wouldn't work at all - and the
problem with the Plexus still isn't fixed, if that was the intent.
The S5 options are useless for old-style non-"-c" tapes.  They are of
some use with "-c" tapes - but only if all the files on the tape
consist solely of "short"s or "long"s, since the data in the data
blocks are all byte-swapped or word-swapped in the same fashion.
Most files I tend to put on or extract from "cpio" tapes are text
files, which obviously need no swapping.

In short, the arguments offered by X/OPEN in favor of "cpio" are
completely bogus.

Now for the arguments against "cpio" format:

	1) It is somewhat more UNIX-specific, in that the "mode"
	   field of the "stat" structure is written out numerically.
	   POSIX does not specify required numeric values for this
	   field.  "tar" indicates the file type with a standard
	   symbolic code, so you can read "tar" tapes even if the
	   machine on which the tape was written and the machine on
	   which it is being read do not have the same values for
	   this field.

	2) It does not handle hard links particularly elegantly.
	   "cpio" knows nothing of files with multiple hard links
	   when it writes a tape; if it is told to write "foo" and
	   "bar" to the tape, and they are both hard links to the
	   same file, it writes two copies of this file to the tape.
	   The hard links are established when the tape is read.
	   If the files appear on the tape in the order "foo" and
	   then "bar", "foo" will be read in first.  Once "bar" has
	   been read in, "cpio" will check to see if it has already
	   read in a file with the same dev/inumber value.  If so, it
	   will delete "bar" and make a hard link to "foo" called
	   "bar".

	3) It is less common.  Almost all UNIX systems that support
	   "cpio" also support "tar"; many UNIX systems that support
	   "tar" do not support "cpio".

	4) POSIX has already chosen "tar" format; why should it
	   change horses in midstream, especially given that the new
	   horse is lame and, despite the claims made by the person
	   selling the horse, is not capable of pulling any heavier
	   loads than the existing one?

Anyway, I'll have to dig up the proposal made to POSIX that "cpio"
supplement or replace "tar" and cast a very strong "no" vote citing
the above.

Now, as for the proposal for handing the whole thing off to P1003.2 -
I have some inclination to support this.  It could, in some ways, be
considered neither part of the scope of P1003.1 nor of P1003.2, but
to be a separate standardization topic entirely.  However, if I had
to choose which of the two items - C-language binding to OS
system call and library functions, or command-language functions -
the data interchange standard belonged to, I'd vote in favor of the
latter.  There is no library of functions for reading or writing
"tar" tapes, but there is a command (namely, "tar") for reading and
writing them, so I think it belongs in that category - especially
given that Section 10 currently says "A conforming system shall
implement a user utility..." which really sounds a lot more like
a P1003.2 requirement than a P1003.1 requirement.

Volume-Number: Volume 11, Number 9

caf@omen.uucp (Chuck Forsberg) (05/10/87)

From: caf@omen.uucp (Chuck Forsberg)

I have played with a program "afio" which is what cpio should/might have
been.  Main features are much faster than cpio, and reads all sorts of
cpio archives, most notably damaged archives.

Out of Sync --- It gets its own help!

This could be the basis of a POSIX cpio program.
Chuck Forsberg WA7KGX Author of Pro-YAM communications Tools for PCDOS and Unix
...!tektronix!reed!omen!caf  Omen Technology Inc "The High Reliability Software"
  17505-V Northwest Sauvie Island Road Portland OR 97231  Voice: 503-621-3406
TeleGodzilla BBS: 621-3746 2400/1200  CIS:70007,2304  Genie:CAF  Source:TCE022
  omen Any ACU 1200 1-503-621-3746 se:--se: link ord: Giznoid in:--in: uucp
  omen!/usr/spool/uucppublic/FILES lists all uucp-able files, updated hourly

Volume-Number: Volume 11, Number7 7 

ralph@ralmar.uucp (Ralph Barker) (05/11/87)

From: ralph@ralmar.uucp (Ralph Barker)

Relative to posting proposals already received and the proposal which you
will make to the working group:

I, for one, would be most interested in seeing the proposals you have
already received (assuming that the writers have included both their
suggestions and the underlying reasoning).

[ They're the articles just before yours.  -mod ]

I suspect that such interim postings might stir additional discussion, as
well.

As an aside, THANKS for your efforts within the working group.  The results
of your efforts (and the efforts of all members of the committees) are of
great importance to all of us in the UNIX community.

[ You're welcome. -mod ]

---
Ralph Barker, RALMAR Business Systems, 640 So Winchester Blvd, San Jose,CA 95128
uucp: ...{ucbvax,hplabs}!sun!idi---\!ralmar!ralph
      ...pyramid!amdahl!unixprt----/             Voice: (408) 559-6202




Volume-Number: Volume 11, Number 16

hedrick@topaz.rutgers.edu (Charles Hedrick) (05/11/87)

From: hedrick@topaz.rutgers.edu (Charles Hedrick)

Since tar exists (as far as I know) on all Unix systems, and cpio only
on ATT ones, tar seems like the best choice for portable use.
Obviously any real POSIX will have both tar and cpio, but why not
leave the standard at tar?  (Note that I haven't read the POSIX
standard, so all I know about the question is what you mentioned in
your note.  I'm responding as if the choice is between the existing
tar or cpio programs.  If this is a new facility that will require a
new program to be written, then I have no comment.)

[ The format in POSIX is upwardly compatible with existing tar programs.
The format proposed for cpio is that of the current cpio program.  -mod ]

By the way, what ever happened about job control?  I recall some
discussion, but not the final resolution.  I had hoped that POSIX
would manage to get a few BSD features into general circulation.
Clearly from the end user's point of view, job control is the one most
important thing missing from System V.  (Actually networking is more
important, but it's not clear whether that is the sort of thing POSIX
should be concerned about.)

[ Job control (the HP proposal) is in the current draft.  Networking
is outside the scope of P1003.1, but there is a /usr/group Technical
Committee addressing the subject with the intention of eventually
providing input to an appropriate IEEE Working Group.  -mod ]

Volume-Number: Volume 11, Number 10

gwyn@brl.arpa (Doug Gwyn) (05/11/87)

From: gwyn@brl.arpa (Doug Gwyn)

Let's get the 1003.1 standard adopted and worry about perfection later.

In the real world one HAS to have a working "tar" if one exchanges files
with many random UNIX sites, even if "cpio" might be better technically.

Any proposal for CPIO format that is system-dependent ("old cpio")
rather than portable ("new cpio") should be rejected out of hand.

1003.2 should probably include the "cpio" utility, which has many uses
besides tape archives.  1003.1 should stick to "tar" for tape archives,
or remove that section altogether.

I would prefer to remove tape archive format altogether from what is
supposed to be a program/system interface specification (1003.1).  There
simply isn't a single universal interchange medium anyway (not every
system has 1/2" magtape, for example).

Volume-Number: Volume 11, Number
L(B

std-unix@ut-sally.UUCP (Moderator, John Quarterman) (05/11/87)

There have been three documents submitted to the IEEE P1003.1
Working Group recently regarding section 10:

N.043	April 22 1987	``X/OPEN Proposals to IEEE P1003.1,'' X/OPEN Group,
Section 3.25, ``Data Interchange format.''

N.048	April 15, 1987	``a proposal for a cpio format to be added to
Chapter 10,'' Lorraine C. Kevra, AT&T.

N.064	April 23, 1987	``Comments on 1003.1 N.048,'' Dominic Dunlop.

They're about a page apiece.  I may feel energetic enough to type them in.

Volume-Number: Volume 11, Number 11

jsdy@hadron.uucp (Joseph S. D. Yao) (05/12/87)

In article <8006@ut-sally.UUCP> guy@sun.com (Guy Harris) writes:
>	3) It is less common.  Almost all UNIX systems that support
>	   "cpio" also support "tar"; many UNIX systems that support
>	   "tar" do not support "cpio".

Guy's arguments are mostly good, especially when reasoning about
the byte-order problem.  It should perhaps be noted, though, that
cpio pre-dates tar, and that there are probably numerous systems
"out there" that have cpio but not tar.  This, at least, seems to
be one of the arguments used by X/OPEN.

Of course, terms like "numerous", "almost all", and "many" are
hard to argue against, because they're so fuzzy.

Personally, I have found good use for both (cpio -p is rather
more elegant than the 2-tar equivalent kludge).  However, I have
had minutely more foul-ups with cpio than with tar.  (At least,
with current versions of tar.)

	Joe Yao		jsdy@hadron.COM (not yet domainised)
	hadron!jsdy@{seismo.CSS.GOV,dtix.ARPA,decuac.DEC.COM}
{arinc,att,avatar,cos,decuac,dtix,ecogong,kcwc}!hadron!jsdy
     {netex,netxcom,rlgvax,seismo,smsdpg,sundc}!hadron!jsdy

Volume-Number: Volume 11, Number 22

trb@ima.ISC.COM (Andrew Tannenbaum) (05/20/87)

From: trb@ima.ISC.COM (Andrew Tannenbaum)

I don't have Section 10 of the POSIX Trial Use Standard, but I am
interested in what happens to tar and cpio in POSIX.

I see that the netnews discussion of this has been partly a
popularity contest between tar and cpio.  There are more
important issues to discuss than people's provincial biases.  If
you come from BSD land, you probably like tar.  If you come from
AT&T land, you probably like cpio.

I have some comments about cpio, since it is my personal favorite.
They apply to both the file format and to the program function.
Some comments apply to tar as well.

I like the idea of cpio taking a list of files on stdin.  I wish
tar had this option.  tar cv `find / -print | fgrep -v -f except.file`
doesn't cut it.

[ Evidently John Gilmore's public domain implementation that he
posted to comp.sources has this.  I know of no proprietary version
that does.  -mod ]

cpio's binary format should have been killed off long ago.
cpio has a 'portable' format, which still has several problems:

-       Byte swapping and its friends.
	There are systems which swap bytes and/or halfwords.  There are
	even systems which xor 0 and 1 bits on tapes.  If CPIO
	wrote a magic number 0x12345678 in the header, it could resolve
	these problems painlessly.

-       I agree that the binary cpio header is silly.
	The portable header is all printable ASCII data, but the
	filename is terminated by a null, which makes it harder to play
	with the archive.  Here is a shar-like program which makes a
	cpio archive which can unpack itself.

<<>><<>><<>><<>><<>><<>><<>><<>><<>><<>><<>><<>><<>><<>><<>>
#! /bin/sh
# take a list of files on stdin and make them into a bundle which
# can be passed through sh to extract them.
cat << \!
#!/bin/sh
# cpio archive
(read a; read a; read a; read a; cat) < $0 | cpio -icdm 
exit
!
cpio -oac
<<>><<>><<>><<>><<>><<>><<>><<>><<>><<>><<>><<>><<>><<>><<>>

	As I recall, it can have problems because of the fact that the
	filename is null-terminated, like if you try to read its
	output into a mail message with an editor.  It would also be
	neighborly if the ASCII header was more human readable, a space
	or carriage return here or there wouldn't hurt at all.

	I realize that this is a standardization effort, but if you are
	going to enhance the format for some portability reason, you
	might want to consider my enhancement suggestions.

-       The familiar problems with damaged archives should be
	fixed (Out of phase--get help).

-	There are systems which need to extend the archive
	formats in local ways, for instance, to add extra mode
	information for a secure UNIX implementation, or file type
	information when the UNIX system deals with other types of file
	systems.  It would be very useful to have a compatible way to
	extend the header such that any system could check the local
	field and either use or ignore the information.  Right
	now, there is little hope for compatibility, the only
	solutions I can think of (various kinds of shadow files
	which contain mode info) are quite kludgy.

-       These programs should deal with multiple tape archives
	in a standard way.  I have seen many local hacks to do
	this.

-       Blocksizes for speed, space, and streaming efficiency are best
	handled by blocking filters rather than by hacks like -B.  I
	have heard of ctccpio, but can only worry about what it
	actually is.  How many programs are going to have to have
	knowledge about how many goofy devices?  I don't understand why
	cpio has to know anything about a device.  This is UNIX, isn't
	it?  Which brings us back to the question of multi-tape
	archives, maybe the blocking filter should also handle the
	multi-tape problem?  This means lots of data travels over a
	pipe.  Modern OS's should be able to do something smart here.
	(The multi-tape question leaves me with a queasy feeling.)

I would like to see a discussion about tar and cpio rather than
opinions about which is better.  I am particularly concerned about
extending the header format to deal with atypical file types.

	Andrew Tannenbaum   Interactive   Boston, MA   +1 617 247 1155

Volume-Number: Volume 11, Number 34

henry@utzoo.uucp (Henry Spencer) (05/28/87)

From: henry@utzoo.uucp (Henry Spencer)

Andy's comments about the facilities offered by tar and cpio are worthy
of note, but irrelevant to the P1003.1 issues.  This was actually raised
at the original /usr/group standards meeting when the question of a
standard intercharge format came up:  the facilities offered by the
current programs are quite irrelevant to the choice of format, since
the format does not dictate the user interface.  It is not especially
difficult to write "cpar" or "tpio", to get one user interface with the
other format.

I thought the choice of tar by /usr/group was a huge win, and still think
so; the extensions added in the Trial Use Standard strengthen this view.

The cpio binary format is a travesty:  unportable, non-extensible (for
example, it is sure that inode numbers are only 16 bits, often not true
today), and generally a mess.  Cpio ASCII format is better, but it still
shares some of these problems, since its field widths are sized to fit
old systems (for example, it can't deal with 32-bit inode numbers either).

Furthermore, I would note that at least the cpio binary format, possibly
the ASCII one as well, has existed in two different versions.  People who
claim that cpio is older than tar are half-lying:  the current version of
cpio is not.  I submit that the mere existence of multiple incompatible
versions of the cpio format is a major black mark against it.  Tar format
is virtually universal, with only minor (compatible!) extensions having
been made here and there.

Andy makes a good point about extensibility.  The tar format extends
gracefully because it has extra room in its header (which the existing
programs helpfully zero rather than filling with random trash) and in its
file-type code space.  (Note that the Trial Use Standard explicitly
reserves certain type codes for local extensions, and others for future
standards.  Note also that the Trial Use Standard's own extensions are
upward-compatible with the existing format and existing programs.)

Chapter 10 of the Trial Use Standard is a valuable part of the standard,
it is not broken, and it does not need fixing.  Leave it there.

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry


Volume-Number: Volume 11, Number 39

std-unix@ut-sally.UUCP (06/02/87)

From: seismo!scgvacd!stb!michael (Michael Gersten)

I have some comments on Tar/Cpio.
 
First, Radio SHack does sell a tar that has an argument F for "here is a
file with the files you should read". Works with - for stdin. However,
you can't read the files from stdin and write the output to stdout,
although you can write to a named pipe (does mostly the same thing).

Secondly, whatever you use for backups should know about blocksizes.
In particular, if you lose one floppy, users should be able to restore all
the information on the other floppies. Tar does not do this--linking information
gets lost if this occurs. 

In particular,
	Floppy A has file X on it
	Floppy B has "Link X to Y"
If you lose floppy A, you've got garbage for Y. Worse, if you restore out
of order, no warning is given other than "Cannot link".

Finally, I feel that tar, in order to be usable as a backup facility, should
be required to unlink a file before it restores it. Otherwise, consider this:

Customer uses initialization floppy to initialize hard disk, which puts basic
commands (ls, tar, cp, etc) on disk, then restores the entire system from tarred
floppies.

Initialization system had /bin/l linked to /bin/ls (AT&T versions)
Customer had /bin/ls linked to lc, lf, lx (Berkeley versions), and the
AT&T as ls.old

After untarring, the Berkeley version was lost, and the AT&T version was under
all the names. Took me a while to figure this one out. Guess who the
customer was.

I do not consider backup/restore usable as they take 5 minutes per file to
recover individual files. I am not kidding; maybe R/S mucked something, but
that is ridiculus. Sure, you can get faster, but only if you first format the
disk, which takes 2 hours, and also do an incremental dump first.

[ Do you mean dump and restor?  (Or dump and restore?)  -mod ]

---
: Michael Gersten		seismo!scgvaxd!stb!michael
: The above is the result of being educated at a school that discriminates
: against roosters.

Volume-Number: Volume 11, Number 48

std-unix@ut-sally.UUCP (06/02/87)

From: tony@uqcspe.oz.au (Tony O'Hagan)

Last year I wrote an off-line tape archive system for use on our UNIX
machines.  Tar and cpio were carefully compared to decide the appropiate
format for the tape archives.  Eventually we chose cpio because it permitted
retrieval of any pattern of files.  

About 120+ cpio files are stored on each archive tape and from bitter
experience I know that sometimes when appending/retrieving files to/from tape
an insufficient number of files are skipped.  (We count them using taprd now)
It would have been useful to write a "file label" included in the header
of each cpio file and be able to check this when reading.  I would have used it
to check the file number but I'm sure it would have other uses. 
( I would skip to the file before and check it's label when appending. )

I recently adapted the archive system from V7 to BSD 4.2 and added the
facility to drive remote tape drives using the blocking filters which fitted
in well with cpio.

[ These are useful points, though they are not problems with the data
interchange format, rather with the program that uses it.  -mod ]

	P.S.
There are a few other bugs in cpio which I had to fix for the local version.
	* creating new otherwise unstored directories with the current
	  mask (not mode 777).  (with -d switch)
	* not changing the ownership/group/permissions of existing
	  directories back to their values at the time of archiving.
==============================================================================
Tony O'Hagan		Australia: (07) 3774125  International: +61 7 3774125
University of Queensland	CSNET:	tony@uqcspe.oz	ACSnet:	tony@uqcspe.oz
Dept. of Computer Science	UUCP:	...!seismo!munnari!uqcspe.oz!tony
St. Lucia, Brisbane, 		ARPA:	tony%uqcspe.oz@seismo.css.gov
AUSTRALIA  4067	 		JANET:	uqcspe.oz!tony@ukc

Volume-Number: Volume 11, Number 42