[comp.unix.questions] tar or cpio?

samperi@mancol.UUCP (Dominick Samperi) (02/01/88)

I've heard that cpio will be used as the unix standard archiver, yet
many people seem to prefer tar. While implementing these programs on
a PC I noticed several advantages/disadvantages of each. In a tar
archive, file headers and file data always begin on a block (512 byte)
boundary, thus making it easier to seek to the beginning of a particular
file, or to append files to a tar archive. On the other hand, files in
a cpio archive can begin at any byte (character format), so a file header
could even span two volumes (floppies), making it difficult to append files
to a cpio archive. It seems that directories and special device files cannot
be written to a tar archive (on the unix systems that I checked), while they
can be written to a cpio archive. This means that more information is stored
in a cpio archive, thus facilitating file restores after a crash. Another
disadvantage of tar archives is the fact that they tend to waste space, since
every file must occupy at lease 1K bytes (512 for a header, and 512 for data).

I'd be interested to hear about any published standards for tar and/or
cpio (AT&T, POSIX, etc.), especially standards that define how to deal
with multi-volume archives (e.g., how do you start reading starting at
volume N?). Perhaps people can add to the list of advantages/disadvantages
of tar and cpio. Differences in the user interface (command syntax) is not
really important, since tar can be used like cpio, and vis versa, via
shell scripts.

-- 
Dominick Samperi, Manhattan College, NYC
	manhat!samperi@NYU.EDU  ihnp4!cmcl2!manhat!samperi (cmcL2)
	ihnp4!cmcl2!phri!dasys1!samperi

al@gtx.com (0732) (02/02/88)

In article <246@mancol.UUCP> samperi@mancol.UUCP (Dominick Samperi) writes:
>
>I'd be interested to hear about any published standards for tar and/or
>cpio (AT&T, POSIX, etc.), especially standards that define how to deal
>with multi-volume archives (e.g., how do you start reading starting at
>volume N?). Perhaps people can add to the list of advantages/disadvantages

I too would like to know if such information is available.  I once did
a cpio backup onto multiple floppies (AT&T 3B1), and when I tried to
restore,  I had a hard read error on one of the floppies, early on in
the sequence.  Luckily, I had the information elsewhere.  Would there
have been any way to bypass the bad floppy and continue the restore?
Where can one get a document describing cpio format?

    ----------------------------------------------------------------------
   | Alan Filipski, GTX Corp, 2501 W. Dunlap, Phoenix, Arizona 85021, USA |
   | {ihnp4,cbosgd,decvax,hplabs,amdahl}!sun!sunburn!gtx!al (602)870-1696 |
    ----------------------------------------------------------------------

mmengel@cuuxb.ATT.COM (Marc W. Mengel) (02/03/88)

In article <246@mancol.UUCP> samperi@mancol.UUCP (Dominick Samperi) writes:
>I've heard that cpio will be used as the unix standard archiver, yet
>many people seem to prefer tar. 
>...
>I'd be interested to hear about any published standards for tar and/or
>cpio (AT&T, POSIX, etc.), especially standards that define how to deal
>with multi-volume archives (e.g., how do you start reading starting at
>volume N?). Perhaps people can add to the list of advantages/disadvantages
>of tar and cpio. Differences in the user interface (command syntax) is not
>really important, since tar can be used like cpio, and vis versa, via
>shell scripts.

Well, you missed (about 1 month ago) a LONG discussion (TAR WARS (-:) in
comp.std.unix, which can be summarized (this off the top of my head, so
I won't try to credit the appropriate folks) as follows (tar and cpio
here refer to  their respective archive formats):

	1) There is much confusion as to whether tar or cpio is older.

	2) tar implementations are more prevalent (almost every release has 
		some version of tar, many (i.e. the BSD releases and 
		v7 derivatives) do not have any version of cpio)

	3) tar format is easily extensible to handle special files such as
		device nodes, named pipes, etc. and has been so extended
		in the public domain version of tar (posted many months
		ago in comp.sources and a PC version about 2 months ago..)

	4) cpio assumes too many things about inode numbers, (limiting their 
		range, etc.)

	5) non-character format cpio archives are not easily moveable to
		machines with different byte ordering.

	6) cpio builds in information to handle file links properly
		regardless of file extraction order. (however it
		uses inode numbers to do this, see (4)

As to the command format

	1) taking files on stdin is more convenient for backups (used
		with find(1))
	
	2) taking files as arguments is more convenient for archives
		constructed "by hand"
	
	3) cpio will copy directory trees with an option, tar needs
		2 tar's in a pipeline to do this.
		
	4) points 1 and 2 are resolved in the public domain tar  (it
		has an option to read filenames from stdin.)

These were the points discussed, and the tar format has been chosen (as
of the last I heard) for the POSIX (a.k.a IEEE 1003) standard.
	
>Dominick Samperi, Manhattan College, NYC


-- 
 Marc Mengel	

 attmail!mmengel
 ...!{moss|lll-crg|mtune|ihnp4}!cuuxb!mmengel

snoopy@doghouse.gwd.tek.com (Snoopy) (02/05/88)

In article <246@mancol.UUCP> samperi@mancol.UUCP (Dominick Samperi) writes:
>I've heard that cpio will be used as the unix standard archiver, yet
>many people seem to prefer tar.

- Tar needs fewer options to do what I want it to do.

- Tar handles symbolic links.  Most implementations of cpio don't.
  (I added this to UTek's cpio.  Great fun.)

- The code for tar is nice and clean, easy to figure out, return codes are
  checked for errors, etc.  The code for cpio is a mess.  => I trust tar
  farther than I trust cpio.  (If you are writing your own from scratch
  this isn't a consideration.)

- Most implementations of tar don't handle multiple volumes. (I haven't
  checked John's PD tar, perhaps it does?)  If it doesn't fit on one
  volume, you're stuck with cpio or using one of those multivolume
  programs.

Snoopy
tektronix!doghouse.gwd!snoopy
snoopy@doghouse.gwd.tek.com

NFS: No Frigging Security

lenny@icus.UUCP (Lenny Tropiano) (02/06/88)

In article <556@gtx.com> al@gtx.UUCP (Al Filipski 839-0732) writes:
|> [... reply to a question on the POSIX standards of cpio or tar ...]
|>
|>I too would like to know if such information is available.  I once did
|>a cpio backup onto multiple floppies (AT&T 3B1), and when I tried to
|>restore,  I had a hard read error on one of the floppies, early on in
|>the sequence.  Luckily, I had the information elsewhere.  Would there
|>have been any way to bypass the bad floppy and continue the restore?
|>Where can one get a document describing cpio format?
|>
|>

What you need to get is a program that was posted to the net a while
back, it was called "afio".  I think it was in comp.sources.unix 
(check with your local archive site).  It was a program that acted just
like cpio, but nicely skipped bad records and jumped to next ASCii header 
record (as long as you used the "-c" option to cpio or afio) and continued
with that file.   This means if you have a bad floppy, you might loose
one file if there is some bad data.  This also allows for starting at backup
disk #50 is you like (any disk) and skip to the first good header.

Nice program!  Very useful, especially if you backup on floppy.  The
program has an option to compile it with -DCTC3B2 (for 3B2 cartridge tape).


						-Lenny
-- 
============================ US MAIL:   Lenny Tropiano, ICUS Computer Group
 IIIII   CCC   U   U   SSSS             PO Box 1
   I    C   C  U   U  S                 Islip Terrace, New York  11752
   I    C      U   U   SSS   PHONE:     (516) 968-8576 [H] (516) 582-5525 [W] 
   I    C   C  U   U      S  AT&T MAIL: ...attmail!icus!lenny  TELEX: 154232428
 IIIII   CCC    UUU   SSSS   UUCP:
============================    ...{uunet!godfre, harvard!talcott}!\
                   ...{ihnp4, boulder, mtune, bc-cis, ptsfa, sbcs}! >icus!lenny 
"Usenet the final frontier"        ...{cmcl2!phri, hoptoad}!dasys1!/

twh@mibte.UUCP (Tim Hitchcock) (02/10/88)

> >I've heard that cpio will be used as the unix standard archiver, yet
> >many people seem to prefer tar. 
> >...
> Well, you missed (about 1 month ago) a LONG discussion (TAR WARS (-:) in
> comp.std.unix, which can be summarized (this off the top of my head, so
> I won't try to credit the appropriate folks) as follows (tar and cpio
> here refer to  their respective archive formats):
> 
> 	3) tar format is easily extensible to handle special files such as
> 		device nodes, named pipes, etc. and has been so extended
> 		in the public domain version of tar (posted many months
> 		ago in comp.sources and a PC version about 2 months ago..)
> 

"cpio -u" will copy special files.

> 
> 	5) non-character format cpio archives are not easily moveable to
> 		machines with different byte ordering.
> 

The "DD" command will swap bytes. In many cases find, cpio & dd are used.

> As to the command format
> 
> 	1) taking files on stdin is more convenient for backups (used
> 		with find(1))
> 	
> 	2) taking files as arguments is more convenient for archives
> 		constructed "by hand"

There is a limit to how many args are allowed on a command line.
There are many UNIX tools one can use to manipulate pathnames.
This seems to be resolved in the public domain tar (4).
> 	
> 	3) cpio will copy directory trees with an option, tar needs
> 		2 tar's in a pipeline to do this.
> 		
> 	4) points 1 and 2 are resolved in the public domain tar  (it
> 		has an option to read filenames from stdin.)
> 
> These were the points discussed, and the tar format has been chosen (as
> of the last I heard) for the POSIX (a.k.a IEEE 1003) standard.
>

guy@gorodish.Sun.COM (Guy Harris) (02/10/88)

> > 	5) non-character format cpio archives are not easily moveable to
> > 		machines with different byte ordering.
> 
> The "DD" command will swap bytes. In many cases find, cpio & dd are used.

Unfortunately, "dd" will swap the bytes in every 16-bit quantity written to the
tape (I don't know how any of this works with non-8-bit-"char" machines).  This
is not useful under these circumstances.  Equally unfortunately, the
byte-swapping options of "cpio" will swap only the bytes in the data blocks.
This is also not useful, under almost *any* circumstances.

What you *want* to do is swap only the bytes in the headers, *not* the bytes in
the data blocks and *not* the bytes in the pathnames.  Unfortunately, as the
astute reader will note, the combination of the byte-swapping of "dd" and the
byte-swapping of "cpio" results in the data blocks being unswapped and the
headers being swapped - BUT it also results in the pathnames being swapped!
The net result is precisely what you think is.

The System III "cpio"'s byte-swapping option swapped bytes in the data blocks
*and* in the pathname; combining this with "dd" provided a stupid and
inefficient way of swapping only the bytes in the headers.  The System V
"cpio"s options cannot be combined with "dd" in this fashion to yield something
useful.  (Please do not tell me that this works, or can be made to work.  I
have tried it, when attempting to use the System V "cpio" on a big-endian Sun-3
to read a binary "cpio" tape made with the System V "cpio" on a little-endian
VAX; it does not work, and cannot be made to work without hacking up "cpio".
This is what I eventually did.)

The *correct* thing to do is to make "cpio" detect that the magic number in the
"cpio" header is byte-swapped from its proper value when reading a tape, and
automatically decide to swap the bytes in the headers, and *only* the headers,
as it reads the data.  After trying to read the aformentioned "cpio" tape, I
fixed the 3.2 SunOS "cpio" to do exactly that.

This is, of course, useful only when you have a binary "cpio" archive.
Everybody now should be using "cpio -c" to make "cpio" archives.  They should
also, if using "find" without "cpio" to make "cpio" archives, be using the
"-ncpio" option, which produces ASCII "cpio" headers, rather than the "-cpio"
option.  Unfortunately, the implementor of this option didn't see fit to
document it.

dhesi@bsu-cs.UUCP (Rahul Dhesi) (02/11/88)

In article <41499@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>The *correct* thing to do is to make "cpio" detect that the magic number in the
>"cpio" header is byte-swapped from its proper value when reading a tape, and
>automatically decide to swap the bytes in the headers, and *only* the headers,
>as it reads the data.

An even more correct thing to do is for cpio to always write archive headers
in a canonical format that is not dependent on the byte-ordering of the
hardware.  E.g., all header data written least significant byte first.

In other words, portability ought to be achieved by making the cpio *format*
portable, not just by compensating for nonportability in the format (in this
case, ambiguity in byte ordering).

This could well be a matter of religion.  Follow-ups to talk.religion.misc.
-- 
Rahul Dhesi         UUCP:  <backbones>!{iuvax,pur-ee,uunet}!bsu-cs!dhesi

gst@gnosys.UUCP (Gary S. Trujillo) (02/16/88)

In article <1629@cuuxb.ATT.COM> mmengel@cuuxb.UUCP (Marc W. Mengel) writes:
| In article <246@mancol.UUCP> samperi@mancol.UUCP (Dominick Samperi) writes:
| | I've heard that cpio will be used as the unix standard archiver, yet
| | many people seem to prefer tar. 
| | ...
| | I'd be interested to hear about any published standards for tar and/or
| | cpio (AT&T, POSIX, etc.)...
| 
| Well, you missed (about 1 month ago) a LONG discussion (TAR WARS (-:) in
| comp.std.unix, which can be summarized (this off the top of my head, so
| I won't try to credit the appropriate folks) as follows (tar and cpio
| here refer to  their respective archive formats):
| 
|  (deleted Marc's summary)
| 
| These were the points discussed, and the tar format has been chosen (as
| of the last I heard) for the POSIX (a.k.a IEEE 1003) standard.
| 	
| | Dominick Samperi, Manhattan College, NYC
| 
| 
| -- 
|  Marc Mengel	
| 
|  attmail!mmengel
|  ...!{moss|lll-crg|mtune|ihnp4}!cuuxb!mmengel

In reviewing my archives, I came across a copy of a message from the Usenix
Association's representatives to the committee responsible for deciding on
a standard for file interchange via magnetic tape.  I thought readers of this
discussion might find it interesting:

| From husc6!ut-sally!std-unix Wed Aug 26 17:14:10 EDT 1987
| Article 114 of comp.std.unix:
| Path: husc6!ut-sally!std-unix
| From: jsq@usenix.uucp (John Quarterman)
| Newsgroups: comp.std.unix
| Subject: cpio format objections
| Message-ID: <8832@ut-sally.UUCP> 
| Date: 24 Aug 87 23:24:22 GMT
| Sender: std-unix@ut-sally.UUCP
| Reply-To: jsq@usenix.uucp (John Quarterman)
| Lines: 128
| Approved: fletcher@sally.utexas.edu (Guest Moderator, Fletcher Mattox)
| 
| From: jsq@usenix.uucp (John Quarterman)
| 
| 	  cpio format objections  Page 1 of 2	    IEEE P1003.1 N.117
| 							24 August 1987
| 
| 			       John S. Quarterman
| 
| 		    Institutional Representative from USENIX
| 				   usenix!jsq
| 
| 
| 
| 	  Secretary, IEEE Standards Board
| 	  Attention: P1003 Working Group
| 	  345 East 47th	St.
| 	  New York, NY 10017
| 
| 	  Cc: 1003.1 Technical Reviewers
| 		      for Section 10:		     for Rationale:
| 	  Stephen Dum		    Lorraine Kevra   Hal Jespersen
| 	  tektronix!athena!steved   attunix!kevra    ucbvax!unisoft!hlj
| 
| 	  The USENIX Association ballots no on the test	balloting of
| 	  IEEE 1003.1 Draft 11,	objecting to the proposed inclusion of
| 	  cpio format, for the following reasons:
| 
| 	    1.	The need for extensions	for symbolic links and
| 		contiguous files has not been properly addressed.
| 		Although three type codes are reserved,	no indication
| 		is given of what they should be	used for.  This	does
| 		not promote the	need for those who implement such
| 		extensions to implement	them the same way.  It is true
| 		that the text of the standard cannot refer to symbolic
| 		links or high performance files, because they are not
| 		defined	in the standard.  But the USTAR	format
| 		indicates the use of its codes for those extensions
| 		both by	the name of the	code given in the standard,
| 		and by explicit	recommendations	in the Rationale.  The
| 		cpio proposal does neither.
| 
| 	    2.	The need for implementation-specific extensions	that
| 		do not conflict	with present or	future standard	file
| 		types has not been addressed.  The USTAR format
| 		addresses the problem by reserving 26 codes for
| 		implementations	to use as they see fit.	 The cpio
| 		proposal does not address the problem at all.
| 
| 	    3.	The c_ino field	of the cpio format is derived from the
| 		UNIX inode number.  Many implementations of cpio use
| 		only 16	bits for this number, and thus cannot properly
| 		resolve	links noted in cpio archives that use more
| 		bits for this number.  Tar and USTAR formats do	not
| 		have this problem, because they	do not use a number
| 		like this to resolve links.  While some	USTAR file
| 		types cannot be	read by	historical tar
| 		implementations, an error will usually be produced.
| 		This cpio problem will cause silent creation of
| 
| 
| 
| 
| 
| 
| 
| 	  cpio format objections  Page 2 of 2	    IEEE P1003.1 N.117
| 
| 
| 
| 		erroneous links, which is worse.
| 
| 	    4.	There are few, if any, distributions of	UNIX systems
| 		that do	not include the	tar program, which is
| 		compatible with	the POSIX USTAR	format.	 There are
| 		many UNIX systems that do not include cpio.
| 
| 	    5.	There is a public domain implementation	of USTAR
| 		format.	 There is no public domain implementation of
| 		cpio format, with or without extensions.
| 
| 	  There	should be one data interchange/archive format in IEEE
| 	  1003.1.
| 
| 	     + The proposed cpio format	is technically inferior	to
| 	       USTAR format.
| 
| 	     + The program that	cpio format is based on	is not as
| 	       widely available	as the one that	USTAR format is	based
| 	       on, and the same	is true	of the proposed	cpio format
| 	       and of USTAR format, respectively.
| 
| 	  Therefore, the one format in the standard should be USTAR.
| 
| 	  Specific action:  deny the cpio format proposal, and do not
| 	  include in the standard any references to that format	or to
| 	  cpio.
| 
| 						  Thank	you,
| 
| 
| 
| 						  John S. Quarterman
| 						  Texas	Internet Consulting
| 						  701 Brazos, Suite 500
| 						  Austin, TX 78701-3243
| 						  512-320-9031
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| Volume-Number: Volume 12, Number 21
| 
| 

Gary S. Trujillo	{ihnp4,harvard,husc6,linus,ima,bbn,m2c}!spdcc!gnosys!gst
Somerville, Massachusetts
-- 
Gary S. Trujillo	{ihnp4,harvard,husc6,linus,ima,bbn,m2c}!spdcc!gnosys!gst
Somerville, Massachusetts

fnf@mcdsun.UUCP (Fred Fish) (02/17/88)

In article <2071@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
>An even more correct thing to do is for cpio to always write archive headers
>in a canonical format that is not dependent on the byte-ordering of the
>hardware.  E.g., all header data written least significant byte first.
>
>In other words, portability ought to be achieved by making the cpio *format*
>portable, not just by compensating for nonportability in the format (in this
>case, ambiguity in byte ordering).

Unfortunately this will not work for one not-so-obvious reason, and that
is because there are systems that when reading the exact same media,
will return bytes ordered differently.  I have seen this happen for example
when reading a given floppy on both a Callan Unistar machine and a Motorola
machine.  Thus there is no way to get around have to detect and do byte
swapping in the archiver as the same media can be read differently on
two different systems.  As long as you never move off the system, you
will never notice this problem.

Several years ago after I got my first Unix system, I was so disgusted
with tar and cpio that I wrote my own backup/archiver type program (bru)
which has always handled this problem completely transparently to the
user by writing ASCII formatted archives and doing whatever byte swapping
was necessary.  The basic algorithm is:

	1.	Examine block's magic number, if correct, no swapping.
	2.	Swap all bytes and try again.  If correct, note byte swap
		needed for all blocks.
	3.	Swap all shorts and try again.  If correct, note short swap
		also needed for all blocks.
	4.	Swap all bytes and try again.  If correct, reset byte swapping
		flag (swap shorts only).

I have never encountered a machine where one of these four combinations of
byte/short swapping didn't result in a readable archive, but I've seen
each combination needed at least once for at least one machine.

-Fred
-- 
# Fred Fish    hao!noao!mcdsun!fnf    (602) 438-3614
# Motorola Computer Division, 2900 S. Diablo Way, Tempe, Az 85282  USA

syd@dsinc.UUCP (Syd Weinstein) (02/18/88)

In article <699@mcdsun.UUCP> fnf@mcdsun.UUCP (Fred Fish) writes:
:In article <2071@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
:>An even more correct thing to do is for cpio to always write archive headers
:>in a canonical format that is not dependent on the byte-ordering of the
:>hardware.  E.g., all header data written least significant byte first.
:>
[ stuff deleted for brevity ]
:
:Unfortunately this will not work for one not-so-obvious reason, and that
:is because there are systems that when reading the exact same media,
:will return bytes ordered differently.
[again deleted]
:Several years ago after I got my first Unix system, I was so disgusted
:with tar and cpio that I wrote my own backup/archiver type program (bru)
:which has always handled this problem completely transparently to the
:user by writing ASCII formatted archives and doing whatever byte swapping
:was necessary.  The basic algorithm is:
:
:	1.	Examine block's magic number, if correct, no swapping.
:	2.	Swap all bytes and try again.  If correct, note byte swap
:		needed for all blocks.
:	3.	Swap all shorts and try again.  If correct, note short swap
:		also needed for all blocks.
:	4.	Swap all bytes and try again.  If correct, reset byte swapping
:		flag (swap shorts only).
:
:I have never encountered a machine where one of these four combinations of
:byte/short swapping didn't result in a readable archive, but I've seen
:each combination needed at least once for at least one machine.

I had to write a backup system also a while back, and I can add another
step to your algorithm that will force it to 8 steps.  There are some
machines out there that invert the bits on tapes.  IE, you need to
NOT(~) it to make it work.   Adding this makes one repeat all four steps
again for bit inverted tapes.  Boy was that a pain to figure out.
-- 
=====================================================================
Sydney S. Weinstein, CDP, CCP
Datacomp Systems, Inc.				Voice: (215) 947-9900
{allegra,bellcore,bpa,vu-vlsi}!dsinc!syd	FAX:   (215) 938-0235

wcs@ho95e.ATT.COM (Bill.Stewart) (02/25/88)

In article <699@mcdsun.UUCP> fnf@mcdsun.UUCP (Fred Fish) writes:
:In article <2071@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
:>An even more correct thing to do is for cpio to always write archive headers
:>in a canonical format that is not dependent on the byte-ordering of the
:>hardware.  E.g., all header data written least significant byte first.
	The cpio -c format uses an ascii character representation for the
	headers, which is transparent across byte orders.  It's
	unfortunately not the default, but if you always use it you won't
	get burned.  Unfortunately, cpio also doesn't detect -c format when
	reading, though I think the PD cpio-replacement program afio does.

	Of course, this can't tell you what to do with your *data*, but
	that's at a level transparent to cpio and tar.
-- 
#				Thanks;
# Bill Stewart, AT&T Bell Labs 2G218, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wcs