[net.unix] tar .vs. cpio

tut@ucbopal.CC.Berkeley.ARPA (08/07/84)

Could someone justify the existence of cpio?  What's wrong with tar?
As the saying goes, "if it ain't broke, don't fix it."  The only
portability problems I've ever encountered with tar were, I believe,
caused by 1) a strange Plexus tape drive, and 2) the unavailability
of tar on bare System V.  Tar descends directory hierarchies, while
cpio requires the aid of find.  Tar always uses character at a time
I/O, while cpio must be passed the -c flag to do this.  So what have
I overlooked?

Bill Tuthill
ucbvax!opal.tut

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (08/10/84)

I changed from tar to cpio for several reasons:
	- tar would overflow its link table (running on a PDP-11)
	  frequently and produce random behavior
	- cpio by default will not overwrite files during extraction
	  if the archive copy is older than the current one
	- cpio will match files using general patterns whereas tar
	  has no such feature
	- cpio can create a copy of a hierarchy using links rather
	  than copies of the files
I don't understand two of your comments, Bill.  "cpio -c" makes the
headers ASCII instead of binary; I don't know what "character at a
time I/O" is supposed to mean but this isn't it.  Also, tar is
supplied with UNIX System V as it comes from AT&T.

I move archives and especially partially-modified archives around
a lot and find cpio to be just what I need for this task.  I think
the choice between the two depends on:
	- whether one is exporting to a non-cpio site
	- whether the above differences are important
	- personal preference

greg@sdcsvax.UUCP (Greg Noel) (08/10/84)

In article <198@ucbopal.CC.Berkeley.ARPA> tut@ucbopal.CC.Berkeley.ARPA writes:
>Could someone justify the existence of cpio?  What's wrong with tar?

Actually, it's easy to justify the existance of cpio.  There are at least
three reasons:

(1) History.  They grew up at different organizations.  Tar comes from
Berkeley and cpio from Bell (now AT&T).  (This may not be completely
accurate -- I first saw cpio in the PWB release and I first saw tar in
a Berkeley system; for all I know, tar may have come with Version 7.  The
point remains the same, even if the different organizations were within
Bell.  Tar has changed (grown?) (groan?) at Berkeley (witness the recent
complaints about the incompatible tar formats); they picked tar and ran
with it, while AT&T went with cpio.  I don't justify it, or point fingers
at either organization; I just report it.)

(2) Problem.  They grew up to solve different problems.  Tar is a "tape
archiver" and its major function is to produce backups of filesystems.
(This was in the days when a filesystem would fit on a single tape.)  (All
right, that's oversimplified.)  Its functionality is based upon a program
called tp from Version 6 (did tp make it into Version 7?).  On the other
hand, cpio was designed from the ground up to solve a very different
problem -- selectivly copying lists of files (actually, filesystem elements).
Thus, it is useful for distributions, or for copying recently-changed files
for backup, or for copying a selected part of a directory tree somewhere
else, or .....  Tar takes its list of files from the command line, effectivly
limiting the number of arguments, while cpio takes them from the standard
input, giving no such limitation (this is why tar copies directory trees --
otherwise you couldn't get enough on the tape to make it useful).  (I actually
consider it a flaw of tar that you MUST copy ALL of a directory tree; there
is no way to make the choice of files conditional.)

(3) Philosophy.  Cpio is more in keeping with the Unix (tm) philosophy,
since it seperates the job of SELECTING the files from the job of COPYING
the files.  ANY algorithm can be used to select the files to be copied,
but cpio can still be used to copy them.  In fact, I have an application
that tries to keep two sets of files in sync on different computers -- it
does it by running a shell script that scans a set of files and determines
which files have changed since the last run and then passes the names to
cpio to be copied.  There are about six thousand files to select from; on
a given day, anywhere from a hundred to several thousand will be selected
for transfer.  I don't think tar could do that as well.

In case you hadn't noticed, I prefer cpio.  There are times when tar is
better (if what you really want to do is copy all of a directory, it's
just fine, and the interface is simpler), but I find that if it is
complicated enough to need to write a shell script then cpio is usually
the program of choice.

Don't get me wrong -- cpio isn't perfect.  Internally, it's a nightmare,
and AT&T would be better off to rewrite the whole thing.  But it works
just fine, and it does what I want it to.

BTW, the -c option of cpio does not cause it to write one character at
a time; it causes the headers for each file to be in ASCII characters
instead of binary.  The output is still blocked.  Now if the null at the
end of the header could be changed into a carriage return, we could use
cpio instead of shar format......

(tm)  Unix is a footnote of AT&T Bell Laboratories
-- 
-- Greg Noel, NCR Torrey Pines       Greg@sdcsvax.UUCP or Greg@nosc.ARPA

wescott@ncrcae.UUCP (Mike Wescott) (08/11/84)

There are several reasons why I prefer cpio to tar, the biggest
one being that I used cpio first and got used to its peculiarities
rather than tar.  But prejudice aside:
	1. cpio handles special files (device nodes)
	2. pass option (-p) in combo with find allows me
		to easily move entire subtrees around easily
		(I realize it can be done with a tar-to-tar
		pipeline, but cpio is more straightforeward)
	3. rename option when de-archiving allows greater flexiblity
		in where to put/name things
	4. for archiving, find has -cpio and -ncpio options which
		do not require the pipe

Drawbacks in cpio:
	1. No easy way to override full pathname in the archive
	2. Loses phase easily, bad spot on some records makes rest of
		archive unsalvagable
	3. File extraction does not include the directory and 
		(recursively) everything in it if a directory
		is specified as the file to be extracted, its
		annoying to remember to spec. xyz* to get the
		xyz directory and its contents
	4. To be portable to the v7-based UNIXes I still need
		to use tar

Mike Wescott
NCR Corp.

    mcnc!ncsu!\
	       >ncrcae!wescott
akgua!usceast!/

johnl@haddock.UUCP (08/11/84)

#R:ucbopal:-19800:haddock:16700033:000:1189
haddock!johnl    Aug 10 12:45:00 1984

>Could someone justify the existence of cpio?  What's wrong with tar?

There's nothing wrong with tar, but I like cpio better because it is a
lot more powerful than tar:

  - reading file names from stdin is a feature, not a bug.  You can use
    find to enumerate just the files you want rather than having to dump
    everything in a directory tree, e.g.

	$ find somedir -mtime -14 -print | cpio -oB >/dev/rmt0

    (dump only files modified within the last two weeks.)  Doing this
    with tar is pretty hard.  For the most common cases of cpio, we
    usually have little shell scripts.  Also (hack) the find command has
    a little of cpio built into it so the above example could be:

	$ find somedir -mtime -14 -cpio /dev/rmt0

  - cpio knows about special files and FIFOs.  Most versions of tar
    don't.  Could be fixed, of course.

  - cpio -p lets you copy a directory tree by linking (so that you
    have new names but the same files.)  Tar can't do that.

Basically, anything you can do with tar, you can do with cpio but not the
converse.

John Levine, ima!johnl

PS:  I offer no defense of the internal coding style of cpio, which
still has innumerable MERT-isms.  Ugh.

henry@utzoo.UUCP (Henry Spencer) (08/12/84)

Cpio pre-dates tar, I believe, and the USG/USDL/whatever-it's-called-
-this-week turkeys didn't have the brains to drop cpio when tar arrived.

The last /usr/group standards meeting tentatively recommended using tar
as the standard tape format for interchange work, due to much wider
availability and better standardization ("cpio" is not a single format,
it's at least three different ones).
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

wjb@ariel.UUCP (W.BOGSTAD) (08/12/84)

	With the versions of tar and cpio, I have used there is
a big difference.  "tar" does not backup special files correctly.
The "dump" and "restor" programs available under version 7 are a
real pain to use so cpio was used instead.  I don't know if current
versions of tar have this problem, but it would be one reason to
justify cpio.

					Bill Bogstad

henry@utzoo.UUCP (Henry Spencer) (08/13/84)

It's worth noting that it is quite possible to write "tpio", i.e. a
program with the cpio user interface but the tar data format.  There
is no doubt that some of cpio's facilities are useful, but this is a
case where it's quite possible to have your cake and eat it too, by
combining some of the neat bits of user interface in cpio with the
much-more-widely-standard tar data format.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

barmar@mit-eddie.UUCP (Barry Margolin) (08/13/84)

In article <226@haddock.UUCP> johnl@haddock.UUCP writes:
>  - reading file names from stdin is a feature, not a bug.  You can use
>    find to enumerate just the files you want rather than having to dump
>    everything in a directory tree, e.g.
>
>	$ find somedir -mtime -14 -print | cpio -oB >/dev/rmt0
>
>    (dump only files modified within the last two weeks.)  Doing this
>    with tar is pretty hard.

It isn't really very hard:
	tar <options> `find ...`

Accepting file names on the command line is the Unix convention.

Note that I have no real opinion on the debate.  I have only used tar so
far.

In response to someone's mention of "tp", the predecessor to "tar": it
is still available in 4.2BSD.
-- 
    Barry Margolin
    ARPA: barmar@MIT-Multics
    UUCP: ..!genrad!mit-eddie!barmar

dan@idis.UUCP (08/13/84)

I believe that the history of the "tar" program reported during the
"tar .vs. cpio" debate in net.unix is a bit confused.  The "tar" program
is from unix V7 and was intended to provide a convenient and machine
independent way of sending collections of files (e.g. a software
distribution) to other machines.  The most immediate ancestor of "tar"
is probably "ar".  Before version 7 (i.e. version 6), we used ad hoc
combinations of "tp", "ar", and "dd".  I suspect that "tp" was originally
designed to work with dectapes (small replaceable blocks) and extended
to handle 9 track mag tapes (potentially large but nonreplaceable blocks)
as an afterthought.

I doubt that "tar" was originally intended to be used for system backup.
This is why most versions of "tar" cannot handle special files or
multiple volumes.

The "cpio" program comes from PWB unix (which was developed at about the
same time as V7 but by a different group of people).  I imagine that
"cpio" was developed for some of the same reasons as "tar" ("tp", "ar",
and "dd" being inadequate).

				Dan Strick
				University of Pittsburgh
				[decvax|mcnc]!idis!dan

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (08/14/84)

There is a (relatively small) limit on the total length of
command line options, typically 5120 bytes.  I have lots of
archives whose component file names far exceed this limit.

wls@astrovax.UUCP (William L. Sebok) (08/14/84)

>In article <226@haddock.UUCP> johnl@haddock.UUCP writes:
>>  - reading file names from stdin is a feature, not a bug.  You can use
>>    find to enumerate just the files you want rather than having to dump
>>    everything in a directory tree, e.g.
>>
>>	$ find somedir -mtime -14 -print | cpio -oB >/dev/rmt0
>>
>>    (dump only files modified within the last two weeks.)  Doing this
>>    with tar is pretty hard.
>
>It isn't really very hard:
>	tar <options> `find ...`
>
>Accepting file names on the command line is the Unix convention.
>    Barry Margolin

No.  It may be the Unix convention but it is not useful for dumping large
numbers of files (like when doing a backup).  There is a limit on how large
an argument list can be passed to a program.  On the Vax under 4.2 BSD as
distributed this is 10240 characters.  This can be easily exceeded in a
medium large file system.  Heck, it is sometimes exceeded in a run-away
uucp spool directory.
-- 
Bill Sebok			Princeton University, Astrophysics
{allegra,akgua,burl,cbosgd,decvax,ihnp4,noao,princeton,vax135}!astrovax!wls

jhall@ihuxu.UUCP (John R. Hall-"the samurai MTS") (08/14/84)

If Berkeley UNIX has the xargs command, I believe you could use the
following technique to avoid exceeding the MAXARGS parameter (usually
10 blocks default):
	find ...| xargs [options] tar

xargs reads an argument list from standard input, and repeatedly
builds up and executes a command line for the specified command.

xargs is part of System V; it's not in my very old Berkeley UNIX manuals...
-- 
--John R. Hall, ...ihnp4!ihuxu!jhall "And may your days be celebrations"

adm@cbneb.UUCP (08/14/84)

#R:ucbopal:-19800:cbnap:27300004:000:579
cbnap!whp    Aug 14 09:36:00 1984

>>It isn't really very hard:
>>	tar <options> `find ...`
>>
>>Accepting file names on the command line is the Unix convention.
>>    Barry Margolin
>
>No.  It may be the Unix convention but it is not useful for dumping large
>numbers of files (like when doing a backup).  There is a limit on how large
>an argument list can be passed to a program.

I don't know about BSD UNIX, but in sys V you can always use:
	find <options> |xargs tar <tar options>

(xargs is a program that reads stdin, constructs and executes a proper
command line, and repeats until eof is found on stdin)

marcus@pyuxt.UUCP (M. G. Hand) (08/15/84)

I've always found cpio quite adequate and usable, except for one thing:
I want a -U option which would UNCONDITIONALLY copy files over existing
ones whose permissions were, eg 444. Of course, it should still obey
the other rules about ownership (ie it would overwrite those files
which would be deleted by rm -f).  This would save a lot of hassle.

		Marcus Hand	(pyuxt!marcus)

bsa@ncoast.UUCP (The WITNESS) (08/15/84)

First, both tar and tp were in V7 Unix.  Second, V7-flavored Unixes (including
99% of Xenix) lack cpio; -c format with ^J at the end of a header would not
be readable by us.  I find the #!/bin/sh stuff bad enough as it is; PLEASE
don't do THAT to me! :-)

--bsa
-- 
      Brandon Allbery: decvax!cwruecmp{!atvax}!bsa: R0176@CSUOHIO.BITNET
					       ^ Note name change!
	 6504 Chestnut Road, Independence, OH 44131 <> (216) 524-1416

"The more they overthink the plumbin', the easier 'tis tae stop up the drain."

geoff@callan.UUCP (Geoff Kuenning) (08/17/84)

Nobody seems to have mentioned special files.  Cpio correctly saves and
restores the files in /dev and fifo's.  Most tar's break on this (although,
again, this can be fixed easily).
-- 

	Geoff Kuenning
	Callan Data Systems
	...!ihnp4!wlbr!callan!geoff

jab@uokvax.UUCP (08/23/84)

#R:ihuxu:-37200:uokvax:6100040:000:824
uokvax!jab    Aug 22 22:18:00 1984

/***** uokvax:net.unix / ihuxu!jhall / 12:56 pm  Aug 14, 1984 */

If Berkeley UNIX has the xargs command, I believe you could use the
following technique to avoid exceeding the MAXARGS parameter (usually
10 blocks default):
--John R. Hall, ...ihnp4!ihuxu!jhall "And may your days be celebrations"
/* ---------- */

Umm, I had the understanding that MAXARGS/NCARGS was the number of bytes passed
as arguments from one program to the program it was exec'ing. The code
in the exec(2) system call allocates only so much space for the arguments
when it begins to fabricate the running program, and it's quite unwilling
to let you pass MORE than those numbers.

xargs(1) is only a program --- it still runs the command in question using
the exec(2) system call, and is still stuck with those constraints.

	Jeff Bowles
	LIsle, IL

barmar@mit-eddie.UUCP (Barry Margolin) (08/23/84)

In article <3948@brl-tgr.ARPA> gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) writes:
>There is a (relatively small) limit on the total length of
>command line options, typically 5120 bytes.  I have lots of
>archives whose component file names far exceed this limit.

I guess I'm completely spoiled by Multics.  I'm used to command lines
that can take up a 1 megabyte segment and stack frames that can handle
argument lists that have up to 16K parameters (64Kwords/max-stack-frame,
two pointers/parameter, 2 words/pointer).
-- 
    Barry Margolin
    ARPA: barmar@MIT-Multics
    UUCP: ..!genrad!mit-eddie!barmar