[comp.std.unix] tar vs. cpio

std-unix@ut-sally.UUCP (Moderator, John Quarterman) (06/02/87)

Included below is a draft proposal for IEEE P1003.1 regarding
the recently raised issue of Archive/Data Interchange Format.
I will deliver a proposal resembling it to P1003.1 at their
next meeting, which is three weeks from today, in Seattle.

Note two things:  this is a proposal for P1003.1, not P1003.2,
or any other group; if you disagree with my conclusions, you
can submit your own proposal-- the address is below.

If you agree with my approach but think it needs adjusting,
you can send me mail or submit articles.  If you disagree, you
can also do those things.



                                  tar vs. cpio      IEEE P1003.1 N.___
                                                           1 June 1987

                               John S. Quarterman

                    Institutional Representative from USENIX
                                   usenix!jsq



          Secretary, IEEE Standards Board
          Attention: P1003 Working Group
          345 East 47th St.
          New York, NY 10017

          In both the Trial Use Standard and Draft 10, POSIX sS10.1
          describes a data interchange format based on the tar
          program.  That section has appeared in every draft of IEEE
          1003.1 in some form and has always been based on tar format.
          The P1003.1 Working Group has recently received two related
          proposals regarding that section: one to add cpio format
          (including old-style, non-ASCII (non c option) format);
          <N.048 Lorraine C. Kevra> <V11N14> <V11N25 Eric S. Raymond>
          the other to replace the existing tar-based format with cpio
          format.  <N.043 X/OPEN> <V11N13> Some clarifications were
          received to the former.  <N.064 Dominic Dunlop> <V11N15> It
          was also proposed verbally in the latest Working Group
          meeting to drop sS10.1 altogether and let P1003.2 handle the
          issue.  <V11N08> <V11N11> <V11N09 Guy Harris> <V11N12 Doug
          Gwyn>

          The present note is a response to those proposals.  Much of
          the detail in it is derived from articles posted in the
          USENET newsgroup comp.std.unix.  Those articles are
          referenced with this format: <V11N09 Guy Harris> which gives
          the volume (11) and number of the article, and the name of
          the submittor.  If no submittor name is given, the posting
          was by the moderator, John S. Quarterman.  Thanks to those
          who submitted articles.  However, the content of this note
          is solely the responsibility of the author.

          There are a number of problems with both cpio formats.
          First, those related to the non-ASCII format:

            1.  Numerous parameters, including inode numbers, mode
                bits, and user and group IDs, are kept in two-byte
                binary integers.  This has historically produced
                serious byte-order problems when data is moved among
                systems with different byte orders.  <V11N09 Guy
                Harris>

            2.  The byte-swapping and word-swapping options to the
                cpio program are inadequate patches; with an ASCII
                format the problem would not be present.  The options
                are not consistent across versions of the program: in







          Page 2                  tar vs. cpio      IEEE P1003.1 N.___



                System III, data blocks and file names are byte
                swapped; in System V, only data blocks are byte
                swapped.  <V11N09 Guy Harris>

            3.  The two-byte integer format limits the range of inode
                numbers to 1..65535.  Many current file systems are
                bigger than that.  <V11N37 Paul Eggert> <V11N39 Henry
                Spencer>

          Non-ASCII cpio format is clearly not portable and should not
          even be considered for standardization.  <V11N12 Doug Gwyn>

          There are several problems that occur even with the ASCII
          cpio format:

            1.  Many implementations of cpio only look at the lower 16
                (or even 15) bits of the inode number, even in ASCII
                format.  <V11N39 Henry Spencer> This is because the
                variable that is used to contain the value is declared
                to be unsigned short, just as in binary format.  Thus,
                even though ASCII cpio format does not constrain this
                number, it is still less than portable.  <V11N37 Paul
                Eggert>

            2.  The proposed cpio ASCII format as specified, <N.048
                Lorraine C. Kevra> <V11N14> is not portable because
                the proposal assumes that sizeof(int) == sizeof(long).
                <N.064 Dominic Dunlop> <V11N15>

            3.  The file type written in a numerical format, making it
                UNIX specific rather than POSIX specific, since POSIX
                (and tar) specifies symbolic, rather than numerical,
                values for file types.  <V11N09 Guy Harris>

            4.  Hard links are not handled well, since cpio format
                does not record that two files are linked.  If two
                files that are linked are written in cpio format, two
                copies will be written.  There is an option to the
                cpio program to detect duplicate files by matching
                pairs of (h_dev, h_ino) and producing links, but that
                is done after the fact.  <V11N09 Guy Harris> (There is
                a program, afio, that handles cpio format more
                efficiently in this and other cases than the licensed
                versions of the program.) <V11N21 Chuck Forsberg>

            5.  Symbolic links are not handled at all, and no type
                value is reserved for them.  This makes cpio useless
                on a large class of historical implementations (those
                based on 4.2BSD or its file system) for one of the
                main purposes of POSIX sS10.1: archiving files for
                later retrieval and use on the same system.







          Page 3                  tar vs. cpio      IEEE P1003.1 N.___



            6.  The cpio format is less common than tar format: there
                are few historical implementations from Version 7 on
                that do not have tar; there are many that do not have
                cpio.  <V11N09 Guy Harris> <V11N10 Charles Hedrick>
                <V11N24 Jim Cottrell> It is true that cpio (non-ASCII
                format) was invented before tar, <V11N22 Joseph S. D.
                Yao> apparently in PWB System 1.0.  <V11N26 Joseph S.
                D. Yao> However, cpio was not available outside AT&T
                before the release of System III, while tar was in
                wide use with Version 7 and is still much more common.
                Also, it appears that the cpio format of PWB was not
                the same as that of System III.  <V11N39 Henry
                Spencer> Although System III and perhaps early
                releases of System V did not include tar, <V11N26
                Joseph S. D. Yao> current releases of System V do.

            7.  It is very late in the process to propose that P1003.1
                adopt cpio format now, especially considering that it
                was originally proposed to and rejected by the
                /usr/group committee before P1003.1 was even formed.
                <V11N39 Henry Spencer>

          There are several advantages to the current tar-based format
          as specified in sS10.1:

            1.  There are no byte- or word-swapping issues caused by
                the format, since all the header values are ASCII byte
                streams.  <V11N17 John Gilmore>

            2.  There are no inode numbers recorded, and file types
                are kept in symbolic form, so the format is less
                implementation-specific than cpio format.  <V11N17
                John Gilmore>

            3.  Historical tar format is the most widely used, as
                discussed in 6. above, despite apparent assertions to
                the contrary.  <N.043 X/OPEN> <V11N13>

            4.  The format specified in sS10.1 is upward-compatible
                with tar format.  Old tar archives can be extracted by
                a program that implements sS10.1.  Archives using some
                of the extensions of sS10.1 can be extracted with old
                (Version 7) tar programs, although symbolic links will
                not be extracted and contiguous files will not be
                handled properly (cpio does not handle these
                capabilities at all).  Files with very long names will
                not be handled properly (cpio does no better at this).
                All tar implementations are compatible to this extent.
                <V11N17 John Gilmore>









          Page 4                  tar vs. cpio      IEEE P1003.1 N.___



            5.  The /usr/group working group and P1003.1 have already
                done the work <P.061> <M.019 5.1.121 Pg.13> <RFC.003
                #121> <P.038> <P.006> required to add optional
                extensions (such as symbolic links, contiguous files,
                and long file names) that are needed on many
                historical implementations and that cpio format lacks.

            6.  The format is extensible for future facilities.
                <V11N39 Henry Spencer>

            7.  There is a public domain implementation of the format
                of sS10.1.  That implementation provided feedback which
                led to improvements in the current specification, and
                has been in use for years in transferring data with
                licensed tar implementations.  <V11N17 John Gilmore>

            8.  Many people prefer the user interface of the cpio
                program to that of the tar program, because the former
                can accept a list of pathnames to archive on standard
                input while the latter takes them as arguments,
                limiting the length of the list.  <V11N34 Andrew
                Tannenbaum> However, the above-mentioned public domain
                implementation of tar accepts pathnames on standard
                input.  <V11N17 John Gilmore> <V11N19 Jim Cottrell>
                Diffs to standard tar to add an option to accept
                pathnames on standard input when creating an archive
                have also been posted to USENET.  <V11N36 John
                Gilmore> The user interface is, in any case,
                irrelevant to P1003.1.  <V11N39 Henry Spencer> <V11N40
                Rahul Dhesi>

          There are some problems that neither tar nor cpio handles
          well.

            1.  An option to prevent crossing mount points would be
                useful for backups.  <V11N19 Jim Cottrell> <V11N22
                Joseph S. D. Yao> However, this appears to be more of
                an implementation issue than a format issue, <V11N28
                Dave Brower> <V11N32 Joseph S. D. Yao> especially
                considering that there are options to find in 4.2BSD,
                <V11N24 Jim Cottrell> SunOS 3.2, <V11N36 John Gilmore>
                and System V Release 3.0 <V11N35 Mike Akre> that take
                care of this.

            2.  The default block size in many tar implementations is
                too large for some tape controllers to read <V11N27
                Rob Lake> (the 3B20 has this problem).  This is not a
                problem with the interchange format, however.

          There is nothing that the proposed cpio can handle that the
          tar-based format already in POSIX sS10.1 cannot handle; in







          Page 5                  tar vs. cpio      IEEE P1003.1 N.___



          fact, the former is less capable.  If cpio format were
          augmented to handle missing capabilities, it would be
          subject to the same objections now aimed at the format given
          in sS10.1: that it was not identical with an existing format.

          There is no advantage in replacing the current tar-based
          format of sS10.1 with cpio format.  There is also no
          advantage in adding cpio format, because two standards are
          not as good as a single standard.

          Some have recommended removing sS10.1 from POSIX altogether,
          <V11N12 Doug Gwyn> perhaps with a recommendation for P1003.2
          to pick up the idea.  <V11N09 Guy Harris> While I believe
          that that would be preferable to adding cpio format, whether
          or not tar format remains, I recommend leaving sS10.1 as it
          is, because

             o+ The inclusion of an archive/interchange file format is
               in agreement with the purpose of POSIX to promote
               portability of application programs across interface
               implementations.  Some format will be used.  It is to
               the advantage of the users of the standard for there to
               be a standard format.

             o+ The de facto standard is tar format.  The current sS10.1
               standardizes that, and provides upward-compatible
               extensions in areas that were previously lacking.

          The Archive/Interchange File Format should be left as it is.

                                                  Thank you,



                                                  John S. Quarterman






Volume-Number: Volume 11, Number 41

std-unix@ut-sally.UUCP (06/02/87)

From: <gmt@arizona.edu> (Gregg Townsend)

I agree with you on this and appreciate the effort it took to collect
all the various points in a single document.  Here is one additional
point about the ASCII cpio format that I think is worth mentioning:

	n.  Even when a cpio archive is composed entirely of text
	    files, it can be tricky to transport because the file
	    name is terminated by a 00 byte.


Hope this is helpful.  Do with it what you wish.

     Gregg Townsend / Computer Science Dept / Univ of Arizona / Tucson, AZ 85721
     +1 602 621 4325      gmt@Arizona.EDU       110 57 17 W / 32 13 47 N / +758m

Volume-Number: Volume 11, Number 43

std-unix@ut-sally.UUCP (06/02/87)

From: guy@sun.com (Guy Harris)

I agree with the proposal; these are just some nits.

          ...meeting to drop sS10.1 altogether...

The sequence "s^HS" appears here, and in several other places - is
this intentional or a bizarre result from "nroff"?

[ It's nroff's attempt to produce a section sign.  The actual note
will be formatted with troff, which can handle it.  I will incorporate
your other comments.  -mod ]

            4.  Hard links are not handled well, since cpio format
                does not record that two files are linked.  If two
                files that are linked are written in cpio format, two
                copies will be written.  There is an option to the
                cpio program to detect duplicate files by matching
                pairs of (h_dev, h_ino) and producing links, but that
                is done after the fact.

Actually, this is the standard way "cpio" handles hard links; it's
not an option.

            5.  Symbolic links are not handled at all, and no type
                value is reserved for them.  This makes cpio useless
                on a large class of historical implementations (those
                based on 4.2BSD or its file system) for one of the
                main purposes of POSIX sS10.1: archiving files for
                later retrieval and use on the same system.

(Another s^HS here)  It is possible to extend this format to handle
symbolic links; we have done this.

[ But remember that what was proposed to P1003.1 was existing System V
cpio format. -mod ]

                ...However, cpio was not available outside AT&T
                before the release of System III, while tar was in
                wide use with Version 7 and is still much more common.

Actually, the old "cpio" was available with PWB/UNIX 1.0, which AT&T
did release.

                Also, it appears that the cpio format of PWB was not
                the same as that of System III.  <V11N39 Henry
                Spencer> Although System III and perhaps early
                releases of System V did not include tar, <V11N26
                Joseph S. D. Yao> current releases of System V do.

No, System III and all releases of S5 included "tar".

Volume-Number: Volume 11, Number 45

std-unix@ut-sally.UUCP (06/03/87)

From: trb@ima.ISC.COM (Andrew Tannenbaum)

>            6.  ...
>		<V11N39 Henry Spencer> Although System III and perhaps
>		early releases of System V did not include tar, <V11N26
>		Joseph S. D. Yao> current releases of System V do.

User manuals for UNIX Release 3.0 (June 1980) and Release 5.0 (June
1982) both contain (identical) tar man pages.  The changes between the
V7 (January 1979) tar man page and these two seem to be cosmetic.  Both
contain cpio man pages, the 5.0 cpio has the byteswapping options, 3.0
does not.

This doesn't mean that Sys III and Sys V had tar, but it does indicate
an intention to include them.

	Andrew Tannenbaum   Interactive   Boston, MA   +1 617 247 1155

Volume-Number: Volume 11, Number 47

std-unix@ut-sally.UUCP (06/03/87)

From: jss@ulysses.uucp (Jerry Schwarz)

One problem with tar format is that it has a fixed 100 character
limit on the length of pathnames.   While this may not have
been a problem in practice, it would be short sighted to adopt
a standard with this limitation.  

[ The format in POSIX 10.1 has a way of handling longer file names. -mod ]

Jerry Schwarz

Volume-Number: Volume 11, Number 49

MIKEMAC%UNBMVS1.BITNET@wiscvm.wisc.edu (Michael MacDonald) (06/05/87)

From: MIKEMAC%UNBMVS1.BITNET@wiscvm.wisc.edu (Michael MacDonald)

    I have just finished working on a CPIO tape reader and approx 1 year
ago a TAR tape reader for our IBM3090 180/vf running MVS/XA.

   The following comments may be of interest as they come from a slightly
different point of view. I do not have significant *ix experience and
the following comments come as a result of trying to pick apart these tapes
when they are used for data interchange.

   TAR and CPIO are *used* for purposes of backup AND data interchange.

   TAR Format comments.
       1)  Data is written as blocks of 512 bytes. This allows for faster
           processing and this is important for BIG files.
[ Most implementations allow using tape blocks larger than that.  -mod ]
       2)  There is room left in the header. This allows for customization
           by a site while still allowing other sites to read the tape
           without using the customized version (if they do it right).
       3)  The length of the NAME and the LINKNAME field is not enough.
           Extending the length to 256 would extend the header to 2 blocks
           but I think that extending the length outweighs the disadvantages.
[ In addition to

#define NAMSIZ	100
	char	name[NAMSIZ];

POSIX Section 10.1 also has

#define PFXSIZ	155
	char	prefix[PFXSIZ];

which is used when name isn't big enough.  The total of the two is set
to match the minimum permissible value of PATH_MAX.  -mod ]
       4)  All of the tape drives that I have worked with (not that many)
           are capable of writing a short block. If TAR would recognize
           a physical end of file rather than two blocks of hex 00's.
           This would solve a number of problems with TAR.
       5)  Limited amount of Unix dependent information in the header.
           If a *backup* system is used for data interchange is it really
           necessary to add many Operating System dependent features.
           Are the advantages gained by using these dependencies *really*
           advantages even in a backup system?

   CPIO Format comments.
        1)  Data is not block oriented. This slows down processing
            considerably.
        2)  There is no room left in the header. No customization
            possible (without also sending the customized program).
        3)  Is 128 that much better than 100? See TAR note 3.
        4)  The CPIO end of file mark (TRAILER!!!) why not a physical EOF
            See TAR note 4.
        5)  When it comes to OS dependent information the CPIO header is
            full of it.
        6)  After writing the CPIO tape reader I came across a ?serious?
            problem. (The following note is from the unix manual page cpio(4)
            The h_name field is "h_namesize rounded to word" long. The
            header must begin on a word boundary (although not documented).
            The wordsize of the machine is not a CPIO option (as far as I can
            tell). This means CPIO tapes cannot be read on a machine with
            a different wordsize. I question if this "feature" should be
            standardized without at least a wordsize option.

 Michael MacDonald
 Software Specialist, School of Computer Science
 University of New Brunswick
 Po. Box 4400
 Fredericton, New Brunswick
 CANADA    E3B 5A3

 (506) 453-4566

 Netnorth/BITNET: MIKEMAC@UNB

Disclaimer: The opinions stated are mine, no one likes them around here either.

Volume-Number: Volume 11, Number 50

std-unix@ut-sally.UUCP (Moderator, John Quarterman) (06/06/87)

Since the tar and cpio comments keep coming in, I will let them collect
(posting them meanwhile) until about 16 June, after which I will incorporate
them into the note I posted previously and deliver same to IEEE P1003.1.

Volume-Number: Volume 11, Number 51

std-unix@ut-sally.UUCP (06/07/87)

From: billj@dvlmarv.uucp (Bill Jones)

In article <8188@ut-sally.UUCP> you write:
>                However, cpio was not available outside AT&T
>                before the release of System III, while tar was in
>                wide use with Version 7 and is still much more common.

My memory is fuzzy now, but I recall cpio having been distributed on
the V7 addendum tape, whose other contents were (I think) fsck, the
line printer driver, and a c2 cured of certain overoptimism.  (This is
a nit picked for historical accuracy only:  I believed then, and still
do, that tar is the better format.  I'm not even keen that this should be
posted, especially if you cannot verify the assertion.)
[ Can anybody verify this?  -mod ]
-- 
Bill Jones, Develcon Electronics, 856 51 St E, Saskatoon S7K 5C7 Canada
uucp:  ...ihnp4!sask!zaphod!billj               phone:  +1 306 931 1504

Volume-Number: Volume 11, Number 53

std-unix@ut-sally.UUCP (06/08/87)

From: seismo!munnari!yabbie.oz!rcodi (Ian Donaldson)

In article <8208@ut-sally.UUCP>:
> From: MIKEMAC%UNBMVS1.BITNET@wiscvm.wisc.edu (Michael MacDonald)
> 
>     I have just finished working on a CPIO tape reader and approx 1 year
> ago a TAR tape reader for our IBM3090 180/vf running MVS/XA.

Sounds reasonable.  About the same time it would seem, I wrote an implementation
of Tar in Pascal under NOS on a Cyber 170, with 60-bit words.  See comments
below.
 
> #define PFXSIZ	155
> 	char	prefix[PFXSIZ];
> 
> which is used when name isn't big enough.  The total of the two is set
> to match the minimum permissible value of PATH_MAX.  -mod ]

I don't agree - there should be no limit to the pathname size.
4.2bsd has a limit of MAXPATHLEN (unfortunately), but it is a reasonable
value (1024).  Probably the reason that was enforced was a side-effect
of the speedups to nami(), using one copyinstr() rather than judicious use
of fubyte().  The code could be tweaked to overcome this limit without
too much trouble no doubt.  V7 and SVR2 don't have any such limit.

100+155 just doesn't seem enough in comparison for Tar, but in practice
it would suffice 99% of the time.

>        4)  All of the tape drives that I have worked with (not that many)
>            are capable of writing a short block. If TAR would recognize
>            a physical end of file rather than two blocks of hex 00's.
>            This would solve a number of problems with TAR.

Yes, but tar is not always used with tapes, and is not always used
with machines that have 8-bit bytes (eg: Cyber).  

I have often written tar archives to raw disk partitions.  If you were 
to use the OS EOF concept here, it would fail miserably since the archive 
is only a fraction of the size of the partition usually. 

The implementation of Tar I had on the Cyber worked well but it it would have 
been much more complicated to make it recognize a physical EOF half way through
a 60-bit word (yes there are 7.5 8-bit bytes/60 bit word, and thats
how they arrive from the tape-drives or disks).  NOS does provide a
rather perverted way of determing how many unused bits are in the last
word, but that information is typically only available to the assembly
language programmer, or system guru, and is device dependent.

Remember that tar and cpio work with any "file" that you tell them to.
They are not hard-coded for tapes.

>        5)  Limited amount of Unix dependent information in the header.
>            If a *backup* system is used for data interchange is it really
>            necessary to add many Operating System dependent features.

Yes, but remember this is a UNIX archive mechanism.  Thus it should be able
to save/restore files on most UNIX systems.  It was not designed for
other operating systems.  Information that is not common to most
implementations of UNIX is probably not worth putting in the header.
Those implementations that don't support such information can safely
ignore it anyway.
There is no such information there at the moment in Tar headers.
There is an inode/dev pair in cpio's header however, but this is used
for link determination at extract time.

[ We are discussing standardizing a data interchange/archive format 
in a standard that its authors explicitly wanted to be implementable
on hosted, i.e., non-UNIX-based, systems.  The inclusion of inode
numbers is a problem for such implementations, especially when it
is not necessary, as demonstrated by the tar format.  -mod ]

>    CPIO Format comments.
>         1)  Data is not block oriented. This slows down processing
>             considerably.

Its a time/space tradeoff.

Ian D.


Volume-Number: Volume 11, Number 54

std-unix@ut-sally.UUCP (06/11/87)

From: davidsen@steinmetz.uucp (William E. Davidsen Jr)

In article <8208@ut-sally.UUCP>:
>From: MIKEMAC%UNBMVS1.BITNET@wiscvm.wisc.edu (Michael MacDonald)
>   TAR Format comments.
I realize this hasn't stopped some people, but I will pass on tar
comments because I'm not an expert.

>
>   CPIO Format comments.
>        1)  Data is not block oriented. This slows down processing
>            considerably.
I miss this one. It may slow things under MVS, but there's no reason
why reading less physical data should slow things down. Quite the
opposite.

>        2)  There is no room left in the header. No customization
>            possible (without also sending the customized program).
This is a major advantage. Save us from "custom standard' format. The
custom stuff belongs in the *file*, not the format (in my opinion).

>        3)  Is 128 that much better than 100? See TAR note 3.
Although I've never been bitten by this, it could be a problem. I'm not
sure that it justified scrapping a format which is widely used. cpio
does allow dumping from a relative directory if you have a system with
pathnames longer than the files.

>        4)  The CPIO end of file mark (TRAILER!!!) why not a physical EOF
>            See TAR note 4.
cpio will run nicely to other media sice as floppy disk and/or
removable disk packs. Most device drivers don't support any EOF on
these other than the physical size of the media. You can also have
multiple cpio dumps on a single file, although this is most useful when
doing incremental backups.

>        5)  When it comes to OS dependent information the CPIO header is
>            full of it.
We *are* talking about a U*IX standard here. For data interchange
between unlike systems we have the ANSI standard for tapes, which has
been around since at least 1975 because I wrote a driver for it on a
custom o/s. In FORTRAN. Yes, barf!
[ See comments in previous article about what IEEE 1003.1 is. -mod ]

>        6)  After writing the CPIO tape reader I came across a ?serious?
>            problem. (The following note is from the unix manual page cpio(4)
>            The h_name field is "h_namesize rounded to word" long. The
>            header must begin on a word boundary (although not documented).
>            The wordsize of the machine is not a CPIO option (as far as I can
>            tell). This means CPIO tapes cannot be read on a machine with
>            a different wordsize. I question if this "feature" should be
>            standardized without at least a wordsize option.
I confess I don't understand the wording here, but cpio is *not*
limited in this way as far as I can tell. I routinely transfer files
from Xenix (16 bit) to PC/IX (16 bit), to VAX, Sun3, and unix-pc (all
32 bit), and from time to time Cray2 (64 bit). It all works, so I think
the wording is at fault here, not the method.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {chinet | philabs | sesimo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

Volume-Number: Volume 11, Number 56

std-unix@ut-sally.UUCP (06/11/87)

From: jsdy@hadron.uucp (Joseph S. D. Yao)

John,

Thanks for an excellent summary.  However, I don't think I ever
said that System III or System V didn't have tar.  I don't have
an archive to check, but that may have been when I posited that
V7 distribution tapes probably did not come in tar format, since
prior to that tar had not been distributed.  System V may have
had tar from the beginning.  I do not remember about System III.

Berkeley 4BSD tapes did come in tar format, I remember; but they
assumed that you had already received your V7/32V tapes, and were
in a different format from the V7/32V tapes.
[ You're evidently referring to some pretty old BSD tapes. -mod ]

Once tar had come out, versions were written which could read
tar distribution tapes under V6 and PWB 1.0.  (PWB was released
outside of AT&T, BTW.)

	Joe Yao		jsdy@hadron.COM (not yet domainised)
	hadron!jsdy@{seismo.CSS.GOV,dtix.ARPA,decuac.DEC.COM}
{arinc,att,avatar,cos,decuac,dtix,ecogong,kcwc}!hadron!jsdy
     {netex,netxcom,rlgvax,seismo,smsdpg,sundc}!hadron!jsdy

Volume-Number: Volume 11, Number 63

std-unix@ut-sally.UUCP (06/13/87)

From: katzung@laidbak.UUCP (Brian Katzung)

When I was working at the University of California San Francisco,
I made some simple modifications to tar for use as a backup facility.
As far as I know, this is still the method being used for backup on
that system.

The changes I made were:

	o Incremental mode (accepts file names from standard input
		and directories don't recurse).
	o Stay on the device associated with a disk or directory
		(don't cross mount points).
	o Multi-volume tapes on drives that support EOT reporting
		(I added a simple ioctl to our tape driver that
		reported EOT status).
	o A flag for continuing after directory checksum errors (to
		allow starting with any volume in a multi-volume
		set; it quickly syncs up on the first valid header).
	o Copies of all errors could be sent to a log file for
		semi-unattended operation.

-- Brian Katzung  ihnp4!laidbak!katzung


Volume-Number: Volume 11, Number 57

std-unix@ut-sally.UUCP (06/14/87)

From: guy@sun.com (Guy Harris)

> My memory is fuzzy now, but I recall cpio having been distributed on
> the V7 addendum tape, whose other contents were (I think) fsck, the
> line printer driver, and a c2 cured of certain overoptimism. ...
> [ Can anybody verify this?  -mod ]

No, but I think I can refute it with a reasonable degree of accuracy.
It's been a while since I've seen the V7 addendum tape, but I don't
remember "cpio" being on it.  (There were other things, like a
beefed-up F77, some fixes to "fgrep", and a newer version of "awk".
The versions that came with various 4BSD releases seemed to be the V7
addendum tape versions; "cpio" didn't come with any 4BSD release,
which suggests that "cpio" wasn't on the V7 addendum tape, although
it doesn't indicate it for sure.)

Volume-Number: Volume 11, Number 62

std-unix@ut-sally.UUCP (06/15/87)

From: rick@seismo.CSS.GOV (Rick Adams)

This is the README file off of the V7 addendum tape. 
cpio is clearly not on the tape. Note that the addendum tape
is in TAR format! (That should say something...)

---rick

--------------BEGIN README from V7 addendum---------------------
Addenda to UNIX 7th edition distribution tape, 12/2/80.

Format: tar(1), 800 bpi.

Contents:
	
	README: this descriptive file.
	
	lp.c: the missing line printer driver that
	      belongs in /usr/sys/dev/lp.c.
	      The program comes from PWB, and needs minor
	      changes to work in version 7; see comment
	      at head of program.
	
	lpr: a directory containing the lpr utility and
	      its daemon lpd.
	      See lpr/makefile for instructions on putting it
	      together.
	
	lpd.8: the manual section for the line printer daemon.

	fgrep.c: new source for fgrep(1) corrects certain
	      troubles with keys with common prefixes
	
	c2: directory containing C optimizer cured of
	      certain instances of overoptimism.
	      The existing C makefile works

	awk: directory with complete new awk processor,
	      see README and makefile therein

	tmac.r: macros to simulate old "roff" in "nroff",
	      to support -mr option mentioned in roff(1)

	f77: directory with complete new fortran compiler,
	      contains makefiles.
	      Further improvements to the I/O library have
	      been made at UC Berkeley, and may be obtainable
	      from them.

	malloc.c: new source for malloc(3) corrects rare bug

	dev: directory with more robust mag tape drivers for /usr/sys/dev

	fsck: directory with new, stringent file system checking
	      program and manual section, far superior to old
	      [ind]check.  It checks some data not maintained
	      by v7, in particular superblock counts; resulting
	      complaints are harmless

	
Other bug fixes:

/usr/sys/h/param.h: CMAPSIZ and SMAPSIZ
	should both be defined as (NPROC/2)
	otherwise trouble will occur with very large
/usr/sys/conf/low.s: replace br7+7. with br7+10.
	memories
/usr/src/cmd/sed/sed0.c: delete continue after
	case '\0' in compile() 
/usr/src/cmd/cu.c: args 1 and 2 of some ioctl calls
	may be interchanged
	a ~ may be lacking from references to ECHO or CRMOD
	in case (f == 1) of mode(f)

The following bugs exist, no fix is included.
      (1) adb does not report floating registers correctly
      (2) ldiv, lmod fail with largest negative dividend
          (these implement division of longs in C);
          the division (unsigned)32768/1 also fails
      (3) dump(1) maintains ddate incorrectly.
          This bug is relatively innocuous; it causes
          more dumping than necessary on some occasions.
      (4) join(1) treats null keys as end of file
      (5) sort -t includes the following tab in some field comparisons
      (6) hs(4) is irrevocably lost
      (7) exec writes arguments into swap space with buffered
          I/O, which may happen physically much later, after
          the space has been used for a core image.  The
          solution is to preallocate
          a portion of swap space to this single purpose.
      (8) break is turned into a DEL regardless
          of what the current interrupt character is
      (?) and others, see warranty


Volume-Number: Volume 11, Number 65

ka@hropus.UUCP (Kenneth Almquist) (06/15/87)

From: ka@hropus.UUCP (Kenneth Almquist)

> [ We are discussing standardizing a data interchange/archive format
> in a standard that its authors explicitly wanted to be implementable
> on hosted, i.e., non-UNIX-based, systems.  The inclusion of inode
> numbers is a problem for such implementations, especially when it
> is not necessary, as demonstrated by the tar format.  -mod ]

Several people have suggested that tar's method of handling links is
better than cpio's.  After looking at the tar format, I wondered how
tar could possibly handle links correctly.   A quick experiment showed
that it doesn't.  Try the following:

	> file1
	ln file1 file2
	tar -cf archive file1 file2
	rm file1 file2
	tar -xf archive file2

The second tar command will fail because tar will simply try to create
a link from file1 to file2, but since I only requested that file2 be
extracted file1 does not exist.

I claim that this is a bug in the tar archive format rather than just
the tar program.  Consider what tar must do to function correctly.  Tar
could remember the location of file1 and lseek to it in this particular
example, but in general the input to tar is not a regular file and thus
may not be seekable.  The best bug fix that I could come up with is to
to make tar write the contents of all files that it does not extract
to a temporary file.  This is unsatisfactory because a user who tried
to extract a single file from a 32 megabyte tape would almost certainly
run out of disk space.

So it seems to me that tar cannot be made to handle links correctly
unless the tar archive format is changed.  The cpio format, on the other
hand, allows links to be handled correctly.  The fact that cpio includes
inode numbers is not all that major a problem for non-UNIX based systems.
Since the only thing the inode numbers are used for is resolving links,
a system which does not support (non-symbolic) links can leave garbage
in the inode field when writing tapes.  A system which does have links
but does not have inode numbers can use a sequence number in place of
the inode number.

I recognize that users will very rarely encounter this bug in tar, but
I still view it as a serious problem in a *standard*.  The question is
not whether this bug in tar desperately needs to be fixed (which is
doesn't), but whether it is reasonable to expect vendors selling cpio
to deliberately introduce a bug into cpio.  Unless someone can suggest
a good way to make cpio use the tar format and still work correctly,
vendors will have to do just that to be compatible with the new standard.


I wonder if there is still any chance of a new interchange format that
corrected the deficiencies of both cpio and tar being accepted as the
standard.  Assuming someone could be found to write a public domain
implementation of the new format, would that be sufficient to make it
a reasonable alternative to the existing implementations?
				Kenneth Almquist

Volume-Number: Volume 11, Number 66

std-unix@ut-sally.UUCP (Moderator, John Quarterman) (06/17/87)

Yesterday was 16 June, which was the day I said I would collect
tar and cpio comments until.  Included below is the revised note
for P1003.1, incorporating those comments.  I will deliver it
to P1003.1 in Seattle Monday.



                                  tar vs. cpio      IEEE P1003.1 N.___
                                                          17 June 1987

                               John S. Quarterman

                    Institutional Representative from USENIX
                                   usenix!jsq



          Secretary, IEEE Standards Board
          Attention: P1003 Working Group
          345 East 47th St.
          New York, NY 10017

          In both the Trial Use Standard and the current Draft 10,
          POSIX sS10.1 describes a data interchange format based on the
          tar program.  That section has appeared in every draft of
          IEEE 1003.1 in some form and has always been based on tar
          format.  The P1003.1 Working Group has recently received two
          related proposals regarding that section: one to add cpio
          format (including old-style, non-ASCII (non c option)
          format); <N.048 Lorraine C. Kevra> <V11N14> <V11N25 Eric S.
          Raymond> the other to replace the existing tar-based format
          with cpio format.  <N.043 X/OPEN> <V11N13> Some
          clarifications were received to the former.  <N.064 Dominic
          Dunlop> <V11N15> It was also proposed verbally in the latest
          Working Group meeting to drop sS10.1 altogether and let
          P1003.2 handle the issue.  <V11N08> <V11N11> <V11N09 Guy
          Harris> <V11N12 Doug Gwyn>

          The present note is a response to those proposals.  Much of
          the detail in it is derived from articles posted in the
          USENET newsgroup comp.std.unix.  Those articles are
          referenced with this format: <V11N09 Guy Harris> which gives
          the volume (always 11) and number of the article, and the
          name of the submittor.  If no submittor name is given, the
          posting was by the moderator, John S. Quarterman.  Thanks to
          those who submitted articles.  However, the content of this
          note is solely the responsibility of the author.

          This note is addressed to P1003.1, and is concerned with
          data interchange formats.  Although user interface issues
          may be of interest to P1003.2, they are not addressed here.

          There are a number of problems with both cpio formats.
          First, those related to the non-ASCII format:

            1.  Numerous parameters, including inode numbers, mode
                bits, and user and group IDs, are kept in two-byte
                binary integers.  This has historically produced
                serious byte-order problems when data is moved among
                systems with different byte orders.  <V11N09 Guy
                Harris>








          Page 2                  tar vs. cpio      IEEE P1003.1 N.___



            2.  The byte-swapping and word-swapping options to the
                cpio program are inadequate patches; with an ASCII
                format the problem would not be present.  The options
                are not consistent across versions of the program: in
                System III, data blocks and file names are byte
                swapped; in System V, only data blocks are byte
                swapped.  <V11N09 Guy Harris> <V11N47 Andrew
                Tannenbaum>

            3.  The two-byte integer format limits the range of inode
                numbers to 0..65535.  Many current file systems are
                bigger than that.  <V11N37 Paul Eggert> <V11N39 Henry
                Spencer>

          Non-ASCII cpio format is clearly not portable and should not
          even be considered for standardization.  <V11N12 Doug Gwyn>

          There are several problems that occur even with the ASCII
          cpio format:

            1.  Many implementations of cpio only look at the lower 16
                (or even 15) bits of the inode number, even in ASCII
                format.  <V11N39 Henry Spencer> This is because the
                variable that is used to contain the value is declared
                to be unsigned short, just as in binary format.  Thus,
                even though ASCII cpio format only constrains this
                number to the range 0..262143, the format is still
                less than portable.  <V11N37 Paul Eggert>

            2.  The proposed cpio ASCII format as specified, <N.048
                Lorraine C. Kevra> <V11N14> is not portable because
                the proposal assumes that sizeof(int) == sizeof(long).
                <N.064 Dominic Dunlop> <V11N15>

            3.  The file type is written in a numerical format, making
                it UNIX specific rather than POSIX specific, since
                POSIX (and tar) specifies symbolic, rather than
                numerical, values for file types.  <V11N09 Guy Harris>

            4.  Hard links are not handled well, since cpio format
                does not directly record that two files are linked.
                If two files that are linked are written in cpio
                format, two copies will be written.  The cpio program
                detects duplicate files by matching pairs of (h_dev,
                h_ino) and producing links, but that is done after the
                fact.  <V11N09 Guy Harris> <V11N45 Guy Harris> <V11N54
                Ian Donaldson> (There is a program, afio, that handles
                cpio format more efficiently in this and other cases
                than the licensed versions of the program.) <V11N21
                Chuck Forsberg>








          Page 3                  tar vs. cpio      IEEE P1003.1 N.___



            5.  Symbolic links are not handled at all, and no type
                value is reserved for them.  This makes cpio useless
                on a large class of historical implementations (those
                based on 4.2BSD or its file system) for one of the
                main purposes of POSIX sS10.1: archiving files for
                later retrieval and use on the same system.  Although
                it is possible to extend cpio to handle symbolic
                links, and at least one vendor has done this, <V11N45
                Guy Harris> the format proposed to P1003.1 is the
                format in the SVID, and does not handle symbolic
                links.

            6.  The cpio format is less common than tar format: there
                are few historical implementations from Version 7 on
                that do not have tar; there are many that do not have
                cpio.  <V11N09 Guy Harris> <V11N10 Charles Hedrick>
                <V11N24 Jim Cottrell> It is true that cpio (non-ASCII
                format) was invented before tar, <V11N22 Joseph S. D.
                Yao> apparently in PWB System 1.0.  <V11N26 Joseph S.
                D. Yao> The cpio program was first available outside
                AT&T with PWB/UNIX 1.0, <V11N45 Guy Harris> <V11N63
                Joseph S. D. Yao> and later with System III.  However,
                in the interim, Version 7, which did not include cpio
                <V11N53 Bill Jones> <V11N62 Guy Harris> but did
                include tar, became the most influential system.
                There was a V7 addendum tape, but it also did not
                include cpio (according to its README file); <V11N65
                Rick Adams> the addendum tape was in tar format.
                Also, it appears that the cpio format of PWB was not
                the same as that of System III.  <V11N39 Henry
                Spencer> And System III and all releases of System V
                include tar.  <V11N26 Joseph S. D. Yao> <V11N63 Joseph
                S. D. Yao> <V11N45 Guy Harris> <V11N47 Andrew
                Tannenbaum>

            7.  It is very late in the process to propose that P1003.1
                adopt cpio format now, especially considering that it
                was originally proposed to and rejected by the
                /usr/group committee before P1003.1 was even formed.
                <V11N39 Henry Spencer>

          Advantages of cpio format include:

            1.  Both X/OPEN <N.043 X/OPEN> <V11N13> and the SVID
                <N.048 Lorraine C. Kevra> <V11N14> use it, although
                evidently defined somewhat differently.  <N.064
                Dominic Dunlop> <V11N15>

            2.  Archives made in cpio format are often smaller than
                ones in tar format.  <V11N44 Mark Horton> But this is
                only because of the headers, and thus the effect







          Page 4                  tar vs. cpio      IEEE P1003.1 N.___



                diminishes with larger files.

            3.  On a local (non-networked) system, cpio is more
                efficient at copying directory trees than tar.
                <V11N46 Steve Blasingame> However, this is really an
                implementation issue.

          There are several advantages to the current tar-based format
          as specified in sS10.1:

            1.  There are no byte- or word-swapping issues caused by
                the format, since all the header values are ASCII byte
                streams.  <V11N17 John Gilmore>

            2.  There are no inode numbers recorded, and file types
                are kept in symbolic form, so the format is less
                implementation-specific than cpio format.  <V11N17
                John Gilmore>

            3.  Historical tar format is the most widely used, as
                discussed in 6. above, despite apparent assertions to
                the contrary.  <N.043 X/OPEN> <V11N13>

            4.  The format specified in sS10.1 is upward-compatible
                with tar format.  Old tar archives can be extracted by
                a program that implements sS10.1.  Archives using some
                of the extensions of sS10.1 can be extracted with old
                (Version 7) tar programs, although symbolic links will
                not be extracted and contiguous files will not be
                handled properly (cpio does not handle these
                capabilities at all).  Files with very long names will
                not be handled properly (cpio does no better at this).
                All tar implementations are compatible to this extent.
                <V11N17 John Gilmore>

            5.  The /usr/group working group and P1003.1 have already
                done the work <P.061> <M.019 5.1.121 Pg.13> <RFC.003
                #121> <P.038> <P.006> required to add optional
                extensions (such as symbolic links, long file names,
                <V11N49 Jerry Schwarz> <V11N50 Michael MacDonald> and
                contiguous files) that are needed on many historical
                implementations and that cpio format lacks.

            6.  The format is extensible for future facilities.
                <V11N39 Henry Spencer>

            7.  There is a public domain implementation of the format
                of sS10.1.  That implementation provided feedback which
                led to improvements in the current specification, and
                has been in use for years in transferring data with
                licensed tar implementations.  <V11N17 John Gilmore>







          Page 5                  tar vs. cpio      IEEE P1003.1 N.___



            8.  Many people prefer the user interface of the cpio
                program to that of the tar program, because the former
                can accept a list of pathnames to archive on standard
                input while the latter takes them as arguments,
                limiting the length of the list.  <V11N34 Andrew
                Tannenbaum> However, the above-mentioned public domain
                implementation of tar accepts pathnames on standard
                input, <V11N17 John Gilmore> <V11N19 Jim Cottrell> and
                at least one vendor sells a version of tar that can do
                this.  <V11N48 Michael Gersten> Diffs to standard tar
                to add an option to accept pathnames on standard input
                when creating an archive have also been posted to
                USENET.  <V11N36 John Gilmore> The user interface is,
                in any case, irrelevant to P1003.1.  <V11N39 Henry
                Spencer> <V11N40 Rahul Dhesi>

          Disadvantages of tar format:

            1.  If an attempt is made to extract only the second of a
                pair of hard linked files the tar program will attempt
                to link the second file to the nonexistent first file,
                and nothing will be extracted.  Although a
                sufficiently clever implementation could avoid this,
                the problem can be considered to be in the archive
                format.  <V11N66 Kenneth Almquist>

          There are some problems that neither tar nor cpio handles
          well.

            1.  File names still longer than the length of PATH_MAX
                (at least 255) <V11N50 Michael MacDonald> that the
                POSIX format allows (and than the 128 that cpio
                permits or than the 100 that historical tar allows)
                would be preferable, although the POSIX limit is
                useful for most cases.  <V11N54 Ian Donaldson>

            2.  An option to prevent crossing mount points would be
                useful for backups.  <V11N19 Jim Cottrell> <V11N22
                Joseph S. D. Yao> However, this appears to be more of
                an implementation issue than a format issue, <V11N28
                Dave Brower> <V11N32 Joseph S. D. Yao> especially
                considering that there are options to find in 4.2BSD,
                <V11N24 Jim Cottrell> SunOS 3.2, <V11N36 John Gilmore>
                and System V Release 3.0 <V11N35 Mike Akre> that take
                care of this.

            3.  The default block size in many tar implementations is
                too large for some tape controllers to read <V11N27
                Rob Lake> (the 3B20 has this problem).  This is not a
                problem with the interchange format, however.








          Page 6                  tar vs. cpio      IEEE P1003.1 N.___



          There is nothing that the proposed cpio can handle that the
          tar-based format already in POSIX sS10.1 cannot handle; in
          fact, the former is less capable.  If cpio format were
          augmented to handle missing capabilities, it would be
          subject to the same objections now aimed at the format given
          in sS10.1: that it was not identical with an existing format.

          There is no advantage in replacing the current tar-based
          format of sS10.1 with cpio format.  There is also no
          advantage in adding cpio format, because two standards are
          not as good as a single standard.

          Some have recommended removing sS10.1 from POSIX altogether,
          <V11N12 Doug Gwyn> perhaps with a recommendation for P1003.2
          to pick up the idea.  <V11N09 Guy Harris> While I believe
          that that would be preferable to adding cpio format, whether
          or not tar format remains, I recommend leaving sS10.1 as it
          is, because

             o+ The inclusion of an archive/interchange file format is
               in agreement with the purpose of POSIX to promote
               portability of application programs across interface
               implementations.  Some format will be used.  It is to
               the advantage of the users of the standard for there to
               be a standard format.

             o+ The de facto standard is tar format.  The current sS10.1
               standardizes that, and provides upward-compatible
               extensions in areas that were previously lacking.

          The Archive/Interchange File Format should be left as it is.

                                                  Thank you,



                                                  John S. Quarterman


Volume-Number: Volume 11, Number 67

henry@utzoo.UUCP (Henry Spencer) (06/17/87)

From: henry@utzoo.UUCP (Henry Spencer)

> >        1)  Data is not block oriented. This slows down processing...
> I miss this one. It may slow things under MVS, but there's no reason
> why reading less physical data should slow things down. Quite the
> opposite.

The problem being alluded to is that the data is not block-aligned.  This
is a bit of a performance pain when the disks are block-aligned, although
tar's block alignment isn't going to help a lot if the disk blocks are bigger
than tar's (which they normally are, nowadays).

> >        2)  There is no room left in the header. No customization
> >            possible (without also sending the customized program).
> This is a major advantage. Save us from "custom standard' format. The
> custom stuff belongs in the *file*, not the format (in my opinion).

The point here is that you can customize tar to some degree *without*
making it incompatible with the standard ones.  (We did, for example.)
This is not true of cpio, since there's no spare space in the header.

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

Volume-Number: Volume 11, Number 69

rcodi@yabbie.oz.au (Ian Donaldson) (06/18/87)

From: rcodi@yabbie.oz.au (Ian Donaldson)

In article <8249@ut-sally.UUCP>, std-unix@ut-sally.UUCP (Moderator, John Quarterman) writes:
> 
> [ We are discussing standardizing a data interchange/archive format 
> in a standard that its authors explicitly wanted to be implementable
> on hosted, i.e., non-UNIX-based, systems.  The inclusion of inode
> numbers is a problem for such implementations, especially when it
> is not necessary, as demonstrated by the tar format.  -mod ]

How much of a problem?  Surely the numerical value of these numbers are 
unimportant, but their relationship to other files in the same
archive is important.  They are just magic cookies specifying whether
file A is the same as file B.  Any computer can generate such
magic cookies.    

For so-called implimentations of "UNIX" (as opposed to UN*X)
that don't have linked file capabilities (gasp!) this is still not a problem,
as archives that they generate won't have linked files, and archives
that they read will simply ignore this information.  

Same goes for tar.

Ian D.

Volume-Number: Volume 11, Number 71

cameron@elecvax.oz (Cameron Simpson) (06/22/87)

From: cameron@elecvax.oz (Cameron Simpson)

>From: ka@hropus.UUCP (Kenneth Almquist)
>Subject: Re: tar vs. cpio
>Message-ID: <8276@ut-sally.UUCP>
>
>[ example of failure of tar's format ]
>
>So it seems to me that tar cannot be made to handle links correctly
>unless the tar archive format is changed.  The cpio format, on the other
>hand, allows links to be handled correctly.  The fact that cpio includes
>inode numbers is not all that major a problem for non-UNIX based systems.
>Since the only thing the inode numbers are used for is resolving links,
>a system which does not support (non-symbolic) links can leave garbage
>in the inode field when writing tapes.  A system which does have links
>but does not have inode numbers can use a sequence number in place of
>the inode number.

Please, monotonic garbage! Imagine extracting such a tape on a system
which *does* support (non-symbolic) links. I can easily envisage an
implementation which simply did not initialise the garbage, and wrote
said garbage the same for each file. On extraction you'd end up with a
single file with LOTS of links!-)
	- Cameron Simpson

ACSnet:	cameron@elecvax.eecs.unsw.oz	JANET: elecvax.eecs.unsw.oz!cameron@ukc
CSNET:	cameron@elecvax.oz		BITNET:	cameron%elecvax.oz@CSNET-RELAY
ARPA:	cameron%elecvax.eecs.unsw.oz@seismo.css.gov
UUCP:	...!seismo!munnari!elecvax.eecs.unsw.oz!cameron
     or	munnari!elecvax.eecs.unsw.oz!cameron@seismo.css.gov


Volume-Number: Volume 11, Number 77

mwm@cuuxb.att.com (Marc W. Mengel) (06/23/87)

From: mwm@cuuxb.att.com (Marc W. Mengel)

In article <8276@ut-sally.UUCP> ka@hropus.UUCP (Kenneth Almquist) writes:
:Several people have suggested that tar's method of handling links is
:better than cpio's.  After looking at the tar format, I wondered how
:tar could possibly handle links correctly.   A quick experiment showed
:that it doesn't.  Try the following:
:
:	> file1
:	ln file1 file2
:	tar -cf archive file1 file2
:	rm file1 file2
:	tar -xf archive file2
:
:The second tar command will fail because tar will simply try to create
:a link from file1 to file2, but since I only requested that file2 be
:extracted file1 does not exist.
:
:I claim that this is a bug in the tar archive format rather than just
:the tar program.  Consider what tar must do to function correctly.  Tar
:could remember the location of file1 and lseek to it in this particular
:example, but in general the input to tar is not a regular file and thus
:may not be seekable.  
:				Kenneth Almquist

Actually this is easily solved using the current format, you need merely
ensure that when a file has multiple links, that the file's data is put
on the archive only the last time that it is referenced.  This guarantees
that the location of file1 in your example is always further along the
archive than file2, and therefore no "rewinding" is ever needed to find 
the file after discovering that a link to it has been requested.  This 
does require the program to make two traversals over the directory tree 
( 1 to determine where the last reference to each file in the subtree 
occurs, 1 to actually write out the files), but the format itself is *not*
inherently broken.

-- 
 Marc Mengel
 ...!{moss|lll-crg|mtune|ihnp4}!cuuxb!mwm

Volume-Number: Volume 11, Number 78

henry@utzoo.UUCP (Henry Spencer) (06/24/87)

From: henry@utzoo.UUCP (Henry Spencer)

As a counterpoint to Kenneth Almquist's example of tar fouling up links,
try the following:

	echo hi >one
	ln one two
	ls one two | cpio -o >/tmp/foo
	rm one two
	cpio -i one </tmp/foo
	cpio -i two </tmp/foo

Now we have cpio creating two files where there was supposed to be only one
with two links.  Note that tar would get this one right!  (Although you would
have to be careful to get the order right, admittedly a wart.)

The moral of the story is that it is very hard to handle links properly in
a format like tar or cpio, and *neither* of the existing formats is bullet-
proof in the presence of links.  This is not a valid argument for or against
either format; they merely screw up in different ways.  One can argue about
the relative probabilities of the different screwup types causing trouble,
but I think the point is made.

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

Volume-Number: Volume 11, Number 79

henry@utzoo.UUCP (Henry Spencer) (06/24/87)

From: henry@utzoo.UUCP (Henry Spencer)

> ... A system which does have links
> but does not have inode numbers can use a sequence number in place of
> the inode number.

Suppose it wants to use a sequence number that is bigger than 16/18 (16
for binary cpio format, 18 for ASCII cpio format) bits?  For that matter,
what if *inode* numbers are bigger than that?  This argument would be a
whole lot stronger if the cpio formats (note plural) left more room for
this particular magic cookie.

> I wonder if there is still any chance of a new interchange format that
> corrected the deficiencies of both cpio and tar being accepted as the
> standard...

Not unless there were truly overwhelming technical reasons for picking it.
Even the cpio formats, scummy though they are, are readily readable without
new software at a great many existing sites.  The same is true of tar on an
even larger scale.  A new format would start from zero... not an attractive
proposition.  I would also observe that we don't *want* the sort of format
that a standards committee would invent -- have you studied, say, X.400
lately?  Better to pick the best of existing practice and standardize that,
possibly with minor changes.  That is what standards committees are really
supposed to do.

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

Volume-Number: Volume 11, Number 80

michael@stb.UUCP (Michael) (06/29/87)

From: michael@stb.UUCP (Michael)

Actually, Tar CAN handle links properly, with the present file format.

There are two types of input to tar (generally): Regular files, which can
be seek'd to the proper place, and special files which cannot be seek'd.
They can, however, be close(); open() 'd, which does a rewind on any device
that I'm familiar with.
Presto, you've just seeked to the beginning, now you can skip as much as you
need to get to the file. 

The only thing this won't work with is pipes; the only thing I can think
of using pipes with tar are copying directories (in which case selective
retrieval isn't needed) and compressed archives (you're out of luck).
-- 
: Michael Gersten		seismo!scgvaxd!stb!michael
: Ground floor, comming up -- 1-3-7

Volume-Number: Volume 11, Number 76

zink@seismo.CSS.GOV@bunker.uucp (David Zink) (07/04/87)

From: harvard!husc6!bunker!zink@seismo.CSS.GOV (David Zink)

In article <8362@ut-sally.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
)From: henry@utzoo.UUCP (Henry Spencer)
)
)As a counterpoint to Kenneth Almquist's example of tar fouling up links,
)try the following:
)
)	echo hi >one
)	ln one two
)	ls one two | cpio -o >/tmp/foo
)	rm one two
)	cpio -i one </tmp/foo
)	cpio -i two </tmp/foo
)
)Now we have cpio creating two files where there was supposed to be only one
)with two links.  Note that tar would get this one right!  (Although you would
)have to be careful to get the order right, admittedly a wart.)

Actually This should always create TWO FILES! Unless you want to go to
the effort of proving that one has not been modified when two is removed.
We're seeing two separate invocations of cpio with no knowledge of each
other -- How could it be correct to link them ??
:g/^>/s//;-)/

[ Um, haven't we about beat this subsubject to death?  - mod ]

Volume-Number: Volume 11, Number 86

ka@gatech.UUCP@opus.uucp (Kenneth Almquist) (07/05/87)

From: ka@gatech.UUCP@opus.uucp (Kenneth Almquist)

OK, here is a simple, backward compatible fix to the tar format.  When
tar encounters a file name which is a link to a file that it previously
dumped, it should first write out a header for the file indicating that
it is a link to a previously dumped file.  It should then write out
another header for the file, this time without linkflag set, and follow
this header with the contents of the file.  This way, if the first link
to a file is not dumped, its contents will be available later when
subsequent links are dumped.

This is backward compatible because an old version of tar would make the
link when it read the first header, and then dump the contents of the
file when it read the second header.  Dumping the contents of the file
does no harm because is will not modify the contents of the file.  Of
course, new implementations of tar might want to recognize this situation
and avoid dumping the contents of the file, but only for reasons of
efficiency.

I noted Marc Mengel's suggestion that tar write out the contents of a file
when the last link to a file is encountered, rather than the first.  This
would be nice, but I don't see how it could be done in a way that is
backward compatible with the current tar format.  I also read Michael
Gersten's article suggesting that tar could rewind raw magnetic tapes by
closing them and openning them again.  This proposal doesn't deal with
the question of how cpio could be made to use the tar format, since cpio
reads from its standard input, which it has no way of closing and openning
again, and it also ignores the case where tar is reading from a pipe
because the tape drive is not on the same machine that tar is running on.
So I feel that the above change to the tar format is necessary.

The remaining problem with the tar format is the limit on the file name
size.  If memory serves, cpio originally limited file names to 127 char-
acters, and this was recognized as inadequate and increased to 255 char-
acters.  The current maximum file name in tar is 99 characters.

However, the maximum file name supported by tar can be increased while
still allowing files whose names are not more than 99 characters long to
be read by existing implementations.  I will suggest one possibility here.
Increase the size of the linkname field to 200 characters.  Since this
field is at the end of the header structure, this will not alter the
location of any of the other fields.  Place a 100 character name exten-
tion field after the linkname field.  If the file name field does not
contain a nul terminator, the remainder of the file name is assumed to
be in the file name extention field.  This scheme allows file names of
up to 199 characters to represented, which comes close to the 255
character limit of the current cpio implimentation.  It leaves 55 bytes
of the header free for future expansion.

These changes to the tar format would make it possible to write a program
which used the tar format, but otherwise behaved exactly like cpio except
for a slight decrease in the maximum file name length.

I still don't like it, mind you.  I receive a lot more programs over the
net than I do via tape, and here tar fails miserably because it has nul
characters in the header which news and mail programs cannot handle.  It
is hard to get excited over a standard that fails to handle the most
common case (or more accurately, what is the most common case for me).
But I agree with Henry Spencer's statement that the role of standards
committees should be to standardize existing practice, with at most minor
changes.  So we should either forget about developing a standard now, or
standardize on the most widely available format (which is tar) after fixing
the major problems with it.  I could go for either approach.
				Kenneth Almquist

P.S.  A lot of nonsense has appeared in this group about the supposed
      deficiencies of cpio, which I won't rebut since I don't support
      using the cpio format as a standard.  Just please take it all with
      a grain of salt.

[ I'd rather have details than innuendo, thanks.  -mod ]

Volume-Number: Volume 11, Nuopw
w
w

mmengel@cuuxb.uucp (Marc W. Mengel) (07/13/87)

From: mmengel@cuuxb.uucp (Marc W. Mengel)

In article <8440@ut-sally.UUCP>;
>I noted Marc Mengel's suggestion that tar write out the contents of a file
>when the last link to a file is encountered, rather than the first.  This
>would be nice, but I don't see how it could be done in a way that is
>backward compatible with the current tar format.  


Maybe I was too brief in my attempt at describing it... What you do
is make two recursive descents of the directory tree.  

In the first recursive descent, you generate a table that looks like

	file id #	last place seen in recursive descent
	123		foo/bar/baz
	333		foo/baz/bleem
...

Each time you encounter a file in the recursive descent you either add it
to the list, if its file id # isn't there, or replace its pathname in 
the second half of the table.

In the second recursive descent, you look up each file you see in the table,
(by file id #) and if its path-from-the-table matches the current path,
you write out the file on the archive, otherwise you mark the current 
path as a link to path-from-the-table in the archive.

Since our two recursive descents of the file tree should be identical, we
should always have only the last reference to a given file with data, and
all earlier references to that file listed as links to the one later on
the archive.

(Of course we only need to put files in the table that have multiple
links...)

So I am asking that a notation be put in the tar archive format for a
link to another file in the archive, along with the requirement that 
link-to-file-"foo" nodes in the archive must always precede file-"foo"
nodes.


[ If you want to make a proposal to the P1003.1 Working Group,
you need to supply actual wording for the standard.  -mod ]

-- 
 Marc Mengel
 attmail!mmengel
	or
 ...!{moss|lll-crg|mtune|ihnp4}!cuuxb!mmmengel


Volume-Number: Volume 11, Number 90

std-unix@uunet.UU.NET (Moderator, John Quarterman) (07/15/87)

From: usenix!jsq (John S. Quarterman)

A belated report about the Seattle P1003 meeting,
regarding section 10.

No one proposes non-ASCII cpio format any more.

A revised cpio proposal was received.  It is in
appropriate format for P1003.1, but is still straight
System V cpio.

The proposer of that proposal has agreed to supply
an updated proposal, including optional extensions
for symbolic links, contiguous files, and a general
method of extension.  This is analogous to what is
already in Draft 10 about the ustar format.

P1003.1 Draft 11 will include the updated cpio proposal
in addition to the already-present ustar format.

Some notes have been moved from Section 10 into the Rationale.

The introductory matter in 10.1 about the user of permission
information on extraction of archives has been reworded, mostly
to avoid the word "utility" (this is 1003.1, i.e., the programming
language interface standard, that we are discussing.)

A note is expected from X/OPEN to address the issues raised in my
previous note (IEEE 1003.1 N.100, "tar vs. cpio"), and to include
some comments about the motivation for the cpio proposals.

The cpio proponents have been invited to post that note and
the new cpio proposal in this newsgroup.

N.100 will appear in the next issue of ;login:, the Newsletter
of the USENIX Association.  The cpio proponents have been 
invited to submit equivalent material.  There is a possibility
that similar articles may appear in the EUUG newsletter.


An actual decision on what format(s) will be in the IEEE 1003.1
Full Use Standard is expected at the September meeting in
Nashua, New Hampshire.  Though, of course, there is still the
possibility that it will be determined in actual balloting.

[ Note that I am posting this report as the USENIX Institutional
Representative to IEEE P1003, not as the moderator.  Replies
and related submissions are solicited.  -mod ]

Volume-Number: Volume 11, Number 91