[mod.std.unix] In which gnu dissects the "tar" section of Draft 6; V4N9.

std-unix@ut-sally.UUCP (Moderator, John Quarterman) (12/11/85)
Date: Wed, 11 Dec 85 00:32:33 PST
From: l5!gnu@LLL-CRG.ARPA (John Gilmore)

Section 10.1.1 introduces the terms "interpret" and "translate" for
"load" and "dump".  Can we just use the familiar terms?  I have trouble
remembering which is which.

[ I think using either of the words "dump" or "restore" (the latter
actually used in the section) is a mistake, since they also connote
a completely different set of programs than those usually associated
with the format in question.  -mod ]

This section also says:
> "The format-interpreting utility is defined such that if it is not a
> privileged program, when data is read into the system from the
> transportable media, all protection information is ignored.  Instead the
> user ownership and group owership are set to that of the process context
> which is running the utility.  All access protection information can be
> set to be no more liberal than that of the process that is running the
> utility.  A privileged version of the utility must have as a minimum, an
> option that obeys the protection information stored on the transportable
> media, such that this format and the corresponding utility can be used as
> a save/restore mechanism."

First, this is self-contradictory; it says all protection information
is ignored, then says it can be set "no more liberal" than the
process.

[ The utility is not prevented from reading anything in the data
because of any protections associated with it.  However, once the
utility converts the data into files in the file system, there *is*
protection associated with it, as with any file:  the utility must
set the appropriate protection bits.  -mod ]

(I would assume the OS takes care of not letting your process
set protection more liberal than its own, else there is no security.)
I think what it means is that it's not legal for system V "tar" to always
chown away the files, which you can't get back.

[ Cpio actually does that under System V.  Such chowning is a major
security problem, much like your phrase:  "there is no security",
since the numeric user ids on the "tape" may have completely different
meanings on the system where it is being read than on the one where
it was written.  This problem has been addressed in several other places
in the standard as well as here.  -mod ]

Was there some other reason for this paragraph?  If not, can we replace
the text with something like:

  "The format-loading utility must not set access protections that cannot
  be revoked by the user running the utility (whether the user is
  privileged or not).  If it can be run as a privileged utility, an
  option (or default behaviour) must exist which obeys all the loaded
  protection information, so it can be used for system backups."

---

Also, section 10.1.2 uses confusing terminology with regard to blocks
and records.  In the data processing world, a block is a big thing and
one or more records fit in it (roughly speaking).  Like you write 100
records 80 chars long in an 8000 byte block on tape.  Has anybody
checked the ANSI standard for tape format to see what they call 'em?
The Unix standard uses "block" for the small records, "group" for the
large things, and also mentions that a "group" might turn into a single
tape "record".

I also don't see the need for two records of zeros on the end.  One
should be fine, and it won't break compatability with the Unix tar
program, which quits as soon as it sees the first one.  Tar should
really use EOF rather than this funny end of tape record; this would
solve two or three minor problems with it, but would break
compatability with existing Unix "tar".  (The problems:  the tape is
positioned wrong after reading a tar archive from a multi-file tape,
since the tape mark has not yet been read; you can't just concatenate
tar archives to combine their contents (which would make multi-volume
tar handling somewhat easier too); extra data is written, which
makes it uneconomical to use a large, tape-efficient block size (like a
megabyte on streaming cartridge tapes, since this will waste up to a
megabyte of space on the tape).

What I suggest is that ANSI standard tar's should be required to work
OK when reading an archive terminated by EOF (short last block, then
zero length result from read()).  Suggested wording:

  An archive tape or file contains a series of records.  Each record is of
  size TRECORDSIZE (see below).  Although this format may be thought of as
  being on magnetic tape, this does not exclude the use of other
  media.  Each file archived is represented by a header record
  which describes the file, followed by zero or more records which give the
  contents of the file.  At the end of the archive file there may be a record
  filled with binary zeros as an end-of-file indicator.  A conforming
  system must write a record of zeros at the end, but must not assume that
  an end-of-file record exists when reading an archive.

  The records may be blocked for physical I/O operations.  Each block of
  n records (where n is set by the application program creating the
  archive file) may be written with a single write() operation.  On
  magnetic tapes, the result of such a write is a single tape record.
  When writing an archive, the last block of records shall be written
  at the full size, with records after the zero record containing
  undefined data.  When reading an archive, a confirming system shall
  properly handle an archive whose last block is shorter than the rest.

This allows a system to provide an option to write more modern
archives, which will be readable by all P1003 conforming systems, but
requires that the default be compatible (readable with V7 Unix 'tar').

---

> /* Values used in typeflag field */
> #define REGTYPE   '0'         /* Regular file  */
> #define AREGTYPE  '\0'        /* Regular file  */
> #define LNKTYPE   '1'         /* Link          */
> #define SYMTYPE   '2'         /* Reserved      */
> #define CHRTYPE   '3'         /* Char. special */
> #define BLKTYPE   '4'         /* Block special */
> #define DIRTYPE   '5'         /* Directory     */
> #define FIFOTYPE  '6'         /* FIFO special  */
> #define CONTTYPE  '7'         /* Reserved      */

In the header file, less generic names than e.g. "REGTYPE" should be used.
How about "TF_REGULAR" (typeflag = regular file).  This avoids the well
known problem that a #define is a joy (or a pain) forever, especially
when some other header file wants to use the same name:

  /* The typeflag defines the type of file */
  #define	TF_OLDNORMAL	'\0'		/* Normal disk file, compat */
  #define	TF_NORMAL	'0'		/* Normal disk file */
  #define	TF_LINK		'1'		/* Link to dumped file */
  #define	TF_SYMLINK	'2'		/* Symbolic link */
  #define	TF_CHR		'3'		/* Character special file */
  #define	TF_BLK		'4'		/* Block special file */
  #define	TF_DIR		'5'		/* Directory */
  #define	TF_FIFO		'6'		/* FIFO special file */
  #define	TF_CONTIG	'7'		/* Contiguous file */
  /*
   * All other type values except A-Z are reserved for future standardization
   * and may not be used.  A-Z may be used for implementation-dependent
   * record types.
   */
 
The mode fields should use a prefix like "TM_" rather than just "T".
Also, TSVTX (the sticky bit) cannot be "reserved" otherwise implementations
cannot write archives that have it turned on.  Call it implementation-defined,
if you must.

> All characters are represented in ASCII, using 8-bit characters without
> parity.  Each field within the structure is contiguous; that is, there is
> no padding used within the structure.  Each character on the archive media
> is stored contiguously.

You'd better be more specific.  USASCII, with the 7-bit character in the
low-order 7 bits and the high-order bit cleared?  What about foreign
sites with funny characters in their file names?

> The fields name, linkname, magic, uname and gname are null-terminated
> character strings.

Does this mean that when writing an archive, you MUST put in the null,
or if the value exactly fills the field, is it OK to not have a null
there?  In other words, caveat writer or caveat reader?  Here again, a
prudent course would be to require the writer to do it right, and
require the reader to accept it either way.

> The mtime field is the modification time of the file at the time it was
> archived.  It is the ASCII representation of the octal value of the
> modification time obtained from the stat() call.

This should be spelled out in detail, so the definition of the archive
format can stand alone.

> ASCII digit `2' is reserved.
> ASCII digit `7' is reserved.
> ASCII letters `A' through `Z' are reserved for custom implementations.
> All other values are reserved for specification in future revisions of the
> standard.

As I understand standards, something that is reserved canNOT be used by
an implementation to extend the standard.  This is not the intention
here, since I presume compatability with BSD systems (which use 2 for
symlinks) is desired.  I'm not sure why we don't just standardize
symlinks here; after all, not all systems have fifos or contiguous
files either...

[ They were in there at one point.  I wonder what happened to them.  -mod ]
 
> The encoding of the header is designed to be portable across machines.

This sentence can go... 
 
> 10.1.3  Notes
> ...
> Implementors should be aware that the previous file format did not include
> a mechanism to archive directory type files.  For this reason, the
> convention of using a file name which ended with a slash (/) was adopted
> to specify the archiving of a directory.

But ANSI standard systems are not required to read such a tape?  I think
they should be required to read it but not write it.


An additional point.  The standard does not specify what fields are defined
in what record types.  For example, is it OK to have garbage in the linkname
in record type 0 (normal files)?  Is it OK to put zeros in the uid/gid fields
if you have filled in the uname/gname/magic fields (say your system does not
have numeric uids?).  What about the bytes in the header records that
are not defined by the structure?  Or the bytes beyond the end of a file,
in its last record?  I'd suggest that we require these fields to be nulls
on writing, and require them to be ignored on reading, again for prudence.

Volume-Number: Volume 4, Number 9