[net.bugs.4bsd] cpio

earle@smeagol.UUCP (Greg Earle) (09/04/86)

Recently I attempted to read the /usr tape from the System V Release 2.0
distribution for VAX processors.  It comes from AT&T in cpio(1)
format, and Sun distributes a version of cpio from SysV.2 with OS 3.0.
I found an interesting problem:
	If one tried to read the tape directly, lo and behold it was byteswapped
so cpio complained.  Fair enough.  In the manual page for cpio, it
explicitly warns of byteswapped cpio tapes, and also warns that the `-s'
option will not help because it only swaps data bytes, and not those in the
header.  The cure, as prescribed, is to dd(1) the contents first with the
`conv=swab' option to swap all the bytes, including the header, before
feeding to cpio (with the `-s' option set).  As I was only interested in
a table of contents, I merely tried to get one via the `-t' and `-v' options
to produce an `ls -l'-like output.  In doing so, I discovered that swapping
all the bytes made cpio happy, yet somehow the filenames were still coming 
out byteswapped!!  Example:

% dd if=/dev/rmt0 ibs=10b conv=swab | cpio -istvBm | head -15
40775  sys         0  Oct 15 18:55:51 1983  
40775  uucp        0  Nov  4 12:32:17 1983  da
40775  uucp        0  Apr 26 07:35:16 1982  da/mcatc
40775  uucp        0  Jan 28 11:19:44 1982  da/mcatcn/ti
40775  uucp        0  Jan 28 11:19:44 1982  da/mcatcf/siac
40775  uucp        0  Jan 28 11:19:44 1982  da/mcatcs/mua
100600 root        0  Dec 31 21:00:00 1969  da/musol
100664 sys         0  Jun 10 07:05:33 1982  da/mrefrli
100664 uucp        0  Dec 31 21:00:00 1969  da/mapcc
40775  uucp        0  Jun 10 09:21:13 1980  da/masc
40775  sys         0  Nov  7 07:50:23 1983  ib
100775 sys     10148  Nov  5 19:16:31 1983  ib/ncpta
100775 sys     10760  Nov  5 17:14:27 1983  ib/napkc
100775 sys     10148  Nov  5 19:16:31 1983  ib/nnuapkc
100775 sys       964  Nov  5 19:32:05 1983  ib/nuuotk

I assumed that byteswapping everything would take care of the filenames
as well, but apparently they are in the `correct' order (for Suns &
680x0 architecture, at any rate) before the byteswap.

How this might have arisen???  Is it a bug in the way
it (the tape) was written originally, or a bug in cpio(1)?
Or in the way a VAX writes char arrays?

I have implemented a `fix', based on this source version:
>#ifndef lint
>static	char sccsid[] = "@(#)cpio.c 1.1 86/02/03 SMI"; /* from S5R2 1.17 */
>#endif

--------------
diff -l -cb /usr/src/sun/usr.bin/cpio.c /tmp/cpio.c
*** /usr/src/sun/usr.bin/cpio.c	Mon Feb  3 23:58:42 1986
--- /tmp/cpio.c	Wed Sep  3 20:10:23 1986
***************
*** 602,609
  	}
  	if(Cflag)
  		readhdr(Hdr.h_name, Hdr.h_namesize);
! 	else
! 		bread(Hdr.h_name, Hdr.h_namesize);
  	if(EQ(Hdr.h_name, "TRAILER!!!"))
  		return 0;
  	ftype = Hdr.h_mode & Filetype;

--- 602,611 -----
  	}
  	if(Cflag)
  		readhdr(Hdr.h_name, Hdr.h_namesize);
! 	else {
! 		bread(Name, Hdr.h_namesize);
! 		swab(Name, Hdr.h_name, (Hdr.h_namesize + 1) & ~001);
! 	}
  	if(EQ(Hdr.h_name, "TRAILER!!!"))
  		return 0;
  	ftype = Hdr.h_mode & Filetype;
--------------

The results of this fix:
% dd if=/dev/rmt0 ibs=10b conv=swab | cpio.fixed -istvBm | head -16
40775  sys         0  Oct 15 18:55:51 1983  .
40775  uucp        0  Nov  4 12:32:17 1983  adm
40775  uucp        0  Apr 26 07:35:16 1982  adm/acct
40775  uucp        0  Jan 28 11:19:44 1982  adm/acct/nite
40775  uucp        0  Jan 28 11:19:44 1982  adm/acct/fiscal
40775  uucp        0  Jan 28 11:19:44 1982  adm/acct/sum
100600 root        0  Dec 31 21:00:00 1969  adm/sulog
100664 sys         0  Jun 10 07:05:33 1982  adm/errfile
100664 uucp        0  Dec 31 21:00:00 1969  adm/pacct
40775  uucp        0  Jun 10 09:21:13 1980  adm/sa
40775  sys         0  Nov  7 07:50:23 1983  bin
100775 sys     10148  Nov  5 19:16:31 1983  bin/pcat
100775 sys     10760  Nov  5 17:14:27 1983  bin/pack
100775 sys     10148  Nov  5 19:16:31 1983  bin/unpack
100775 sys       964  Nov  5 19:32:05 1983  bin/uuto
100775 sys       357  Nov  5 17:32:42 1983  bin/scc

This looks a little more reasonable; but I don't know if it is a `fix' or
a `kludge to counteract a certain non-uniform condition'.

Any clarification would be appreciated.

-- 
	Greg Earle		UUCP: sdcrdcf!smeagol!earle; attmail!earle
	JPL			ARPA: elroy!smeagol!earle@csvax.caltech.edu
				AT&T: +1 818 354 0876

I'm continually AMAZED at th'breathtaking effects of WIND EROSION!!

ggs@ulysses.UUCP (Griff Smith) (09/04/86)

> Recently I attempted to read the /usr tape from the System V Release 2.0
> distribution for VAX processors.  It comes from AT&T in cpio(1)
> format, and Sun distributes a version of cpio from SysV.2 with OS 3.0.
> I found an interesting problem:
> 	If one tried to read the tape directly, lo and behold it was byteswapped
> so cpio complained.  Fair enough.  In the manual page for cpio, it
> explicitly warns of byteswapped cpio tapes, and also warns that the `-s'
> option will not help because it only swaps data bytes, and not those in the
> header.  The cure, as prescribed, is to dd(1) the contents first with the
> `conv=swab' option to swap all the bytes, including the header, before
> feeding to cpio (with the `-s' option set)....
...
> % dd if=/dev/rmt0 ibs=10b conv=swab | cpio -istvBm | head -15
> 40775  sys         0  Oct 15 18:55:51 1983  
> 40775  uucp        0  Nov  4 12:32:17 1983  da
...
> How this might have arisen???  Is it a bug in the way
> it (the tape) was written originally, or a bug in cpio(1)?
> Or in the way a VAX writes char arrays?

There are two kinds of byte swapping you might encounter: swapping
caused by ill-conceived tape controllers and swapping of binary fields
in cpio headers.  You are trying to correct for the first, but you are
being bitten by the second.  Based on your cpio command, the tape was
written without the "c" option, which means the headers are written in
machine-dependent binary instead of ascii.  Character fields are in
normal order, however.  The dd byte-swap trick is swapping the
correctly-written ascii data, which includes the file name.

As to why the tape wasn't written with -ocB: ask AT&T!  My guess is
that the policy is to write a tape that is compatible with the machine
it is licensed to be read on.
-- 

Griff Smith	AT&T (Bell Laboratories), Murray Hill
Phone:		(201) 582-7736
UUCP:		{allegra|ihnp4}!ulysses!ggs
Internet:	ggs@ulysses.uucp

earle@smeagol.UUCP (Greg Earle) (09/05/86)

In article <760@smeagol.UUCP>, earle@smeagol.UUCP I wrote:
> I have implemented a `fix', based on this source version:
> >#ifndef lint
> >static	char sccsid[] = "@(#)cpio.c 1.1 86/02/03 SMI"; /* from S5R2 1.17 */
> >#endif
> 
> --------------
> diff -l -cb /usr/src/sun/usr.bin/cpio.c /tmp/cpio.c
> *** /usr/src/sun/usr.bin/cpio.c	Mon Feb  3 23:58:42 1986
> --- /tmp/cpio.c	Wed Sep  3 20:10:23 1986
> ***************
> *** 602,609
>   	}
>   	if(Cflag)
>   		readhdr(Hdr.h_name, Hdr.h_namesize);
> ! 	else
> ! 		bread(Hdr.h_name, Hdr.h_namesize);
>   	if(EQ(Hdr.h_name, "TRAILER!!!"))
>   		return 0;
>   	ftype = Hdr.h_mode & Filetype;
> 
> --- 602,611 -----
>   	}
>   	if(Cflag)
>   		readhdr(Hdr.h_name, Hdr.h_namesize);
> ! 	else {
> ! 		bread(Name, Hdr.h_namesize);
> ! 		swab(Name, Hdr.h_name, (Hdr.h_namesize + 1) & ~001);
> ! 	}
>   	if(EQ(Hdr.h_name, "TRAILER!!!"))
>   		return 0;
>   	ftype = Hdr.h_mode & Filetype;
> --------------

This should probably read:

diff -l -cb /usr/src/sun/usr.bin/cpio.c /tmp/cpio.c
*** /usr/src/sun/usr.bin/cpio.c	Mon Feb  3 23:58:42 1986
--- /tmp/cpio.c	Wed Sep  3 20:10:23 1986
***************
*** 602,609
  	}
  	if(Cflag)
  		readhdr(Hdr.h_name, Hdr.h_namesize);
! 	else
! 		bread(Hdr.h_name, Hdr.h_namesize);
  	if(EQ(Hdr.h_name, "TRAILER!!!"))
  		return 0;
  	ftype = Hdr.h_mode & Filetype;

--- 602,612 -----
  	}
  	if(Cflag)
  		readhdr(Hdr.h_name, Hdr.h_namesize);
! 	else {
! 		bread(Name, Hdr.h_namesize);
!		if (Swap)
! 			swap(Hdr.h_name, (Hdr.h_namesize + 1) & ~001);
! 	}
  	if(EQ(Hdr.h_name, "TRAILER!!!"))
  		return 0;
  	ftype = Hdr.h_mode & Filetype;
----------------
since you don't want to swap unless you are converting from a machine that
needs the data bytes swapped (`-s' option).

Further question:
	There are two undocumented options, `-S' and `-b', that are supposed
to be for "swap half words" and "swap both words".  I'm not sure under what
circumstance these switches would be used (swap halfwords only when 
1/2 word != byte?  Swap both when in `pass' mode?).  Any explanation?
-- 
	Greg Earle		UUCP: sdcrdcf!smeagol!earle; attmail!earle
	JPL			ARPA: elroy!smeagol!earle@csvax.caltech.edu
				AT&T: +1 818 354 0876

Here I am in the POSTERIOR OLFACTORY LOBULE but I don't see CARL SAGAN
anywhere!!

guy@sun.UUCP (09/06/86)

> The cure, as prescribed, is to dd(1) the contents first with the
> `conv=swab' option to swap all the bytes, including the header, before
> feeding to cpio (with the `-s' option set).  As I was only interested in
> a table of contents, I merely tried to get one via the `-t' and `-v' options
> to produce an `ls -l'-like output.  In doing so, I discovered that swapping
> all the bytes made cpio happy, yet somehow the filenames were still coming 
> out byteswapped!!
> ...
> I assumed that byteswapping everything would take care of the filenames
> as well, but apparently they are in the `correct' order (for Suns &
> 680x0 architecture, at any rate) before the byteswap.
> 
> How this might have arisen???  Is it a bug in the way
> it (the tape) was written originally, or a bug in cpio(1)?
> Or in the way a VAX writes char arrays?

The tape was written correctly.  VAXes write "char" arrays the way any sane
machine does: if character N of an array goes onto frame M of the tape,
character N+1 goes onto frame M+1, etc..  Any sane machine will also read
"char" arrays in the same way, so the character array

	"Kilroy was here"

will, if written to a tape by a sane little-endian machine, produce

	"Kilroy was here"

when read from that tape by any sane machine, whether big-endian or little-
endian.  Thus, the filenames are in the right order, except on insane
machines that swap bytes on character strings when they write them to tape
(there are such machines out there, alas).

(BTW, "for Suns & 680x0 architecture" is redundant in this case; the 680x0
is big-endian regardless of whether it appear in a Sun or anything else.
The only machine I know of where its "endianness" is settable is the WE32000
chip, and maybe the later chips in that family; the endianness of that chip
is settable from one of the pins on the chip, but I don't think there are
many of them running as little-endian machines, if any at all.)

The problem is that a "cpio" tape consists of three kinds of data:

	1) Headers.  All the data in a header (unless the tape was written
	   with the "-c" option) are in the form of "short"s, and must be
	   byte-swapped if they are read on a machine with a different
	   byte order with "short"s.

	2) Pathnames.  This is a "char" array, and must not have its
	   byte order changed.

	3) File contents.  In general, this is either: text, which is,
	   in effect, a gigantic "char" array and must not have its
	   byte order changed, or binary data, which could require
	   an arbitrarily complex transformation, so simply changing the
	   byte order is unlikely to be useful.

"dd" will change the byte order of *all* the data on the tape; thus, the
headers will be read OK but everything else will be garbled.  The System III
"cpio"s "-b" option would swap the pathnames and the file contents, leaving
the headers alone; thus, you first run the tape through "dd" and then
through "cpio -b" to read it correctly.  Obviously, the person who
implemented this realized that swapping most of the data on the tape twice
was far more efficient than swapping a small amount of it once.

Some bear of equally little brain decided to "fix" this for System V; they
realized that almost all files written to "cpio" tapes consist solely of
characters, "short"s, or "long"s, so there should be options to swap bytes,
halfwords, or both, and those options should apply *only* to the data.
Thus, there is no now way to swap *just* the headers by some combination of
"cpio" and "dd".

The correct fix - available in our next release, because it bit *me* when
trying to read in a "cpio" tape made on a VAX - is to check the "magic
number" in the header.  If it is equal to a byte-swapped version of the
"cpio" magic number, then the tape is almost certainly a "cpio" tape written
on a machine of the opposite byte sex; "cpio" should then byte-swap the
header *and nothing else*.  This way, you don't have to worry about the byte
sex of the machine on which the tape was written (unless you're trying to
transport binary data, but in that case it's not a simple matter of byte sex
anyway); "cpio" will figure it out for you.

Ideally, the "-c" option should be used; that writes the header in a
printable ASCII format, just as "tar" did N years before the "cpio"
maintainers figured it out.  Unfortunately, there is a bug in the System III
"cpio" that means that the "-c" format doesn't work right.  Equally
unfortunately, the S5R2 distribution tape wasn't written with "-c" ("gee,
why should it be, if it's a VAX distribution tape people are going to read
it in on their VAX, right?").
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

gwyn@brl-smoke.ARPA (Doug Gwyn ) (09/06/86)

In article <760@smeagol.UUCP> earle@smeagol.UUCP (Greg Earle) writes:
>Any clarification would be appreciated.

There are actually two "cpio" modes.  The "old original" one
works with archives that are machine-specific ("binary headers").
As you discovered, it is an oversimplification to analyze the
machine architecture dependency in terms of "byte swapping".  The
"new" cpio mode works with a machine-independent "ASCII header"
format.  AT&T ships add-on software such as DWB in the portable
format, but older distributions were in machine-specific CPIO format.

Unfortunately the "cpio" default is binary header (for efficiency
in the typical use of "cpio" to copy directories, I suppose, as
well as for backward-compatibility reasons).  One should be careful
to specify the "-c" option when writing archives for export to other
sites.  (There is a "-ncpio" option to find that makes ASCII-header
CPIO archives, by the way.)

Our version of "cpio" adds to the "out of phase" error a suggestion
that perhaps the "-c" option should be used, since this is often
the cause of that error.