[comp.unix.questions] Resolved tar-to-a-pipe problem

malc@equinox.unr.edu (Malcolm Carlock) (04/02/91)

It looks as if tar commands of the form

  tar cf - files | rsh somewhere dd [ibs=???] [obs=???] of=/dev/sometape

just can't be made to work the way you'd expect (i.e., being able to
read the tape later using a normal tar command.)  GNU tar was recommended
as the solution, and indeed does the job very neatly:

  gtar -cf somewhere:/dev/sometape files

Piece o' cake, and highly recommended.  Beats me why dd doesn't seem to
behave more reasonably, though.

GNU tar can be ftp'd from prep.ai.mit.edu (18.71.0.38).

Malcolm L. Carlock                      Internet:  malc@unr.edu
                                        UUCP:      unr!malc
                                        BITNET:    malc@equinox

torek@elf.ee.lbl.gov (Chris Torek) (04/04/91)

In article <5932@tahoe.unr.edu> malc@equinox.unr.edu (Malcolm Carlock) writes:
>It looks as if tar commands of the form
>
>  tar cf - files | rsh somewhere dd [ibs=???] [obs=???] of=/dev/sometape
>
>just can't be made to work the way you'd expect ...

Sure they can:

	tar cf - . | rsh foo dd of=/dev/tape/1n@6250 obs=20b

Be forewarned that most incarnations of `dd' handle this egregiously
slowly.

What is going on?  This answer requires some background:

 A. Tapes have `block sizes'.  Not all tapes, mind you---most SCSI
    tapes have a fixed block size that can, for the most part, be
    ignored.  9-track tapes, however, typically record data in
    `records' separated by `gaps', and only whole records can be
    re-read later.

 B. In order to accomodate this, Unix tape drivers generally translate
    each read() or write() system call into a single record transfer.
    The size of a written record is the number of bytes passed to
    write().  (There may be some additional constrants, such as
    `the size must be even' or `the size must be no more than 32768
    bytes'.  Note that phase encoded [1600 bpi] blocks should be no
    longer than 10240 bytes, and GCR [6250 bpi] blocks should be no
    longer than 32768 bytes, to reduce the chance of an unrecoverable
    error.)  Each read() call must ask for at least one whole record
    (many drivers get this wrong and silently drop trailing portions
    of a record that was longer than the byte count given to read());
    each read() returns the actual number of bytes in the record.

 C. Network connections are generally `byte streams': the two host
    `peers' (above, the machine running tar, and the machine with the
    tape drive) will exchange data but will drop any `record boundary'
    notion at the protocol interface level.  If record boundaries are
    to be preserved, this must be done in a layer above the network
    protocol itself.  (Not all network protocols are stream-oriented,
    not even flow-controlled, error-recovering protocols.  Internet RDP
    and XNS SPP are two examples of reliable record-oriented protocols.
    Many of these, however, impose fairly small record sizes.)

 D. rsh simply opens a stream protocol, and does no work to preserve
    `packet boundaries'.

 E. dd works in mysterious ways.

	dd if=x of=y

    is the same as

	dd if=x of=y ibs=512 obs=512

    which means `open files x and y, then loop doing read(fd_x) with
    a byte count of 512, take whatever you got, copy it into an output
    buffer for file y, and each time that buffer reaches 512 bytes,
    do a single write(fd_y) with 512 bytes'.

    On the other hand,

	dd if=x of=y bs=512

    means something completely different:  `open files x and y, then
    loop doing read(fd_x) with a byte count of 512, take whatever
    you got, and do a single write(fd_y) with that count'.

All of this means that

	tar cf - . | rsh otherhost dd of=/dev/tape/0

will write 512 byte blocks (not what you wanted), while

	tar cf - . | rsh otherhost dd of=/dev/tape/0 bs=20b

will be even worse: it will take whatever it gets from stdin---which,
being a TCP connection, will be arbitrarily lumpy depending on the
underlying network parameters and the particular TCP implementation
---and write essentially random-sized records.  On purely `local'
(Ethernet) connections, with typical implementations, you will wind
up with 1024 byte blocks (a tar `block factor' of 2).

If a blocking factor of 2 is acceptable, and if `cat' forces 1024 byte
blocks (both true in some cases), you can use

	tar cf - . | rsh otherhost 'cat >/dev/tape/0'

but this depends on undocumented features in `cat'.  In any case, on
9-track tapes, since each `gap' occupies approximately% 0.7 inches of
otherwise useful tape space, a block size of 1024 has 10 times as many
gaps as a block size of 10240, wasting 9x1600x0.7 = 10 kbytes of
tape at 1600 bpi, or 32 times as many as a size of 32768, wasting
31x6250x0.7 = 136 kbytes of tape at 6250 bpi.
-----
% Actual gap sizes vary.  In particular, certain `streaming' drives
  (all too often called `streaming' because they do not---in some cases
  the controller is too `smart' to be able to keep up with the required
  data rate, even when fed back-to-back DMA requests) have been known
  to stretch the gaps to 0.9".
-----

In general, because of tape gaps, you should use the largest record size
that permits error recovery.  Note, hoever, that some olid% hardware (such
as that found on certain AT&T 3B systems) puts a ridiculous upper limit
(5K) on tape blocks.

-----
% Go ahead, look it up... it is a perfectly good crossword puzzle word :-)
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov