[comp.sys.ibm.pc] File packaging and compression

jbrown@jato.Jpl.Nasa.Gov (Jordan Brown) (10/06/88)

(gotta avoid the A word... :-)

I'm considering building a PUBLIC DOMAIN (that means *no* restrictions on
anything) file packaging and compression program.  I would attempt to
maintain portability across a wide variety of environments (obviously
MS-DOS and UNIX; others as appropriate) and would distribute the source
code.

I wouldn't promise that this would be the most featureful or fastest such
program ever built, but it would be PUBLIC DOMAIN.  And since I'd be
distributing source code, if somebody else figured out a way to be a
little faster or better, we could arrange to work TOGETHER to build
a better program.  (I anticipate compression ratios comparable to the
existing A-word programs, because everybody really uses compress.)

I don't have any arguments with SEA and PK.  I'm not sure who is in the
wrong, but it's clear we're all suffering.  I agree completely with somebody
who said that we (USENET, BBSes, etc) simply should not be depending on
a commercial product.

The initial interesting-feature list would include hierarchy support,
compression, and multivolume archive support.

So, what do people think?  Would anybody be interested in working on such
a project?  Would anybody support (as in use) such a program?

Jordan Brown
jbrown@jato.jpl.nasa.gov

bobmon@iuvax.cs.indiana.edu (RAMontante) (10/06/88)

jbrown@jato.UUCP (Jordan Brown) writes:
}
}I'm considering building a PUBLIC DOMAIN (that means *no* restrictions on
}anything) file packaging and compression program.  I would attempt to
}maintain portability across a wide variety of environments (obviously
}MS-DOS and UNIX; others as appropriate) and would distribute the source
}code.

How about basing your program on the Zoo archive format?  You get to
step into an existing format which already has supporters.  As I said
elsewhere, the format IS public-domain; one of Rahul's specific programs
(the only one that can generate zoo archives, I think) is the only part
that has restrictions.
-- 
--    bob,mon			(bobmon@iuvax.cs.indiana.edu)
--    "Aristotle was not Belgian..."	- Wanda

dhesi@bsu-cs.UUCP (Rahul Dhesi) (10/07/88)

In article <259@jato.Jpl.Nasa.Gov> jbrown@jato.UUCP (Jordan Brown) writes:
>I'm considering building a PUBLIC DOMAIN (that means *no* restrictions on
>anything) file packaging and compression program.

Think about the following issue carefully.

If the archive is a concatenation of files like the cpio, tar, and arc
formats, then updating it requires copying the whole archive.

If the archive contains more structure, e.g. a linked list of directory
entries like the zoo format, then updates need direct access writes but
allow you to avoid copying the whole archive.

Also, if the compressed file is preceded by length information, as in
cpio, tar, and arc, then you can't easily add a compressed file to the
archive without knowing the compressed size *first*, which means
compressing to a temporary file, which I don't like.

Take a look at the way zmodem protocol works:  it does not precede file
data with length information.  Instead, it uses an escape sequence of
bytes to denote the end of a file.  This may need some tricky
programming, and will slow down the speed with which archive contents
are listed, but it will let you add a compressed file directly to an
archive without creating a temporary file first.

The first has the advantage that archives can be read from and written
to standard input/output, allowing easy use of pipes in UNIX.

The second has the advantage that users with limited disk space can
still create and update large archives, and updating a large archive by
adding a tiny file does not need much overhead in CPU or I/O time.
(The tar format allows appending a file to a tar archive, but then you
can get two instances of the same file in the archive, and to extract
the file you extract both and let the second one overwrite the first --
not very elegant.)

If you can combine the advantages of both in an easy way, you have
achieved something very useful.
-- 
Rahul Dhesi         UUCP:  <backbones>!{iuvax,pur-ee}!bsu-cs!dhesi

les@chinet.UUCP (Leslie Mikesell) (10/08/88)

In article <4225@bsu-cs.UUCP> dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:

>If the archive is a concatenation of files like the cpio, tar, and arc
>formats, then updating it requires copying the whole archive.
>
>If the archive contains more structure, e.g. a linked list of directory
>entries like the zoo format, then updates need direct access writes but
>allow you to avoid copying the whole archive.

>Also, if the compressed file is preceded by length information, as in
>cpio, tar, and arc, then you can't easily add a compressed file to the
>archive without knowing the compressed size *first*, which means
>compressing to a temporary file, which I don't like.

There is also the problem with cpio even without compressing that if the
length of the file changes between writing the cpio header and reading the
end of the file the rest of the archive is corrupted.  I think the
archiver should work in a streaming mode if necessary so that it can
handle tape drives that don't seek, but there should be a length field
that can be filled in if you can seek on the media.  Your idea of a
magic escape sequence to mark the end of an entry solves 2 problems - the
file length where you can't seek on the device, and also the problem of
re-syncing on an archive with a corrupted entry or part of a multi-volume
set.  The program could also keep a separate directory (optional) in
another file or tacked on to the end of the archive.  This could be used
for several purposes with obvious advantages when the archive spans
volumes.  A minor extension would be to allow the directory portion to
contain entries for files that are not contained in the archive which would
allow (a) preserving links that would otherwise not be possible and (b)
restoring a directory tree to exactly the condition that it was in when
the last incremental backup was done (i.e. delete extraneous files that
had been deleted before the incremental but still existed on the last
full backup or intermediate incrementals).  A fairly simple program could
manipulate the information from the directory files to determine where to
find archive copies (disk n of set xxx) and also determine exactly which
files need to be copied in in incremental backup.

Les Mikesell