[comp.sys.apple] Sparse files and ShrinkIT

nicholaA@moravian.EDU (04/28/89)

>>I believe sparse files are handled the only way possible on an 8-bit machine-
>>you just ask prodos to read them, prodos says, hey' there is alot of 00 data
>>here, and gives it to you anyway.

>It has nothing to do with bits, Andy. Sparse files are designed to reduce disk
>usage of empty blocks. ProDOS is SUPPOSED to return 00's when you read from a
>gap in the file. The gaps represent all of those 00's. What I was suggesting
>(or maybe just jesting) was that you track down and shrink only those blocks
>which actually contain data.

What *I* am suggesting is that trying to "track down and shrink on those blocks
which contain data" is a bit much for the current shrinkit, let along the ][+
versions (which have almost no space left for improvements.  The IIe version
has about 3-4k left.)  I was just trying to let you know the way shrinkit
_currently_ works.

>Suppose you have to shrink /USERS.DISK/HELPSCREENS.

>Open the parent directory for the file to shrink.
>	OPEN /USERS.DISK
>	descriptor = directory header entry

>Read the directory until we come to the file descriptor for /HELPSCREENS.
>	while descriptor is not for HELPSCREENS
>		READ [next descriptor]

>Depending on the file's storage type: find and read blocks of data into the
>input stream, shrink them, and write the shrunk data to the output stream.

>	If a seedling, read the single data> block.
>		BLOCK_READ [data block]
>		shrink [data block]

>	If a sapling, read the index block and use it to read single data
>	blocks.
>		BLOCK_READ [index block]
>		while more data blocks
>			if non-contiguous data block
>				output SET_MARK for next file position
>			BLOCK_READ [next data block]
>			shrink [data block]

>	If a tree, read the master index block. Use that to read index blocks,
>	and use those to read single data blocks.
>		BLOCK_READ [master index block]
>		while more index blocks
>			BLOCK_READ [next index block]
>			while more data blocks
>				if non-contiguous data block
>					output SET_MARK for next file position
>				BLOCK_READ [next data block]
>				shrink [data block]

>Of course, this method relies on the current ProDOS file system's directory an
>file structures. If Apple ever decided to come out with a new filing system,
>you would have to update ShrinkIT. As your program stands now, it is basically
>file system independent, but sparse files are not handled very elegantly.

I think the easiest method to handle what you are suggesting would be to
include the entire set of index blocks within the archive.  The is also
probably the most space-efficient way as well, unless you consider the
alternative, file system independent, method of doing as well.

Alternatively, even though the ProDOS filing system will probably _never_ (so
says apple) change, a "better" way for handling sparse files would be to have
a series of threads within a NuFX archive which described the next mark
within a program which actually contained data.  The overhead for this is
16 bytes per blank area, be it 1 byte or 4 gigabytes.  This would allow the
archive to remain file-system independent.  Thus, your example of the single
byte preceded by 16 million bytes of "blankness" would be reduced to the header
record (64 bytes), length of the filename (xx bytes), single sparse record
(16 bytes), single data record (16 bytes), and 1 byte of non-compressed data.
If the filename is another 10 bytes, we have a grand total of 107 bytes to
represent a 16 megabyte file in a NuFX archive.

>>ShrinkIt run-length encodes the block before it
>>does LZW on it, so at most, there is an extra 6 bytes of overhead per 512
>>bytes of "blankness" -- this is probably less for small files which would
>>have to have a "map" of the sparse blocks included in the data anyway...

>ProDOS currently allows for a file of length 16777215 bytes. While RLE and LZW
>are good for the majority of cases, in the extreme case, a file with one byte
>at the end, you still end up with a file which is approximately

>			   block     6 bytes
>	16777215 bytes * --------- * ------- = 196608 bytes long!
>			 512 bytes    block

Um, not really.  the 6 bytes overhead I quoted was from the RLE'ing of the
512 bytes.  The RLE is then LZW'd to save space, so since the actual savings
is more like 27 bytes per 4096 bytes read.  This is roughly double the space
needed to represent the blank area via a NuFX sparse tread (16 bytes) per 4k
represented.  This applies to any of the same type of bytes, be they 00's or
$20's or whatever...

                           chunk      27 bytes
        16777215 bytes * ---------- * --------- = 110592 bytes
                         4096 bytes     chunk

And, chances are, that if you have a sparse file _that_ long, the user will
also have storage space for a measly 110k more.  That is, the MAXimum overhead
to handle sparse files under the ProDOS filing system is 110k.  Never more.
AND, this is an extreme case -- most users will never generate a file with a
single non-00 on the end preceded by 16 million 00's, so the actual space
savings would be somewhat more substantial.

ShrinkIT is made to be a fairly simple archive program.  It is _not_ made to be
the answer to all of man's problems.  Right now, I consider the handling of
sparse files is adequate.  Perhaps in the future when the HFS FST becomes
available for GS/OS, the GS version may have to deal with sparse files only
because the file system for HFS can deal with files up to 4 gig in length.
But, by then, I hope to have the GS version of ShrinkIT done, and will always
have more memory available to handle the full implementation of NuFX as
it now stands (since IIe ShrinkIT actually implements a subset of it).

andy

----
Andy Nicholas                 CsNET: shrinkit@moravian.edu
Box 435                    InterNET: shrinkit%moravian.edu@relay.cs.net 
Moravian College               uucp: rutgers!lafcol!lehi3b15!mc70!shrinkit
Bethlehem, PA  18018          GEnie: shrinkit
----                   AppleLink PE: shrinkit