[comp.binaries.ibm.pc.d] WARNING! Vicious bug in GSARC

rmpinchback@watmum.waterloo.edu (Reid M. Pinchback) (11/05/88)

I've found a nasty bug in GSARC.  Its the sort of bug you come up against
accidentally when converting archives from an old archiver (PKARC) to 
take advantage of a (supposedly better) new archiver.  Luckily I was working
on a copy of the old archive.  Actually, I've found something else, but the 
second observation is merely disappointing, not damaging.

Nasty bug found as follows:
   - de-arc an archive.  Lets say it was the archive TEST.ARC
   - tell GSARC to make an archive of the same name, ie:
          GSARC m TEST.ARC *.*
   - notice that (accidentally or purposely) the same archive name is to
     be used, and the old archive was neither renamed nor deleted.
   - GSARC now does NOT attempt to make an archive.  It just deletes all
     the files in the current directory, including the old archive!

                            ACK!

   - Solution: be VERY careful when converting archives, and only use
     the GSARC conversion option, ie:
          GSARC c TEST.ARC

The second item is pretty simple.  When using either SEA's ARC or Katz's
PKARC, both of them will use the most effective compression method to
suit each file being compressed.  GSARC has this nice new Crushing method.
In fact, GSARC thinks it is SO nice, it is the ONLY method it will use
to compress a file, unless you tell it specifically to create either an
ARC or PKARC compatible archive.  As a result, I often end up with updated
archives that are 20-25% LARGER than they were with ARC or PKARC.  To avoid
this you would have to add a file to an archive one at a time, experimenting
with three different compression options to see which yielded better results.

                            ICK!

Oh well, so much for a nice new archive tool.  Time to go play with Zoo.  :-)



      Reid M. Pinchback
      CS/C&O Undergraduate
      University of Waterloo

spolsky-joel@CS.YALE.EDU (Joel Spolsky) (11/06/88)

In article <6627@watcgl.waterloo.edu> rmpinchback@watmum.waterloo.edu (Reid M. Pinchback) writes:
>As a result, I often end up with updated
>archives that are 20-25% LARGER than they were with ARC or PKARC.  
>
>      Reid M. Pinchback

Compressing what kind of files? 

+----------------+---------------------------------------------------+
|  Joel Spolsky  | bitnet: spolsky@yalecs     uucp: ...!yale!spolsky |
|                | arpa:   spolsky@yale.edu   voicenet: 203-436-1483 |
+----------------+---------------------------------------------------+
                                               #include <disclaimer.h>

rmpinchback@watmum.waterloo.edu (Reid M. Pinchback) (11/07/88)

In article <42285@yale-celray.yale.UUCP> spolsky-joel@CS.YALE.EDU (Joel Spolsky) writes:
>In article <6627@watcgl.waterloo.edu> rmpinchback@watmum.waterloo.edu (Reid M. Pinchback) writes:
>>As a result, I often end up with updated
>>archives that are 20-25% LARGER than they were with ARC or PKARC.  
>>
>>      Reid M. Pinchback
>
>Compressing what kind of files? 
>
>+----------------+---------------------------------------------------+
>|  Joel Spolsky  | bitnet: spolsky@yalecs     uucp: ...!yale!spolsky |
>|                | arpa:   spolsky@yale.edu   voicenet: 203-436-1483 |
>+----------------+---------------------------------------------------+

First:  All kinds of files are compressed the same way.
Second: The kinds of files where this seems to be a problem, appear to be
        text files, but I'm not yet sure exactly what kind of content
        causes the lousy compression.  I first noticed it when trying to
        archive a uuencoded archive (please don't ask why, its a
        long story) :-)
        Since then, I've noticed it cropping up often when I've been re-arcing
        some of the various archives i've had laying around on my hard disk
        for eons, these archives primarily being mixed text and executable
        binaries.


   Reid M. Pinchback

miken@wybbs.UUCP (Michael Neuhaus) (11/11/88)

In article <6627@watcgl.waterloo.edu>, rmpinchback@watmum.waterloo.edu (Reid M. Pinchback) writes:
> I've found a nasty bug in GSARC.  Its the sort of bug you come up against
> accidentally when converting archives from an old archiver (PKARC) to ...

 
     I'd like to make you aware that GSARC is now PAK, to avoid infringement
of SEA's trademark on the letters ARC.

     Thanks for finding the bug in the Move command, which occurs when
moving all of the files in a directory to an existing archive in the same
directory.  You are, however, incorrect as to the nature of the bug.  You
state:

> GSARC does NOT attempt to make an archive.  It just deletes all the files
> in the current directory.
 
     PAK (formerly GSARC) does update the archive, but when it then follows
this by deleting the files, it doesn't check that the newly-updated archive
is one of those files!  Future releases will correct this, of course, but in
the meantime avoid this situation by keeping the destination archive in
another directory when Moving entire directories.  This bug won't show up
if you are Moving files to a newly created archive, but it's better to be
safe.
     It's too bad this didn't come out in the six months of beta-test.

> archives are 20%-25% LARGER than they were with ARC or PKARC.

     I'd seriously like to see your data, since everything we've seen has
usually been the reverse.  It is true that on certain rare text files between
1K and 5K in final size, Crushing is marginally larger than PKARC's
Crunching, but never more than 1%.  In a sample of 20 text files in this size
range, only 3 exhibited this characteristic:

Name       Original size  PAK Size  PKARC size  difference
EVAL2.DOC       2560        1567      1551       1.02%
SUBMIT.DOC      8704        4714      4709       0.11%
WRITE.DOC       5600        3186      3180       0.19%

      Incidently, the final sizes were 34177 for the PAK archive, and
34422 for the PKARC archive, for a net savings of 245 bytes.  Not much,
but excellent considering that PAK has the most difficulty with files in
this range.  On all other test archives, PAK averaged about 10% smaller than
PKARC, and 15% smaller than ARC.

Michael Neuhaus
NoGate Consulting

wtm@neoucom.UUCP (Bill Mayhew) (11/16/88)

I don't know what GSARC uses for its coding scheme.  If it uses
huffman coding to compress ASCII files, the output can be bigger
than the input on a short file.  Huffman coding uses a variable
number of bits to encode characters based upon the frequency with
which characters appear.  Characters that appear frequently are
encoded with short bit patters, which infrequently encountered
characters are longer.  A table is prepended to such a file so that
the decoding algorithm knows which is which.

On a short file where all characters appear with about the same
freuquency, huffman coding is inefficient.  You are also penalized
by the fact that the lookup table takes some space.  Arcing a
uuencoded file of a few K in length would possibly present such a
situation.

Unix compress, for instance, uses huffman coding.

--Bill

spolsky-joel@CS.YALE.EDU (Joel Spolsky) (11/17/88)

In article <1413@neoucom.UUCP> wtm@neoucom.UUCP (Bill Mayhew) writes:
| 
| I don't know what GSARC uses for its coding scheme.  If it uses
| huffman coding to compress ASCII files, the output can be bigger
| than the input on a short file. 

Don't be absurd. Huffman encoding went out with pet rocks :-)

| Huffman coding uses a variable
| number of bits to encode characters based upon the frequency with
| which characters appear.  Characters that appear frequently are
| encoded with short bit patters, which infrequently encountered
| characters are longer.  A table is prepended to such a file so that
| the decoding algorithm knows which is which.

OK, you get an A+ in CS100 :-)

| On a short file where all characters appear with about the same
| frequency, huffman coding is inefficient.  You are also penalized
| by the fact that the lookup table takes some space.  

Well, I guess that's why nobody uses Huffman encoding :-)

| Unix compress, for instance, uses huffman coding.

False. Unix compress uses Ziv-Lempel-Welch encoding, which achieves
much higher compression rates than (snicker) Huffman encoding. ARC
also uses ZLW. GSARC uses a modified form of ZLW with variable length
codes, which makes GSARC perform better on very short files.

+----------------+----------------------------------------------------------+
|  Joel Spolsky  | bitnet: spolsky@yalecs.bitnet     uucp: ...!yale!spolsky |
|                | internet: spolsky@cs.yale.edu     voicenet: 203-436-1483 |
+----------------+----------------------------------------------------------+
                                                     #include <disclaimer.h>