[alt.hackers] Recovering corrupted UUE and ABE data.

pruss@ria.ccs.uwo.ca (Alexander Pruss) (01/11/91)

Someone recently tried to send me a zip file for the PC via email.
The first recourse was UUENCODE.  When I received these files and
tried to decode them, I got loads of CRC errors (in some but not
all lines) and of course a corrupted zip file.  I reported these
problems to the sender, and he then tried to send me an ABE encoded
file.  This is a similar scheme to UUENCODE to allow mailing of 
binary files but uses a different encoding method.  Of course, when
I tried to DABE these files, I got errors due to bad headers (why else am
I posting).  The sender gave up and was going to send the file by
ordinary mail, but I couldn't allow him to perpetrate this slight
against the world of computers.  I decided to investigate what might
be wrong with the encoded files.  First on the agenda was to get a
frequency distribution of the characters used in the UUE and ABE files.
I discovered that no backslashes (\) were present in either file, although
they normally appear in both types of encoding.  Also, there were about
twice as many colons (:) as any other character in both files.  Coincidence?
I thought not.  Some machine along the way performed this nefarious
character switch (why I ask? - I still don't know).
Anyway, I figured that there must be enough correct information in the
two files to decode the true zip file.  Several problems of course arose,
but the handy line checksums in both types of files helped out.  ABE files,
for the ignorant, have a character map header that informs as to how each
hex byte is encoded.  Each printable character (most at least) are used
in three different sets with escape characters to switch between sets.  Thus
there were six colons (in the character map) 3 of which should be backslashes.
Using the checksums and the sets that each colon was marked as being in, it was
possible to deduce which three had been switched.  Thus the character mapping
was reconstructed.
I then wrote a filter to check the line checksum of each line in both the
UUE and ABE files.  If only one colon appeared in a line, I could tell if 
it was wrong using the checksum.  Thus all lines with only one error were
repaired.
Next on the agenda was to get ABE to decode the file.  It cleverly(?) ignores
any line with a bad checksum, and then gives you errors because certain
line numbers are missing.  I thus filtered the ABE file again and fixed all
the bad checksums, since I couldn't tell where the mistakes were in those
lines.  I could now get both DABE and UUENCODE to produce output zip files, 
which of course were still wrong.
My first attempt was a simple fix.  

It worked by stepping through the two zip files, and when a discrepancy was 
found, it used a simple algorithm to guess which was right:
  If character in ABE zip file did not come from an colon, it is right.
  If character in UUE zip is a character that would encode to a backslash 
     under ABE and the character in the ABE zip came from a colon, the
     UUE zip is probably right.
  Else we use the ABE zip byte, not knowing any better.

Although not perfect this produced a composite zip file with only two
CRC errors (out of about 10 files in the zip), even though it had to "guess"
about 50 times (of 200K bytes).
My next attempt was more ambitious and MUCH more effective.  Instead of 
using the UUE zip I worked directly from the UUENCODED file and the ABE zip.

The program read four characters from the UUE file (which decode to three
bytes) and the corresponding three bytes from the zip that ABE had produced.
  If the UUE line checksum was right, or if there were no colons in the four
    character segment, then the decoded bytes were the correct ones.
  If the ABE file had no characters that had come from colons, then these
    3 bytes were the correct ones.
  Otherwise, it constructed all possible variations that were possible in 
    the UUE file by substituting backslashes for colons (eg.  a::b could 
    be a:\b or a\:b or a\\b).  Similarly, it constructed all variations 
    of the 3 ABE bytes by replacing characters that had been
    encoded as colons with the character that encoded as a backslash in the
    same set (remember each printable character can represent 3 different
    bytes in the ABE file).  
    It then decoded all the UUE possibilities (up to 16) and compared them 
    with all the ABE possiblities (up to 8).  
    If it found a unique match
      it used the three bytes as correct
    and otherwise
      reported an error message, using a guess as to the correct bytes.  
      
When I ran it, it produced absolutely no errors in 
205034 bytes, and left me with a perfectly good zip file with no errors.

Thus I had saved the world from the horrors of the US Postal Service, and
recovered my zip file to boot.

If anyone wants a challenge, see how corrupted the two files can be and still
allow the correct encoded file to be recovered - I still didn't use all the
information available since I didn't write a DABE decoder - the line checksums
etc. would give more information about the correctness of the decoded bytes.

pat