pruss@ria.ccs.uwo.ca (Alexander Pruss) (01/11/91)
Someone recently tried to send me a zip file for the PC via email. The first recourse was UUENCODE. When I received these files and tried to decode them, I got loads of CRC errors (in some but not all lines) and of course a corrupted zip file. I reported these problems to the sender, and he then tried to send me an ABE encoded file. This is a similar scheme to UUENCODE to allow mailing of binary files but uses a different encoding method. Of course, when I tried to DABE these files, I got errors due to bad headers (why else am I posting). The sender gave up and was going to send the file by ordinary mail, but I couldn't allow him to perpetrate this slight against the world of computers. I decided to investigate what might be wrong with the encoded files. First on the agenda was to get a frequency distribution of the characters used in the UUE and ABE files. I discovered that no backslashes (\) were present in either file, although they normally appear in both types of encoding. Also, there were about twice as many colons (:) as any other character in both files. Coincidence? I thought not. Some machine along the way performed this nefarious character switch (why I ask? - I still don't know). Anyway, I figured that there must be enough correct information in the two files to decode the true zip file. Several problems of course arose, but the handy line checksums in both types of files helped out. ABE files, for the ignorant, have a character map header that informs as to how each hex byte is encoded. Each printable character (most at least) are used in three different sets with escape characters to switch between sets. Thus there were six colons (in the character map) 3 of which should be backslashes. Using the checksums and the sets that each colon was marked as being in, it was possible to deduce which three had been switched. Thus the character mapping was reconstructed. I then wrote a filter to check the line checksum of each line in both the UUE and ABE files. If only one colon appeared in a line, I could tell if it was wrong using the checksum. Thus all lines with only one error were repaired. Next on the agenda was to get ABE to decode the file. It cleverly(?) ignores any line with a bad checksum, and then gives you errors because certain line numbers are missing. I thus filtered the ABE file again and fixed all the bad checksums, since I couldn't tell where the mistakes were in those lines. I could now get both DABE and UUENCODE to produce output zip files, which of course were still wrong. My first attempt was a simple fix. It worked by stepping through the two zip files, and when a discrepancy was found, it used a simple algorithm to guess which was right: If character in ABE zip file did not come from an colon, it is right. If character in UUE zip is a character that would encode to a backslash under ABE and the character in the ABE zip came from a colon, the UUE zip is probably right. Else we use the ABE zip byte, not knowing any better. Although not perfect this produced a composite zip file with only two CRC errors (out of about 10 files in the zip), even though it had to "guess" about 50 times (of 200K bytes). My next attempt was more ambitious and MUCH more effective. Instead of using the UUE zip I worked directly from the UUENCODED file and the ABE zip. The program read four characters from the UUE file (which decode to three bytes) and the corresponding three bytes from the zip that ABE had produced. If the UUE line checksum was right, or if there were no colons in the four character segment, then the decoded bytes were the correct ones. If the ABE file had no characters that had come from colons, then these 3 bytes were the correct ones. Otherwise, it constructed all possible variations that were possible in the UUE file by substituting backslashes for colons (eg. a::b could be a:\b or a\:b or a\\b). Similarly, it constructed all variations of the 3 ABE bytes by replacing characters that had been encoded as colons with the character that encoded as a backslash in the same set (remember each printable character can represent 3 different bytes in the ABE file). It then decoded all the UUE possibilities (up to 16) and compared them with all the ABE possiblities (up to 8). If it found a unique match it used the three bytes as correct and otherwise reported an error message, using a guess as to the correct bytes. When I ran it, it produced absolutely no errors in 205034 bytes, and left me with a perfectly good zip file with no errors. Thus I had saved the world from the horrors of the US Postal Service, and recovered my zip file to boot. If anyone wants a challenge, see how corrupted the two files can be and still allow the correct encoded file to be recovered - I still didn't use all the information available since I didn't write a DABE decoder - the line checksums etc. would give more information about the correctness of the decoded bytes. pat