[comp.lang.perl] Need help with error correction.

steve@gapos.bt.co.uk (Steve Rooke) (02/05/91)

I have two sites, one sending a file, the other receiving.  There is a lot
of corruption at the receiving end caused by line noise.  I am not able to
use any standard form of error correction on the line but I can request
retransmission of the file, as many times as needed, over another link.

I need to be able to compair these files and reconstruct the original with
reasonable confidence.  By that I mean that if two, or more, files have the
same text at a certain point then I am reasonably confident that the text
is OK.

I orignally thought of basing this on diff by trying to understand it's
output and collecting up the lines that are not different in each file but
the error rate may be high enough to cause some corruption on each line.
I expect the routine would have to look for substrings and be able to
re-sync when characters are lost/gained.

Has anyone tried to do this sort of thing before and how did you do it,
please?  Solutions based upon diff would do, I guess, for a majority of
the time.

Thanks a lot,
Steve
-- 
Steve Rooke  steve@gapos.bt.co.uk  (...mcsun!ukc!gapos!steve)  UK + 394 693595
BT, CSD/AS, Area 106, Anzani House,   | "You roll the dice with your heart
Trinity Ave, FELIXSTOWE, Suffolk, UK  |   and soul,  But some times you
#include <std/disclaimer>             |    just don't know." - Sam Brown

hunt@dg-rtp.rtp.dg.com (Greg Hunt) (02/06/91)

In article <steve.665748651@paddy>, steve@gapos.bt.co.uk (Steve Rooke) writes:
> I have two sites, one sending a file, the other receiving.  There is a lot
> of corruption at the receiving end caused by line noise.  I am not able to
> use any standard form of error correction on the line but I can request
> retransmission of the file, as many times as needed, over another link.
> 
> I need to be able to compair these files and reconstruct the original with
> reasonable confidence.  By that I mean that if two, or more, files have the
> same text at a certain point then I am reasonably confident that the text
> is OK.
> 
> Has anyone tried to do this sort of thing before and how did you do it,
> please?  Solutions based upon diff would do, I guess, for a majority of
> the time.

When I've had file transmission problems, I've used sum(1) to produce
a checksum of the file on both the sending side machine and the
receiving side machine and compared the results.  If they weren't the
same, then I knew that something got corrupted in the transmission and
I got the file again.

If the systems you're working with have sum(1) that might be an easy
thing to use.  Also, sum(1) will work for any sort of file, it doesn't
just have to be text (which is the only thing diff(1) can look at).

-- 
Greg Hunt                        Internet: hunt@dg-rtp.rtp.dg.com
DG/UX Kernel Development         UUCP:     {world}!mcnc!rti!dg-rtp!hunt
Data General Corporation
Research Triangle Park, NC, USA  These opinions are mine, not DG's.

meissner@osf.org (Michael Meissner) (02/07/91)

In article <1991Feb6.142829.20725@dg-rtp.dg.com>
hunt@dg-rtp.rtp.dg.com (Greg Hunt) writes:

| When I've had file transmission problems, I've used sum(1) to produce
| a checksum of the file on both the sending side machine and the
| receiving side machine and compared the results.  If they weren't the
| same, then I knew that something got corrupted in the transmission and
| I got the file again.
| 
| If the systems you're working with have sum(1) that might be an easy
| thing to use.  Also, sum(1) will work for any sort of file, it doesn't
| just have to be text (which is the only thing diff(1) can look at).

The only hitch is that sum(1) produces different results on System V
based systems and Berkeley based systems.  I think sum -r on System V
gives the BSD behavior.

--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (02/07/91)

In article <MEISSNER.91Feb6163623@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
: In article <1991Feb6.142829.20725@dg-rtp.dg.com>
: hunt@dg-rtp.rtp.dg.com (Greg Hunt) writes:
: 
: | When I've had file transmission problems, I've used sum(1) to produce
: | a checksum of the file on both the sending side machine and the
: | receiving side machine and compared the results.  If they weren't the
: | same, then I knew that something got corrupted in the transmission and
: | I got the file again.
: | 
: | If the systems you're working with have sum(1) that might be an easy
: | thing to use.  Also, sum(1) will work for any sort of file, it doesn't
: | just have to be text (which is the only thing diff(1) can look at).
: 
: The only hitch is that sum(1) produces different results on System V
: based systems and Berkeley based systems.  I think sum -r on System V
: gives the BSD behavior.

Since this is cross-posted to comp.lang.perl, I suppose it's okay for me
to mention that you can emulate System V sum with

#!/usr/bin/perl

undef $/;
while (<>) {
    print unpack("%32C*", $_) % 65535, " ", int((length()+511)/512), " $ARGV\n";
}

The Book, by the way, is wrong when it says you can emulate sum with "%16C*".
That is only guaranteed to work on files less than 256 bytes long (512 if
there are not eighth bits).  Teach me to choose my test cases better...

No, I didn't have any sources to consult.  The man page says sum does a
16-bit checksum, and it lies.  It does modulo 65535 (not 65536).  Ah well.

The above code will only work on files up to 2**24 bytes long or so.
Some machines may need to change the "%32C*" to "%31C*" until 4.0 comes
out, since some machines think that 1 << 32 == 1, GRRR!  I won't mention
any names, because I don't want to get sun4's into trouble...  :-)

Larry

raja@bombay.cps.msu.edu (Narayan S. Raja) (02/07/91)

In article <1991Feb6.142829.20725@dg-rtp>, (Greg Hunt) writes:


< When I've had file transmission problems, I've used sum(1) to produce
< a checksum of the file on both the sending side machine and the
< receiving side machine and compared the results.  If they weren't the
< same, then I knew that something got corrupted in the transmission and
< I got the file again.


I've also used sum for the same purpose.
However, according to the man page, sum
may give different checksums if sizeof(int)
is different on the two machines.


Narayan Sriranga Raja.

les@chinet.chi.il.us (Leslie Mikesell) (02/08/91)

In article <steve.665748651@paddy> steve@gapos.bt.co.uk (Steve Rooke) writes:
>I have two sites, one sending a file, the other receiving.  There is a lot
>of corruption at the receiving end caused by line noise.  I am not able to
>use any standard form of error correction on the line but I can request
>retransmission of the file, as many times as needed, over another link.

Unless one or both of the sites are truely arcane or mismanaged, you
should be able to get a version of kermit working to provide error
correction during the transfer.  If you have a problem with getting
appropriate permissions for outbound calls with kermit, you might
consider using a PC as an intermediate, placing the calls into
both machines.  Both the unix and PC versions can be script-driven
so you can probably make it run unattended.

Les Mikesell
  les@chinet.chi.il.us

worley@compass.com (Dale Worley) (02/08/91)

   When I've had file transmission problems, I've used sum(1) to produce
   a checksum of the file on both the sending side machine and the
   receiving side machine and compared the results.  If they weren't the
   same, then I knew that something got corrupted in the transmission and
   I got the file again.

But you're forgetting that in this application there are so many
errors that one cannot expect that more than a few lines get through
without error.  The probability that the the entire file gets through
without error is infinitesimal, and waiting for it to happen twice
would take forever.

Here's an idea: Break up lines into, say, ten-character lines.  (In
fact, you are using newlines in the file to resynchronize the
line-breaking algorithm.)  The line length should be chosen so that at
least 3/4 of the created lines have no errors in them.  Then apply Gnu
diff or diff3 (for speed) to the resulting files.  Since most of the
ten-character lines get through uncorrupted, diff should be able to
discern how the two files correspond.  Then you can integrate the
output of one or more diffs to reconstruct the file.

Dale Worley		Compass, Inc.			worley@compass.com
--
PHOTOVOLTAICS: safe and clean (but not cheap) electricity from the SUN.

guy@auspex.auspex.com (Guy Harris) (02/08/91)

>Some machines may need to change the "%32C*" to "%31C*" until 4.0 comes
>out, since some machines think that 1 << 32 == 1, GRRR!  I won't mention
>any names, because I don't want to get sun4's into trouble...  :-)

Or PCs and clones or other 386-based machines, or 3B{2,5,15}s, or
perhaps DECstations, or MIPS boxes, or MIPS-based SGI boxes, or.... 
SPARC is hardly unique in that regard; maybe MIPS's compiler generates
instructions to compensate for the fact that the shift count is taken
modulo 32, but other compilers don't.

steve@gapos.bt.co.uk (Steve Rooke) (02/08/91)

In article <steve.665748651@paddy>, steve@gapos.bt.co.uk (Steve Rooke) writes:
> I have two sites, one sending a file, the other receiving.  There is a lot
> of corruption at the receiving end caused by line noise.  I am not able to
> use any standard form of error correction on the line but I can request
> retransmission of the file, as many times as needed, over another link.
> 
> I need to be able to compair these files and reconstruct the original with
> reasonable confidence.  By that I mean that if two, or more, files have the
> same text at a certain point then I am reasonably confident that the text
> is OK.
> 
> Has anyone tried to do this sort of thing before and how did you do it,
> please?  Solutions based upon diff would do, I guess, for a majority of
> the time.

Thanks for all your initial replys about using file xfer protocols, kermit
and such, checksuming of the file at each end and splitting the file into
short line lengths to enable diff to at least produce some matches.

I guess I should have expanded on the real problem.  As I stated, I cannot
use any standard xfer protocols which is due to the sending and receiving
equipment being dumb (ie LIKE telex [no flames please!]).  The file is
then passed onto a U*IX system where it is checked and actioned upon.
The feedback path, for error retransmission, is by voice as, up to now,
a human has checked the file and pieced the contents together from a
number of transmissions.

As you can see I have no way of runing kermit (or the like), checksums
or altering the standard xmission of the file but I can request a resending
as many times as necessary.  The only way, I can see, for rebuilding the
correct contents is to compair sets of files and select matching sub-strings
in them.

If you have any further ideas or some code fragments then please let me
know.

Thanks again,
Steve
-- 
Steve Rooke  steve@gapos.bt.co.uk  (...mcsun!ukc!gapos!steve)  UK + 394 693595
BT, CSD/AS, Area 106, Anzani House,   | "You roll the dice with your heart
Trinity Ave, FELIXSTOWE, Suffolk, UK  |   and soul,  But some times you
#include <std/disclaimer>             |    just don't know." - Sam Brown

louie@sayshell.umd.edu (Louis A. Mamakos) (02/11/91)

This is (probably) not a perl solution, but you might investigate using some
sort of forward error correcting code, since the "cost" of retransmitting
parts of the data is "high."

louie