[mod.protocols] PROB CRC-FAIL on VMSHARE

NJG@CORNELLA.BITNET (07/21/86)

The following was posted to PROB CRC-FAIL on VMSHARE Friday evening and
has been sent to TRANS-L maillist (sorry for the 2nd and 3rd copies for
those that read these other sources, but I feel that this is important
enough to justify the added exposure that this list gives this subject).

RSCS delivers bad data when using RF modems:

We are currently using RF modems on our broad-band network to
interconnect two VM systems via RSCS v1.3 across a bisync line
supported by an Amdahl 4705 and IBM 3705 both running EP.  We have
found that this configuration has lead to RSCS starting to deliver
invalid data (i.e.  the file received is NOT the same as the file
sent).  The problem occurs at rates as high as 1 error in less than 1M
transfered.  It is related to RF modems producing a class of error that
the CRC-16 that the 37x5 uses is incapable of catching.  We have a sev
1 problem (7x748 and 8x017 which is just a reference to 7x748 for
charge back reasons) open with the support center.  This problem was
originally opened against RSCS.  Level 2 EP has issued apar IR69384
which says this is a hardware problem.  They have also handed
responsibility for the problem to the hardware folks.  In Cornell's
opinion this is not a hardware problem, it is a software problem.  I'll
be explaining this in more detail in this file and asking that anyone
using RF modems try a test to see if they are also subject to the same
problem.  Please bear with me as I try to fill you in on all the gory
details.  We feel this is a very serious failure that all sites using
RF modems should be highly concerned about.

First I'll describe our hardware and software of interest:

We are running RSCS v1.3 at SLU312.  This is a modified system, but
tests with vanilla RSCS v1.3 at SLU316 (off the 8508 tape) show the
same symptoms.  All tests have been with the DMTVMB driver.  I have not
brought up RSCS v2 and its NJI driver, nor looked at how it handles its
buffers and use of ITB characters within those buffers to control CRC
calculations.  The RSCS VMB blocksize has been set at both 4072 (the
max) and 824 (the min).  I have also successfully used a modified
DMTVMB (simple zap) with a blocksize of 400 with the same results.  Use
of a blocksize of 300 hardstops the 3705.

VM is at the HPO 4.2 level on a 3081 and a 4381-3.  The 4381-3 was a
3084 in partitioned mode until 1 month ago.  Some 6 weeks ago we were
running HPO 3.4 on these systems.  None of these changes have affected
the problem's symptoms.

We run EP with the Compro code at ship 20 in an IBM 3705 and Amdahl
4705.  Tests of vanilla EP at SLU301 (the latest you can get for VM)
show the same problems and symptoms.  The lines in question have been
on both a type 3 scanner on each of these boxes and we have also
wrapped externally the 3705 on a set of lines in the 3705 on its type 2
scanner.  None of these changes affect the problem.

We have used both General Instruments's Jerrold Metronet 1000 9.6kb
RF modems (hereafter refered to as 9.6kb RF modems) and
Ungermann-Bass's NM670 56kb RF modems (hereafter refered to as 56kb RF
modems).  We also have a set of Codex 56kb RF modems manufactured by
Zeta which seem to be subject to the same problems (although I have not
examined the failed files to verify the exact same symptoms in detail).
The head-end translator is a Ungermann-Bass 3 channel translator sold
for use with their NM670 and NM640 modems.  We have two which we switch
between periodically, running all our RF modem translation traffic thru
the one currently in use (the second for backup purposes only).

Definition of an error:

An error is a file delivered by RSCS which does not match the file
sent.  In all cases we have examined in detail we have found that the
error is a small number of bits changed in a small number of nibbles.
The errors are always 2 or 3 nibbles in 4 consecutive bytes being bad.
We have seen cases of two such errors occuring in a buffer.  In all
cases that we have looked at even closer the total number of incorrect
bits is even and 4 or more.  We have seen 1 and 2 bit errors in
nibbles.  We tend to see mostly a total of 4 bits in error, but have
identified cases of 6 and 8 bits in error in the RSCS buffer.

Some examples of representative errors follow:
                                            x'50004b60bace19'
                                   becomes: x'50005360b0ce19'
   number of bits in error in above nibble:       11   2
                                            x'5850500807f5d5'
                                   becomes: x'5850534806f5d5'
   number of bits in error in above nibble:        21  1
                                            x'999481a37a40c6'
                                   becomes: x'9994e1a35240c6'
   number of bits in error in above nibble:       2   11
                                            x'FF00FF00FF'
                                   becomes: x'FF067F02FF'
   number of bits in error in above nibble:      21  1
                                            x'FF00FF00FF'
                                   becomes: x'FF30FF14FF'
   number of bits in error in above nibble:     2   11

Observed undetected error rates:

The bad data arrives at what we consider to be very high rates.  When
using 56kb RF modems tuned to be as clean as possible, rates as high as
1 error in 15M have been seen.  With 9.6kb RF modems which are properly
tuned things are much better with 30-100M of data being transmitted
between errors.  It is possible to de-tune the 9.6 modems to achieve
rates of failure as high as 1 per 1M, likewise if the 56kb RF modems
are used without explicitly tuning them similar rates are observed.
When 9.6kb twisted pair modems (de-tuned to generate line errors at
higher rates) are tested I have passed over 200M on them without a
single bad file being delivered.

These above rates of failure have been shown to be directly related to
RSCS's reported line error rate (i.e.  data check count).  It would
seem that from the data check error counts and the bad delivered file
counts that we are seeing between 10% and 50% of the line errors
passing the CRC16 check that the 3705 is performing (i.e.
(10-50%)*(line_error_rate+bad_file_count)=bad_file_count).

How we tested:

We first noticed this problem when modules which worked on one CPU
started failing on the other CPU (although we now believe that we have
suffered from this problem for several years, it is only recently that
we have started to use such a connection for high volume distribution
of files between VM systems).  These MODULEs were transfered by RSCS
across the RF modems.  I wrote a quick and dirty type exec to act as a
server talking to RSCS.  This server punches a data file to RSCS who
transfers it to another RSCS where it is delivered to a duplicate of
the original server.  The original server then waits for a file to
arrive in its reader.  When it does it compares it to the original data
file, reports any differences, and sends another copy of the file to
the other server.  In this manner we can get and keep the link as busy
as we want (i.e.  have files going in both directions or just have one
file active at a time as we choose, the choice doesn't seem to matter).

I have been running my tests as of late on a single CPU.  Thus I have
both servers and RSCS machines on the same VM.  The two RSCS
userids communicate via two lines in the same 3705.  These lines go out
to RF modems (either 9.6kb or 56kb) which are hooked up to our campus
wide broad-band network.  This test behaves the same as the use of two
real CPUs and the 3705 and 4705 described in earlier in this memo.

We have watched the data leaving CP via the CCWTRACE facility as well
as watching it arrive back in from the 3705.  We can see from these
traces that the data leaves in good condition and arrives in bad
condition and that there is no data check condition.  We have thus
eliminated CP and RSCS as the source of the bad data.

We have also run an EP trace on a vanilla EP at SLU 301 over 9.6kb RF
modems hooked to type 2 scanners.  This trace shows the data being
handed to the scanner correctly and then coming back in from the
scanner on the other line with the data incorrect.  It also shows EP
deciding that the CRC16 does match.  We could not run EP trace with the
56kb lines on type 3 scanners as this hung the 3705.  We do not know
what caused this problem.  These EP traces show EP behaving correctly.
Due to the failures on both the IBM 3705 and Amdahl 4705 and due to the
correct behaviour of twisted pair we do not believe that we have a case
of the hardware CRC failing to calculate the CRC value correctly.

I have tried random data and data with a high pattern to it (alternating
x'FF' and x'00'). We can observe no effects of the data on the problem.

Why does it pass the CRC16:

From Tanenbaum's 'Computer Networks' book: "A 16-bit checksum, such as
CRC-16 or CRC-CCITT, catches all single and double errors, all errors
with an odd number of bits, all burst errors of length 16 or less,
99.997% of 17-bit error bursts, and 99.998% of 18-bit and longer
bursts".  By implication even bit errors of 4 or more are not always
(or even very well?) caught.  IBM hardware people seem to feel that
this is a reasonable statement.

Tanenbaum also writes earlier "As a result of the physical processes
causing the noise, errors tend to come in bursts rather than singly."
However, at the time of the printing of the edition I am quoting from,
1981, RF modems were not commonly available on the market place.  It
would appear from our experience that single and double bit errors in a
nibble are now a common mode of failure with RF modems.  At this point
we have had no luck getting a statement from someone in the know (modem
designers, etc) if this is truely the case.  If it is, then it seems
that the designers of RF modems have made a mistake.  As they have to
live in a world where 16 bit CRC checks are a standard then they should
not produce a design that tends to produce errors that will pass the
16-bit CRC tests.

There has been statements made that the CRC-16 will tend to perform
better when small blocks are checked.  We have been told that 742 is
one of the 'magic' numbers.  Unfortunately RSCS v1.3's DMTVMB driver
will only allow block sizes as small as 824...  Its NJI driver, which
goes down to 300, will not talk to another copy of itself.  However we
have tried a modified version of the VMB driver with a block size of
400 and have seen no change in the symptoms.  Attempts to use this same
driver at 300 hardstops the 3705.  We are not currently running (nor do
we have any desire to run) v2 of RSCS, so we have not been in a
position to try its NJI driver at 300 or 400 byte blocks.  Even if we
could run such a test, we found from the VMB test that at 400 byte
blocks we got effective thruput rates of only 20kb on a 56kb line.

We have also been told that 256 is another 'magic number' for CRCs.
None of RSCS's drivers support block sizes this small.  The VMB driver
does not use ITBs to force the re-calculation of the CRC on smaller
quantities.  It is not yet clear to us if the NJI driver uses ITBs to
improve the performance of the CRC and yet keep the line performance up
by allowing large blocks to be sent before turning the line around.
This issue is under investigation by IBM.

It is not clear to us that reduction of the amount of data to 'reasonable'
numbers such as 256 will help this problem. The statements by Tanenbaum
and our own experiences seem to imply that this is a problem inherent
to use of 16bit CRCs.

What IBM has said:

The support center problem mentioned above has had a long and varied
history so far.  It started out against RSCS, it has been passed to EP
and hardware.  From there, once we found that the CRC was correctly not
catching the error, it was passed back to RSCS.  RSCS has claimed that
they depend upon EP and the 3705 to 'provide reliable communications'
and have refused to address the problem any further 'as that would
involve a design change'.  They sent the problem back to EP who has
sent it to hardware and issued the APAR which is closed 'hardware'.
The problem is STILL open as a sev 1 problem, now against hardware.

The IBM support staff associated with our account seem to agree with
our stance that this is a problem which IBM needs to address.  They
have been, and continue to be, actively working with us to reach an
acceptable resolution of the situation.  It is clear that this issue
has raised many red flags within IBM and been the topic of many
conversations within IBM.  It is not an easy problem to be dismissed
lightly as it has such far reaching ramifications across their (and
many other vendors') product line.

It is clear to us that IBM needs to address this problem.  It is our
feeling that RSCS needs to add its own end-to-end verification of data
validity, not simply reply upon what we have shown to be in today's
world an unreliable CRC check performed by the front-end.

What else is affected by this problem?

We suspect that lots of other packages also depend upon EP and the 3705
to provide reliable communications and thus are also affected by this
problem.  We have not yet verified it, but we believe that Passthru
will have the same problems.  We know that VM's remote 3270 support
does not attempt to send small blocks to the 3274 and thus will be
subject to the problem.  However, at least remote 3274s send things as
256 byte blocks for the 3705 to validity check.  While this smaller
block size improves the performance of the CRC-16 we are suspicious
that it does not solve the problems related to this class of errors.
We suspect that packages such as CICS, VTAM, etc, etc, etc are all also
subject to the same problems of use of large blocks and relying upon
the 3705 for all validity checking.  We have not checked into this
question relative to any of these other packages.

What you can do to help us:

One of the first things I think every site with RF modems should do is
to run a similar test to see if they are also subject to these
problems.  I don't think that it matters what kind of software you are
running.  The interesting thing will be to see how common it is to have
RF modems generate errors that are not caught by the error checking
mechanisms, be they CRC-16 or otherwise.  Based upon the error rates
for our cleaner RF modem connections it would appear that you need to
run several hundred Meg of data thru the connection before you can be
relatively sure of not being subject to the problem (assuming our rates
are at all representative, which they may not be).

The second thing you can do it express your concerns to IBM that its
RSCS (and other high level) protocols do not have their own end-to-end
verification of data validity.  It appears from our experience that EP
and the 3705's CRC-16 checks can not be relied upon in today's
environment which includes RF modems on broad-band networks.

-Nick Gimbrone, Cornell University, NJG@CORNELLA.BITNET, (607)255-3747

W8SDZ@SIMTEL20.ARPA (Keith Petersen) (07/23/86)

Have you checked for overrun of the front end by the sending machine?
This is a real problem with the new "error-free" MNP modems.  They
give one a false sense of security.  Also it has been pointed out that
the "error-free" modems don't deal with errors associated with overrun
of the receiving machine.

--Keith