[mod.protocols] CRC failure analysis

NJG@CORNELLA.BITNET (07/24/86)
I received the following from Craig Watkins, the author of Jnet.
I will also be uploading this information to PROB CRC-FAIL on VMSHARE.
We are also following up on a suspicion expressed by a couple of sources
that the reason that these RF broadband modems are generating this class
of error is related to the scrambling polynomial used by them. The basic
analysis is that if they have factors in common then we would see the
sort of problems we are currently seeing. I'll upload more information
as this suspicion is firmed up.
-NJGimbrone

-------------------------------------------------------------------------

In the following, I will probably tell you much that you already know
since it is obvious that you have put a lot of work into this problem.
Please forgive any of my explanation that you may already know.

I first took the blocks of trace data that you sent me and ran them
through some software and calculated the CRCs on them in the exact
manner that your communications processor would do.  This shows us
that your modems are indeed creating errors that CRC16 is unable to catch
and that your hardware is working correctly.

I also performed CRC calculations on the data that you had in your
posting.  Each of the pairs of (short) data streams showed the same resultant
CRC, again confirming that these were errors that are undetected with
CRC16.  I then tried to characterize what type of error represents something
that will be undetected.

A very basic understanding of the math behind CRCs is necessary.  A
transmitted bit stream can be thought of as a very large binary number.
A CRC polynomial, such as CRC16 (X|16 + X|15 + X|2 + 1) can also be represented
as the binary number 11000000000000101.  The "CRC" code is the remainder
resulting from the division of the bit stream by the CRC polynomial.

To divide these binary numbers "modulo mathematics" is used.  Involved in
modulo mathematics is the use of exclusive-or'ing instead of the traditional
additions and subtractions.  This method makes the hardware simpler, as
well as the concepts.

It is fairly clear that if we divide G by P and get remainder X, we will
also yield X as a remainder if we divide G+P by P.  Further, we will achieve
a remainder X when we divide G+nP by P where n is an integer.  In the process
of calculating a CRC, G is the binary bit stream, P is the CRC polynomial
and X is the CRC check code.

What the above suggests is that we can add any multiple of the polynomial to
the actual bit stream and yield the same CRC check code.  Since we are using
modulo math, our adds will take on the form of exclusive-or's.  The logical
operation "exclusive-or" can be thought of as a bit-toggle, or in our
particular analysis, an error.

What I am suggesting is that you can take the CRC polynomial in the format
of 11000000000000101 and exclusive-or it into the good bit stream, producing
an erroneous bit stream which will produce the same CRC check code.  Further,
this mask can be exclusive-or'ed into the bit stream in any position, as
many times as desired, and the resultant CRC check code will remain the
same.

You will note that in the five examples in your posting, each one was
an instance of the single mask 11000000000000101 being XOR'ed into the
bit stream.  While looking at a hex dump, it is not immediately obvious
that this is so.  One must realize that each byte is sent LSB first.  As an
example, 52 7A B5 would be sent as:
(first bit sent this end) 0100 1010  0101 1110  1010 1101 (last bit sent).

Thus, it seems reasonable to assume that for some reason your RF modems
tend to produce errors in the pattern of the CRC16 polynomial.

Comments on specifics you mentioned in your posting:

> We have seen cases of two such errors occurring in a buffer.  In all
> cases that we have looked at even closer the total number of incorrect
> bits is even and 4 or more.  We have seen 1 and 2 bit errors in
> nibbles.  We tend to see mostly a total of 4 bits in error, but have
> identified cases of 6 and 8 bits in error in the RSCS buffer.

Four-bit errors are the smallest number of bit errors that will be passed
undetected because there are four terms in the CRC16 polynomial.  Likewise,
following the above discussion of exclusive-or'ing various positioned masks
in the bit stream, you will also note that it is impossible to achieve an
odd number of bit errors that go undetected.  Your observations seem to
be correct.

> There has been statements made that the CRC-16 will tend to perform
> better when small blocks are checked.  We have been told that 742 is
> one of the 'magic' numbers.  Unfortunately RSCS V1.3's DMTVMB driver
> will only allow block sizes as small as 824...  Its NJI driver, which
> goes down to 300, will not talk to another copy of itself.  However we
> have tried a modified version of the VMB driver with a block size of
> 400 and have seen no change in the symptoms.

For certain types of errors, it may be true that CRCs perform better on
smaller blocks, but for your type of error, I don't believe the size of
the block has much (if any) significance.  Given that it seems that your
RF modems produce errors in the same pattern of the CRC16 polynomial, and
the error's placement in the block is not significant, varying the block
size would seem to me to have no (or little) effect.

Also, in regard to this, when you tell RSCS to send a buffer size of 742,
RSCS sends blocks up to the size of the buffer and hardly ever fills the
buffer to perfect capacity.

> None of RSCS's drivers support block sizes this small.  The VMB driver
> does not use ITBs to force the re-calculation of the CRC on smaller
> quantities.

This is true.  VMB always uses a single ITB in each block.

> It is not yet clear to us if the NJI driver uses ITBs to
> improve the performance of the CRC and yet keep the line performance up
> by allowing large blocks to be sent before turning the line around.

NJI doesn't use ITBs.


I hope this information helps somewhat.