NJG@CORNELLA.BITNET (07/21/86)
The following was posted to PROB CRC-FAIL on VMSHARE Friday evening and has been sent to TRANS-L maillist (sorry for the 2nd and 3rd copies for those that read these other sources, but I feel that this is important enough to justify the added exposure that this list gives this subject). RSCS delivers bad data when using RF modems: We are currently using RF modems on our broad-band network to interconnect two VM systems via RSCS v1.3 across a bisync line supported by an Amdahl 4705 and IBM 3705 both running EP. We have found that this configuration has lead to RSCS starting to deliver invalid data (i.e. the file received is NOT the same as the file sent). The problem occurs at rates as high as 1 error in less than 1M transfered. It is related to RF modems producing a class of error that the CRC-16 that the 37x5 uses is incapable of catching. We have a sev 1 problem (7x748 and 8x017 which is just a reference to 7x748 for charge back reasons) open with the support center. This problem was originally opened against RSCS. Level 2 EP has issued apar IR69384 which says this is a hardware problem. They have also handed responsibility for the problem to the hardware folks. In Cornell's opinion this is not a hardware problem, it is a software problem. I'll be explaining this in more detail in this file and asking that anyone using RF modems try a test to see if they are also subject to the same problem. Please bear with me as I try to fill you in on all the gory details. We feel this is a very serious failure that all sites using RF modems should be highly concerned about. First I'll describe our hardware and software of interest: We are running RSCS v1.3 at SLU312. This is a modified system, but tests with vanilla RSCS v1.3 at SLU316 (off the 8508 tape) show the same symptoms. All tests have been with the DMTVMB driver. I have not brought up RSCS v2 and its NJI driver, nor looked at how it handles its buffers and use of ITB characters within those buffers to control CRC calculations. The RSCS VMB blocksize has been set at both 4072 (the max) and 824 (the min). I have also successfully used a modified DMTVMB (simple zap) with a blocksize of 400 with the same results. Use of a blocksize of 300 hardstops the 3705. VM is at the HPO 4.2 level on a 3081 and a 4381-3. The 4381-3 was a 3084 in partitioned mode until 1 month ago. Some 6 weeks ago we were running HPO 3.4 on these systems. None of these changes have affected the problem's symptoms. We run EP with the Compro code at ship 20 in an IBM 3705 and Amdahl 4705. Tests of vanilla EP at SLU301 (the latest you can get for VM) show the same problems and symptoms. The lines in question have been on both a type 3 scanner on each of these boxes and we have also wrapped externally the 3705 on a set of lines in the 3705 on its type 2 scanner. None of these changes affect the problem. We have used both General Instruments's Jerrold Metronet 1000 9.6kb RF modems (hereafter refered to as 9.6kb RF modems) and Ungermann-Bass's NM670 56kb RF modems (hereafter refered to as 56kb RF modems). We also have a set of Codex 56kb RF modems manufactured by Zeta which seem to be subject to the same problems (although I have not examined the failed files to verify the exact same symptoms in detail). The head-end translator is a Ungermann-Bass 3 channel translator sold for use with their NM670 and NM640 modems. We have two which we switch between periodically, running all our RF modem translation traffic thru the one currently in use (the second for backup purposes only). Definition of an error: An error is a file delivered by RSCS which does not match the file sent. In all cases we have examined in detail we have found that the error is a small number of bits changed in a small number of nibbles. The errors are always 2 or 3 nibbles in 4 consecutive bytes being bad. We have seen cases of two such errors occuring in a buffer. In all cases that we have looked at even closer the total number of incorrect bits is even and 4 or more. We have seen 1 and 2 bit errors in nibbles. We tend to see mostly a total of 4 bits in error, but have identified cases of 6 and 8 bits in error in the RSCS buffer. Some examples of representative errors follow: x'50004b60bace19' becomes: x'50005360b0ce19' number of bits in error in above nibble: 11 2 x'5850500807f5d5' becomes: x'5850534806f5d5' number of bits in error in above nibble: 21 1 x'999481a37a40c6' becomes: x'9994e1a35240c6' number of bits in error in above nibble: 2 11 x'FF00FF00FF' becomes: x'FF067F02FF' number of bits in error in above nibble: 21 1 x'FF00FF00FF' becomes: x'FF30FF14FF' number of bits in error in above nibble: 2 11 Observed undetected error rates: The bad data arrives at what we consider to be very high rates. When using 56kb RF modems tuned to be as clean as possible, rates as high as 1 error in 15M have been seen. With 9.6kb RF modems which are properly tuned things are much better with 30-100M of data being transmitted between errors. It is possible to de-tune the 9.6 modems to achieve rates of failure as high as 1 per 1M, likewise if the 56kb RF modems are used without explicitly tuning them similar rates are observed. When 9.6kb twisted pair modems (de-tuned to generate line errors at higher rates) are tested I have passed over 200M on them without a single bad file being delivered. These above rates of failure have been shown to be directly related to RSCS's reported line error rate (i.e. data check count). It would seem that from the data check error counts and the bad delivered file counts that we are seeing between 10% and 50% of the line errors passing the CRC16 check that the 3705 is performing (i.e. (10-50%)*(line_error_rate+bad_file_count)=bad_file_count). How we tested: We first noticed this problem when modules which worked on one CPU started failing on the other CPU (although we now believe that we have suffered from this problem for several years, it is only recently that we have started to use such a connection for high volume distribution of files between VM systems). These MODULEs were transfered by RSCS across the RF modems. I wrote a quick and dirty type exec to act as a server talking to RSCS. This server punches a data file to RSCS who transfers it to another RSCS where it is delivered to a duplicate of the original server. The original server then waits for a file to arrive in its reader. When it does it compares it to the original data file, reports any differences, and sends another copy of the file to the other server. In this manner we can get and keep the link as busy as we want (i.e. have files going in both directions or just have one file active at a time as we choose, the choice doesn't seem to matter). I have been running my tests as of late on a single CPU. Thus I have both servers and RSCS machines on the same VM. The two RSCS userids communicate via two lines in the same 3705. These lines go out to RF modems (either 9.6kb or 56kb) which are hooked up to our campus wide broad-band network. This test behaves the same as the use of two real CPUs and the 3705 and 4705 described in earlier in this memo. We have watched the data leaving CP via the CCWTRACE facility as well as watching it arrive back in from the 3705. We can see from these traces that the data leaves in good condition and arrives in bad condition and that there is no data check condition. We have thus eliminated CP and RSCS as the source of the bad data. We have also run an EP trace on a vanilla EP at SLU 301 over 9.6kb RF modems hooked to type 2 scanners. This trace shows the data being handed to the scanner correctly and then coming back in from the scanner on the other line with the data incorrect. It also shows EP deciding that the CRC16 does match. We could not run EP trace with the 56kb lines on type 3 scanners as this hung the 3705. We do not know what caused this problem. These EP traces show EP behaving correctly. Due to the failures on both the IBM 3705 and Amdahl 4705 and due to the correct behaviour of twisted pair we do not believe that we have a case of the hardware CRC failing to calculate the CRC value correctly. I have tried random data and data with a high pattern to it (alternating x'FF' and x'00'). We can observe no effects of the data on the problem. Why does it pass the CRC16: From Tanenbaum's 'Computer Networks' book: "A 16-bit checksum, such as CRC-16 or CRC-CCITT, catches all single and double errors, all errors with an odd number of bits, all burst errors of length 16 or less, 99.997% of 17-bit error bursts, and 99.998% of 18-bit and longer bursts". By implication even bit errors of 4 or more are not always (or even very well?) caught. IBM hardware people seem to feel that this is a reasonable statement. Tanenbaum also writes earlier "As a result of the physical processes causing the noise, errors tend to come in bursts rather than singly." However, at the time of the printing of the edition I am quoting from, 1981, RF modems were not commonly available on the market place. It would appear from our experience that single and double bit errors in a nibble are now a common mode of failure with RF modems. At this point we have had no luck getting a statement from someone in the know (modem designers, etc) if this is truely the case. If it is, then it seems that the designers of RF modems have made a mistake. As they have to live in a world where 16 bit CRC checks are a standard then they should not produce a design that tends to produce errors that will pass the 16-bit CRC tests. There has been statements made that the CRC-16 will tend to perform better when small blocks are checked. We have been told that 742 is one of the 'magic' numbers. Unfortunately RSCS v1.3's DMTVMB driver will only allow block sizes as small as 824... Its NJI driver, which goes down to 300, will not talk to another copy of itself. However we have tried a modified version of the VMB driver with a block size of 400 and have seen no change in the symptoms. Attempts to use this same driver at 300 hardstops the 3705. We are not currently running (nor do we have any desire to run) v2 of RSCS, so we have not been in a position to try its NJI driver at 300 or 400 byte blocks. Even if we could run such a test, we found from the VMB test that at 400 byte blocks we got effective thruput rates of only 20kb on a 56kb line. We have also been told that 256 is another 'magic number' for CRCs. None of RSCS's drivers support block sizes this small. The VMB driver does not use ITBs to force the re-calculation of the CRC on smaller quantities. It is not yet clear to us if the NJI driver uses ITBs to improve the performance of the CRC and yet keep the line performance up by allowing large blocks to be sent before turning the line around. This issue is under investigation by IBM. It is not clear to us that reduction of the amount of data to 'reasonable' numbers such as 256 will help this problem. The statements by Tanenbaum and our own experiences seem to imply that this is a problem inherent to use of 16bit CRCs. What IBM has said: The support center problem mentioned above has had a long and varied history so far. It started out against RSCS, it has been passed to EP and hardware. From there, once we found that the CRC was correctly not catching the error, it was passed back to RSCS. RSCS has claimed that they depend upon EP and the 3705 to 'provide reliable communications' and have refused to address the problem any further 'as that would involve a design change'. They sent the problem back to EP who has sent it to hardware and issued the APAR which is closed 'hardware'. The problem is STILL open as a sev 1 problem, now against hardware. The IBM support staff associated with our account seem to agree with our stance that this is a problem which IBM needs to address. They have been, and continue to be, actively working with us to reach an acceptable resolution of the situation. It is clear that this issue has raised many red flags within IBM and been the topic of many conversations within IBM. It is not an easy problem to be dismissed lightly as it has such far reaching ramifications across their (and many other vendors') product line. It is clear to us that IBM needs to address this problem. It is our feeling that RSCS needs to add its own end-to-end verification of data validity, not simply reply upon what we have shown to be in today's world an unreliable CRC check performed by the front-end. What else is affected by this problem? We suspect that lots of other packages also depend upon EP and the 3705 to provide reliable communications and thus are also affected by this problem. We have not yet verified it, but we believe that Passthru will have the same problems. We know that VM's remote 3270 support does not attempt to send small blocks to the 3274 and thus will be subject to the problem. However, at least remote 3274s send things as 256 byte blocks for the 3705 to validity check. While this smaller block size improves the performance of the CRC-16 we are suspicious that it does not solve the problems related to this class of errors. We suspect that packages such as CICS, VTAM, etc, etc, etc are all also subject to the same problems of use of large blocks and relying upon the 3705 for all validity checking. We have not checked into this question relative to any of these other packages. What you can do to help us: One of the first things I think every site with RF modems should do is to run a similar test to see if they are also subject to these problems. I don't think that it matters what kind of software you are running. The interesting thing will be to see how common it is to have RF modems generate errors that are not caught by the error checking mechanisms, be they CRC-16 or otherwise. Based upon the error rates for our cleaner RF modem connections it would appear that you need to run several hundred Meg of data thru the connection before you can be relatively sure of not being subject to the problem (assuming our rates are at all representative, which they may not be). The second thing you can do it express your concerns to IBM that its RSCS (and other high level) protocols do not have their own end-to-end verification of data validity. It appears from our experience that EP and the 3705's CRC-16 checks can not be relied upon in today's environment which includes RF modems on broad-band networks. -Nick Gimbrone, Cornell University, NJG@CORNELLA.BITNET, (607)255-3747
W8SDZ@SIMTEL20.ARPA (Keith Petersen) (07/23/86)
Have you checked for overrun of the front end by the sending machine? This is a real problem with the new "error-free" MNP modems. They give one a false sense of security. Also it has been pointed out that the "error-free" modems don't deal with errors associated with overrun of the receiving machine. --Keith