baskett@decwrl.UUCP (Forest Baskett) (05/03/84)
[History as vaguely remembered by a participant.] Los Alamos had serial one of the Cray-1 in 1976 for a six month evaluation period. During that six months, the machine was exercised in a variety of ways in order to test performance and reliability. One of the interesting things that the testing uncovered was that, at 7000 feet above sea level, the mean time between memory parity errors when running memory tests over the whole eight megabytes was about 30 minutes! This number was discovered early in the 6 month period and continued to be confirmed as the months wore on despite the attempts of the on site engineers to find something marginal in the memory system. Thus this discovery seemed to lead to the decision of Cray Research Inc. to change the machine from parity main memory to error corrected main memory. Serial two was already built but never delivered. Serial three was about three inches taller and a memory reference took 11 cycles instead of the original 10. When it was delivered to Los Alamos we discovered that the MTBF of the memory system tests was now about 6 hours. Thus it took about 6 hours to get a double bit, uncorrectable memory error. Ordinary programs would run the machine for more than 24 hours between double bit errors. This wasn't wonderful but it was considered acceptable because the machine was so fast. The on site and off site engineers continued to look for something marginal in the memory system and I wrote the first program to take the log of single bit error correction syndromes and translate them into board number, chip number pairs and then sort the resulting list by frequency of correction. What that revealed was that some chips would occasionally go thru a period of apparently marginal operation where they generated a lot of one bit errors and then they would apparently return to normal. The on site engineers starting taking this translated and sorted log and using it as a guide for replacing "weak" memory chips during the weekly preventive maintenance period. The MTBF started going up. By 1978, as I vaguely recall, it was on the order of one or two weeks. We never did figure out what was wrong with the "weak" memory chips but the key to the success of this program did seem to be the assumption that the whole memory chip was weak and that the location on the chip that failed was just noise information (it was). Serial one was sold by Cray Research to the British government and I can easily believe it was made reliable by careful attention to the memory system. (I mentioned the altitude at Los Alamos because of the cosmic ray theory :-) Forest Baskett - Western Research Laboratory - Digital Equipment Corporation