jamesa%betelgeuse@Sun.COM (James D. Allen) (05/06/88)
Recently virtual-address aliasing, cache consistency, and "volatile" variables have been discussed. Here are three "real world" 10-year-old examples of bugs associated with these three ideas. There is nothing of theoretical interest in what follows and I will surely be flamed for irrelevancy. All I can offer in defense is that "real" bugs are more interesting than textbook examples. Hopefully some comp.arch readers will find these anecdotes entertaining. Bug 1) Virtual Address Aliasing On the IBM 370 if `SRC_A' and `DEST_A' are distinct virtual aliases for the same double-word storage object, the instruction MVC DEST_A,SRC_A should just be a sort of no-op, possibly setting the reference and change bits. On models 135 or 138, however, if there is a single-bit (correctable) error in the low-order 4 bytes of the storage object the bit-in-error will be *inverted* by the MVC instruction! (The bit was already inverted of course, but the MVC will alter the check bits to make the inversion appear "correct".) This bug was interesting but harmless since no IBM Operating System permitted virtual-address aliasing within a single context. The bug results from the following: a) ECC checking is done on 8 bytes, but the 135 has only a 4-byte path to/from storage. b) 370 principles require that `MVC' on different doublewords "validate" an uncorrectable error in the destination. c) 135 microcode checks for source/destination overlap, but only virtual addresses are checked. The aliasing goes unnoticed. d) 135 implements the "validation" by disabling ECC. DBE detection must be suppressed; SBE correction is a "don't care" and happens to be suppressed also. e) There would be no problem if the firmware did: Fetch Hi; Fetch Lo; Store Hi (disable ECC); Store Lo. Instead the sequence is: Fetch Hi; Store Hi (disable ECC); Fetch Lo; Store Lo. f) The inferior sequence is used because of the paucity of high-speed work registers available to 135 microcode. There is barely room for 4 bytes of data let alone 8. Bug 2) Cache Consistency The second bug was more subtle and more significant since real customers were affected and several $100,000's were ultimately spent by several manufacturers. The 370/158AP has two cpu's, each with a cache, and a common storage. Writes are store-through with a cycle "stolen" in the remote cpu to invalidate its cache. Cache operation is almost completely transparent to software and microcode and absolute consistency is required. No problems were noticed during the first years but in late 1977 IBM delivered new OS's (MVS/SE and VM370/SE). One of the speed enhancements was to replace hardware locking (eg, compare-and-swap instruction) with code that relied on existing kernel single-threadedness constraints. Several machines developed strange symptoms: jobsteps went into coma, or one of the cpus would enter a tight loop. Different machines failed differently although a given machine usually failed the same way repeatedly. IBM supplied custom kernel patches at each site to "zero in" on the problem(s). Eventually two common symptoms were found: a) cache inconsistencies were noted, always on the "AP side". b) each affected machine had Brand N add-on memory installed. Brand N Memories got a large bill from IBM (never paid :-) ) and took over the problem. What was happening was this: a) Cpu A initiates a Write, updates its cache, steals a cycle in cpu B to invalidate its cache entry, if any. b) Cpu A attempts to update Main Memory but is held off; the add-on memory is doing a refresh. c) Cpu B initiates a Read at the same address (triple coincidence), sees that cache is invalid and therefore also selects Main Memory. d) The refresh completes and the cpu B read request is serviced before cpu A's write (AP side has priority, not first-come). e) The obsolete data can be held in B's cache indefinitely since the invalidation signal has come and gone. Other dynamic memory add-on manufacturers had the same problem, but it was masked by schemes that usually "hid" the refresh. Instead of taking a few hits per day, they took a few hits per month or year. Several years later they too became aware of the bug. Bug 3) Volatile Variable This bug arose in the aftermath of Bug 2. Brand N Memories supplied a hardware fix for its machines; all but one were cured. A site in Columbus Ohio insisted on running with its modified "bug detecting" kernel which continued to produce bug reports. The kernel modification had the following form. "memoryfetch()" is really a CS instruction (compare_and_swap) which happens to bypass the cache on the model 158: memorydata = memoryfetch(&shared_variable); cache_data = cache_fetch(&shared_variable); if (cache_data != memorydata) /* possible race */ memorydata = memoryfetch(&shared_variable); if (cache_data != memorydata) /* still bad ? */ panic("cache inconsistency"); Do you see the bug in this bug-detector? There is significant code path between updates of `shared_variable' in the dispatcher running on the other cpu and interrupts are disabled on this cpu. But even with interrupts disabled, the 158 IO firmware can break in; hence various timings are possible. Columbus refused to remove this kernel patch even after the flaw was pointed out. Eventually Brand N Memories lost the account; the kernel was restored only after the "re-virginized" 158 continued to "fail." The Columbus bug-detector bug taught me a healthy respect for Murphy's Law as it applies to multi-processor races. - James Allen