[comp.arch] True Horror Stories

jamesa%betelgeuse@Sun.COM (James D. Allen) (05/06/88)

	Recently virtual-address aliasing, cache consistency, and "volatile"
variables have been discussed.  Here are three "real world" 10-year-old
examples of bugs associated with these three ideas.
	There is nothing of theoretical interest in what follows and I will
surely be flamed for irrelevancy.  All I can offer in defense is that "real"
bugs are more interesting than textbook examples.  Hopefully some comp.arch
readers will find these anecdotes entertaining.

Bug 1) Virtual Address Aliasing

	On the IBM 370 if `SRC_A' and `DEST_A' are distinct virtual
aliases for the same double-word storage object, the instruction
		MVC	DEST_A,SRC_A
should just be a sort of no-op, possibly setting the reference and change bits.
	On models 135 or 138, however, if there is a single-bit (correctable)
error in the low-order 4 bytes of the storage object the bit-in-error will be
*inverted* by the MVC instruction!  (The bit was already inverted of course,
but the MVC will alter the check bits to make the inversion appear "correct".)
	This bug was interesting but harmless since no IBM Operating System
permitted virtual-address aliasing within a single context.  The bug results
from the following:
	a) ECC checking is done on 8 bytes, but the 135 has only a 4-byte
		path to/from storage.
	b) 370 principles require that `MVC' on different doublewords
		"validate" an uncorrectable error in the destination.
	c) 135 microcode checks for source/destination overlap, but only
		virtual addresses are checked.  The aliasing goes unnoticed.
	d) 135 implements the "validation" by disabling ECC.  DBE detection
		must be suppressed; SBE correction is a "don't care" and
		happens to be suppressed also.
	e) There would be no problem if the firmware did:
			Fetch Hi; Fetch Lo; Store Hi (disable ECC); Store Lo.
		Instead the sequence is:
			Fetch Hi; Store Hi (disable ECC); Fetch Lo; Store Lo.
	f) The inferior sequence is used because of the paucity of high-speed
		work registers available to 135 microcode.  There is barely
		room for 4 bytes of data let alone 8.

Bug 2) Cache Consistency

	The second bug was more subtle and more significant since real
customers were affected and several $100,000's were ultimately spent by
several manufacturers.
	The 370/158AP has two cpu's, each with a cache, and a common storage.
Writes are store-through with a cycle "stolen" in the remote cpu to invalidate
its cache.  Cache operation is almost completely transparent to software
and microcode and absolute consistency is required.
	No problems were noticed during the first years but in late 1977 IBM
delivered new OS's (MVS/SE and VM370/SE).  One of the speed enhancements
was to replace hardware locking (eg, compare-and-swap instruction) with
code that relied on existing kernel single-threadedness constraints.  Several
machines developed strange symptoms: jobsteps went into coma, or one of the
cpus would enter a tight loop.
	Different machines failed differently although a given machine usually
failed the same way repeatedly.  IBM supplied custom kernel patches at each
site to "zero in" on the problem(s).  Eventually two common symptoms were
found:
	a) cache inconsistencies were noted, always on the "AP side".
	b) each affected machine had Brand N add-on memory installed.

	Brand N Memories got a large bill from IBM (never paid  :-) ) and
took over the problem.  What was happening was this:
	a) Cpu A initiates a Write, updates its cache, steals a cycle in
		cpu B to invalidate its cache entry, if any.
	b) Cpu A attempts to update Main Memory but is held off; the add-on
		memory is doing a refresh.
	c) Cpu B initiates a Read at the same address (triple coincidence),
		sees that cache is invalid and therefore also selects
		Main Memory.
	d) The refresh completes and the cpu B read request is serviced
		before cpu A's write (AP side has priority, not first-come).
	e) The obsolete data can be held in B's cache indefinitely since
		the invalidation signal has come and gone.

	Other dynamic memory add-on manufacturers had the same problem, but
it was masked by schemes that usually "hid" the refresh.  Instead of taking
a few hits per day, they took a few hits per month or year.  Several years
later they too became aware of the bug.

Bug 3) Volatile Variable

	This bug arose in the aftermath of Bug 2.  Brand N Memories supplied
a hardware fix for its machines; all but one were cured.  A site in Columbus
Ohio insisted on running with its modified "bug detecting" kernel which
continued to produce bug reports.  The kernel modification had the following
form.  "memoryfetch()" is really a CS instruction (compare_and_swap) which
happens to bypass the cache on the model 158:

		memorydata = memoryfetch(&shared_variable);
		cache_data = cache_fetch(&shared_variable);
		if (cache_data != memorydata)	/* possible race */
			memorydata = memoryfetch(&shared_variable);
		if (cache_data != memorydata)	/* still bad ? */
			panic("cache inconsistency");

	Do you see the bug in this bug-detector?  There is significant
code path between updates of `shared_variable' in the dispatcher running
on the other cpu and interrupts are disabled on this cpu.  But even with
interrupts disabled, the 158 IO firmware can break in; hence various timings
are possible.
	Columbus refused to remove this kernel patch even after the flaw was
pointed out.  Eventually Brand N Memories lost the account; the kernel was
restored only after the "re-virginized" 158 continued to "fail."
	The Columbus bug-detector bug taught me a healthy respect for Murphy's
Law as it applies to multi-processor races.

	- James Allen