[comp.sys.dec] Help! Nonrepeatable bugs on DS3100

nr@notecnirp.Princeton.EDU (Norman Ramsey) (07/28/89)

I am writing a code generator for a DecStation 3100 (MIPS R2000
architecture) running 

Ultrix Worksystem V2.0 (rev. 7) System #1: Fri Mar  3 19:46:51 EST 1989

(I am told this is eqwuivalent to Ultrix 3.0.)  Instead of generating
assembly code, I generate machine code directly into a file.  At run
time, this code gets loaded into the data space (the heap) and then
gets branched to.

My programs are failing in nonrepeatable ways.  The nonrepeatability
has made things damned difficult to debug.  I did find one unusual
behavior that I want to ask the net about.  

I have a piece of straight-line code (40 instructions) that consists
entirely of adds and stores.  The stores are all fullword stores (4
bytes), and they store into 14 consecutive locations. The order of the
locations stored is 2 1 5 4 3 8 7 6 11 10 9 14 13 12, so the 2nd word
is stored first, then the 1st, then the 5th, etc.  These stores are
onto the heap, in new memory that has never been stored before (so
everything is 0 until stored).  On one run not all the stores went to
the right places; post-mortem analysis showed that numbers 1 through 5
went in the right location.  Numbers 6 and 7 were stored 12 bytes
lower than they should have been (i.e. in locations 5 and 4 instead of
locations 8 and 7).  Number 8 went in the correct location. Numbers 9
and 10 were stored at location unknown (perhaps overwritten by a later
store).  Number 11 was correctly placed.  Number 12 went to location
unknown.  Number 13 was 24 bytes low.  Number 14 was correctly placed.

I observed this pattern in memory after some small number of
instructions (<100?) executed. I got a segmentation fault when C code
fetched a pointer from location 14 (which should have been stored by
store number 12).  Since the location had never been stored, its
contents were zero, and I caught a segmentation fault when my C code
tried to dereference the zero.

I isolated the offending code fragment and ran it several hundred
thousand times in an effort to make it, and it alone, fail.  It works
fine by itself both in the text segment and in the data segment.

I should add that I'm not doing anything fancy with delay slots; all
the delay slots are filled with nops (add $0,$0,$0).


Question:   Can anyone out in net-land envision a failure mode (hardware
	    or software) that would either lead to the results I describe
	    or cause successive runs of the same program to behave very
	    differently.

I would be happy to hear from anyone via email (nr@princeton.edu, 
...!allegra!princeton!nr) or by phone (609/452-5135).

Please note I am sending followups to comp.sys.dec.
Norman Ramsey
nr@princeton.edu