SYSRUTH@utorphys.bitnet (Ruth Milner) (01/12/89)
OK, I've got a real poser for all the Sun OS gurus out there. A couple of months back we had a very bizarre episode with our C compiler on a Sun 3/180 running SunOS3.4. One afternoon, the code it compiled started producing the wrong answers. One of our users noticed this, and wrote a nice simple program to test it with, which went something like this (all variables are float, initial values may have been different): a = 3.0; b = 4.0; if (a < b) then (Ignore any syntax errors here, I'm not printf("%f",a); a proficient C programmer) if (b < a) then printf("%f",b); The result should have been (e.g.) 3.0 printed. I don't recall now exactly what the output numbers were, but somehow both printf statements were executed, and both numbers were wrong. I had the compiler produce the 68020 assembly code and looked at it. Whereas on a correctly-behaving compiler (and I checked one) it simply loaded one number into a floating-point register and compared the other, the bad code was dividing one number by the other and *storing* this back into the first. Then repeating that again for the second "if" statement, winding up with another true situation since the numbers were different. It did this for *all* floating-point code, -fsoft, the 68881, and FPA. I checked all our compiler-related programs; none had been touched in months, and they were identical to those of a working one. The sticky bit is *not* set on any of them; we have a suspected bad block in swap which causes occasional core dumps in things like vi, so I un-stickied everything I could find quite a while ago (as an aside: is there any way to make diag format just a specified group of cylinders instead of the whole disk?). Late in the afternoon it suddenly started behaving correctly again. It has never boo-booed since (that we know of :-) ). At the time there was lots of free memory (out of total 16MB) and swap (ditto - a hangover from when we had only 8MB memory), and few other users, none of whom were compiling. Over the Christmas holidays, the user who first noticed this happening was talking to a friend who "lives and breathes for weird Sun problems", and this fellow told him that sometimes Suns will keep a copy of a heavily-used program in RAM and keep using that copy rather than reloading the one on disk. If this copy were corrupted, it could have caused the behaviour we saw, and would explain why it appeared and disappeared the way it did (though I have a couple of objections to this: 1. the program was not really heavily used when this started happening, and 2. wouldn't corruption be much likelier to result in an unrunnable copy, or else something that produced unrunnable code?). The question: do Suns do this when the sticky bit is *not* set on a program? If so, would this copy actually stay in RAM rather than swap? And why, when it was getting heavier use while we were testing it, would it have gotten rid of it? I know there is much more sharing of code in 4.0, but this machine is still at 3.4. Any educated guesses about this would be welcome. Please reply directly to me, and if I learn something people should generally be aware of, I'll pass it on to the list. Thanks (yet again), everyone. Ruth Milner Systems Manager University of Toronto Physics sysruth@helios.physics.utoronto.ca