[comp.sys.sun] shared executables without sticky bit?

SYSRUTH@utorphys.bitnet (Ruth Milner) (01/12/89)

OK, I've got a real poser for all the Sun OS gurus out there.

A couple of months back we had a very bizarre episode with our C compiler
on a Sun 3/180 running SunOS3.4. One afternoon, the code it compiled
started producing the wrong answers. One of our users noticed this, and
wrote a nice simple program to test it with, which went something like
this (all variables are float, initial values may have been different):

    a = 3.0;
    b = 4.0;
    if (a < b) then                 (Ignore any syntax errors here, I'm not
       printf("%f",a);               a proficient C programmer)
    if (b < a) then
       printf("%f",b);

The result should have been (e.g.) 3.0 printed. I don't recall now exactly
what the output numbers were, but somehow both printf statements were
executed, and both numbers were wrong. I had the compiler produce the
68020 assembly code and looked at it. Whereas on a correctly-behaving
compiler (and I checked one) it simply loaded one number into a
floating-point register and compared the other, the bad code was dividing
one number by the other and *storing* this back into the first. Then
repeating that again for the second "if" statement, winding up with
another true situation since the numbers were different.

It did this for *all* floating-point code, -fsoft, the 68881, and FPA.

I checked all our compiler-related programs; none had been touched in
months, and they were identical to those of a working one. The sticky bit
is *not* set on any of them; we have a suspected bad block in swap which
causes occasional core dumps in things like vi, so I un-stickied
everything I could find quite a while ago (as an aside: is there any way
to make diag format just a specified group of cylinders instead of the
whole disk?).

Late in the afternoon it suddenly started behaving correctly again. It has
never boo-booed since (that we know of :-) ). At the time there was lots
of free memory (out of total 16MB) and swap (ditto - a hangover from when
we had only 8MB memory), and few other users, none of whom were compiling.

Over the Christmas holidays, the user who first noticed this happening was
talking to a friend who "lives and breathes for weird Sun problems", and
this fellow told him that sometimes Suns will keep a copy of a
heavily-used program in RAM and keep using that copy rather than reloading
the one on disk. If this copy were corrupted, it could have caused the
behaviour we saw, and would explain why it appeared and disappeared the
way it did (though I have a couple of objections to this: 1. the program
was not really heavily used when this started happening, and 2. wouldn't
corruption be much likelier to result in an unrunnable copy, or else
something that produced unrunnable code?).

The question: do Suns do this when the sticky bit is *not* set on a
program?  If so, would this copy actually stay in RAM rather than swap?
And why, when it was getting heavier use while we were testing it, would
it have gotten rid of it? I know there is much more sharing of code in
4.0, but this machine is still at 3.4.

Any educated guesses about this would be welcome. Please reply directly to
me, and if I learn something people should generally be aware of, I'll
pass it on to the list.

Thanks (yet again), everyone.

Ruth Milner
Systems Manager
University of Toronto Physics

sysruth@helios.physics.utoronto.ca