[comp.mail.elm] Serious elm problems on Sun OS 3.5.2

mikes@oakhill.UUCP (Mike Schultz) (08/16/89)

Our sight has been running elm 2.2 ever since it was available.  We are now
running patch level 10.

Shortly after I installed elm 2.2, I began getting complaints from users that
elm was crashing.  I blamed the problem on system overloading:  elm doesn't
always check all system calls for the less frequent error returns caused by
stuff like process table overflows (which does happen occasionally :-)).

But then it happened to me one day and I could repeat the failure, so I got out
adb and went bug hunting!  And what did I find?  I found an illegal instruction
in the middle of the object file.  No, I didn't disassemble in the middle of
an instruction, it was right there in the middle of the disassembly of the 
legal instructions,.... And then it wasn't!

Yes, that's right.  While poking around trying to figure out how on a virtual
memory machine with page protection, elm managed to get an instruction trashed,
the instruction fixed itself and elm ran my test case without failure!

Whoa!  I took my evidence to the system managers, and reported something was
wrong.  For my trouble I got blank looks, told that no one else is having this
problem, and that usenet software always has bugs in it.  (So they're human
and overworked, who isn't.)

I let the problem drop,  confident people would begin to report other 
mysterious problems and that the problem would evidentually get corrected.

But they didn't and the elm problem complaints went away too.

Unfortunately, the problems are coming back again, now several months later.

The symptoms are still the same.  Elm randomly crashes, adb disassemblies
show illegal instructions in the code, but almost never in the same place.
The most frequent sequence that causes the crash is to start elm, change 
folders, use * to advance to the end, backup one message, and then perform
a reply.  I don't know if it is significant, but the folder has 51 messages
in it, and when I backup a message, elm has to scroll backwards on page.
The routine that most frequently shows the problem is get_return.

It has happened using a vt100 termcaps and the Sun OS is 3.5.2.

Anybody heard of this problem?  Is there a bug in the Sun OS?  Is anybody
currently running elm 2.2 on a Sun OS 3.5.2?

Please respond using email.

Hurry up guys!  The natives are beginning to want 2.1 back.

Mike Schultz
mikes@oakhill.uucp
...!uunet!cs.utexas.edu!oakhill!mikes

"At Motorola, we make many types of high speed microprocessor chips.
	So NUMBER CRUNCH all you want, we'll make more!"

scs@itivax.iti.org (Steve Simmons) (08/17/89)

mikes@oakhill.UUCP (Mike Schultz) writes:

>But then it happened to me one day and I could repeat the failure, so I got out
>adb and went bug hunting!  And what did I find?  I found an illegal instruction
>in the middle of the object file.  No, I didn't disassemble in the middle of
>an instruction, it was right there in the middle of the disassembly of the 
>legal instructions,.... And then it wasn't!

>Yes, that's right.  While poking around trying to figure out how on a virtual
>memory machine with page protection, elm managed to get an instruction trashed,
>the instruction fixed itself and elm ran my test case without failure!

>Whoa!  I took my evidence to the system managers . . .

And they said "You're crazy".  We had the same thing happen (tho not
with elm) at another site where *I* was the system manager -- Suns, and
certian programs would random crash with illegal instructions.  Only
certian programs.  And relinking a new version would make the problem go
away.  The programmers complained to the system manager (me) and I said
they were crazy.

They weren't.

It turns out that there is a hardware bug in some Sun 3/50s (those made before
1988, I think).  It affects very very few programs, and requires some really
odd circumstances -- a branch instruction at the end of a virtual page
which crosses into the 'next' page, which causes a page fault, and other
I/O must be happening at the same time (I'm serious!) and sometimes it'll
happen.

So: do you have old 3/50s?  Does the problem occur only there?  Does it
come and go as versions of elm change?  If so, you been bit.
-- 
Steve Simmons		          scs@vax3.iti.org
Industrial Technology Institute     Ann Arbor, MI.
"Velveeta -- the Spam of Cheeses!" -- Uncle Bonsai