[net.unix-wizards] vi core dumping on Sun 2

malloy@ittral.UUCP (William P. Malloy) (06/21/85)

We ran into a rather odd little problem here a few days ago.  And would like
someone elses opinion.

SYSTEM description:
	Sun 2/170  running Sun release 1.2 (4.2bsd binary)
	4 Meg memory
	1 Xylogics 450 SMD disk controller
	1 Eagle (Fujitsu M 2351A) disk drive
	1 SKY FPA
	1 3COM Ethernet board
	1 Systech Terminal Mux (16 ports) 
In short a small 16 user computer with virtual memory and a big disk.

Problem Description:
	vi dumps core, repeatedly and reliably.  I.e. it doesn't work.

How it started:
Someone here had just ignored the old saying about not trying to fix that
which is not broken,  and had just rebooted all our Sun 2/170's (in order to
force fsck to check the disks) when the problem started (vi core dumping).

NOTE the following:
1) Binary hadn't been modified at all since pulled off the Sun release tape.
2) Rcp-ed vi from another Sun over ethernet and did a ``cmp'' on the two
   binarys and noted they were EXACTLY the same.
3) Tried running the vi binary from the other Sun, noted it works FINE.
4) Tried copying the vi binary to another file, works FINE.
5) Get clever.  Notice vi has 6 links, make 5 links to the other Suns vi and
   run it.  Works FINE.
6) Get cleverer still.  Notice vi has sticky bit turned on, so turn on the
   sticky bit of the other vi, try that.  Works FINE.
7) Rebooted system, problem goes away.      <************* NOTE *************

Needless to say we were ANNOYED.  The only thing worse then a bug, is a non-
repeatable bug.  And the only thing worse than that is a repeatable, but
incomprehensible one.  The only thing I can think of is that vi (perhaps
because of the sticky bit) had been loaded into memory and retained there
by UNIX.  And some location in memory where vi had been put was BAD.  Sun is
well known for having memory problems.  However we had no Memory errors in
our /usr/adm/messages file (at least none recently).

Anybody out there want to guess at another explanation.  It's not to important
so long is it doesn't repeat itself, but it is ODD.

BTW To people who have Suns. In section 8 of the manual Sun has a standalone
command called ``imemtest'' which is supposed to run a memory testing routine.
The man page says it's interactive and self-explanatory.  Well it doesn't work
for me.  Anyone ever get this to work (or is it just some nonsense from Sun
that is best deleted?)
-- 
Address: William P. Malloy, ITT Telecom, B & CC Engineering Group, Raleigh NC
         {ihnp4!mcnc, burl, ncsu, decvax!ittvax}!ittral!malloy

reno@bunker.UUCP (Jim Reno) (06/26/85)

Oddly, we recently had a similar problem on a VAX, also with 'vi'.

Since the sticky bit is on, a copy of the text is being maintained
in the swap area (even when nobody is executing it). Later invocations
result in the text being read from the swap area, not the binary
file. Hence if the image in swap has a bad block, it will bomb,
even if the original binary is still good. (Our situation was caused
by a hardware problem in accessing the swap area).

Of course, this just changes the problem to 'why did the swap image
get corrupted?'

tim@cithep.UucP (Tim Smith ) (06/27/85)

The problem lies, I believe, with the sticky bit.  You mention that vi has
the sticky bit on.  Assuming that 4.2bsd treats this bit the same was that
UNIX does, then this means that there will be a copy of the text of vi on
the swapping device ( or maybe pageing device ).  Assume that this copy is
bad.  I think all your symptoms are explained then:

> 1) Binary hadn't been modified at all since pulled off the Sun release tape.
> 2) Rcp-ed vi from another Sun over ethernet and did a ``cmp'' on the two
>    binarys and noted they were EXACTLY the same.

Fine.  cmp opens /bin/vi, and gets the data from the filesystem, which has
not been trashed.

> 3) Tried running the vi binary from the other Sun, noted it works FINE.
> 4) Tried copying the vi binary to another file, works FINE.

To be expected, since cp gets the data from the filesystem.

> 5) Get clever.  Notice vi has 6 links, make 5 links to the other Suns vi and
>    run it.  Works FINE.
> 6) Get cleverer still.  Notice vi has sticky bit turned on, so turn on the
>    sticky bit of the other vi, try that.  Works FINE.

All this is being done to the new vi.  The old one is still sitting wounded
on your swapping or pageing area(s). 

> 7) Rebooted system, problem goes away.      <************* NOTE *************

Rebooting makes the system forget the bad copy on the swapping area.
-- 
					Tim Smith
				ihnp4!{wlbr!callan,cithep}!tim

chris@umcp-cs.UUCP (Chris Torek) (06/27/85)

(This is not an attempt at an explanation.)  The sticky bit doesn't
keep things in core; it keeps them around in swap space once they've
made it out there.  (They will hang around in the buffer cache for
a short while, but at 200K (for a 2M Sun) that isn't usually very
long.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

thomas@utah-gr.UUCP (Spencer W. Thomas) (06/27/85)

I would bet on a bad spot in the swap area (or memory).  As Chris points
out, the sticky bit means "keep a copy of this program in swap space,
even if nobody is running it".  If the "running" copy of the program got
corrupted, and then swapped out, you would keep on using the corrupted
copy.  A process is loaded into the swap area by first reading it into
memory, then writing it out to swap, so if you had a bad memory
location, it would be possible to corrupt the swap copy.

Things you could try if it happens again:
1. Turn off the sticky bit and run vi a couple of times.  I think the
system will throw away the copy in the swap area, so you should be
getting a clean copy.

2. Write a program that uses ptrace() to compare a "running" copy of vi
to the binary (the running copy should get the bad swap version).

-- 
=Spencer   ({ihnp4,decvax}!utah-cs!thomas, thomas@utah-cs.ARPA)

terryl@tekcrl.UUCP () (06/27/85)

>Problem Description:
>	vi dumps core, repeatedly and reliably.

>How it started:
>Someone here had just ignored the old saying about not trying
>to fix that which is not broken,  and had just rebooted all our Sun 2/170's
>(in order to force fsck to check the disks) when the problem started
>(vi core dumping).

>NOTE the following:
>1) Binary hadn't been modified at all since pulled off the Sun release tape.
>2) Rcp-ed vi from another Sun over ethernet and did a ``cmp'' on the two
>   binarys and noted they were EXACTLY the same.
>3) Tried running the vi binary from the other Sun, noted it works FINE.
>4) Tried copying the vi binary to another file, works FINE.
>5) Get clever.  Notice vi has 6 links, make 5 links to the other Suns vi and
>   run it.  Works FINE.
>6) Get cleverer still.  Notice vi has sticky bit turned on, so turn on the
>   sticky bit of the other vi, try that.  Works FINE.
>7) Rebooted system, problem goes away.      <************* NOTE *************

     We saw this happen once on a VAX 11/780 with cp, of all programs!!! We
did exactly the same things that you did, with the exception of making the
multiple links(cp has no links). The problem disappeared on reboot. What we
think happened (and we were never able to prove it since the problem never
re-occurred), is that 4.2 (and 4.1, and 4.0) does not give up pages of pure-
text programs when the program exits, but marks them as pages of pure-text
programs in case if the program is run again soon, then the page does not
have to be read in again from disk. The theory is that the page somehow got
corrupted before its next use, but hadn't been reclaimed for some other use,
thus invalidating the incore copy of that text page. Unfortunately, when a
program dies and dumps core, only the data portion of the program is dumped,
and not the pure-text portion, so if one wanted to verify this theory, one
would have to get a core dump of PHYSICAL memory before rebooting and poke
around in there. We didn't think of this(and even if we did, we didn't feel
like poking around in PHYSICAL memory to test this theory).



					Terry Laskodi
					     of
					Tektronix

guy@sun.uucp (Guy Harris) (06/29/85)

> What we think happened (and we were never able to prove it since the problem
> never re-occurred), is that 4.2 (and 4.1, and 4.0) does not give up pages of
> pure-text programs when the program exits, but marks them as pages of
> pure-text programs in case if the program is run again soon, then the page
> does not have to be read in again from disk.

This is, indeed the case.  (Try bringing up a 4.xBSD machine up single-user,
timing a compile, and then timing the same compile.  The second one will be
faster, both due to the pagein of the compiler passes being bypassed and due
to the inode for the passes and input files (and their directory entries, in
4.3BSD) being in an in-core cache.  Of course, if you have so little
physical memory that each pass of the compiler flushes the previous one out
of memory, you won't get any speedup; the blocks won't be in the buffer
cache because pageins don't go through the buffer cache.)

	Guy Harris