[comp.sys.dec] swap problem on DS3100

richk@june.cs.washington.edu (Richard Korry) (10/13/89)

behavior: jobs that are swapped (ps stat = W) get placed on the
run queue (ps stat = RW) but never seem to run. the load average shoots
up but nothing is actually executing. The only fix has been to reboot.
Anyone else ever seen this?
	rich

litwack@operations.dccs.upenn.edu (Mark Litwack) (10/13/89)

In article <9475@june.cs.washington.edu> richk@june.cs.washington.edu (Richard Korry) writes:
>behavior: jobs that are swapped (ps stat = W) get placed on the
>run queue (ps stat = RW) but never seem to run. the load average shoots
>up but nothing is actually executing. The only fix has been to reboot.
>Anyone else ever seen this?
>	rich

  Yes, I'm glad (in a way) that we aren't the only ones.  DEC claims
that they've never heard of a problem like this.  Our system
continually hangs either every 3-5 days or after a few hours.

  Once the problem starts happening it locks up in two flavors:
quickly in a matter of 5 minutes, or slowly over a period of 2-3
hours.  In both cases pstat always shows processes runnable but
swapped.  Eventually, I guess the wrong process gets swapped out and
the system dies completely.

  We've noted that some processes that are running that don't need to
start any new processes work ok.  named and nfsd both work fine when
the system is in this state.  inetd works fine until it has to fork
off a telnetd or other server for you.  More often than not, large
programs on the order of 1-6meg have been running when the lockup
occurs.

  We've had a series of other very strange problems and I don't know
if they're related or not:

- Large programs, like emacs, sometimes get a segmentation fault when
they start.  If you run something else then try again it works.
Usually no core file is produced.  Once it did, and it was the core
file of a csh.

- Someone running emacs on a small plain text file file crashed (not
hung) our system twice with a "tlbmod on invalid pte" error.  I have
no idea what this error means.

- The crash dump procedure included with Ultrix 3.1 failed the first
two times we tried it.  The dump code appears to have been corrupted.
It worked on successive tries.

  We've tried everything that we can think of to stop this from
happening.  We've already tried a different system unit, altering
configuration parameters, and disabling any suspect local or third
party programs.  In desperation, we even tried rebooting the system
once a day from crontab.  Nothing helped.

  I've got about 5 people at DEC looking at this now.  They have
several guesses:

- A manufacturing problem caused memory modules to not be seated
correctly and the diagnostics won't pick it up (this was a DEC
internal report).  So, they were here today to reseat my memory
modules.  I am skeptical of this being the problem.

- They say our system is "memory starved".  They recommended doubling
our memory from 12meg to 24meg.  Granted, 12meg is not much for a
system with alot of users, let alone a RISC architecture, but it still
shouldn't cause this type of problem.  Our swap space has never been
over half full with 64meg configured.

- They also recommended that we drop our maxuprc from 100 back to the
default 50.  Ok, I'll try it, but it isn't a large savings.

  It looks to me like this is a kernel problem and not a parameter
tuning or configuration mismatch.

-mark

paul@speedmetal.engin.umich.edu (Paul Killey) (10/13/89)

In article <15444@netnews.upenn.edu> litwack@operations.dccs.upenn.edu (Mark Litwack) writes:
>In article <9475@june.cs.washington.edu> richk@june.cs.washington.edu (Richard Korry) writes:

About weird swapping/paging behavior ...

Same here with both of the above!

--paul


"Don't call me baby when she's waiting in the car."

swilson@pprg.unm.edu (Scott Wilson [CHTM]) (10/13/89)

In article <15444@netnews.upenn.edu> litwack@operations.dccs.upenn.edu (Mark Litwack) writes:
>In article <9475@june.cs.washington.edu> richk@june.cs.washington.edu (Richard Korry) writes:
>>behavior: jobs that are swapped (ps stat = W) get placed on the
>>run queue (ps stat = RW) but never seem to run. the load average shoots
>>up but nothing is actually executing. The only fix has been to reboot.
>>Anyone else ever seen this?
>>	rich
>
>  Yes, I'm glad (in a way) that we aren't the only ones.

We have seen something like this too - I happened to catch a "sysmon"
of what it was doing. I had a program running that had dynamically
allocated large amounts of memory, in little bits at a time. At some point
the program wanted wanting "just a little more memory" , say 200 bytes.
The kernel would SWAP the entire working set out to disk, and then 
fault it all back in a little shred at a time until we finally get back 
to asking for the SAME "little bit of memory", and then this all 
happens again, ad infinitum.  Load average goes nuts. May or may not be 
able to suspend or kill the job.  May lock up other system resources too.  
Often times, the amount of free memory was still a Meg or so...

This was apparently "fixed" when we upgraded to the latest 3100 rev of
Ultrix, but the behavior persits. If the memory allocation is done in
all big chunks (totally to the same amount as before, but not in little
chunks), then everything seems to be OK.

Note that the behavior I describe will occur even if I am the ONLY user,
and (other a few of the background processes) am the only runnable process.
Working off of an NFS-mounted partition seems to make it worse, sort-of.

Scott Wilson
University of New Mexico
Center for High Technology Materials
Albuquerque, NM 87131
(505)277-0780

te07@edrc.cmu.edu (Thomas Epperly) (10/13/89)

I have had very similar problems.  I have gotten the panic invalid pte
crash many times.  I also get the RW problem from the way you describe it.
It has happened on several different main CPU boards and with a completely
replaced set of RAM.  The segmentation faults when Emacs starts also appear
often.

DEC please fix this bug/these bugs!

Tom Epperly

mikem+@andrew.cmu.edu (Michael Meyer) (10/14/89)

> Excerpts from netnews.comp.sys.dec: 13-Oct-89 Re: swap problem on DS3100
> Thomas Epperly@edrc.cmu. (354)

> I have had very similar problems.  I have gotten the panic invalid pte
> crash many times.  I also get the RW problem from the way you describe
> it.
> It has happened on several different main CPU boards and with a
> completely
> replaced set of RAM.  The segmentation faults when Emacs starts also
> appear
> often.

Me too.
--Mike