[comp.sys.sun] SparcStation I Memory Leak?

murthy@algron.cs.cornell.edu (Chet Murthy) (06/12/90)

I have been running a large LISP application on a SparcStation I for a
while now, and I have noticed some really awful problems with the
allocation of memory as time goes on.

Here's the scenario:

I have a Sparcstation I with 16MB of memory, 2 100MB internal prodrives,
and an external 700MB SCSI drive (HP).  I use 120MB of the internal drives
for swapping and all the HP drive also.

So I start up a Lucid LISP process which takes up 256MB of swap.
Initially, I usually get around 8MB of memory (as shown by ps), but that
drops off to around 3.9MB pretty soon.  If somebody else starts up a
semi-large program, my RSS allocation drops further, and even if they
finish their application, I don't get any more memory.

After about a day or 2, my LISP is getting less than 2MB of RSS, on a 16MB
machine!

Now, I thought that perhaps this was because of the particular kind of
LISP job I was running.  But when I run the same exact job on a SUN-4/260
Sparc with 32MB of memory, even in the presence of other large,
compute-intensive jobs my working set never drops below 10MB, and often
hovers near 21MB.

Moreover, I can see the noticeable slowdown from day 1 to day 2 to day 3,
as my program is getting less and less memory.

This makes the Sparc almost useless by the end of the third day - just to
check, I killed my LISP job and restarted it after the Sparc had been
running 3 days, and it got around 3-4MB of memory - down from 8MB
initially.

A friend noticed that if you write a C program that malloc's 64MB, and
then repeatedly copy a random amount of bytes between 1 and 256 between
two random pointers into that data space just allocated, say, 100,000
times, pretty soon your C program will end up with a working set of around
a quarter of a Meg on a 16MB workstation.

Does anybody know WHY the Sparc is so brain-damaged?  I don't see how this
could be anything other than a memory leak.

As I mentioned before, if I kill the LISP and restart it, I don't get
nearly as much RSS as I had the first time around.  So it doesn't seem
possible that it could be anything other than a leak in the kernel.

What's going wrong?  Why is my Sparc useless after 3 days?

	murthy@cs.cornell.edu

murthy@algron.cs.cornell.edu (Chet Murthy) (06/15/90)

murthy@algron.cs.cornell.edu (Chet Murthy) writes:

>I have been running a large LISP application on a SparcStation I for a
>while now, and I have noticed some really awful problems with the
>allocation of memory as time goes on.

Well, after some talking with a Sun OS ambassador at a new products
session, I found out some interesting stuff.

The phenomenon is called "pmeg stealing".  I'm not sure what's going on,
exactly, but the idea seems to be that somebody in the kernel is stealing
memory from the pool, and not putting it back.  

So it looks like there's less and less.  The fix, from someone who may
choose to remain anonymous (otherwise, he can raise his hand - I didn't
figure this out myself) is to turn off the swapper, leaving only the pager
running:

To turn off swap:

   % su
   # adb -wk /vmunix /dev/mem
   nosched?W 1
   ^D
   # reboot

And I've gotten conflicting reports as to whether it is fixed in 4.1 or
not.  So we'll just have to wait and see...

	murthy@cs.cornell.edu