murthy@algron.cs.cornell.edu (Chet Murthy) (06/12/90)
I have been running a large LISP application on a SparcStation I for a while now, and I have noticed some really awful problems with the allocation of memory as time goes on. Here's the scenario: I have a Sparcstation I with 16MB of memory, 2 100MB internal prodrives, and an external 700MB SCSI drive (HP). I use 120MB of the internal drives for swapping and all the HP drive also. So I start up a Lucid LISP process which takes up 256MB of swap. Initially, I usually get around 8MB of memory (as shown by ps), but that drops off to around 3.9MB pretty soon. If somebody else starts up a semi-large program, my RSS allocation drops further, and even if they finish their application, I don't get any more memory. After about a day or 2, my LISP is getting less than 2MB of RSS, on a 16MB machine! Now, I thought that perhaps this was because of the particular kind of LISP job I was running. But when I run the same exact job on a SUN-4/260 Sparc with 32MB of memory, even in the presence of other large, compute-intensive jobs my working set never drops below 10MB, and often hovers near 21MB. Moreover, I can see the noticeable slowdown from day 1 to day 2 to day 3, as my program is getting less and less memory. This makes the Sparc almost useless by the end of the third day - just to check, I killed my LISP job and restarted it after the Sparc had been running 3 days, and it got around 3-4MB of memory - down from 8MB initially. A friend noticed that if you write a C program that malloc's 64MB, and then repeatedly copy a random amount of bytes between 1 and 256 between two random pointers into that data space just allocated, say, 100,000 times, pretty soon your C program will end up with a working set of around a quarter of a Meg on a 16MB workstation. Does anybody know WHY the Sparc is so brain-damaged? I don't see how this could be anything other than a memory leak. As I mentioned before, if I kill the LISP and restart it, I don't get nearly as much RSS as I had the first time around. So it doesn't seem possible that it could be anything other than a leak in the kernel. What's going wrong? Why is my Sparc useless after 3 days? murthy@cs.cornell.edu
murthy@algron.cs.cornell.edu (Chet Murthy) (06/15/90)
murthy@algron.cs.cornell.edu (Chet Murthy) writes: >I have been running a large LISP application on a SparcStation I for a >while now, and I have noticed some really awful problems with the >allocation of memory as time goes on. Well, after some talking with a Sun OS ambassador at a new products session, I found out some interesting stuff. The phenomenon is called "pmeg stealing". I'm not sure what's going on, exactly, but the idea seems to be that somebody in the kernel is stealing memory from the pool, and not putting it back. So it looks like there's less and less. The fix, from someone who may choose to remain anonymous (otherwise, he can raise his hand - I didn't figure this out myself) is to turn off the swapper, leaving only the pager running: To turn off swap: % su # adb -wk /vmunix /dev/mem nosched?W 1 ^D # reboot And I've gotten conflicting reports as to whether it is fixed in 4.1 or not. So we'll just have to wait and see... murthy@cs.cornell.edu