[comp.unix.wizards] Gripe about mickey-mouse VM behaviour on many Unixes

tbray@watsol.waterloo.edu (Tim Bray) (01/16/90)

gm@keysec.kse.com (Greg McGary) writes:
>Our MIPS M/120 has been spending too much time page-thrashing lately.
>I would like to have a utility that tells me about the paging behavior...

Good luck, you'll need it.  Here at the New OED project we have got seriously
cheesed off about VM implementations on many unix systems.  No matter how much
you have, memory remains a critical resource.

But even well-regarded Unixes don't give you the tools to manage it.  In many
applications (e.g. database indexing) performance of algorithms can be greatly
improved by knowing how much physical memory you can use, and tuning to use it 
efficiently.  But not on Unix.  Some case studies:

1. On a 32-Mb (4.3bsd) machine with *nothing* else happening, the OS stupidly 
   pages away at you if you try to use more than about 20 Mb in the inane 
   belief that the memory will be needed any moment for one of those gettys or
   nfsds or something that aren't doing anything.

2. A process using only a moderate amount of memory (you think) runs like
   a dog, and you note that the system is spending much of its time in
   system state or idle.  Why, you wonder.  It quickly becomes apparent 
   that the information produced by items such as ps, vmstat, vsar, top, 
   and so on, is comparable in relevance and accuracy to Robert Ludlum novels 
   or peyote visions.  (SunOS the villain here).

3. On a 64-Mb (MIPS) machine, your paging rate, system time, and idle time 
   all go through the roof if your process insolently tries to random-access
   more than 32 Mb of memory at once.

Look, we all appreciate the tender loving care that VM architects have put
into strategies that are friendly to 100+ moderate-size processes context
switching rapidly in time-sharing mode.  But there are other ways to use
computers, and they are currently very poorly supported.  We paid for that
memory, we have a good use for it, and the OS is getting in our way, and it's
also REFUSING TO TELL US ACCURATELY WHAT'S GOING ON - an unforgiveable sin by
my Unix dogma.

Harrumph, Tim Bray, New OED Project, U of Waterloo

Ed@alderaan.scrc.symbolics.com (Ed Schwalenberg) (01/17/90)

    From: Tim Bray <tbray@watsol.waterloo.edu>
    Date: 16 Jan 90 03:28:19 GMT

    Good luck, you'll need it.  Here at the New OED project we have got seriously
    cheesed off about VM implementations on many unix systems.  No matter how much
    you have, memory remains a critical resource.

And if you don't have enough, you lose just as badly.  Under System V
Unix for the 386, when your large process exceeds the amount of
non-wired physical memory, the paging algorithm pages out the ENTIRE
process (which takes a LONG time), then lets your poor process fault
itself in again, oh so painfully, until you exceed physmem again and
start the cycle over.

lm@snafu.Sun.COM (Larry McVoy) (01/17/90)

In article <19821@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes:
>1. On a 32-Mb (4.3bsd) machine with *nothing* else happening, the OS stupidly 
>   pages away at you if you try to use more than about 20 Mb in the inane 
>   belief that the memory will be needed any moment for one of those gettys or
>   nfsds or something that aren't doing anything.

Not a great alg but not terrible if you are running a time sharing system.
Take your 32 meg, chop of the ~2 meg for the kernel, chop off the ~4 meg
for the buffer cache, and you have about 26 meg left.  Now there is still 
something fishy here - if I've got the numbers right, it does seem odd that
the pager is beating you up with 6 megs free.  I don't believe this for a
second.  Try this

$ adb /vmunix /dev/kmem
lotsfree/D
freemem/D
^D

The pager does not turn on until freemem < lotsfree (and lotsfree on Sun's
is typically small, like 256K or so).  So something is wacko.

>2. A process using only a moderate amount of memory (you think) runs like
>   a dog, and you note that the system is spending much of its time in
>   system state or idle.  Why, you wonder.  It quickly becomes apparent 
>   that the information produced by items such as ps, vmstat, vsar, top, 
>   and so on, is comparable in relevance and accuracy to Robert Ludlum novels 
>   or peyote visions.  (SunOS the villain here).

Yeah, well, um, yeah.  Right. Well, it's like this see...  Actually, the real
problem is sharing.  Who do you charge shared libraries to?  The numbers
displayed by all those programs don't take that into account, but they should
give you a general idea.  Oh, yeah, I assume 4.0 or greater, things were
easy before then.

>3. On a 64-Mb (MIPS) machine, your paging rate, system time, and idle time 
>   all go through the roof if your process insolently tries to random-access
>   more than 32 Mb of memory at once.

Waddya expect?  :-) :-)

>Look, we all appreciate the tender loving care that VM architects have put
>into strategies that are friendly to 100+ moderate-size processes context
>switching rapidly in time-sharing mode.  But there are other ways to use
>computers, and they are currently very poorly supported.  We paid for that
>memory, we have a good use for it, and the OS is getting in our way, and it's
>also REFUSING TO TELL US ACCURATELY WHAT'S GOING ON - an unforgiveable sin by
>my Unix dogma.

Hmm.  The SunOS VM model was designed with exactly this in mind.  You can
use damn near 100% of physical mem on a 4.0 or greater rev of the OS (the
os uses some, but on a 32 meg machine you should be looking at close to
30 megs of user usable ram).

At any rate, qwitchyerbitchin and tell me what you want to have happen.
Don't forget that your solution has to work well when I'm time sharing,
when one process wants the whole machine, and when two processes want the
whole machine.  And if you get it right, I'll get it into SunOS or die
trying.  Looking forward to your reply,
---
What I say is my opinion.  I am not paid to speak for Sun, I'm paid to hack.
    Besides, I frequently read news when I'm drjhgunghc, err, um, drunk.
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

dwc@cbnewsh.ATT.COM (Malaclypse the Elder) (01/18/90)

In article <130347@sun.Eng.Sun.COM>, lm@snafu.Sun.COM (Larry McVoy) writes:
> >Look, we all appreciate the tender loving care that VM architects have put
> >into strategies that are friendly to 100+ moderate-size processes context
> >switching rapidly in time-sharing mode.  But there are other ways to use
> >computers, and they are currently very poorly supported.  We paid for that
> >memory, we have a good use for it, and the OS is getting in our way, and it's
> >also REFUSING TO TELL US ACCURATELY WHAT'S GOING ON - an unforgiveable sin by
> >my Unix dogma.
> 
> Hmm.  The SunOS VM model was designed with exactly this in mind.  You can
> use damn near 100% of physical mem on a 4.0 or greater rev of the OS (the
> os uses some, but on a 32 meg machine you should be looking at close to
> 30 megs of user usable ram).
> 
actually, the vm model address using all of physical memory because
it has integrated the paging pool with the buffer pool.  but it really
hasn't done much for such things as page stealing.  in fact, on the
version that was ported into system v release 4, i believe it still uses the
two hand clock algorithm which goes through physical memory regardless
of what that page is being used for.  my studies have shown that you
really want to classify pages according to "type" even with reference
information.  i worked with some developers on prototyping some
improvements in the old regions architecture (system v release 3)
and maybe will get around to integrating it into the vm model.

danny chen
att!hocus!dwc

dwc@cbnewsh.ATT.COM (Malaclypse the Elder) (01/30/90)

In article <1424@eutrc3.urc.tue.nl>, wsinpdb@eutws1.win.tue.nl (Paul de Bra) writes:
> In article <22105@adm.BRL.MIL> Ed@alderaan.scrc.symbolics.com (Ed Schwalenberg) writes:
> >...
> >And if you don't have enough, you lose just as badly.  Under System V
> >Unix for the 386, when your large process exceeds the amount of
> >non-wired physical memory, the paging algorithm pages out the ENTIRE
> >process (which takes a LONG time), then lets your poor process fault
> >itself in again, oh so painfully, until you exceed physmem again and
> >start the cycle over.
> 
> This most certainly is not true.
> I have experimented with growing processes and what really happens is
> that when the total size of all processes approaches physical memory
> size the pager starts to page out some old pages. I can have a process
> grow slowly and never really be paged (or swapped) out completely.
> (I have tried this with a 20Mbyte process on a machine with 8Mbyte of
> memory).
> 
> However, if a process is using the standard malloc() routine to allocate
> memory, then in order to allocate more memory malloc will search through
> a linked list of pointers, which are scattered throughout your process'
> memory. This usually involves a lot of paging, and it indeed is possible
> that all pages of a process are paged out, while other pages (of the
> same process) are being paged in. I have observed this behaviour with
> a process that exceeded physical memory only by a small margin.
> The solution is to use the routines in libmalloc, which do not use the
> scattered linked list of pointers. Switching to libmalloc completely
> stopped the thrashing.
> 
> The malloc() routine in BSD does not use the linked list approach either,
> so a growing process does not cause this kind of thrashing in BSD.
> 
i'm not sure about the user side of things (e.g. malloc) but i think what
the original poster was referring to was the fact that in system v release 3,
if a page fault could not find any free physical memory, the faulting
process would roadblock and put the process in SXBRK state.  the memory
scheduler, sched, would then be awaken to swap out a process(es).
note that when this happens, it is VERY LIKELY for other processes
to also roadblock on memory in the same state.  i use the "keystone cops"
as the visualization aid for this effect.  this is the reason by SVR3
could go into idle state on a busy system (if you examine sar output for
a memory overloaded system).  but i digress.

i don't remember what the final method was for handling the case of
a single process on the run queue faulting and using more than physical
memory but one iteration of it had the memory scheduler do nothing
in that situation.  it would then be up to the paging daemon to steal
pages from that process (page aging was done according to wall clock time).

note that the swapping was not done in a single i/o operation but was
ultimately broken up by at least the device driver into track-size pieces.
but that doesn't address the problem of latency.  the process was subject
to latency on swapping and would have to painfully page its 'working set'
back in.  it would certainly make more sense to swap out in convenient
size pieces, leaving the process ineligible to run (don't want a process
that is being swapped out continuing to contend for pages) until either
the memory shortage cleared up or the entire process was swapped out.

this idea was incorporated into a regions-based prototype designed to
handle memory contention, load control, and page replacement in a more
sane manner.  of course, with SVR4, the regions architecture went out
the window and we would have to redesign a prototype based on VM (not
an easy task).  we may eventually do it though.

danny chen
att!hocus!dwc