[comp.sys.att] Help! Altos 5.3.1 fork is failing!

dlm@cuuxb.ATT.COM (Dennis L. Mumaugh) (10/17/89)
Ordinarily I don't answer questions like this as I work for
support and customers pay money for answers, but .... this is a
System V error and generally true for all System V 3.xxx
machines.

Skip to end for concise answer rather than a blow by blow
analysis.

In article <506@oglvee.UUCP> jr@oglvee.UUCP (Jim Rosenberg)
writes:

        We have 4M RAM and before the upgrade the machine just
        screamed.  Now we are paging like mad and getting
        sporadic fork failures.

        The system error reporting is filled with messages like
        this:

        000146 07:50:06 00e6f0f6 ... 0000 00 NOTICE: getcpages -
                waiting for 1 contiguous pages 
	000147 08:13:16 00e80082 ... 0000 00 
	000148 08:13:16 00e80082 ...  0000 00 NOTICE: getcpages - 
		Insufficient memory to allocate 1 contiguous page - 
		system call failed

        In many cases I can exactly correlate one of these
        "system call failed" messages with a fork failure.

Usually true.  Sometimes it will be an exec and rarely a stack
growth.

        According to the man page for fork(2) there are 3 ways a
        fork can fail:  No process table slots left, exceeding
        the per-user limit, and a most obscure indeed 3rd one:
        "Total amount of system memory available when reading via
        raw IO is temporarily insufficient".  Either the man page
        lies or this third one is it.

In a sense.

        I took a blind stab and guessed that the parameter
        involved here is PBUF.  Altos recommends PBUF=8 straight
        across the board no matter how much memory you have.
        Sounds pretty odd to me, since on a 6386 running V.3.2
        with 2 Meg RAM I've got 20, and never fiddled with it.  I
        jacked up PBUF to 16 -- but it made no difference.

Sorry, wrong guess.  Try /etc/swap -a ......

        So, my questions are:

        What the bleep is getcpages?  It sounds like an internal
        kernel routine to get continuous pages in RAM.

Ordinarily true.  But when we need only one page we call it as it
is fast.

        Is this call issued by the paging daemon?

Close, by a kernel routine looking for pages, such as grow, or
dupproc.

        How could it fail on a request to get only 1 page unless
        I'm out of swap space?

How did you guess?

        (Which I'm not.  We're getting these with many many
        thousand blocks of free swap space -- we have a swap(1)
        which will show these.)

Not true! /etc/swap only shows actual use of swap not committed use
of swap.  Similarly for sar reports.

        Is there a tunable parameter that will rescue me here?

        Altos seems to think that a failed fork should only get a
        "NOTICE".  Yeah, well, I notice all right.  It's bad
        enough when the shell reports "No more processes" -- you
        just try again and it works.  But we have all kinds of
        batch jobs that spawn uux requests and other such things
        and they're just getting shot right out of the sky.

True, some code isn't very robust and ought to sleep and wait for
less load, but people who do forks don't examine error codes, nor
do people who do execs.   fork and exec will return either ENOSPC or
EAGAIN if you would check errno.

        Any words of wisdom gratefully accepted!  I skimmed over
        the likeliest parts of Bach to see if the light would
        dawn -- looks like I better go back and reread the
        section on demand paging pretty carefully.

Answer:

When a process execs or forks, the kernel must ensure there is
enough space on the paging device to hold all of the memory owned
by the process.  Since all of the data and bss (and depending on
the type of program even the text) can be written and then paged
out, we must make sure that there is enough swap space for all of
this.  Hence we have a kernel variable called availsmem
(available swap memory) that holds how much swap memory is
uncommitted.    The kernel uses that and does not check swap for size.

Needless to say, the kernel is pessimistic and expects all pages
to be dirtied and thus assumes each and every page of a fork will
be touched.  Your swap device isn't big enough to hold all of the
programs' memory were all to be swapped out.

The ONLY solution is to increase swap by either increasing a partition
or by adding swap with the
	/etc/swap -a ...
command.  That or reduce process load.

If you checked, each of the NOTICE: logged by the kernel resulted in a
failed syscall and a return of ENOSPC or EAGAIN or ENOMEM or something
that means "out of resources, try later".  The problem as you not
is prevelant under heavy load and lots of paging.   

Changing your paging parameters and making the paging daemon more
agressive might help.   But ultimately you need more swap or more
real memory.
-- 
=Dennis L. Mumaugh
 Lisle, IL  ...!{att,lll-crg,attunix}!cuuxb!dlm  OR dlm@cuuxb.att.com