dlm@cuuxb.ATT.COM (Dennis L. Mumaugh) (10/17/89)
Ordinarily I don't answer questions like this as I work for support and customers pay money for answers, but .... this is a System V error and generally true for all System V 3.xxx machines. Skip to end for concise answer rather than a blow by blow analysis. In article <506@oglvee.UUCP> jr@oglvee.UUCP (Jim Rosenberg) writes: We have 4M RAM and before the upgrade the machine just screamed. Now we are paging like mad and getting sporadic fork failures. The system error reporting is filled with messages like this: 000146 07:50:06 00e6f0f6 ... 0000 00 NOTICE: getcpages - waiting for 1 contiguous pages 000147 08:13:16 00e80082 ... 0000 00 000148 08:13:16 00e80082 ... 0000 00 NOTICE: getcpages - Insufficient memory to allocate 1 contiguous page - system call failed In many cases I can exactly correlate one of these "system call failed" messages with a fork failure. Usually true. Sometimes it will be an exec and rarely a stack growth. According to the man page for fork(2) there are 3 ways a fork can fail: No process table slots left, exceeding the per-user limit, and a most obscure indeed 3rd one: "Total amount of system memory available when reading via raw IO is temporarily insufficient". Either the man page lies or this third one is it. In a sense. I took a blind stab and guessed that the parameter involved here is PBUF. Altos recommends PBUF=8 straight across the board no matter how much memory you have. Sounds pretty odd to me, since on a 6386 running V.3.2 with 2 Meg RAM I've got 20, and never fiddled with it. I jacked up PBUF to 16 -- but it made no difference. Sorry, wrong guess. Try /etc/swap -a ...... So, my questions are: What the bleep is getcpages? It sounds like an internal kernel routine to get continuous pages in RAM. Ordinarily true. But when we need only one page we call it as it is fast. Is this call issued by the paging daemon? Close, by a kernel routine looking for pages, such as grow, or dupproc. How could it fail on a request to get only 1 page unless I'm out of swap space? How did you guess? (Which I'm not. We're getting these with many many thousand blocks of free swap space -- we have a swap(1) which will show these.) Not true! /etc/swap only shows actual use of swap not committed use of swap. Similarly for sar reports. Is there a tunable parameter that will rescue me here? Altos seems to think that a failed fork should only get a "NOTICE". Yeah, well, I notice all right. It's bad enough when the shell reports "No more processes" -- you just try again and it works. But we have all kinds of batch jobs that spawn uux requests and other such things and they're just getting shot right out of the sky. True, some code isn't very robust and ought to sleep and wait for less load, but people who do forks don't examine error codes, nor do people who do execs. fork and exec will return either ENOSPC or EAGAIN if you would check errno. Any words of wisdom gratefully accepted! I skimmed over the likeliest parts of Bach to see if the light would dawn -- looks like I better go back and reread the section on demand paging pretty carefully. Answer: When a process execs or forks, the kernel must ensure there is enough space on the paging device to hold all of the memory owned by the process. Since all of the data and bss (and depending on the type of program even the text) can be written and then paged out, we must make sure that there is enough swap space for all of this. Hence we have a kernel variable called availsmem (available swap memory) that holds how much swap memory is uncommitted. The kernel uses that and does not check swap for size. Needless to say, the kernel is pessimistic and expects all pages to be dirtied and thus assumes each and every page of a fork will be touched. Your swap device isn't big enough to hold all of the programs' memory were all to be swapped out. The ONLY solution is to increase swap by either increasing a partition or by adding swap with the /etc/swap -a ... command. That or reduce process load. If you checked, each of the NOTICE: logged by the kernel resulted in a failed syscall and a return of ENOSPC or EAGAIN or ENOMEM or something that means "out of resources, try later". The problem as you not is prevelant under heavy load and lots of paging. Changing your paging parameters and making the paging daemon more agressive might help. But ultimately you need more swap or more real memory. -- =Dennis L. Mumaugh Lisle, IL ...!{att,lll-crg,attunix}!cuuxb!dlm OR dlm@cuuxb.att.com