jr@oglvee.UUCP (Jim Rosenberg) (10/15/89)
We just recently "upgraded" [sic] an Altos 2000 from Xenix 5.2c to UNIX 5.3d. uname reports the operating system as 5.3.1. We have 4M RAM and before the upgrade the machine just screamed. Now we are paging like mad and getting sporadic fork failures. The increased paging activity has my users bitching and moaning, but the fork failures are like a sniper loose in my system gunning down processes sporadically. The problem is surely *not* insufficient process table slots. crash(1) reports we have 180 slots (NPROC is 0 in the tuning parameter file, which on this system is called /usr/sys/master.d/kernel) and we've got nowhere within a country mile of that many processes. The per-user limit is 30, and we're getting fork failures where that's not exceeded either. The system error reporting is filled with messages like this: 000146 07:50:06 00e6f0f6 ... 0000 00 NOTICE: getcpages - waiting for 1 contiguous pages 000147 08:13:16 00e80082 ... 0000 00 000148 08:13:16 00e80082 ... 0000 00 NOTICE: getcpages - Insufficient memory to allocate 1 contiguous page - system call failed ^^^^^^^^^^^^^^^^^^ In many cases I can exactly correlate one of these "system call failed" messages with a fork failure. According to the man page for fork(2) there are 3 ways a fork can fail: No process table slots left, exceeding the per-user limit, and a most obscure indeed 3rd one: "Total amount of system memory available when reading via raw IO is temporarily insufficient". Either the man page lies or this third one is it. I took a blind stab and guessed that the parameter involved here is PBUF. Altos recommends PBUF=8 straight across the board no matter how much memory you have. Sounds pretty odd to me, since on a 6386 running V.3.2 with 2 Meg RAM I've got 20, and never fiddled with it. I jacked up PBUF to 16 -- but it made no difference. So, my questions are: What the bleep is getcpages? It sounds like an internal kernel routine to get continuous pages in RAM. Is this call issued by the paging daemon? How could it fail on a request to get only 1 page unless I'm out of swap space? (Which I'm not. We're getting these with many many thousand blocks of free swap space -- we have a swap(1) which will show these.) Is there a tunable parameter that will rescue me here? Altos seems to think that a failed fork should only get a "NOTICE". Yeah, well, I notice all right. It's bad enough when the shell reports "No more processes" -- you just try again and it works. But we have all kinds of batch jobs that spawn uux requests and other such things and they're just getting shot right out of the sky. Any words of wisdom gratefully accepted! I skimmed over the likeliest parts of Bach to see if the light would dawn -- looks like I better go back and reread the section on demand paging pretty carefully. -- Jim Rosenberg pitt Oglevee Computer Systems >--!amanue!oglvee!jr 151 Oglevee Lane cgh Connellsville, PA 15425 #include <disclaimer.h>
larry@hcr.UUCP (Larry Philps) (10/16/89)
In article <506@oglvee.UUCP> jr@oglvee.UUCP writes: >We just recently "upgraded" [sic] an Altos 2000 from Xenix 5.2c to UNIX 5.3d. >uname reports the operating system as 5.3.1. We have 4M RAM and before the >upgrade the machine just screamed. Now we are paging like mad and getting >sporadic fork failures. The increased paging activity has my users bitching >and moaning, but the fork failures are like a sniper loose in my system gunning >down processes sporadically. ... > > NOTICE: getcpages - waiting for 1 contiguous pages > NOTICE: getcpages - Insufficient memory to allocate 1 contiguous page - system call failed I had this exact problem after I ported System 5.3.1 to a VAX (no comments from the cheap seats please!). Not very surprisingly, Release 3 requires significantly more memory to run than Release 2. So, like us, you were probably running your system close to, but not over the memory trash point. After upgrading, the extra OS memory requirement pushed you over the limit and paging death occured. Getcpages, is indeed get "contiguous" physical pages. There are parts of the paging system on some processors that require this. The complaint about a failure on 1 page simply means that ALL RAM was being used when the fork appeared and the system needed a page to hold page tables or the like. Now, for some reason unknown to me, in fork (procdup actually), dupreg is called with arguments that specify that it is not to sleep. I couldn't come up with any sensible reason why this had to be, so I changed the call to allow sleeps. The fork failure problems simply went away, and no other problems manifested. So, what you should do is bitch to your supplier, and if you are very daring, binary patch your kernel and change the second argument to "dupreg", as called from "produp" (there is only one such call) to be a 0 rather than a 1, and try things out. It works on a vax, you can give it a try on your Altos if you are brave. By the way, this did NOT solve our trashing problems, we only got rid of that by buying more memory for the machine. Bet you are really happy to hear that! ---- Larry Philps HCR Corporation 130 Bloor St. West, 10th floor Toronto, Ontario. M5S 1N5 (416) 922-1937 {utzoo,utcsri}!hcr!larry
dlm@cuuxb.ATT.COM (Dennis L. Mumaugh) (10/17/89)
Ordinarily I don't answer questions like this as I work for support and customers pay money for answers, but .... this is a System V error and generally true for all System V 3.xxx machines. Skip to end for concise answer rather than a blow by blow analysis. In article <506@oglvee.UUCP> jr@oglvee.UUCP (Jim Rosenberg) writes: We have 4M RAM and before the upgrade the machine just screamed. Now we are paging like mad and getting sporadic fork failures. The system error reporting is filled with messages like this: 000146 07:50:06 00e6f0f6 ... 0000 00 NOTICE: getcpages - waiting for 1 contiguous pages 000147 08:13:16 00e80082 ... 0000 00 000148 08:13:16 00e80082 ... 0000 00 NOTICE: getcpages - Insufficient memory to allocate 1 contiguous page - system call failed In many cases I can exactly correlate one of these "system call failed" messages with a fork failure. Usually true. Sometimes it will be an exec and rarely a stack growth. According to the man page for fork(2) there are 3 ways a fork can fail: No process table slots left, exceeding the per-user limit, and a most obscure indeed 3rd one: "Total amount of system memory available when reading via raw IO is temporarily insufficient". Either the man page lies or this third one is it. In a sense. I took a blind stab and guessed that the parameter involved here is PBUF. Altos recommends PBUF=8 straight across the board no matter how much memory you have. Sounds pretty odd to me, since on a 6386 running V.3.2 with 2 Meg RAM I've got 20, and never fiddled with it. I jacked up PBUF to 16 -- but it made no difference. Sorry, wrong guess. Try /etc/swap -a ...... So, my questions are: What the bleep is getcpages? It sounds like an internal kernel routine to get continuous pages in RAM. Ordinarily true. But when we need only one page we call it as it is fast. Is this call issued by the paging daemon? Close, by a kernel routine looking for pages, such as grow, or dupproc. How could it fail on a request to get only 1 page unless I'm out of swap space? How did you guess? (Which I'm not. We're getting these with many many thousand blocks of free swap space -- we have a swap(1) which will show these.) Not true! /etc/swap only shows actual use of swap not committed use of swap. Similarly for sar reports. Is there a tunable parameter that will rescue me here? Altos seems to think that a failed fork should only get a "NOTICE". Yeah, well, I notice all right. It's bad enough when the shell reports "No more processes" -- you just try again and it works. But we have all kinds of batch jobs that spawn uux requests and other such things and they're just getting shot right out of the sky. True, some code isn't very robust and ought to sleep and wait for less load, but people who do forks don't examine error codes, nor do people who do execs. fork and exec will return either ENOSPC or EAGAIN if you would check errno. Any words of wisdom gratefully accepted! I skimmed over the likeliest parts of Bach to see if the light would dawn -- looks like I better go back and reread the section on demand paging pretty carefully. Answer: When a process execs or forks, the kernel must ensure there is enough space on the paging device to hold all of the memory owned by the process. Since all of the data and bss (and depending on the type of program even the text) can be written and then paged out, we must make sure that there is enough swap space for all of this. Hence we have a kernel variable called availsmem (available swap memory) that holds how much swap memory is uncommitted. The kernel uses that and does not check swap for size. Needless to say, the kernel is pessimistic and expects all pages to be dirtied and thus assumes each and every page of a fork will be touched. Your swap device isn't big enough to hold all of the programs' memory were all to be swapped out. The ONLY solution is to increase swap by either increasing a partition or by adding swap with the /etc/swap -a ... command. That or reduce process load. If you checked, each of the NOTICE: logged by the kernel resulted in a failed syscall and a return of ENOSPC or EAGAIN or ENOMEM or something that means "out of resources, try later". The problem as you not is prevelant under heavy load and lots of paging. Changing your paging parameters and making the paging daemon more agressive might help. But ultimately you need more swap or more real memory. -- =Dennis L. Mumaugh Lisle, IL ...!{att,lll-crg,attunix}!cuuxb!dlm OR dlm@cuuxb.att.com
nvk@ddsw1.MCS.COM (Norman Kohn) (10/18/89)
In article <4219@cuuxb.ATT.COM> dlm@cuuxb.UUCP (Dennis L. Mumaugh) writes: ...>The ONLY solution is to increase swap by either increasing a partition >or by adding swap with the > /etc/swap -a ... >command. I used to have the line swap -a /dev/dsk/1s2 0 16592 in /etc/rc2.d/S10swap, for automatic installation of extra swap space on my second drive. This seemed to work ok... but if a large enough process was swapped out to actually make reference to it, I would start to see garbage in the printout of ps -ef even though the processes in question worked fine. This is with uport 386; unfortunately, I cannot recall whether the release was 3.0e or its predecessor. I concluded that, at least in microport's implementation, swap -a is unhappy when called with space on another drive than the primary one. I finally broke down and remade my disk partitions. -- Norman Kohn | ...ddsw1!nvk Chicago, Il. | days/ans svc: (312) 650-6840 | eves: (312) 373-0564
scott@altos86.Altos.COM (Scott A. Rotondo) (10/19/89)
A recent posting by Jim Rosenberg of Oglevee Computer Systems reports that fork() exhibits intermittent failures on an Altos 2000 with 4 MB of RAM. Each fork() failure is accompanied by a message from getcpages() reporting that memory is unavailable. Fork() can fail when getcpages() finds insufficient physical memory available to allocate a u-area for the new process. In this case, fork() returns -1 and sets errno to EAGAIN. This is the error condition mentioned on the man page, although the reference to raw I/O is incorrect. The NPBUF kernel parameter controls the number of scatter/gather address lists available for raw I/O, but it does not affect this fork() failure except by reducing slightly the total amount of memory available for user processes. Memory requirements for processes are somewhat different between UNIX and Xenix. I am sending (via US mail) an Altos document that outlines recommended memory sizes as a function of system load. It also describes tuning kernel parameters for maximum performance, based on activity reports generated by the sar utility. In this particular case, the best thing to do is probably to add more memory or allocate less space for structures like the buffer pool. The latter can be accomplished by tuning the kernel parameter NBUF. A more reliable way to raise issues like this one is to send mail to issues@Altos.COM or {sun,apple,pyramid}!altos86!issues. I happened to see this problem on the net, but mail to the above address is guaranteed to reach Altos engineering. Scott Rotondo Project Leader, UNIX OS Development -- =============================================================================== Scott A. Rotondo, Altos Computer Systems (408) 946-6700 {sun|pyramid|uunet}!altos86!scott scott@Altos.COM
jerry@altos86.Altos.COM (Jerry Gardner) (10/19/89)
In article <506@oglvee.UUCP> jr@oglvee.UUCP (Jim Rosenberg) writes: > >What the bleep is getcpages? It sounds like an internal kernel routine to get >continuous pages in RAM. Is this call issued by the paging daemon? How could >it fail on a request to get only 1 page unless I'm out of swap space? (Which >I'm not. We're getting these with many many thousand blocks of free swap >space -- we have a swap(1) which will show these.) > Getpages is an internal kernel routine that allocates contiguous pages of kernel virtual memory. It's not called by the paging daemon, but rather to allocate or grow regions, among other things. >Is there a tunable parameter that will rescue me here? > Not really. You really are running out of swap space. Even though "swap -l" may show plenty of swap space remaining, it is misleading. UNIX allocates swap space for the entire .data and .bss regions whenever a process is exec'ed. Even though swap -l shows plenty of swap space available, most of the swap space is allocated to processes, which, although they may not currently be swapped out, still tie up the swap space. Your best solution: get more RAM. The 2000 in my office that I use as a single-user personal machine has 24MB. If you can't get more RAM, you could try a larger swap partition, but if your system is heavily loaded, it'll just thrash, constantly paging and swapping things in and out. -- Jerry Gardner, NJ6A Altos Computer Systems UUCP: {sun|pyramid|sco|amdahl|uunet}!altos86!jerry 2641 Orchard Parkway Internet: jerry@altos.com San Jose, CA 95134 I survived the Big One, October 17, 1989 946-6700
jr@oglvee.UUCP (Jim Rosenberg) (10/19/89)
In article <2296@hcr.UUCP> larry@zeus.UUCP (Larry Philps) writes: >Getcpages, is indeed get "contiguous" physical pages. There are parts of the >paging system on some processors that require this. The complaint about a >failure on 1 page simply means that ALL RAM was being used when the fork >appeared and the system needed a page to hold page tables or the like. > >Now, for some reason unknown to me, in fork (procdup actually), dupreg is >called with arguments that specify that it is not to sleep. I couldn't come >up with any sensible reason why this had to be, so I changed the call to >allow sleeps. The fork failure problems simply went away, and no other >problems manifested. OK, kernel gurus, what's the word: *is there* a good reason why the call to dupreg shouldn't sleep??? We are also running V.3.2 on a bunch of AT&T 6386en. Those machines have only 2M RAM. I know damn well that we're just on the borderline of what's doable with that little memory -- it's a budget issue, not a technical issue. Although I do often suffer from the overhead of paging, I've *NEVER* seen a fork failure on these machines. Admittedly this is V.3.2 and not V.3.1. But I wonder if AT&T did go ahead and change the dupreg call to allow a sleep. Can someone from AT&T comment? I must say this, though: while I've never seen an identifiable fork failure on one of the 6386en, I *have* seen a phenomenon which I call Kernel Narcolepsy: the whole system just seems to fall asleep now and then. I had one machine a couple of months ago that had an extremely sick disk. To make sure another machine didn't have the problem I intentionally loaded it with enough continuous compiles of our database language (Progress) to cause solid thrashing. Every now and then the thrashing would just stop. After about 5 minutes it would pick up again. I don't know for a fact that it was really sleeping: it could have been a kind of beat frequency where the processes just happened to hit on the same pages. But I did suffer one definite case where the whole system went to sleep and even though characters would echo I could get no response from any getty and the system was definitely just plain stuck. This took a full reboot, fsck found minor damage, etc. etc. So I guess the question is this: If the dupreg call from fork allows sleeps, could this lead to a deadlock? Is it possible I may be seeing this on V.3.2? If the dupreg call *can be* safely changed to allow sleeping then my Altos problem is a flat out case of a bug in their System V.3.1. If it *can't* safely be changed, then as I understand the situation V.3 DOES NOT RELIABLY IMPLEMENT VIRTUAL MEMORY!! Is it not true that pages are freed by an asynchronous kernel process? Is it not true that, given the indeterminate way things work in UNIX, one cannot absolutely guarantee when this process will run? If you can't allow a sleep from fork in dupreg then the only way of guaranteeing that fork won't fail is to guarantee that you don't page. I.e. if you page, you run a certain risk that forks will fail no matter how much swap space you have. The only way to guarantee fork will never fail is to guarantee you don't page. I.e. don't really exercise virtual memory. I.e. V.3 virtual memory is NOT RELIABLE because if you use it you may trigger fork failures. Please tell me it ain't so!!!!! -- Jim Rosenberg pitt Oglevee Computer Systems >--!amanue!oglvee!jr 151 Oglevee Lane cgh Connellsville, PA 15425 #include <disclaimer.h>
jr@oglvee.UUCP (Jim Rosenberg) (10/20/89)
In article <4219@cuuxb.ATT.COM> dlm@cuuxb.UUCP (Dennis L. Mumaugh) writes: >Ordinarily I don't answer questions like this as I work for >support and customers pay money for answers, but .... Thank you for going above and beyond the call of duty. Since I have an unreliable operating system for which we paid real money, it's a comfort to know we don't have to pay more real money to find out how to get relief from the defects in what we already paid our money for. >In article <506@oglvee.UUCP> jr@oglvee.UUCP (Jim Rosenberg) >writes: > > What the bleep is getcpages? > > [...] > > How could it fail on a request to get only 1 page unless > I'm out of swap space? > >How did you guess? Are you *ABSOLUTELY* sure this is the only way getcpages can fail??? I already have one response to the contrary. > (Which I'm not. We're getting these with many many > thousand blocks of free swap space -- we have a swap(1) > which will show these.) > >Not true! /etc/swap only shows actual use of swap not committed use >of swap. Similarly for sar reports. OK, you can tell me all you like that swap is broken and is lying to me and that sar is broken and is lying to me (these are *my* fault???) and that I really really am out of swap space, but frankly I just don't believe this. I *DID* add a new swap partition with swap -a (*before* posting the original article, as a matter of fact.) The system is clearly using it. I got one fork failure with no interactive users logged in -- we had 4 database servers up and one client batch job, which had three or four child UNIX processes -- enough to page a bit perhaps but nowhere *NEAR ENOUGH* loading to exhaust 24,000 blocks of swap space. If my swap space runs out with lots of users then I can deal with that, but if that were my problem then the whole system would come crashing to its knees many times a day. I'm sorry, but I just don't believe you're right that every fork failure happens because I truly am out of swap space. >True, some code isn't very robust and ought to sleep and wait for >less load, but people who do forks don't examine error codes, nor >do people who do execs. fork and exec will return either ENOSPC or >EAGAIN if you would check errno. ^^^ If **WHO** would check errno??? I beg your pardon? I am supposed to dig into cron with a can opener (we are a binary licensee, not source!) and somehow "check" errno? When I get a fork failure from a fork issued by cron it cutely logs the fact that fork failed, and that it is "rescheduling". Right. It then just falls asleep and no more cron jobs run. When csh gets the fork failure it simply reports "No more processes". Um, just what would you like me to check here? It's *you folks in AT&T* who should check errno, don't you think? -- Jim Rosenberg pitt Oglevee Computer Systems >--!amanue!oglvee!jr 151 Oglevee Lane cgh Connellsville, PA 15425 #include <disclaimer.h>
allbery@NCoast.ORG (Brandon S. Allbery) (10/20/89)
As quoted from <506@oglvee.UUCP> by jr@oglvee.UUCP (Jim Rosenberg): +--------------- | We just recently "upgraded" [sic] an Altos 2000 from Xenix 5.2c to UNIX 5.3d. | per-user limit is 30, and we're getting fork failures where that's not exceeded | either. The system error reporting is filled with messages like this: +--------------- Ah, so someone else *is* getting those little buggers. As far as I can tell, "fork failed"s happen when memory is mostly full and something wants to fork and for some stupid reason Altos 5.3[a-d][DT][0-9] doesn't want to page anything out to make more room in core even though it can do so. I have some "sar" output that corroborates this, "fork failed" happens when a process tries to fork and there are < 100 free 512-byte (I think that's the units sar uses, I need to check) chunks of memory. I plan to ram this down Altos T/S's collective throat, since they haven't fixed it in 5.3dT1 and I reported it in 5.3bT1 (3 upgrades have gone by so far...). ++Brandon, for this message speaking as the tech guru of telotech, inc. -- Brandon S. Allbery, moderator of comp.sources.misc allbery@NCoast.ORG uunet!hal.cwru.edu!ncoast!allbery ncoast!allbery@hal.cwru.edu bsa@telotech.uucp 161-7070 (MCI), ALLBERY (Delphi), B.ALLBERY (GEnie), comp-sources-misc@backbone [comp.sources.misc-related mail should go ONLY to comp-sources-misc@<backbone>] *Third party vote-collection service: send mail to allbery@uunet.uu.net (ONLY)*
allbery@NCoast.ORG (Brandon S. Allbery) (10/20/89)
As quoted from <4219@cuuxb.ATT.COM> by dlm@cuuxb.ATT.COM (Dennis L. Mumaugh): +--------------- | Ordinarily I don't answer questions like this as I work for | support and customers pay money for answers, but .... this is a | System V error and generally true for all System V 3.xxx | machines. +--------------- I hope the answers you give paying customers are a bit more on the mark. (Disclaimer: I'm in this business too. And I'm familiar with Altos's OS, which you apparently aren't.) +--------------- | jacked up PBUF to 16 -- but it made no difference. | | Sorry, wrong guess. Try /etc/swap -a ...... +--------------- Sorry, wrong guess. I have repeated this bug with /etc/swap -l showing 100 out of 25000 swap blocks in use. Somehow, that doesn't look like "out of swap" to me. The point is that Altos 5.3.1 (and, as it turns out, AT&T System V Release 3.1) has a stupid bug that makes the kernel not want to page stuff out when a fork() comes up with insufficient free memory in core. I claimed this to Altos a year ago, I now have confirmation. PLEASE check your facts before posting, I at least had the sense to pull out every tool I possess, from sar to /etc/swap -l to sysmon to some custom jobs I whipped up (years of practice babysitting a SVR2 system) before claiming a kernel problem. ++Brandon -- Brandon S. Allbery, moderator of comp.sources.misc allbery@NCoast.ORG uunet!hal.cwru.edu!ncoast!allbery ncoast!allbery@hal.cwru.edu bsa@telotech.uucp 161-7070 (MCI), ALLBERY (Delphi), B.ALLBERY (GEnie), comp-sources-misc@backbone [comp.sources.misc-related mail should go ONLY to comp-sources-misc@<backbone>] *Third party vote-collection service: send mail to allbery@uunet.uu.net (ONLY)*
roe@sobmips.UUCP (r.peterson) (10/21/89)
From article <3684@altos86.Altos.COM>, by jerry@altos86.Altos.COM (Jerry Gardner): > In article <506@oglvee.UUCP> jr@oglvee.UUCP (Jim Rosenberg) writes: > > Getpages is an internal kernel routine that allocates contiguous pages of > kernel virtual memory. It's not called by the paging daemon, but rather to > allocate or grow regions, among other things. > >>Is there a tunable parameter that will rescue me here? >> > > Not really. You really are running out of swap space. Even though > "swap -l" may show plenty of swap space remaining, it is misleading. > > UNIX allocates swap space for the entire .data and .bss regions whenever > a process is exec'ed. Even though swap -l shows plenty of swap space > available, most of the swap space is allocated to processes, which, although > they may not currently be swapped out, still tie up the swap space. > > Your best solution: get more RAM. Not true. While unix does CHECK FOR available swap space for the entire .data and .bss regions whilst fork()ing, it does NOT necessarily USE that space. Simply increasing available swap space will solve the problem. You will not begin to thrash, since demand paging is still in effect. The BSD file systems simply want to know that, if necessary, there is enough swap space to swap. We saw the same problem on our MIPS systems, and adding another swap partition solved the problem - with no noticable (sar, vsar, etc) performance degredation. -- Roe Peterson {attcan,mcgill-vision,telly}!sobeco!roe
dwc@cbnewsh.ATT.COM (Malaclypse the Elder) (10/21/89)
i originally sent a reply to the poster of the question stating that the reason that getcpages is failing trying to get 1 contiguous page is that there is probably no free memory for a page table. its been a while since i looked at the problem but i seem to remember that the reason getcpages() can fail without sleeping is to prevent deadlock-type situations. on release 3, there are certain process data structures that are not swapped out: the ublock (depending on the version), the page tables and DBDs, and maybe more. well, you can get into a situation of deadlock in which all memory is committed to these data structures and no process can continue because it they are all both holding memory and waiting for more. allowing the sleep to happen is okay if you make the sleep interruptable. then at least the user can attempt to abort his program voluntarily (the problem is determining when you are in this deadlock situation... you can't run user level programs to tell you this). my solution to this was really very simple. at fork time, the parent knows how much memory resources it will take to create this process (ublock, page tables, dbds, etc.). with this information, the parent can check freemem level and reserve the necessary amount of memory to satisfy the fork and sleep until that amount of memory is available. this sleep is safe since no resources have been committed to the child yet (the child doesn't even exist). we prototyped this for release 3 and it was going to go into some future release when they decided to use sun's VM architecture instead of regions. i suspect that release 4 will have a similar problem but i'm not sure. if you don't have source to modify, i suggested to the original requestor that he set a very high value for GETPGSLO and GETPGSHI. this will make the paging daemon very active and MAY prevent you from hitting situations where freemem goes to zero. its not guaranteed since requests for freemem is VERY bursty and the reaction time of vhand is fairly slow. danny chen att!hocus!dwc
ka@cs.washington.edu (Kenneth Almquist) (10/23/89)
jerry@altos86.Altos.COM (Jerry Gardner) writes: > You really are running out of swap space. Even though > "swap -l" may show plenty of swap space remaining, it is misleading. > > UNIX allocates swap space for the entire .data and .bss regions whenever > a process is exec'ed. Even though swap -l shows plenty of swap space > available, most of the swap space is allocated to processes, which, although > they may not currently be swapped out, still tie up the swap space. Are you sure about this? This summer I was working on a project that was running System V release 3.2 on machines with 10 megabytes of swap and 8 megabytes of memory. We were getting messages in the error log about getcpages failing and forks failing, so I looked at the code and some sar output, and concluded: - We were running out of virtual address space and doing a lot of paging. - The total amount of virtual memory that the system will allocate to processes is bound by the *sum* of the physical memory and the swap space. (This makes sense because a page can either be in physical memory or on the swap device; there is no need for it to be both places.) So I had the amount of memory increased to 16 megabytes, and everything worked fine. Of course this might not have worked with release 3.1. > Your best solution: get more RAM. The 2000 in my office that I use as a > single-user personal machine has 24MB. If you can't get more RAM, you could > try a larger swap partition, but if your system is heavily loaded, it'll just > thrash, constantly paging and swapping things in and out. If the system is allocating swap space for pages that are in RAM, then getting more RAM won't help the problem of running out of swap space. But if I read the code correctly, then release 3.2 will not allocate swap space for pages in RAM so adding more RAM will solve both the space problem and the excessive paging rate. All this assumes that the diagnosis of running out of swap space is correct. I've never used "swap -l", but I've never had any reason to doubt the output of sar. On the other hand, I've never tried to tune a release 3.1 system. If Jim happens to have an unused partition on his disk, he could easily see whether adding more swap space makes the fork problem go away. Kenneth Almquist
jerry@altos86.Altos.COM (Jerry Gardner) (10/24/89)
In article <1989Oct20.020309.2081@NCoast.ORG> allbery@ncoast.ORG (Brandon S. Allbery) writes: > >As far as I can tell, "fork failed"s happen when memory is mostly full and >something wants to fork and for some stupid reason Altos 5.3[a-d][DT][0-9] >doesn't want to page anything out to make more room in core even though it can >do so. I have some "sar" output that corroborates this, "fork failed" happens >when a process tries to fork and there are < 100 free 512-byte (I think that's >the units sar uses, I need to check) chunks of memory. > >I plan to ram this down Altos T/S's collective throat, since they haven't >fixed it in 5.3dT1 and I reported it in 5.3bT1 (3 upgrades have gone by so >far...). > The fork() failures you are seeing are occuring when procdup() calls dupreg(). Dupreg() calls ptsalloc() which eventually calls getcpages() to allocate memory for page tables to map the new child process' u-area. Apparently, the kernel is paranoid in one place here and it calls ptsalloc with a flag that doesn't allow it to sleep. The best way to make this problem go away is to get more RAM. You'd be amazed how little paging a 64MB 2000 will do. -- Jerry Gardner, NJ6A Altos Computer Systems UUCP: {sun|pyramid|sco|amdahl|uunet}!altos86!jerry 2641 Orchard Parkway Internet: jerry@altos.com San Jose, CA 95134 I survived the Big One, October 17, 1989 946-6700
allbery@NCoast.ORG (Brandon S. Allbery) (10/25/89)
As quoted from <9556@june.cs.washington.edu> by ka@cs.washington.edu (Kenneth Almquist): +--------------- | All this assumes that the diagnosis of running out of swap space is | correct. I've never used "swap -l", but I've never had any reason to | doubt the output of sar. On the other hand, I've never tried to tune | a release 3.1 system. If Jim happens to have an unused partition on | his disk, he could easily see whether adding more swap space makes the | fork problem go away. +--------------- It doesn't. I was running into this on a system rented to a client; I doubled the swap space, but nothing changed. (Yes, this is the system with 25776 blocks of swap... after the addition. It was the first time I encountered this problem. I have since seen it on three other systems, one of which is not currently expandable with more RAM (Series 2000; this is changing, but the client in question cannot presently take advantage of it).) I still question the diagnosis, and continue to suspect that Altos 5.3.1 does not page when it needs space for a page table during a fork(), even when it can do so. It should be noted that I patched a Series 600 (the 1000's little brother) Unix kernel as suggested earlier (although the flag was the reverse of that specified; did the poster get it backwards or did Altos already do this?) and am currently running tests on it. Unfortunately, hitting the trigger point on a 2MB system is a bit tricky, so I haven't yet reproduced the core memory conditions which trigger the failure. One more comment: I have observed this on systems which are relatively unloaded, which don't complain when much more heavily loaded. Specifically, a few occurrences on our office system, which is usually kept busy by two people with a full complement of MultiView windows and a minimum of two sessions not running under MultiView. I have gotten "fork fail" when not at the full complement of windows *if* the users start up processes in a particular order, again arguing for an out-of-*core*-memory condition causing the problem rather than an out-of-swap condition. ++Brandon -- Brandon S. Allbery, moderator of comp.sources.misc allbery@NCoast.ORG uunet!hal.cwru.edu!ncoast!allbery ncoast!allbery@hal.cwru.edu bsa@telotech.uucp 161-7070 (MCI), ALLBERY (Delphi), B.ALLBERY (GEnie), comp-sources-misc@backbone [comp.sources.misc-related mail should go ONLY to comp-sources-misc@<backbone>] *Third party vote-collection service: send mail to allbery@uunet.uu.net (ONLY)*
jim@applix.UUCP (Jim Morton) (10/25/89)
In article <3696@altos86.Altos.COM>, jerry@altos86.Altos.COM writes: > The best way to make this problem go away is to get more RAM. You'd be > amazed how little paging a 64MB 2000 will do. C'mon Jerry - you left the smiley face off the end of that line! I would hope 64 megs solve that guy's problem! Tell us, what does Altos charge for 64 megs of RAM? -- Jim Morton, APPLiX Inc., Westboro, MA ...uunet!applix!jim jim@applix.com
dwc@cbnewsh.ATT.COM (Malaclypse the Elder) (10/26/89)
In article <1989Oct25.010725.18353@NCoast.ORG>, allbery@NCoast.ORG (Brandon S. Allbery) writes: > > I still question the diagnosis, and continue to suspect that Altos 5.3.1 does > not page when it needs space for a page table during a fork(), even when it > can do so. > the people who believe that it is swap space shortage is confusing the implementation with BSD/SunOS which requires that a page of swap be allocated with each page that is faulted. the system v release 3 implementation did not have this restriction. but the system DOES page whenever freemem drops below the low water mark. and it swaps when freemem hits zero. the question is whether the fork sleeps waiting for the memory. and as i wrote in an earlier article, it does not to avoid certain deadlock conditions. > It should be noted that I patched a Series 600 (the 1000's little brother) > Unix kernel as suggested earlier (although the flag was the reverse of that > specified; did the poster get it backwards or did Altos already do this?) and > am currently running tests on it. Unfortunately, hitting the trigger point on > a 2MB system is a bit tricky, so I haven't yet reproduced the core memory > conditions which trigger the failure. > for those playing around with paging experiments on release 3, there is a tunable parameter, maxpmem, which allows one to configure the system with LESS memory than there actually is. i think it was put in just for doing such experiments and never taken out. danny chen att!hocus!dwc