lkaplan@bbn.com (Larry Kaplan) (07/13/90)
In article <5855@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >In article <5DL4SPD@xds13.ferranti.com> > peter@ficc.ferranti.com (Peter da Silva) writes: > >>> 1) An utterly broken implementation where some important system >>> process (such as inetd, ypbind or sendmail) may killed if there >>> is not enough swap space. > >>Alternatively, put the program in a wait state until swap space is available. >>Deadlocks are possible, but unlikely. Indefinite deferment is more likely, > >No. > >Once swap space shortage occurs, it will tend to occur continually >until some large process exits. So, if all such processes put in >wait states (which is very likely to occur, because active process >often requires new pages) the situation is deadlock. > First, preallocating still has some problems with respect to dynamic allocations. While deadlock doesn't occur, random processes can still die (of their own doing) when some malloc call fails to be able to reserve swap space. This will not happen without pre-allocation, and gives you the opportunity to get more work done and possibly never have trouble. Next, there are actually ways to handle the deadlock. Note that Mach based implementations are willing to page on just about any vfs available. This means that Mach will page to Unix filesystems or NFS filesystems if desired. The kernel (or some appropriately "wired" daemon) could note that the system was running out of paging space, and make arrangements to either suspend memory consuming processes or mount more filesystems. Even if the deadlock actually occurred, you could suspend all the processes waiting for swap space, and then mount some reserve filesystem. Some care would then be necessary to let the important jobs finish. It may be necessary to continue jobs selectively instead of all at once, to prevent a repeat of the deadlock. Even if you need some more memory to get the mounting done, you could kill some non-critical system daemon that could be started later (like lpd or something). Later on, you could decide to restart the daemon killed earlier, and/or unmount no longer used filesystems. Eventually, you could return to normal operation. This is a little complicated but certainly doable and allows you to not reserve swap space on memory allocation and to use a true COW fork(). Even if such a daemon were not implemented, some of this could actually be done by hand by an operator. People may complain that this is not a truly general solution, and I would agree. However, combined with the added flexibility of no preallocation, it seems justifiable. As a side note, on the large systems I work on, we don't do preallocation and have never run out of paging (swap) space. This is not to say that we never will, but typical systems have on the order of at least 10 times as much disk storage as main memory. In some cases, as much as 100 times more. Even if the user filesystems are full, some care is taken to leave some other partitions available for paging. I claim it is hard to fill that much disk space with paging and swapping traffic and still have a usable system. You'll probably be thrashing to death long before that. #include <std_disclaimer> _______________________________________________________________________________ ____ \ / ____ Laurence S. Kaplan | \ 0 / | BBN Advanced Computers lkaplan@bbn.com \____|||____/ 10 Fawcett St. (617) 873-2431 /__/ | \__\ Cambridge, MA 02238
mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/16/90)
In article <58184@bbn.BBN.COM> lkaplan@BBN.COM (Larry Kaplan) writes: >While deadlock doesn't occur, random processes can still >die (of their own doing) when some malloc call fails to be able to reserve >swap space. Failing malloc is very different from being killed at random. The situation is well under control. Important processes can be programmed to try mallocing several times with exponential back-off. Even if it dies, it can cleanup its environment. >This will not happen without pre-allocation, and gives you the >opportunity to get more work done and possibly never have trouble. First, it is very probable (except for mmap) that malloced area is actually used. So, if there is not enough swap space, processes will almost certainly be killed. Without pre-allocation, a process will be killed without notice. There is no opportunity for retry nor for graceful shutdown. It can be a great trouble. >Next, there are actually ways to handle the deadlock. Yes, it is always possible to resolve deadlock by human intervention. >Note that Mach based >implementations are willing to page on just about any vfs available. >This means that Mach will page to Unix filesystems or NFS filesystems if >desired. The kernel (or some appropriately "wired" daemon) could note >that the system was running out of paging space, and make arrangements to >either suspend memory consuming processes or mount more filesystems. Of course, it is not impossible, it is just next to impossible. By the way, suspending memory consuming processes is the worst thing to do. The consumed memory will not be released until those processes will be reactivated and the processes will not be reactivated until a large amount of memory is released. The situation is partial deadlock, and meanwhile, other processes will easily consume the rest of the memory, causing system wide deadlock. >Even >if the deadlock actually occurred, you could suspend all the processes waiting >for swap space, and then mount some reserve filesystem. It is very strange that you have reserved filesystem available for swapping. You should have already allocated such a free space in advance as swap area. >Some care would then >be necessary to let the important jobs finish. It may be necessary to continue >jobs selectively instead of all at once, to prevent a repeat of the deadlock. >Even if you need some more memory to get the mounting done, you could kill >some non-critical system daemon that could be started later (like lpd or >something). Later on, you could decide to restart the daemon killed earlier, >and/or unmount no longer used filesystems. Eventually, you could return to >normal operation. Who take care of all these things? Are you proposing to attach knowledgable person all the time? A person who can understand what is deadlock seems to be very uncommon even in this newsgroup. >This is a little complicated but certainly doable and allows you to not >reserve swap space on memory allocation and to use a true COW fork(). But it dose not worth doing so. Use vfork. >People may complain that this is not a truly >general solution, and I would agree. Vfork is the true solution. >However, combined with >the added flexibility of no preallocation, it seems justifiable. No. Vfork do no preallocation either. >As a side note, on the large systems I work on, we don't do preallocation >and have never run out of paging (swap) space. This is not to say that >we never will, but typical systems have on the order of at least 10 times as >much disk storage as main memory. In some cases, as much as 100 times more. You have 100 times more swap space because you think it may be filled, don't you. >I claim it is hard to fill that much disk >space with paging and swapping traffic and still have a usable system. You'll >probably be thrashing to death long before that. As you may know, programs manupulating large arrays, if written properly, can use very large virtual space with little real memory without thrashing. That is why some of your system are configured 100 times more swap space, isn't it? Masataka Ohta
lkaplan@bbn.com (Larry Kaplan) (07/17/90)
In article <5870@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >In article <58184@bbn.BBN.COM> lkaplan@BBN.COM > (Larry Kaplan) writes: > >>While deadlock doesn't occur, random processes can still >>die (of their own doing) when some malloc call fails to be able to reserve >>swap space. > >Failing malloc is very different from being killed at random. > >The situation is well under control. Important processes can be programmed >to try mallocing several times with exponential back-off. Even if it dies, >it can cleanup its environment. > This requires significant modification of programs. My proposals have not required any change to applications, only additional system code. A general solution does not require old applications to be rewritten. >>Next, there are actually ways to handle the deadlock. > >Yes, it is always possible to resolve deadlock by human intervention. Don't put words in my mouth. I say below that you can write a daemon (which is a program) to do this. If you want to let a human do it, you can, but that is suboptimal. >>Even >>if the deadlock actually occurred, you could suspend all the processes waiting >>for swap space, and then mount some reserve filesystem. > >It is very strange that you have reserved filesystem available for >swapping. You should have already allocated such a free space in advance >as swap area. This is only a convenience. Depending where you implemented this code, you could simply check a high water paging mark as mentioned by others. Holding some portion of your swap space in reserve is just another way to give you the opportunity to play these games when trouble starts. Using a high water mark can work just as well. >>Some care would then >>be necessary to let the important jobs finish. It may be necessary to continue >>jobs selectively instead of all at once, to prevent a repeat of the deadlock. >>Even if you need some more memory to get the mounting done, you could kill >>some non-critical system daemon that could be started later (like lpd or >>something). Later on, you could decide to restart the daemon killed earlier, >>and/or unmount no longer used filesystems. Eventually, you could return to >>normal operation. > >Who take care of all these things? Are you proposing to attach knowledgable >person all the time? A person who can understand what is deadlock seems >to be very uncommon even in this newsgroup. Read the posting. It says daemon. This means a program. It could also be part of the kernel. (How about not deriding the readership while your at it.) >>This is a little complicated but certainly doable and allows you to not >>reserve swap space on memory allocation and to use a true COW fork(). > >But it dose not worth doing so. Use vfork. This is your opinion (if I understand the sentence). Making statements like this does not address the other benefits that many people seem to like about COW fork. >>People may complain that this is not a truly >>general solution, and I would agree. > >Vfork is the true solution. True?? By whose standards. This is what we are debating. Making statements like this without justification is worthless. When I say its not general, I mean that there are situations were this may not be appropriate. There are most certainly situations where vfork() is not appropriate. So much for being a true solution. >>As a side note, on the large systems I work on, we don't do preallocation >>and have never run out of paging (swap) space. This is not to say that >>we never will, but typical systems have on the order of at least 10 times as >>much disk storage as main memory. In some cases, as much as 100 times more. > >You have 100 times more swap space because you think it may be filled, >don't you. NO. These numbers come from simply looking at the systems that people have running. These numbers are true for most all computers in general. It may be that most people can't page on all their filesystems, but thats not what I said. >>I claim it is hard to fill that much disk >>space with paging and swapping traffic and still have a usable system. You'll >>probably be thrashing to death long before that. > >As you may know, programs manipulating large arrays, if written properly, >can use very large virtual space with little real memory without >thrashing. That is why some of your system are configured 100 times >more swap space, isn't it? You are attempting to justify the reasons for the way I have my systems configured, when in reality this is the way just about everybody's systems are configured. The point about sparse matrix programs may be true to some extent. These programs may or may not represent a significant portion of a machine's workload and therefore have some bearing on the techniques selected. Depending on the reference patterns of the programs though, if you really don't reference parts of the matrix at all, then pre-allocation of swap space is going to prevent the program from running when it would run fine without preallocation. Even just reads from parts of the matrix wouldn't require swap space allocation since the pages aren't dirtied. This sounds quite likely. On inspection, it appears that one of our big systems currently in the field has 1 gigabyte of physical memory and only about 6.4 gigabytes of disk storage. They run sparse matrix problems. They run lots of scientific codes. They have yet to have any problems with swap space. It may be that the 100 number is a little exaggerated. Having disk storage being 10 times the amount of physical memory in large machines seems to be more the rule. Disk servers, however, move more towards the 100 mark. #include <std_disclaimer> _______________________________________________________________________________ ____ \ / ____ Laurence S. Kaplan | \ 0 / | BBN Advanced Computers lkaplan@bbn.com \____|||____/ 10 Fawcett St. (617) 873-2431 /__/ | \__\ Cambridge, MA 02238
mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/19/90)
In article <58227@bbn.BBN.COM> lkaplan@BBN.COM (Larry Kaplan) writes: >>The situation is well under control. Important processes can be programmed >>to try mallocing several times with exponential back-off. Even if it dies, >>it can cleanup its environment. >This requires significant modification of programs. No. All you have to do is redefine malloc and realloc. >My proposals have not >required any change to applications, only additional system code. A general >solution does not require old applications to be rewritten. But your proposal is not a solution. >>Yes, it is always possible to resolve deadlock by human intervention. >Don't put words in my mouth. I say below that you can write a daemon >(which is a program) to do this. If you want to let a human do it, you can, >but that is suboptimal. You may claim you can write a perfect AI, which I don't mind. >>It is very strange that you have reserved filesystem available for >>swapping. You should have already allocated such a free space in advance >>as swap area. >This is only a convenience. Depending where you implemented this code, you >could simply check a high water paging mark as mentioned by others. Holding >some portion of your swap space in reserve is just another way to give you the >opportunity to play these games when trouble starts. Using a high water mark >can work just as well. But I don't want to play with trouble. I would rather play nethack, instead. >Read the posting. It says daemon. This means a program. Oops! A program? Are you joking? What if the program itself run short of pages? >It could also be part of the kernel. More reasonable proposal. It can be a dirty and unreliable workaround. >>But it dose not worth doing so. Use vfork. >This is your opinion (if I understand the sentence). Making statements like >this does not address the other benefits that many people seem to like >about COW fork. I will use COW fork, of course, if it exists and it is necessary. But, to fork-exec something, you should use vfork. >>>People may complain that this is not a truly >>>general solution, and I would agree. >>Vfork is the true solution. >True?? By whose standards. This is what we are debating. Making statements >like this without justification is worthless. I showed several justification why simple fork is not appropriate for fork-exec. On the other hand, you showed no justification not to use vfork. You only showed it is inelegant not to use vfork and insists on using fork. >There are most certainly situations where vfork() is not appropriate. Have you shown any justification to claim so? Or, is it merely your desire? >>>As a side note, on the large systems I work on, we don't do preallocation >>>and have never run out of paging (swap) space. This is not to say that >>>we never will, but typical systems have on the order of at least 10 times as >>>much disk storage as main memory. In some cases, as much as 100 times more. >>You have 100 times more swap space because you think it may be filled, >>don't you. >NO. These numbers come from simply looking at the systems that people have >running. These numbers are true for most all computers in general. You want to say most all computers in general have 10 to 100 times more swap space than real memory? That is simply incorrect. Very few system have 10 times more swap space. >It may be >that most people can't page on all their filesystems, but thats not what I >said. Then, what you want to say? You should make your point clear. >>>I claim it is hard to fill that much disk >>>space with paging and swapping traffic and still have a usable system. You'll >>>probably be thrashing to death long before that. >>As you may know, programs manipulating large arrays, if written properly, >>can use very large virtual space with little real memory without >>thrashing. That is why some of your system are configured 100 times >>more swap space, isn't it? >You are attempting to justify the reasons for the way I have my systems >configured, when in reality this is the way just about everybody's systems are >configured. The thread of the vfork discussion begins because I said your system without vfork is broken. Moreover, it is you, who brought the configuration of your system into the discussion. So, why can't I refer to your system? >The point about sparse matrix programs may be true to some extent. Sparse matrix? I am not reffering to such a thing. >These programs may or may not represent a significant portion of a machine's >workload and therefore have some bearing on the techniques selected. >Depending on the reference patterns of the programs though, if you really don't >reference parts of the matrix at all, then pre-allocation of swap space is >going to prevent the program from running when it would run fine without >preallocation. If you don't have to reference parts of a matrix, there is an effecient algorithm to do it without extra swap space. >Even just reads from parts of the matrix wouldn't require >swap space allocation since the pages aren't dirtied. This sounds quite >likely. You can do so with mmap specifying readonly option. A cleaver implementation won't allocate extra swap space. BUT, such a thing has nothing to do with fork nor vfork. >On inspection, it appears that one of our big systems currently in the field >has 1 gigabyte of physical memory and only about 6.4 gigabytes of disk storage. >They run sparse matrix problems. They run lots of scientific codes. They have >yet to have any problems with swap space. You might have been lucky, or, you might have just overlooked the problem. >It may be that the 100 number is a little exaggerated. Having disk storage >being 10 times the amount of physical memory in large machines seems to be >more the rule. Disk servers, however, move more towards the 100 mark. So what? Masataka Ohta
lkaplan@bbn.com (Larry Kaplan) (07/19/90)
To start with, I think we've about beat this topic into the ground. We all have our preferences and noone is going to change them easily. :-) Anyway, I'll address some unclear points and provide a little more justification than I have before. I'm going to try and avoid making deprecating (or insulting) remarks as some have done on this topic. In article <5894@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: > (stuff about my proposed swap space monitor daemon) > >Oops! A program? Are you joking? What if the program itself run short >of pages? Our kernel (and others, I am sure) allows sufficiently privileged users (like root) to ask that some of their memory be wired (not paged). This facility could certainly be used for this daemon. I did say this in the original posting but some may have overlooked this. >>It could also be part of the kernel. > >More reasonable proposal. It can be a dirty and unreliable workaround. (Another obnoxious comment.) >I showed several justification why simple fork is not appropriate >for fork-exec. But the debate is COW fork vs vfork. >On the other hand, you showed no justification not to use vfork. >You only showed it is inelegant not to use vfork and insists on using >fork. > >>There are most certainly situations where vfork() is not appropriate. > >Have you shown any justification to claim so? Or, is it merely your >desire? Let me attack this one now. Let me start by saying that the problems we have with a regular vfork are probably fairly machine specific. However, some of this reasoning may soon apply to other machines if not already. Our machine is a non-uniform memory architecture multiprocessor. This means that each node has memory local to it. If you were to use vfork when forking onto another processor, the child would be executing out of remote memory. While caches help alleviate the performance penalty, making things local is a much better idea. By using COW fork, we always set up a local text (initially empty) and stack segment, and even support COR (copy on reference) for remote memory marked INHERIT_COPY. Given this reason, in addition to the other problems with vfork (such as the broken semantics people seem to agree upon), made it fairly clear that we should get rid of it. To make vfork set up local text anyway would eliminate most of performance advantage it had. Note that even for forks onto the same processor, the text segment manipulations aren't that expensive since in this case you do share the segment and only have to set up page table entries on demand (or you could preallocate the page tables for the text). >>>>As a side note, on the large systems I work on, we don't do preallocation >>>>and have never run out of paging (swap) space. This is not to say that >>>>we never will, but typical systems have on the order of at least 10 times as >>>>much disk storage as main memory. In some cases, as much as 100 times more. > >>>You have 100 times more swap space because you think it may be filled, >>>don't you. > >>NO. These numbers come from simply looking at the systems that people have >>running. These numbers are true for most all computers in general. > >You want to say most all computers in general have 10 to 100 times more >swap space than real memory? That is simply incorrect. > >Very few system have 10 times more swap space. Please, please, please, I said disk space to real memory, not swap space. Part of the point was that Mach is willing to page to all the disks whether they are true swap partitions or regular UNIX filesystems. So the ratio is very relevant to Mach based systems. >>>As you may know, programs manipulating large arrays, if written properly, >>>can use very large virtual space with little real memory without >>>thrashing. That is why some of your system are configured 100 times >>>more swap space, isn't it? > >>You are attempting to justify the reasons for the way I have my systems >>configured, when in reality this is the way just about everybody's systems are >>configured. > >The thread of the vfork discussion begins because I said your system without >vfork is broken. > >Moreover, it is you, who brought the configuration of your system into >the discussion. > >So, why can't I refer to your system? You can most certainly refer to my system. But you were suggesting the reasons for why my system looks the way it does and those reasons are simply NOT true. >>The point about sparse matrix programs may be true to some extent. > >Sparse matrix? I am not refering to such a thing. > >>Even just reads from parts of the matrix wouldn't require >>swap space allocation since the pages aren't dirtied. This sounds quite >>likely. > >You can do so with mmap specifying readonly option. A cleaver implementation >won't allocate extra swap space. > >BUT, such a thing has nothing to do with fork nor vfork. But it does have to do with preallocation and COW fork vs vfork. Again any mmap calls require application code changes that the non-preallocating scheme doesn't need. >>On inspection, it appears that one of our big systems currently in the field >>has 1 gigabyte of physical memory and only about 6.4 gigabytes of disk storage. >>They run sparse matrix problems. They run lots of scientific codes. They have >>yet to have any problems with swap space. > >You might have been lucky, or, you might have just overlooked the >problem. Lucky? Maybe. Overlooked? Not possible. While this has been a very interesting discussion, and most of the necessary primitives exist in our kernel to support these fancy emergency schemes, none of higher level things have been implemented yet. If they ran out of swap space, they (and we) would know. #include <std_disclaimer> _______________________________________________________________________________ ____ \ / ____ Laurence S. Kaplan | \ 0 / | BBN Advanced Computers lkaplan@bbn.com \____|||____/ 10 Fawcett St. (617) 873-2431 /__/ | \__\ Cambridge, MA 02238
mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/20/90)
In article <58307@bbn.BBN.COM> lkaplan@BBN.COM (Larry Kaplan) writes: >Anyway, I'll address some unclear points and provide a little more >justification than I have before. OK, now, at last, you present a justification. Now, debate on COW fork vs. vfork can begin. But, be sure that, vfork is only useful just before exec. Let me show that your justification is wrong. >>Oops! A program? Are you joking? What if the program itself run short >>of pages? >Our kernel (and others, I am sure) allows sufficiently privileged users >(like root) to ask that some of their memory be wired (not paged). This >facility could certainly be used for this daemon. I did say this in the >original posting but some may have overlooked this. Perhaps, your kernel (but not all UNIX, I am SURE) may be able to do so. But, it has nothing to do with the problem. Can your kernel allows to force yet unallocated pages to be copied into swap space and make it wired? >>I showed several justification why simple fork is not appropriate >>for fork-exec. > >But the debate is COW fork vs vfork. Simple fork, I mean, includes COW fork. >>Have you shown any justification to claim so? Or, is it merely your >>desire? >Let me attack this one now. Let me start by saying that the problems >we have with a regular vfork are probably fairly machine specific. However, >some of this reasoning may soon apply to other machines if not already. I don't think so. But, I don't want to fork the discussion. So, let's focus on your architecture. >If you were to use vfork when forking >onto another processor, the child would be executing out of remote memory. >While caches help alleviate the performance penalty, making things local >is a much better idea. You misunderstand semantics of vfork. When you want simple, general purpose fork (including COW fork), it is often (but not always) reasonable to use other processors for child process. But, if you are vforking, the parent process is suspended until child process dose exit or exec. So, there is no reason to run the child process on another processor. You should, instead, arrange exec, to be done on other processor. >To make vfork set up >local text anyway would eliminate most of performance advantage it had. >Note that even for forks onto the same processor, the text segment >manipulations aren't that expensive Though, it has nothing to do with the main-stream discussion, you should learn more about the history of UNIX. Text segment copying is never a reason for vfork. Vfork was introduced because copying of large data/stack segment was time-consuming. Sharing of read-only text segment has very old history. >Please, please, please, I said disk space to real memory, not swap space. Sorry about my misunderstanding. >Part of the point was that Mach is willing to page to all the disks whether >they are true swap partitions or regular UNIX filesystems. So the ratio >is very relevant to Mach based systems. But, if you think you have free disk space, you should have allocated it for swapping in advance. >>>Even just reads from parts of the matrix wouldn't require >>>swap space allocation since the pages aren't dirtied. This sounds quite >>>likely. >>You can do so with mmap specifying readonly option. A cleaver implementation >>won't allocate extra swap space. >>BUT, such a thing has nothing to do with fork nor vfork. >But it does have to do with preallocation and COW fork vs vfork. Again any >mmap calls require application code changes that the non-preallocating >scheme doesn't need. But, without mmap, how can you set values to the read-only matrix? It can not be a part of data segment, because your program, practically, can not have huge number of initializer lines for the large matrix. I thought, it can only be a file output from another program. Masataka Ohta
lkaplan@bbn.com (Larry Kaplan) (07/20/90)
I wrote: >>Our kernel (and others, I am sure) allows sufficiently privileged users >>(like root) to ask that some of their memory be wired (not paged). This >>facility could certainly be used for this daemon. I did say this in the >>original posting but some may have overlooked this. mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >Perhaps, your kernel (but not all UNIX, I am SURE) may be able to >do so. But, it has nothing to do with the problem. Can your kernel >allows to force yet unallocated pages to be copied into swap space >and make it wired? Why does this have nothing to do with the problem? We were discussing methods for handling swap space exhaustion without pre-allocation. The ability to wire memory is a very valid aid to solving the problem. When you wire pages on a non-preallocating system, there is no reason to copy them into the swap area. They will never go there. >>If you were to use vfork when forking >>onto another processor, the child would be executing out of remote memory. >>While caches help alleviate the performance penalty, making things local >>is a much better idea. >You misunderstand semantics of vfork. When you want simple, general >purpose fork (including COW fork), it is often (but not always) >reasonable to use other processors for child process. >But, if you are vforking, the parent process is suspended until >child process dose exit or exec. >So, there is no reason to run the child process on another processor. >You should, instead, arrange exec, to be done on other processor. This is an interesting idea, and actually might make vfork useful for us. However, I am unsure that the performance gained by maintaining the vfork hack and creating a new "exec on processor n" syscall would be worth the effort. With COW "fork on processor n" and regular exec, we probably get very close to the same performance and have a syscall that is useful for other things like parallel processing. I think at this point we have alternative solutions that each have benefits but, of course, the time and utility of implementing each needs to be weighed. >>To make vfork set up >>local text anyway would eliminate most of performance advantage it had. >>Note that even for forks onto the same processor, the text segment >>manipulations aren't that expensive >Though, it has nothing to do with the main-stream discussion, you >should learn more about the history of UNIX. Text segment copying >is never a reason for vfork. Vfork was introduced because copying >of large data/stack segment was time-consuming. Sharing of read-only >text segment has very old history. I think you misunderstood this one. I know what vfork was intended for. Our system has other requirements though. Certainly text segment copying is not a reason for vfork. But in a NUMA multiprocessor, vfork (or fork) to another node is a reason for text segment copying. Though no copy is really made except for the pages that are referenced on that node. The same is true for data and stack when going to another node. However, here, copy on reference (or write) can be used. Actually you could say that the way the text segment is brought over is also copy on reference. I think it is apparent that vfork to another node in a NUMA multiprocessor changes the semantics of the call enough to make it inappropriate. vfork to the same node would be fine, but then how do you get any UNIX jobs (lots of serial processes like compiles) running on other nodes. The "exec on processor n" would work but is limited to this use only. COW fork() and regular exec, on the other hand, do pretty well at this job and COW fork is very useful for real parallel programming. >>Part of the point was that Mach is willing to page to all the disks whether >>they are true swap partitions or regular UNIX filesystems. So the ratio >>is very relevant to Mach based systems. >But, if you think you have free disk space, you should have allocated >it for swapping in advance. Why? Then it wouldn't be available for users. The name of the game here is flexibility. Why constrain swap space or user disk space when you can just allow both to allocate from the same place. If you are worried about runaway user disk consumption, thats what quotas are for. #include <std_disclaimer> _______________________________________________________________________________ ____ \ / ____ Laurence S. Kaplan | \ 0 / | BBN Advanced Computers lkaplan@bbn.com \____|||____/ 10 Fawcett St. (617) 873-2431 /__/ | \__\ Cambridge, MA 02238
peter@ficc.ferranti.com (Peter da Silva) (07/23/90)
I don't think that M. Ohta is arguing against retaining COW fork, but in favor of retaining vfork. In the case of multiprocessors, you would want to have fork-on-other-CPU, exec-on-other-CPU, and vfork. Each has its uses. -- Peter da Silva. `-_-' +1 713 274 5180. 'U` <peter@ficc.ferranti.com>
mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/23/90)
In article <58330@bbn.BBN.COM> lkaplan@BBN.COM (Larry Kaplan) writes: >>Perhaps, your kernel (but not all UNIX, I am SURE) may be able to >>do so. But, it has nothing to do with the problem. Can your kernel >>allows to force yet unallocated pages to be copied into swap space >>and make it wired? >Why does this have nothing to do with the problem? Because, you can't wire a yet-unallocated page. >We were discussing methods >for handling swap space exhaustion without pre-allocation. The ability to >wire memory is a very valid aid to solving the problem. Wiring is not essential. It is only necessary to actually allocate space. >When you wire pages >on a non-preallocating system, there is no reason to copy them into the swap >area. They will never go there. Again, if a page is not allocated, there is nothing to wire. >>So, there is no reason to run the child process on another processor. >>You should, instead, arrange exec, to be done on other processor. >This is an interesting idea, and actually might make vfork useful for us. >However, I am unsure that the performance gained by maintaining the vfork >hack and creating a new "exec on processor n" syscall would be worth the effort. You were worrying about copying effort of text segment very much, weren't you? If you always exec on other processor, the text segment of the original processor is untouched, and maybe, used later, without accessing global momory. >I think it is apparent that vfork to another node in a NUMA multiprocessor >changes the semantics of the call enough to make it inappropriate. Nothing is apparent to me. >vfork to the same node would be fine, but then how do you get any UNIX >jobs (lots of serial processes like compiles) running on other nodes. >The "exec on processor n" would work but is limited to this use only. "exec on processor n"? Why can't a regular exec performed on some other appropriate processor? Is a process id tied to a processor? If you have a apprication on current UNIX which uses lots of parallel processes with vfork and exec, you can run it on multi-processor with full parallelism without changing the code. >COW fork() and regular exec, on the other hand, do pretty well at this job As for the COW-fork performance, which I am not interested in so much, someone said some workaround (pre-copying of several pages) is necessary to gain suffecient speed. As for the pre-allocation problem, there only exists complex and incomplete workaround exists. >and COW fork is very useful for real parallel programming. Yes, you may have fork, which may be COW. >>But, if you think you have free disk space, you should have allocated >>it for swapping in advance. >Why? Then it wouldn't be available for users. The name of the game here >is flexibility. Why constrain swap space or user disk space when you can >just allow both to allocate from the same place. Flexibility? It is an illusion. Actually, it is only complexity of management. There is positive correlation for a common process between amount of file space and amount of swap space. >If you are worried about >runaway user disk consumption, thats what quotas are for. It is not acceptable for me, but if you don't mind losing of computation result due to quota, you shouldn't mind pre-allocating COW fork. Pre-allocating fork is only as bad as disk quota. Masstaka Ohta
jkenton@pinocchio.encore.com (Jeff Kenton) (07/23/90)
From article <5928@titcce.cc.titech.ac.jp>, by mohta@necom830.cc.titech.ac.jp (Masataka Ohta): > > Nothing is apparent to me. > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - jeff kenton --- temporarily at jkenton@pinocchio.encore.com --- always at (617) 894-4508 --- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
tif@doorstop.austin.ibm.com (Paul Chamberlain) (07/23/90)
In article <5928@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >In article <58330@bbn.BBN.COM> > lkaplan@BBN.COM (Larry Kaplan) writes: >>>[ deeply nested crud that we're all tired of reading] Why not agree to disagree? Some people don't want their programs to run unless they are absolutely guaranteed that it will finish. The rest of us like things to run as often as possible even when troubled with the occasional failure due to lack of resources. Paul Chamberlain | I do NOT represent IBM tif@doorstop, sc30661@ausvm6 512/838-7008 | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif
fetter@cos.com (Bob Fetter) (07/25/90)
In article <2845@awdprime.UUCP> tif@doorstop.austin.ibm.com (Paul Chamberlain) writes: >In article <5928@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >>In article <58330@bbn.BBN.COM> >> lkaplan@BBN.COM (Larry Kaplan) writes: >>>>[ deeply nested crud that we're all tired of reading] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Not me. I've found this discussion to be interesting, and it has given me several reasons and occasions for me to stop and rethink what I though I already knew--in the process (re-)learning some things. Hey, its the real *fun* of this business, at least for me. > >Why not agree to disagree? > >Some people don't want their programs to run unless they are absolutely >guaranteed that it will finish. The rest of us like things to run as >often as possible even when troubled with the occasional failure due to >lack of resources. Ah, but from the system OS platform standpoint, an occasional failure, esp. one due to lack of resources, is pretty important. So far as conservation of swapspace goes, it (if the system runs with a fixed swapfile mechanism) is a finite resource, and should be husbanded with appropriate care ('fixed', of course, at a given point in time, not to say that swapfile space can't be grown/shrunk). From an application pgm standpoint, resource exhaustion/failure/restarting can be a survivable sequence in many cases -- long-running, compute-bound, memory-expensive, apps, of course, aren't good to let this happen to. If the OS platform dies/locks up end-user processes, due to resource shortfalls or other reasons, not just one process suffers. This really isn't just a case of "not wanting a program to run unless they're sure it WILL run", but wanting OTHER programs to run -- read: user processes. But, enough on this, I guess. As to vfork(), please correct me if I'm wrong, but I have the mental model that vfork can be considered as a degenerate case of lightweight processes (LWP - let me term this degererate case a "psuedo-LWP, or pLWP). I term it degenerate, in that 1 - the main HWP is suspended until the exec*() of the pLWP, because 2 - the stack is logically shared between between the two, and 3 - that there can be only one pLWP paired to the main HWP at a time. (One could say that the existance of the pLWP is bounded by the vfork() call for its creation, and the exec() call terminating its existance as a *pLWP*, transforming it into a full-blown HWP). I see and understand the argument that one doesn't want to create a copy of the stack for this pLWP since, in the context of *this particular case*, the end-goal is to exec() a new process with its own resources, so why copy what one will toss away (using finite resources in the process -- the infamous swapfile discussions to-date), etc. To effect this "savings", I can see why one then needs to suspend the main HWP, so one doesn't get "dueling stack frames syndrome". Having one pLWP per HWP like this appears to be painful enough, hence the limit 3 above. My understanding is that, using vfork(), all segments known to the pLWP are exactly those known to the main HWP and there is no COW being done. As such, the pLWP is operationally the same as a "regular" LWP in this regard, except, of course, that the pLWP and the main HWP share the same stack segment ("regular" LWP have their own stacks). It seems to me that it is this sharing of the stack segment which is the issue (to use the same with its attendant risks, or perform COW a-la fork() with ITS problems such as swapfile allocation, etc), along with the "hidden" semantics of vfork() creating a pLWP when fork() doesn't and the confusion this difference may cause. Standing back a bit (hopefully not too far back), I get the impression that the overall issue just might be thought of as being the conversion of a LWP to a HWP. Now, it's not exactly that as of this point in time as regards the vfork() call, in that "regular" LWP have their own stack, whereas a pLWP is being a parasite on its main HWP. But... If/when actual direct kernal support for LWP is available, why not just then treat vfork() as the creation of a LWP, suspending the main HWP as done now and mapping all segment descriptors/references from the LWP back to the HWP (including the stack) and, when the exec() call occurs, 1 - remap all segment descriptors for the LWP to new, appropriately sized segments for the exec()-ed process, 2 - remove the suspension on the main HWP, putting in the appropriate return value from the vfork() call. This puts the LWP/task management into the system dispatcher/context switcher, removing the existing runtime inter-process implementation/multiplexing technique. I don't see (additional) usage of swapfile space in the above, which appears to be one of the major points under discussion, during the time between vfork() and exec() in the 'child'. It appears to also only allocate memory (and associated swapfile resource) as is required by the new HWP, not carrying over the 'parent' HWP load for a transient time only to release it effectively unused. After all, can't one consider that a LWP just a HWP which has one or more of its segments bound to/shared with another HWP? Or, am I missing something here? (Issues like files, signals, etc., would of course percolate as done today.) I seem to remember an article in this chain (or a related one) which discussed segment management system calls which could be implemented. Am I on a similar train of thought? -Bob-