[comp.arch] fork and preallocation

lkaplan@bbn.com (Larry Kaplan) (07/13/90)

In article <5855@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
>In article <5DL4SPD@xds13.ferranti.com>
>	peter@ficc.ferranti.com (Peter da Silva) writes:
>
>>> 1) An utterly broken implementation where some important system
>>> process (such as inetd, ypbind or sendmail) may killed if there
>>> is not enough swap space.
>
>>Alternatively, put the program in a wait state until swap space is available.
>>Deadlocks are possible, but unlikely. Indefinite deferment is more likely,
>
>No.
>
>Once swap space shortage occurs, it will tend to occur continually
>until some large process exits. So, if all such processes put in
>wait states (which is very likely to occur, because active process
>often requires new pages) the situation is deadlock.
>

First, preallocating still has some problems with respect to dynamic
allocations.  While deadlock doesn't occur, random processes can still
die (of their own doing) when some malloc call fails to be able to reserve 
swap space.  This will not happen without pre-allocation, and gives you the
opportunity to get more work done and possibly never have trouble.

Next, there are actually ways to handle the deadlock.  Note that Mach based
implementations are willing to page on just about any vfs available.
This means that Mach will page to Unix filesystems or NFS filesystems if
desired.  The kernel (or some appropriately "wired" daemon) could note
that the system was running out of paging space, and make arrangements to 
either suspend memory consuming processes or mount more filesystems.  Even
if the deadlock actually occurred, you could suspend all the processes waiting
for swap space, and then mount some reserve filesystem.  Some care would then
be necessary to let the important jobs finish.  It may be necessary to continue
jobs selectively instead of all at once, to prevent a repeat of the deadlock.
Even if you need some more memory to get the mounting done, you could kill 
some non-critical system daemon that could be started later (like lpd or 
something).  Later on, you could decide to restart the daemon killed earlier, 
and/or unmount no longer used filesystems.  Eventually, you could return to
normal operation.

This is a little complicated but certainly doable and allows you to not
reserve swap space on memory allocation and to use a true COW fork().
Even if such a daemon were not implemented, some of this could actually be
done by hand by an operator.  People may complain that this is not a truly
general solution, and I would agree.  However, combined with
the added flexibility of no preallocation, it seems justifiable.

As a side note, on the large systems I work on, we don't do preallocation
and have never run out of paging (swap) space.  This is not to say that
we never will, but typical systems have on the order of at least 10 times as
much disk storage as main memory.  In some cases, as much as 100 times more.
Even if the user filesystems are full, some care is taken to leave some other 
partitions available for paging.  I claim it is hard to fill that much disk
space with paging and swapping traffic and still have a usable system.  You'll
probably be thrashing to death long before that.

#include <std_disclaimer>
_______________________________________________________________________________
				 ____ \ / ____
Laurence S. Kaplan		|    \ 0 /    |		BBN Advanced Computers
lkaplan@bbn.com			 \____|||____/		10 Fawcett St.
(617) 873-2431			  /__/ | \__\		Cambridge, MA  02238

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/16/90)

In article <58184@bbn.BBN.COM> lkaplan@BBN.COM
	(Larry Kaplan) writes:

>While deadlock doesn't occur, random processes can still
>die (of their own doing) when some malloc call fails to be able to reserve 
>swap space.

Failing malloc is very different from being killed at random.

The situation is well under control. Important processes can be programmed
to try mallocing several times with exponential back-off. Even if it dies,
it can cleanup its environment.

>This will not happen without pre-allocation, and gives you the
>opportunity to get more work done and possibly never have trouble.

First, it is very probable (except for mmap) that malloced area is
actually used. So, if there is not enough swap space, processes will
almost certainly be killed.

Without pre-allocation, a process will be killed without notice.

There is no opportunity for retry nor for graceful shutdown.

It can be a great trouble.

>Next, there are actually ways to handle the deadlock.

Yes, it is always possible to resolve deadlock by human intervention.

>Note that Mach based
>implementations are willing to page on just about any vfs available.
>This means that Mach will page to Unix filesystems or NFS filesystems if
>desired.  The kernel (or some appropriately "wired" daemon) could note
>that the system was running out of paging space, and make arrangements to 
>either suspend memory consuming processes or mount more filesystems.

Of course, it is not impossible, it is just next to impossible.

By the way, suspending memory consuming processes is the worst thing to do.
The consumed memory will not be released until those processes will be
reactivated and the processes will not be reactivated until a large
amount of memory is released. The situation is partial deadlock, and
meanwhile, other processes will easily consume the rest of the memory, causing
system wide deadlock.

>Even
>if the deadlock actually occurred, you could suspend all the processes waiting
>for swap space, and then mount some reserve filesystem.

It is very strange that you have reserved filesystem available for
swapping. You should have already allocated such a free space in advance
as swap area.

>Some care would then
>be necessary to let the important jobs finish.  It may be necessary to continue
>jobs selectively instead of all at once, to prevent a repeat of the deadlock.
>Even if you need some more memory to get the mounting done, you could kill 
>some non-critical system daemon that could be started later (like lpd or 
>something).  Later on, you could decide to restart the daemon killed earlier, 
>and/or unmount no longer used filesystems.  Eventually, you could return to
>normal operation.

Who take care of all these things? Are you proposing to attach knowledgable
person all the time? A person who can understand what is deadlock seems
to be very uncommon even in this newsgroup.

>This is a little complicated but certainly doable and allows you to not
>reserve swap space on memory allocation and to use a true COW fork().

But it dose not worth doing so. Use vfork.

>People may complain that this is not a truly
>general solution, and I would agree.

Vfork is the true solution.

>However, combined with
>the added flexibility of no preallocation, it seems justifiable.

No. Vfork do no preallocation either.

>As a side note, on the large systems I work on, we don't do preallocation
>and have never run out of paging (swap) space.  This is not to say that
>we never will, but typical systems have on the order of at least 10 times as
>much disk storage as main memory.  In some cases, as much as 100 times more.

You have 100 times more swap space because you think it may be filled,
don't you.

>I claim it is hard to fill that much disk
>space with paging and swapping traffic and still have a usable system.  You'll
>probably be thrashing to death long before that.

As you may know, programs manupulating large arrays, if written properly,
can use very large virtual space with little real memory without
thrashing. That is why some of your system are configured 100 times
more swap space, isn't it?

						Masataka Ohta

lkaplan@bbn.com (Larry Kaplan) (07/17/90)

In article <5870@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
>In article <58184@bbn.BBN.COM> lkaplan@BBN.COM
>	(Larry Kaplan) writes:
>
>>While deadlock doesn't occur, random processes can still
>>die (of their own doing) when some malloc call fails to be able to reserve 
>>swap space.
>
>Failing malloc is very different from being killed at random.
>
>The situation is well under control. Important processes can be programmed
>to try mallocing several times with exponential back-off. Even if it dies,
>it can cleanup its environment.
>

This requires significant modification of programs.  My proposals have not 
required any change to applications, only additional system code.  A general
solution does not require old applications to be rewritten.

>>Next, there are actually ways to handle the deadlock.
>
>Yes, it is always possible to resolve deadlock by human intervention.

Don't put words in my mouth.  I say below that you can write a daemon
(which is a program) to do this.  If you want to let a human do it, you can,
but that is suboptimal.

>>Even
>>if the deadlock actually occurred, you could suspend all the processes waiting
>>for swap space, and then mount some reserve filesystem.
>
>It is very strange that you have reserved filesystem available for
>swapping. You should have already allocated such a free space in advance
>as swap area.

This is only a convenience.  Depending where you implemented this code, you 
could simply check a high water paging mark as mentioned by others.  Holding 
some portion of your swap space in reserve is just another way to give you the 
opportunity to play these games when trouble starts.  Using a high water mark
can work just as well.

>>Some care would then
>>be necessary to let the important jobs finish.  It may be necessary to continue
>>jobs selectively instead of all at once, to prevent a repeat of the deadlock.
>>Even if you need some more memory to get the mounting done, you could kill 
>>some non-critical system daemon that could be started later (like lpd or 
>>something).  Later on, you could decide to restart the daemon killed earlier, 
>>and/or unmount no longer used filesystems.  Eventually, you could return to
>>normal operation.
>
>Who take care of all these things? Are you proposing to attach knowledgable
>person all the time? A person who can understand what is deadlock seems
>to be very uncommon even in this newsgroup.

Read the posting.  It says daemon.  This means a program.  It could also be
part of the kernel.  (How about not deriding the readership while your at it.)

>>This is a little complicated but certainly doable and allows you to not
>>reserve swap space on memory allocation and to use a true COW fork().
>
>But it dose not worth doing so. Use vfork.

This is your opinion (if I understand the sentence).  Making statements like 
this does not address the other benefits that many people seem to like 
about COW fork.

>>People may complain that this is not a truly
>>general solution, and I would agree.
>
>Vfork is the true solution.

True?? By whose standards.  This is what we are debating.  Making statements
like this without justification is worthless.  When I say its not general, I 
mean that there are situations were this may not be appropriate.  There are 
most certainly situations where vfork() is not appropriate.  So much for being 
a true solution.

>>As a side note, on the large systems I work on, we don't do preallocation
>>and have never run out of paging (swap) space.  This is not to say that
>>we never will, but typical systems have on the order of at least 10 times as
>>much disk storage as main memory.  In some cases, as much as 100 times more.
>
>You have 100 times more swap space because you think it may be filled,
>don't you.

NO.  These numbers come from simply looking at the systems that people have 
running.  These numbers are true for most all computers in general.  It may be 
that most people can't page on all their filesystems, but thats not what I 
said.  

>>I claim it is hard to fill that much disk
>>space with paging and swapping traffic and still have a usable system.  You'll
>>probably be thrashing to death long before that.
>
>As you may know, programs manipulating large arrays, if written properly,
>can use very large virtual space with little real memory without
>thrashing. That is why some of your system are configured 100 times
>more swap space, isn't it?

You are attempting to justify the reasons for the way I have my systems 
configured, when in reality this is the way just about everybody's systems are 
configured.  

The point about sparse matrix programs may be true to some extent.
These programs may or may not represent a significant portion of a machine's
workload and therefore have some bearing on the techniques selected.
Depending on the reference patterns of the programs though, if you really don't
reference parts of the matrix at all, then pre-allocation of swap space is
going to prevent the program from running when it would run fine without 
preallocation.  Even just reads from parts of the matrix wouldn't require
swap space allocation since the pages aren't dirtied.  This sounds quite
likely.

On inspection, it appears that one of our big systems currently in the field
has 1 gigabyte of physical memory and only about 6.4 gigabytes of disk storage.
They run sparse matrix problems.  They run lots of scientific codes.  They have
yet to have any problems with swap space.  

It may be that the 100 number is a little exaggerated.  Having disk storage 
being 10 times the amount of physical memory in large machines seems to be 
more the rule.  Disk servers, however, move more towards the 100 mark.

#include <std_disclaimer>
_______________________________________________________________________________
				 ____ \ / ____
Laurence S. Kaplan		|    \ 0 /    |		BBN Advanced Computers
lkaplan@bbn.com			 \____|||____/		10 Fawcett St.
(617) 873-2431			  /__/ | \__\		Cambridge, MA  02238

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/19/90)

In article <58227@bbn.BBN.COM> lkaplan@BBN.COM (Larry Kaplan) writes:

>>The situation is well under control. Important processes can be programmed
>>to try mallocing several times with exponential back-off. Even if it dies,
>>it can cleanup its environment.

>This requires significant modification of programs.

No. All you have to do is redefine malloc and realloc.

>My proposals have not 
>required any change to applications, only additional system code.  A general
>solution does not require old applications to be rewritten.

But your proposal is not a solution.

>>Yes, it is always possible to resolve deadlock by human intervention.

>Don't put words in my mouth.  I say below that you can write a daemon
>(which is a program) to do this.  If you want to let a human do it, you can,
>but that is suboptimal.

You may claim you can write a perfect AI, which I don't mind.

>>It is very strange that you have reserved filesystem available for
>>swapping. You should have already allocated such a free space in advance
>>as swap area.

>This is only a convenience.  Depending where you implemented this code, you 
>could simply check a high water paging mark as mentioned by others.  Holding 
>some portion of your swap space in reserve is just another way to give you the 
>opportunity to play these games when trouble starts.  Using a high water mark
>can work just as well.

But I don't want to play with trouble. I would rather play nethack, instead.

>Read the posting.  It says daemon.  This means a program.

Oops! A program? Are you joking? What if the program itself run short
of pages?

>It could also be part of the kernel.

More reasonable proposal. It can be a dirty and unreliable workaround.

>>But it dose not worth doing so. Use vfork.

>This is your opinion (if I understand the sentence).  Making statements like 
>this does not address the other benefits that many people seem to like 
>about COW fork.

I will use COW fork, of course, if it exists and it is necessary. But,
to fork-exec something, you should use vfork.

>>>People may complain that this is not a truly
>>>general solution, and I would agree.

>>Vfork is the true solution.

>True?? By whose standards.  This is what we are debating.  Making statements
>like this without justification is worthless.

I showed several justification why simple fork is not appropriate
for fork-exec.

On the other hand, you showed no justification not to use vfork.

You only showed it is inelegant not to use vfork and insists on using
fork.

>There are most certainly situations where vfork() is not appropriate.

Have you shown any justification to claim so? Or, is it merely your
desire?

>>>As a side note, on the large systems I work on, we don't do preallocation
>>>and have never run out of paging (swap) space.  This is not to say that
>>>we never will, but typical systems have on the order of at least 10 times as
>>>much disk storage as main memory.  In some cases, as much as 100 times more.

>>You have 100 times more swap space because you think it may be filled,
>>don't you.

>NO.  These numbers come from simply looking at the systems that people have 
>running.  These numbers are true for most all computers in general.

You want to say most all computers in general have 10 to 100 times more
swap space than real memory? That is simply incorrect.

Very few system have 10 times more swap space.

>It may be 
>that most people can't page on all their filesystems, but thats not what I 
>said.  

Then, what you want to say?

You should make your point clear.

>>>I claim it is hard to fill that much disk
>>>space with paging and swapping traffic and still have a usable system.  You'll
>>>probably be thrashing to death long before that.

>>As you may know, programs manipulating large arrays, if written properly,
>>can use very large virtual space with little real memory without
>>thrashing. That is why some of your system are configured 100 times
>>more swap space, isn't it?

>You are attempting to justify the reasons for the way I have my systems 
>configured, when in reality this is the way just about everybody's systems are 
>configured.  

The thread of the vfork discussion begins because I said your system without
vfork is broken.

Moreover, it is you, who brought the configuration of your system into
the discussion.

So, why can't I refer to your system?

>The point about sparse matrix programs may be true to some extent.

Sparse matrix? I am not reffering to such a thing.

>These programs may or may not represent a significant portion of a machine's
>workload and therefore have some bearing on the techniques selected.
>Depending on the reference patterns of the programs though, if you really don't
>reference parts of the matrix at all, then pre-allocation of swap space is
>going to prevent the program from running when it would run fine without 
>preallocation.

If you don't have to reference parts of a matrix, there is an effecient
algorithm to do it without extra swap space.

>Even just reads from parts of the matrix wouldn't require
>swap space allocation since the pages aren't dirtied.  This sounds quite
>likely.

You can do so with mmap specifying readonly option. A cleaver implementation
won't allocate extra swap space.

BUT, such a thing has nothing to do with fork nor vfork.

>On inspection, it appears that one of our big systems currently in the field
>has 1 gigabyte of physical memory and only about 6.4 gigabytes of disk storage.
>They run sparse matrix problems.  They run lots of scientific codes.  They have
>yet to have any problems with swap space.  

You might have been lucky, or, you might have just overlooked the
problem.

>It may be that the 100 number is a little exaggerated.  Having disk storage 
>being 10 times the amount of physical memory in large machines seems to be 
>more the rule.  Disk servers, however, move more towards the 100 mark.

So what?

						Masataka Ohta

lkaplan@bbn.com (Larry Kaplan) (07/19/90)

To start with, I think we've about beat this topic into the ground.  
We all have our preferences and noone is going to change them easily. :-)

Anyway, I'll address some unclear points and provide a little more 
justification than I have before.  I'm going to try and avoid making
deprecating (or insulting) remarks as some have done on this topic.

In article <5894@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
> (stuff about my proposed swap space monitor daemon)
>
>Oops! A program? Are you joking? What if the program itself run short
>of pages?

Our kernel (and others, I am sure) allows sufficiently privileged users
(like root) to ask that some of their memory be wired (not paged).  This
facility could certainly be used for this daemon. I did say this in the 
original posting but some may have overlooked this.

>>It could also be part of the kernel.
>
>More reasonable proposal. It can be a dirty and unreliable workaround.

(Another obnoxious comment.)

>I showed several justification why simple fork is not appropriate
>for fork-exec.

But the debate is COW fork vs vfork.

>On the other hand, you showed no justification not to use vfork.
>You only showed it is inelegant not to use vfork and insists on using
>fork.
>
>>There are most certainly situations where vfork() is not appropriate.
>
>Have you shown any justification to claim so? Or, is it merely your
>desire?

Let me attack this one now.  Let me start by saying that the problems
we have with a regular vfork are probably fairly machine specific.  However, 
some of this reasoning may soon apply to other machines if not already.
Our machine is a non-uniform memory architecture multiprocessor.  This means
that each node has memory local to it.  If you were to use vfork when forking
onto another processor, the child would be executing out of remote memory.
While caches help alleviate the performance penalty, making things local
is a much better idea.  By using COW fork, we always set up a local text
(initially empty) and stack segment, and even support COR (copy on reference) 
for remote memory marked INHERIT_COPY.  Given this reason, in addition to the 
other problems with vfork (such as the broken semantics people seem to agree 
upon), made it fairly clear that we should get rid of it.  To make vfork set up
local text anyway would eliminate most of performance advantage it had.
Note that even for forks onto the same processor, the text segment 
manipulations aren't that expensive since in this case you do share the segment
and only have to set up page table entries on demand (or you could preallocate 
the page tables for the text).

>>>>As a side note, on the large systems I work on, we don't do preallocation
>>>>and have never run out of paging (swap) space.  This is not to say that
>>>>we never will, but typical systems have on the order of at least 10 times as
>>>>much disk storage as main memory.  In some cases, as much as 100 times more.
>
>>>You have 100 times more swap space because you think it may be filled,
>>>don't you.
>
>>NO.  These numbers come from simply looking at the systems that people have 
>>running.  These numbers are true for most all computers in general.
>
>You want to say most all computers in general have 10 to 100 times more
>swap space than real memory? That is simply incorrect.
>
>Very few system have 10 times more swap space.

Please, please, please, I said disk space to real memory, not swap space.
Part of the point was that Mach is willing to page to all the disks whether
they are true swap partitions or regular UNIX filesystems.  So the ratio
is very relevant to Mach based systems.

>>>As you may know, programs manipulating large arrays, if written properly,
>>>can use very large virtual space with little real memory without
>>>thrashing. That is why some of your system are configured 100 times
>>>more swap space, isn't it?
>
>>You are attempting to justify the reasons for the way I have my systems 
>>configured, when in reality this is the way just about everybody's systems are 
>>configured.  
>
>The thread of the vfork discussion begins because I said your system without
>vfork is broken.
>
>Moreover, it is you, who brought the configuration of your system into
>the discussion.
>
>So, why can't I refer to your system?

You can most certainly refer to my system.  But you were suggesting the reasons
for why my system looks the way it does and those reasons are simply NOT
true.

>>The point about sparse matrix programs may be true to some extent.
>
>Sparse matrix? I am not refering to such a thing.
>
>>Even just reads from parts of the matrix wouldn't require
>>swap space allocation since the pages aren't dirtied.  This sounds quite
>>likely.
>
>You can do so with mmap specifying readonly option. A cleaver implementation
>won't allocate extra swap space.
>
>BUT, such a thing has nothing to do with fork nor vfork.

But it does have to do with preallocation and COW fork vs vfork.  Again any
mmap calls require application code changes that the non-preallocating
scheme doesn't need.

>>On inspection, it appears that one of our big systems currently in the field
>>has 1 gigabyte of physical memory and only about 6.4 gigabytes of disk storage.
>>They run sparse matrix problems.  They run lots of scientific codes.  They have
>>yet to have any problems with swap space.  
>
>You might have been lucky, or, you might have just overlooked the
>problem.

Lucky?  Maybe.  Overlooked?   Not possible.  While this has been a very 
interesting discussion, and most of the necessary primitives exist in our
kernel to support these fancy emergency schemes, none of higher level things
have been implemented yet.  If they ran out of swap space, they (and we) 
would know.

#include <std_disclaimer>
_______________________________________________________________________________
				 ____ \ / ____
Laurence S. Kaplan		|    \ 0 /    |		BBN Advanced Computers
lkaplan@bbn.com			 \____|||____/		10 Fawcett St.
(617) 873-2431			  /__/ | \__\		Cambridge, MA  02238

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/20/90)

In article <58307@bbn.BBN.COM> lkaplan@BBN.COM (Larry Kaplan) writes:

>Anyway, I'll address some unclear points and provide a little more 
>justification than I have before.

OK, now, at last, you present a justification.

Now, debate on COW fork vs. vfork can begin. But, be sure that, vfork
is only useful just before exec.

Let me show that your justification is wrong.

>>Oops! A program? Are you joking? What if the program itself run short
>>of pages?

>Our kernel (and others, I am sure) allows sufficiently privileged users
>(like root) to ask that some of their memory be wired (not paged).  This
>facility could certainly be used for this daemon. I did say this in the 
>original posting but some may have overlooked this.

Perhaps, your kernel (but not all UNIX, I am SURE) may be able to
do so. But, it has nothing to do with the problem. Can your kernel
allows to force yet unallocated pages to be copied into swap space
and make it wired?

>>I showed several justification why simple fork is not appropriate
>>for fork-exec.
>
>But the debate is COW fork vs vfork.

Simple fork, I mean, includes COW fork.

>>Have you shown any justification to claim so? Or, is it merely your
>>desire?

>Let me attack this one now.  Let me start by saying that the problems
>we have with a regular vfork are probably fairly machine specific.  However, 
>some of this reasoning may soon apply to other machines if not already.

I don't think so. But, I don't want to fork the discussion. So, let's focus
on your architecture.

>If you were to use vfork when forking
>onto another processor, the child would be executing out of remote memory.
>While caches help alleviate the performance penalty, making things local
>is a much better idea.

You misunderstand semantics of vfork. When you want simple, general
purpose fork (including COW fork), it is often (but not always)
reasonable to use other processors for child process.

But, if you are vforking, the parent process is suspended until
child process dose exit or exec.

So, there is no reason to run the child process on another processor.

You should, instead, arrange exec, to be done on other processor.

>To make vfork set up
>local text anyway would eliminate most of performance advantage it had.
>Note that even for forks onto the same processor, the text segment 
>manipulations aren't that expensive

Though, it has nothing to do with the main-stream discussion, you
should learn more about the history of UNIX. Text segment copying
is never a reason for vfork. Vfork was introduced because copying
of large data/stack segment was time-consuming. Sharing of read-only
text segment has very old history.

>Please, please, please, I said disk space to real memory, not swap space.

Sorry about my misunderstanding.

>Part of the point was that Mach is willing to page to all the disks whether
>they are true swap partitions or regular UNIX filesystems.  So the ratio
>is very relevant to Mach based systems.

But, if you think you have free disk space, you should have allocated
it for swapping in advance.

>>>Even just reads from parts of the matrix wouldn't require
>>>swap space allocation since the pages aren't dirtied.  This sounds quite
>>>likely.

>>You can do so with mmap specifying readonly option. A cleaver implementation
>>won't allocate extra swap space.

>>BUT, such a thing has nothing to do with fork nor vfork.

>But it does have to do with preallocation and COW fork vs vfork.  Again any
>mmap calls require application code changes that the non-preallocating
>scheme doesn't need.

But, without mmap, how can you set values to the read-only matrix?

It can not be a part of data segment, because your program, practically,
can not have huge number of initializer lines for the large matrix.

I thought, it can only be a file output from another program.

						Masataka Ohta

lkaplan@bbn.com (Larry Kaplan) (07/20/90)

I wrote:
>>Our kernel (and others, I am sure) allows sufficiently privileged users
>>(like root) to ask that some of their memory be wired (not paged).  This
>>facility could certainly be used for this daemon. I did say this in the 
>>original posting but some may have overlooked this.

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
>Perhaps, your kernel (but not all UNIX, I am SURE) may be able to
>do so. But, it has nothing to do with the problem. Can your kernel
>allows to force yet unallocated pages to be copied into swap space
>and make it wired?

Why does this have nothing to do with the problem?  We were discussing methods
for handling swap space exhaustion without pre-allocation.  The ability to
wire memory is a very valid aid to solving the problem.  When you wire pages
on a non-preallocating system, there is no reason to copy them into the swap
area.  They will never go there.

>>If you were to use vfork when forking
>>onto another processor, the child would be executing out of remote memory.
>>While caches help alleviate the performance penalty, making things local
>>is a much better idea.

>You misunderstand semantics of vfork. When you want simple, general
>purpose fork (including COW fork), it is often (but not always)
>reasonable to use other processors for child process.
>But, if you are vforking, the parent process is suspended until
>child process dose exit or exec.
>So, there is no reason to run the child process on another processor.
>You should, instead, arrange exec, to be done on other processor.

This is an interesting idea, and actually might make vfork useful for us.
However, I am unsure that the performance gained by maintaining the vfork
hack and creating a new "exec on processor n" syscall would be worth the effort.
With COW "fork on processor n" and regular exec, we probably get very close
to the same performance and have a syscall that is useful for other things
like parallel processing.  I think at this point we have alternative solutions
that each have benefits but, of course, the time and utility of implementing
each needs to be weighed.

>>To make vfork set up
>>local text anyway would eliminate most of performance advantage it had.
>>Note that even for forks onto the same processor, the text segment 
>>manipulations aren't that expensive

>Though, it has nothing to do with the main-stream discussion, you
>should learn more about the history of UNIX. Text segment copying
>is never a reason for vfork. Vfork was introduced because copying
>of large data/stack segment was time-consuming. Sharing of read-only
>text segment has very old history.

I think you misunderstood this one.  I know what vfork was intended for.
Our system has other requirements though.  Certainly text segment copying is
not a reason for vfork.  But in a NUMA multiprocessor, vfork (or fork) to 
another node is a reason for text segment copying.  Though no copy is really 
made except for the pages that are referenced on that node.  The same is
true for data and stack when going to another node.  However, here,
copy on reference (or write) can be used.  Actually you could say that the
way the text segment is brought over is also copy on reference.

I think it is apparent that vfork to another node in a NUMA multiprocessor 
changes the semantics of the call enough to make it inappropriate.
vfork to the same node would be fine, but then how do you get any UNIX
jobs (lots of serial processes like compiles) running on other nodes.
The "exec on processor n" would work but is limited to this use only.
COW fork() and regular exec, on the other hand, do pretty well at this job 
and COW fork is very useful for real parallel programming.

>>Part of the point was that Mach is willing to page to all the disks whether
>>they are true swap partitions or regular UNIX filesystems.  So the ratio
>>is very relevant to Mach based systems.

>But, if you think you have free disk space, you should have allocated
>it for swapping in advance.

Why?  Then it wouldn't be available for users.  The name of the game here
is flexibility.  Why constrain swap space or user disk space when you can
just allow both to allocate from the same place.  If you are worried about 
runaway user disk consumption, thats what quotas are for.

#include <std_disclaimer>
_______________________________________________________________________________
				 ____ \ / ____
Laurence S. Kaplan		|    \ 0 /    |		BBN Advanced Computers
lkaplan@bbn.com			 \____|||____/		10 Fawcett St.
(617) 873-2431			  /__/ | \__\		Cambridge, MA  02238

peter@ficc.ferranti.com (Peter da Silva) (07/23/90)

I don't think that M. Ohta is arguing against retaining COW fork, but in favor
of retaining vfork. In the case of multiprocessors, you would want to have
fork-on-other-CPU, exec-on-other-CPU, and vfork. Each has its uses.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
<peter@ficc.ferranti.com>

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/23/90)

In article <58330@bbn.BBN.COM>
	lkaplan@BBN.COM (Larry Kaplan) writes:

>>Perhaps, your kernel (but not all UNIX, I am SURE) may be able to
>>do so. But, it has nothing to do with the problem. Can your kernel
>>allows to force yet unallocated pages to be copied into swap space
>>and make it wired?

>Why does this have nothing to do with the problem?

Because, you can't wire a yet-unallocated page.

>We were discussing methods
>for handling swap space exhaustion without pre-allocation.  The ability to
>wire memory is a very valid aid to solving the problem.

Wiring is not essential. It is only necessary to actually allocate
space.

>When you wire pages
>on a non-preallocating system, there is no reason to copy them into the swap
>area.  They will never go there.

Again, if a page is not allocated, there is nothing to wire.

>>So, there is no reason to run the child process on another processor.
>>You should, instead, arrange exec, to be done on other processor.

>This is an interesting idea, and actually might make vfork useful for us.
>However, I am unsure that the performance gained by maintaining the vfork
>hack and creating a new "exec on processor n" syscall would be worth the effort.

You were worrying about copying effort of text segment very much,
weren't you?

If you always exec on other processor, the text segment of the original
processor is untouched, and maybe, used later, without accessing
global momory.

>I think it is apparent that vfork to another node in a NUMA multiprocessor 
>changes the semantics of the call enough to make it inappropriate.

Nothing is apparent to me.

>vfork to the same node would be fine, but then how do you get any UNIX
>jobs (lots of serial processes like compiles) running on other nodes.
>The "exec on processor n" would work but is limited to this use only.

"exec on processor n"? Why can't a regular exec performed on some
other appropriate processor? Is a process id tied to a processor?

If you have a apprication on current UNIX which uses lots of parallel
processes with vfork and exec, you can run it on multi-processor
with full parallelism without changing the code.

>COW fork() and regular exec, on the other hand, do pretty well at this job 

As for the COW-fork performance, which I am not interested in so much,
someone said some workaround (pre-copying of several pages) is necessary
to gain suffecient speed.

As for the pre-allocation problem, there only exists complex and
incomplete workaround exists.

>and COW fork is very useful for real parallel programming.

Yes, you may have fork, which may be COW.

>>But, if you think you have free disk space, you should have allocated
>>it for swapping in advance.

>Why?  Then it wouldn't be available for users.  The name of the game here
>is flexibility.  Why constrain swap space or user disk space when you can
>just allow both to allocate from the same place.

Flexibility? It is an illusion. Actually, it is only complexity of
management. There is positive correlation for a common process
between amount of file space and amount of swap space.

>If you are worried about 
>runaway user disk consumption, thats what quotas are for.

It is not acceptable for me, but if you don't mind losing of computation
result due to quota, you shouldn't mind pre-allocating COW fork.
Pre-allocating fork is only as bad as disk quota.

						Masstaka Ohta

jkenton@pinocchio.encore.com (Jeff Kenton) (07/23/90)

From article <5928@titcce.cc.titech.ac.jp>, by mohta@necom830.cc.titech.ac.jp (Masataka Ohta):
> 
> Nothing is apparent to me.
> 




- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      jeff kenton  ---	temporarily at jkenton@pinocchio.encore.com	 
		   ---  always at (617) 894-4508  ---
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

tif@doorstop.austin.ibm.com (Paul Chamberlain) (07/23/90)

In article <5928@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
>In article <58330@bbn.BBN.COM>
>	lkaplan@BBN.COM (Larry Kaplan) writes:
>>>[ deeply nested crud that we're all tired of reading]

Why not agree to disagree?

Some people don't want their programs to run unless they are absolutely
guaranteed that it will finish.  The rest of us like things to run as
often as possible even when troubled with the occasional failure due to
lack of resources.

Paul Chamberlain | I do NOT represent IBM         tif@doorstop, sc30661@ausvm6
512/838-7008     | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif

fetter@cos.com (Bob Fetter) (07/25/90)

In article <2845@awdprime.UUCP> tif@doorstop.austin.ibm.com (Paul Chamberlain) writes:
>In article <5928@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
>>In article <58330@bbn.BBN.COM>
>>	lkaplan@BBN.COM (Larry Kaplan) writes:
>>>>[ deeply nested crud that we're all tired of reading]
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  Not me.  I've found this discussion to be interesting, and it has
given me several reasons and occasions for me to stop and rethink what
I though I already knew--in the process (re-)learning some things.
Hey, its the real *fun* of this business, at least for me.

>
>Why not agree to disagree?
>
>Some people don't want their programs to run unless they are absolutely
>guaranteed that it will finish.  The rest of us like things to run as
>often as possible even when troubled with the occasional failure due to
>lack of resources.

  Ah, but from the system OS platform standpoint, an occasional
failure, esp. one due to lack of resources, is pretty important.  So
far as conservation of swapspace goes, it (if the system runs with a
fixed swapfile mechanism) is a finite resource, and should be
husbanded with appropriate care ('fixed', of course, at a given point
in time, not to say that swapfile space can't be grown/shrunk).
  From an application pgm standpoint, resource exhaustion/failure/restarting
can be a survivable sequence in many cases -- long-running, compute-bound,
memory-expensive, apps, of course, aren't good to let this happen to.

  If the OS platform dies/locks up end-user processes, due to resource
shortfalls or other reasons, not just one process suffers.  This
really isn't just a case of "not wanting a program to run unless
they're sure it WILL run", but wanting OTHER programs to run -- read:
user processes.  But, enough on this, I guess.

  As to vfork(), please correct me if I'm wrong, but I have the mental
model that vfork can be considered as a degenerate case of lightweight
processes (LWP - let me term this degererate case a "psuedo-LWP, or
pLWP).  I term it degenerate, in that

  1 - the main HWP is suspended until the exec*() of the pLWP, because
  2 - the stack is logically shared between between the two, and
  3 - that there can be only one pLWP paired to the main HWP at a time.

(One could say that the existance of the pLWP is bounded by the vfork()
call for its creation, and the exec() call terminating its existance
as a *pLWP*, transforming it into a full-blown HWP).

  I see and understand the argument that one doesn't want to create a
copy of the stack for this pLWP since, in the context of *this
particular case*, the end-goal is to exec() a new process with its own
resources, so why copy what one will toss away (using finite resources
in the process -- the infamous swapfile discussions to-date), etc.  To
effect this "savings", I can see why one then needs to suspend the
main HWP, so one doesn't get "dueling stack frames syndrome".  Having
one pLWP per HWP like this appears to be painful enough, hence the
limit 3 above.

  My understanding is that, using vfork(), all segments known to the
pLWP are exactly those known to the main HWP and there is no COW being
done.  As such, the pLWP is operationally the same as a "regular" LWP
in this regard, except, of course, that the pLWP and the main HWP
share the same stack segment ("regular" LWP have their own stacks).
It seems to me that it is this sharing of the stack segment which is
the issue (to use the same with its attendant risks, or perform COW
a-la fork() with ITS problems such as swapfile allocation, etc), along
with the "hidden" semantics of vfork() creating a pLWP when fork()
doesn't and the confusion this difference may cause.

  Standing back a bit (hopefully not too far back), I get the
impression that the overall issue just might be thought of as being
the conversion of a LWP to a HWP.  Now, it's not exactly that as of
this point in time as regards the vfork() call, in that "regular" LWP
have their own stack, whereas a pLWP is being a parasite on its main
HWP.  But...

  If/when actual direct kernal support for LWP is available, why not
just then treat vfork() as the creation of a LWP, suspending the main
HWP as done now and mapping all segment descriptors/references from
the LWP back to the HWP (including the stack) and, when the exec()
call occurs,
  1 - remap all segment descriptors for the LWP to new, appropriately
      sized segments for the exec()-ed process,
  2 - remove the suspension on the main HWP, putting in the appropriate
      return value from the vfork() call.

  This puts the LWP/task management into the system dispatcher/context
switcher, removing the existing runtime inter-process
implementation/multiplexing technique.

  I don't see (additional) usage of swapfile space in the above, which
appears to be one of the major points under discussion, during the
time between vfork() and exec() in the 'child'.  It appears to also
only allocate memory (and associated swapfile resource) as is required
by the new HWP, not carrying over the 'parent' HWP load for a transient
time only to release it effectively unused.

  After all, can't one consider that a LWP just a HWP which has one or
more of its segments bound to/shared with another HWP?  Or, am I
missing something here?  (Issues like files, signals, etc., would of
course percolate as done today.)  I seem to remember an article in
this chain (or a related one) which discussed segment management
system calls which could be implemented.  Am I on a similar train of
thought?

  -Bob-