[net.arch] paging and loading

dwc@hou2b.UUCP (D.CHEN) (09/16/86)

>>3) The real reason for virtual memory (and the one which won't
>>away when memories get big) is that I can quickly load [a]
>>working set of a large program ...

>... but it seems to me that you'll use up [the faster startup] and
>a lot more in page faults.  Since you are reading the program
>piecemeal into virtual memory you are going to be a lot slower
>because of the extra seek and rotational delays. It is sort of like
>a guy driving on a surface street, instead of going over and getting
>on the freeway. You get started faster, but you hit all those
>traffic lights. The only case that is valid is if the overwhelming
>majority of the pages in the program are never referenced.

another analogy (and one that i've been using) is this:  imagine
that you have to xerox ten sets of notes and each set consists of
ten pages.  if there is a large setup time on the machine, you would
like to copy the 100 pages in one shot.  even without a large setup
time, if there are other people on line, you would probably want to
copy all of your work in one shot instead of getting on the end of
the line every 10 pages.

the answer is "it depends".  if your executable is on a unix file
system, you probably would have to do multiple i/os to load the
entire address space anyway.  however, if it is contiguous on some
swap device, then it depends on program behavior.

one important aspect that people rarely consider when talking about
response time and loading is what happens if, in the process of demand
paging (and loading of "working sets"), memory runs out?  what are
the implications on response time then?  it would seem that without
any other aids, pure demand paging is a clear loser in this situation.

danny chen
ihnp4!hou2b!dwc

jlg@lanl.ARPA (Jim Giles) (09/17/86)

In article <832@hou2b.UUCP> dwc@hou2b.UUCP (D.CHEN) writes:
>...
>another analogy (and one that i've been using) is this:  imagine
>that you have to xerox ten sets of notes and each set consists of
>ten pages.  if there is a large setup time on the machine, you would
>like to copy the 100 pages in one shot.  even without a large setup
>time, if there are other people on line, you would probably want to
>copy all of your work in one shot instead of getting on the end of
>the line every 10 pages.
>...

This is a remarkably good analogy!  It gives exactly the flavor of
computing on modern supercomputers.  Working with the Cray, for example,
is like having a xerox machine that can duplicate 1000 pages a second
but takes 1000 seconds to get to - even when no one else is in line for
it.  You clearly don't want to divide your 100 page xerography task
into 10 page chunks on such a machine.  You also don't want to force
Cray users to dribble their code into the machine in tiny amounts
for the same reason.

Note that the computational speed of the Cray really does exceed I/O speed
by about a factor of 10^6 - page faults in this kind of environment are
EXTREMELY costly.  If you did page the memory on such a system, it would be
disireable to make the pages large.  That way you can take advantage of the
fact that most I/O comes from consecutive disk locations - therefore there
are fewer seeks with large page size.

J. Giles
Los Alamos

tuba@ur-tut.UUCP (Jon Krueger) (09/18/86)

In article <832@hou2b.UUCP> dwc@hou2b.UUCP (D.CHEN) writes:
>...one important aspect that people rarely consider when talking about
>response time and loading is what happens if, in the process of demand
>paging (and loading of "working sets"), memory runs out?  what are
>the implications on response time then?  it would seem that without
>any other aids, pure demand paging is a clear loser in this situation.
>
>danny chen
>ihnp4!hou2b!dwc

Paging systems I'm familiar with handle this in (at least) three ways:

1) When many processes are demanding memory, and system-wide available free
memory becomes low in consequence, per-process limits on physical memory are
lowered.  The operating system shrinks individual process's working sets.
"Times are lean," says the pager, "and we're all going to have to tighten
our belts for a while".  Response time degrades, but smoothly.

2) If all processes have been shrunk to their lowest permissible limits, and
more memory is still needed, someone gets swapped.  Typically someone who
was holding memory but not executable.  "All of us on this lifeboat won't
make it, so some of us have to go into suspended animation until we are
rescued".  This is a mixed paging/swapping system, which shares physical
memory with paging until the sharing gets too thin to do anybody any good.
Then it swaps, thus lowering the demand.  For the survivors, response time
degrades smoothly again.

3) If most processes are swapped out, you're still paging like mad, and
you've run out of good swap candidates, you lose.  A bad candidate gets
swapped out. Lousy response time for him and every one else.  On the other
hand, such cases are examples of trying to substitute virtual memory for
adequate physical memory.  This is also known as belief in magic, making the
wrong tradeoffs, and typical lousy system management.  So I claim it's not
vm's fault, it did the best it could.

-- 
--> Jon Krueger
uucp: {seismo, allegra, decvax, cmcl2, topaz, harvard}!rochester!ur-tut!tuba
Phone: (716) 275-2811 work, 235-1495 home	BITNET: TUBA@UORDBV
USMAIL:  Taylor Hall, University of Rochester, Rochester NY  14627 

cdshaw@alberta.UUCP (Chris Shaw) (09/19/86)

In article <7597@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>
>Note that the computational speed of the Cray really does exceed I/O speed
>by about a factor of 10^6 - page faults in this kind of environment are
>EXTREMELY costly.  If you did page the memory on such a system, it would be
>desireable to make the pages large.  That way you can take advantage of the
>fact that most I/O comes from consecutive disk locations - therefore there
>are fewer seeks with large page size.
>
>J. Giles
>Los Alamos

This is a good point. One should tailor the paging to the machine. However,
simply saying 
  "x is sooo expensive. gosh. Better not do x, because none of the advantages
	of x are attainable"

.. is not a very scientific attitude. It lies in that fuzzy realm of engineering
judgement. Why doesn't someone do a simulation of the Cray on a slower machine 
that DOES have paging, only slow the paging speed to an appropriately
low rate to compare throughput. Run a VAX from floppy disk as paging store! 
Do an experiment! Get some numbers!! (Maybe this is silly, but you get the idea)

Why? Because one's intuition about performance or throughput is usually wrong.
The reasons why are that the systems we are considering are too complex, and
simply looking at paging speed vs cpu speed is far too simple an analysis
to base a large design decision upon.

In fact, more recent news articles from people on this subject seem to imply
that VM may make you lose 10% on memory references, but the job of using
the machine without VM is so horrendous that you lose all the advantages.

People who are saying "the 205 runs programs x, y, and z faster without VM than
with VM" are looking at only one job at a time. This is fine if your 205 does
ONLY 1 thing all day, every day. Most machines run a mix of jobs, however,
and in this much larger context, individual program performance is far less
important than overall throughput. Anyone who was around in the days of
"batch only" knows this argument inside and out. Batch is fine if you want
to maximize individual program performance, but overall turnaround time is 
dreadful, since ALL things sit in the queue, including the 1/2 second
practice runs.

The point? Looking at single-job performance, no matter how large the job, is 
not good enough. It's the old "1 benchmark doesn't tell anything" rule. You
have to look at a large range of programs before making a decision.
This is also true in design. Running experiments are usually a very rude
awakening. Things you thought were significant aren't, and vice versa. 
Therefore, not designing something into a machine because you think it hits
performance may be a lose if you look at only 1 type of performance, and not
the overall perfomance picture.


Chris Shaw    cdshaw@alberta
University of Alberta
Bogus as HELL !

rb@cci632.UUCP (Rex Ballard) (09/22/86)

In article <832@hou2b.UUCP> dwc@hou2b.UUCP (D.CHEN) writes:
>>>3) The real reason for virtual memory (and the one which won't
>>>away when memories get big) is that I can quickly load [a]
>>>working set of a large program ...
>
>>... but it seems to me that you'll use up [the faster startup] and
>>a lot more in page faults.  Since you are reading the program
>>piecemeal into virtual memory you are going to be a lot slower
>>because of the extra seek and rotational delays. It is sort of like
>>a guy driving on a surface street, instead of going over and getting
>>on the freeway. You get started faster, but you hit all those
>>traffic lights. The only case that is valid is if the overwhelming
>>majority of the pages in the program are never referenced.

In computing, there is an appropriate analogy to the "freeway"
described.  This would be the single process, single processor,
single task, single level, single loop, in which data is either
accessed in a nearly "random" or "large loop" fashion.  For such
applications, such as scientific matrix manipulation or calculations,
on single processor systems like the Cray 1, VM is definitely a lose.

The next question is, is this model still apropriate?  Scientific
and number crunching applications are finding a better home in
multi-processor environments, which effectively become multi-tasking
systems even when only a single "program" is being run.

>another analogy (and one that i've been using) is this:  imagine
>that you have to xerox ten sets of notes and each set consists of
>ten pages.  if there is a large setup time on the machine, you would
>like to copy the 100 pages in one shot.  even without a large setup
>time, if there are other people on line, you would probably want to
>copy all of your work in one shot instead of getting on the end of
>the line every 10 pages.

Ok, let's continue with this.  Do you really want to make the person
who needs two copies wait until your "batch job" is complete?  It
may be more desirable to have a second machine for the "smaller jobs",
or to split the 100 page copy among more machines, by makeing one
copy and putting that on another machine.  If you make the "little
jobs" wait, then they are more likely to collect many "little jobs"
to get one "big job" that is "worth the wait".

I even lean toward the "grocery store" analogy.  Many such stores
have "express lanes" so that the person with 2 items doesn't have
to wait for five people who are buying for the month.  The "bulk
buyers" line may also be staffed with a bagger, a scanner, and
a nice conveyer belt, while the "express lane" might only be a
counter and a manual cash register.  Many stores adhere to a "3
person" rule, where clerks are added as the number of people
in a line exceeds 3, be they express or bulk.  Without this
technique, people would only come to the grocery store when
they were buying large quantities, and go to other stores for
small purchases.

>the answer is "it depends".  if your executable is on a unix file
>system, you probably would have to do multiple i/os to load the
>entire address space anyway.  however, if it is contiguous on some
>swap device, then it depends on program behavior.

Some Unix variations, which are specifically designed to support
VM, keep executable binaries in contiguous form.  The read-only
portion need not be mapped or swapped.

>one important aspect that people rarely consider when talking about
>response time and loading is what happens if, in the process of demand
>paging (and loading of "working sets"), memory runs out?  what are
>the implications on response time then?  it would seem that without
>any other aids, pure demand paging is a clear loser in this situation.

Depending on the mechanism used, this might require up to 500 processes
to be running simultaneously.  Assuming only a modest 1 meg memory
of 1K pages.  There may be an argument for "request without fault"
type accesses, where the OS could be advised by the application that
a new page will be needed in the near future.

>danny chen
>ihnp4!hou2b!dwc

There are several types of processing.  Batch, interactive, pipelined,
I/O intensive, and transaction.  Some unpleasant experiences with
a non-virtual memory transaction processing system have given good
cause to prefer VM for this type of application.  Similar experiences
with interactive applications also lead to the same conclusion.
Even a batch processing compiler that attempted to do "hand optimised"
overlays in a VM environment (thrashed itself to death), tend to
make a case for VM.

Most applications spend 90% of their time executing only 10% of their
code.  Interactive and transaction processing applications spend
nearly 95% of their time in a "wait for I/O, parse, loop" with an
occaisional "do special purpose 'case' processing".  There are more
than a few stories of such applications where swapping that involved
a loop smaller than 2K caused serious degradation due to the 64K to
1Mb "tails" that would get swapped in with them.

Perhaps when it is possible to get 4 Gigabytes of 50ns ram for
$200-$300 that will require only a few watts, VM will become
unnecessary.  Even then however, the lack of need for "heap compaction"
and the ability to "remap" instead of "copy" data from one place
to another will continue to be a win for most applications.

Even this statement should have :-)'s all over it.  I remember thinking
that 2K was a lot of memory on my VIP, the transition from PDP-8
to PDP-11 thrilled many because of the thought of 64K addressing
space.  Microsoft's Bill Gates couldn't imagine why anyone would
need more than 640K with MS-Dos, and even Motorola was surpised
to discover that some systems couldn't fit in the 16 megabyte
virtual address space of the 68010.

There is also the issue of software costs in relation to the
system costs.  When adding an enhancement requires 10-50 times
the actual enhancment costs to "make room" for the enhancement,
the long-term systems costs can get out of hand very quickly.

With complete, detailed structure charts, data-flow diagrams
and data-hierarchy, it is possible to "manually" support
anything from simple "physical=logical" addressing, bank switching,
overlays, segment registers, and/or swapping to "virtual memory
with virtual libraries".  For "automatic" support via linkers
and complers, however, the benefits of "virtual memory" are
difficult to match.

Finally, one of the main trade-offs of VM is the TLB lookup time.
When the processor is spending 10% of it's time "looking for
something to do", CPU speed becomes much less important than
Memory Management, Disk Caching, and Co-processor interlock.

jlg@lanl.ARPA (Jim Giles) (09/25/86)

In article <78@alberta.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes:
>People who are saying "the 205 runs programs x, y, and z faster without VM than
>with VM" are looking at only one job at a time. This is fine if your 205 does
>ONLY 1 thing all day, every day. Most machines run a mix of jobs, however,
>and in this much larger context, individual program performance is far less
>important than overall throughput. Anyone who was around in the days of
>"batch only" knows this argument inside and out. Batch is fine if you want
>to maximize individual program performance, but overall turnaround time is 
>dreadful, since ALL things sit in the queue, including the 1/2 second
>practice runs.

There are places where you really want to maximize individual program
performance.  Take global weather modelling for example: these people have
one big program (that must produce results faster than real time or the
result in not very interesting).  Other programs running in the system
at the same time are incidental - or even counter-productive.  Yes, there
is a trade-off between throughput and individual program speed.  My point
is that some applications don't care a bit about throughput - the machine
was purchased for the purpose of running one particular program as fast
as possible.

J. Giles
Los Alamos

jlg@lanl.ARPA (Jim Giles) (09/25/86)

In article <389@cci632.UUCP> rb@ccird1.UUCP (Rex Ballard) writes:
>Most applications spend 90% of their time executing only 10% of their
>code.  Interactive and transaction processing applications spend
>nearly 95% of their time in a "wait for I/O, parse, loop" with an
>occaisional "do special purpose 'case' processing".  There are more
>than a few stories of such applications where swapping that involved
>a loop smaller than 2K caused serious degradation due to the 64K to
>1Mb "tails" that would get swapped in with them.
>
I've seen codes that spend 99% of their time in just 1-2% of the code.
So what?  The code only occupies 2% of the memory image of the program.
The rest of the memory image is data, and 90% of it is being processed
repeatedly by that tight loop in the code.

Yes, when you job mix is composed mainly of small and/or interactive
processes, by all means you should have a VM system.  I question the
need for VM in contexts where this is not true.

J. Giles
Los Alamos

rb@cci632.UUCP (Rex Ballard) (09/26/86)

In article <7963@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>In article <389@cci632.UUCP> rb@ccird1.UUCP (Rex Ballard) writes:
>>Most applications spend 90% of their time executing only 10% of their
>>code.  Interactive and transaction processing applications spend
>>nearly 95% of their time in a "wait for I/O, parse, loop" with an
>>occaisional "do special purpose 'case' processing".  There are more
>>than a few stories of such applications where swapping that involved
>>a loop smaller than 2K caused serious degradation due to the 64K to
>>1Mb "tails" that would get swapped in with them.
>>
>I've seen codes that spend 99% of their time in just 1-2% of the code.
>So what?  The code only occupies 2% of the memory image of the program.
>The rest of the memory image is data, and 90% of it is being processed
>repeatedly by that tight loop in the code.
>
>Yes, when you job mix is composed mainly of small and/or interactive
>processes, by all means you should have a VM system.  I question the
>need for VM in contexts where this is not true.
>
>J. Giles

Actually, when you have a lot of small processes, segmentation or
paging are about equal in terms of trade-offs, especially when
processes are extremely small.

When large tasks are being run in a multi-tasking environment,
the advantages of paging begin to be more dramatic.

If a task is run to completion without swapping, overlays, or
segmentation, and the OS is smart enough to prevent running
of more processes that is currently available, and each task
can be run in a reasonable amount of time (1/2 second or less),
and there is a separate processor to handle the I/O, then it
is possible to get superior performance.

Having some hands-on experience with all three types of systems,
most of my observations tend favor paging for multi-tasking.
100% memory resident subsystems are also a big win without paging.

Get a 300meg disk and a 300meg ram, treat shared data in the same
area, using signals, semaphores, and queues, "flush the world"
asynchronously, and the overhead becomes the drive interrupt handler.
Even under these conditions the "tagging" of memory references will
reduce the number of "flushes" since only those "pages" written to
since last flush are effected, rather than "read only this time around"
sectors.  Tagging can be done with software, but wouldn't be much
faster than hardware.

One of the "convincers" for me was seeing a "task segmented" system
running a bunch of "named pipe" transaction processing applications.
When the system was fully configured, over 200 tasks might be
running at the same time.  Every few seconds, the system would
go through the "Big swap", during which almost nothing else could
run.  Converting to a paged system actually increased performance
by a couple orders of magnitude, in spite of slower memory accesses
due to TLB translation.

Think about some of the "tails" that come with most well-written
applications.

Argument parsing - do it at start up, probably not used again.
Buffer initialization - you might not need this more than once.
File opening - one of those things you don't want to do too often.
Error handling - lot's of messages you hope you'll never use.
State machines - typically only one choice executed at a time.
Nested subroutines - Only the "active path" is needed.
Default linked libraries - Do you always use the floating point
	when you call printf().

Do you really want to swap all of that every time you have to swap?
Even ramdisk management schemes require the cost of copying in
the appropriate task, in it's entirety before executing.

If the avarage application on a system is only a modest 40K, the
working set in a task-segmented system is still limited to
a smaller number of tasks than in a paged system with even 4K
pages.

One feature I've always wanted to see is mapping of sharable
library executables in a paged system.  Popular subroutines
like "printf" are often very large (6-10K on some systems)
and usually not shared.

Disk cache paging can also be useful.  This is different from
CPU/memory access paging in that things like directory entries,
commonly used data files, and similar frequently used data can
be kept in core.  This is real handy when running several unix
shell scripts.

elg@usl.UUCP (Eric Lee Green) (09/27/86)

In article <78@alberta.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes:
>People who are saying "the 205 runs programs x, y, and z faster without VM than
>with VM" are looking at only one job at a time. This is fine if your 205 does
>ONLY 1 thing all day, every day. Anyone who was around in the days of
>"batch only" knows this argument inside and out. Batch is fine if you want
>to maximize individual program performance, but overall turnaround time is 
>dreadful, since ALL things sit in the queue, including the 1/2 second
>practice runs.

>Chris Shaw    cdshaw@alberta

A rehash of past arguments, about the working through huge matrix problem:
   "By the time we reach the end of the list, the front of the list will
have been paged out".
    If the front of the matrix has been paged out, that means that the
matrix was too big to fit in physical memory. That means that with no
paging, the performance would be 0 -- a 100% degradation.
  "We can take advantage of disk seek times -- just put the data on
close tracks."
    All the disk systems I have ever seen become somewhat fragmented
over time. 
    Also note that, because of intelligent i/o controllers, pulling a
page off of disk takes virtually no CPU time, since the processor is
running another process while the first one is blocked waiting for the
i/o processor to bring in the page (assuming a multi-tasking system --
Crays and such DO multi-task, no?). In otherwords taking advantage of
explicit multi-processing.

Most of the other arguments seen have been spurious line noise :-)
:-). The only one which is really valid is the table lookup times
necessary for paging. Worst case: the large matrix previously
discussed.  For n=page size, every n bytes we have to fetch a new
associative table entry, upon which one-two levels of a tree must be
traversed. There is also the overhead of paging in the MMU lookup
table the initial time. With today's faster disk drives, larger pages
may help, both in increasing the time N between memory-to-associative
fetches, and in decreasing the size of the MMU lookup table. If 4k was
OK back in Multics days, 8K or even 16K might work today. Also note
that this is WORST CASE. Your mileage will probably be better.

I must conclude that for 4k word pages, every 4k memory operations, a
2-4 cycle overhead would be necessary. or, a 4/4096 reduction in
performance. If we are talking about a 100 MIPS machine, it would now
only be running at 99.92 MIPS. Like wow. Especially considering that
because associative lookups probably run about the same speed as
checking bounds registers, the memory speed would be otherwise about
the same.

Another argument was that paging messes up pipelining.  Correct me if
I'm wrong (we obviously don't have a Cray handy :-), but a Cray only
does pipelined vector operations upon the vector registers (although I
think filling and saving those registers are pipelined operations,
also). Requiring vectors to be page-aligned and having the pages the
same size as the registers or a binary size larger would eliminate any
pipeline lossage (just do your virtual memory grocking BEFORE you get
your pipeline rolling). Since the compilers already have to do
considerable grocking to vectorize code, that doesn't seem too
extreme. And, of course, MIPS machines & others prove that for
ordinary code, pipelining and paging are not mutually exclusive.

Of course, the OTHER Big Win of paging is when you want to run a Truly
Humongous working set of programs.. But I think others have already
covered this, plus there's always the argument "Hey, I've got offset
registers and 2 gigabytes of RAM, what more could I want!". Upon which
someone will discover a problem that takes 2,147,483,650 bytes to
solve...

In summary, I really can't see any reason to not have paging, and
since there's a multitude of reasons to HAVE paging (running problem
sets larger than available memory, "graceful" user response time as
vs. multitasking schemes such as segmentation or swapping, etc.)...
-- 

      Eric Green {akgua,ut-sally}!usl!elg
        (Snail Mail P.O. Box 92191, Lafayette, LA 70509)

" In the beginning the Universe was created. This has made a lot of
 people very angry and been widely regarded as a bad move."

elg@usl.UUCP (Eric Lee Green) (09/27/86)

In article <78@alberta.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes:
>In article <7597@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>>
>>Note that the computational speed of the Cray really does exceed I/O speed
>>by about a factor of 10^6 - page faults in this kind of environment are
>>EXTREMELY costly. 

Not at all. The I/O processor does all the work of reading a page off
of disk and stuffing it into memory (via DMA). Meanwhile, the CPU goes
off and runs somebody else's job, taking advantage of that explicit
multiprocessing. Assuming you're not using the entire machine for a
single batch job, that is. So even if you have a working set larger
than available memory, if you have a good scattering of jobs so that
thrashing doesn't occur, the CPU still gets the same amount of work
done -- it's just divided amongst users, instead of all happening to
the same process.

 Also note that page faults do not occur unless the working set of the
program exceeds available physical memory -- if a page fault occurs,
it means your program wouldn't have run without paging, anyway. And if
a page fault DOESN'T occur (that is, if the program would have run in
physical memory), then you can discard the overhead of page faults
altogether, and just concentrate on the overhead of virtual address
translation. Loading is loading. Loading 64K of data via 8 page faults
should not be any more expensive than loading 64K of data in one fell
swoop.

>>desireable to make the pages large.  That way you can take advantage of the
>>fact that most I/O comes from consecutive disk locations - therefore there
>>are fewer seeks with large page size.

I have never seen a disk system that did not fragment. Unless the disk
page size was also similiarly large. Upon which you get lots of wasted
space, because most files are about 1K long or so, and you'd have
these mungo 16K disk pages only 1/16th full... Imagine, a 160 megabyte
disk with only 10 megabytes used on it, totally full.

>>J. Giles
>>Los Alamos
>
>important than overall throughput. Anyone who was around in the days of
>"batch only" knows this argument inside and out. Batch is fine if you want
>to maximize individual program performance, but overall turnaround time is 
>dreadful, since ALL things sit in the queue, including the 1/2 second
>practice runs.

Very good point. See my article elsewhere about the performance
drawbacks of paging (which are so minimal as to be ridiculous).
-- 

      Eric Green {akgua,ut-sally}!usl!elg
        (Snail Mail P.O. Box 92191, Lafayette, LA 70509)

" In the beginning the Universe was created. This has made a lot of
 people very angry and been widely regarded as a bad move."

joel@peora.UUCP (10/08/86)

>>desireable to make the pages large.  That way you can take advantage  of
>>the fact that most I/O comes from consecutive disk locations - therefore
>>there are fewer seeks with large page size.
>
>I have never seen a disk system that did not fragment.  Unless the  disk
>page  size was also similiarly large.  Upon which you get lots of wasted
>space, because most files are about 1K long or so, and you'd have  these
>mungo  16K  disk pages only 1/16th full...  Imagine, a 160 megabyte disk
>with only 10 megabytes used on it, totally full.
>      Eric Green {akgua,ut-sally}!usl!elg

	You seem to assume that all disk I/O is done with a fixed block
	size. On our drives you can do a disk I/O transfer using a block
	size from 256 bytes up to 16MB. We have a file type that lets
	you specify how large the file will be at allocation and reserves
	a contiguous block of disk space for it. We also have another
	file type that will allocate new blocks as required for the file,
	but even there the block size can be specified when the file
	is allocated.  The fixed disk block size used by Unix seems to
	designed to simplfy the operating system, rather than for efficent
	disk I/O.

	I'm not familar with the internals of virtual memory versions of
	Unix. It would seem reasonable to me that paging I/O would
	attempt to bypass most of the file system overhead and use a
	more primitive level of disk access. It therefore seems to me
	that the isn't any necessary relationship between the blocking
	factors used by the file system and the system page size.
-- 
     Joel Upchurch @ CONCURRENT Computer Corporation (A Perkin-Elmer Company)
     Southern Development Center
     2486 Sand Lake Road/ Orlando, Florida 32809/ (305)850-1031
     {decvax!ucf-cs, ihnp4!pesnta, vax135!petsd, akgua!codas}!peora!joel

aglew@ccvaxa.UUCP (10/09/86)

> [Eric Green {akgua,ut-sally}!usl!elg]:

First, my own conclusions: virtual memory may not be necessary for big job
systems, but is for just about anything else. Since I don't see myself
building big single job systems in the near future (or any systems, sigh) I
will take virtual memory as a given. But, I still can see reasons not to have
virtual memory - in certain circumstances.


> A rehash of past arguments, about the working through huge matrix problem:
>    "By the time we reach the end of the list, the front of the list will
> have been paged out".
>     If the front of the matrix has been paged out, that means that the
> matrix was too big to fit in physical memory. That means that with no
> paging, the performance would be 0 -- a 100% degradation.

Talk about spurious line noise! If people explicitly overlay the matrix,
they can still process it - that's a lot better than 0 performance.


>   "We can take advantage of disk seek times -- just put the data on
> close tracks."
>     All the disk systems I have ever seen become somewhat fragmented
> over time. 

Right. But high performance systems give you a way to make certain files
physically contiguous. Whether it's worth the bother is another question.


>     Also note that, because of intelligent i/o controllers, pulling a
> page off of disk takes virtually no CPU time, since the processor is
> running another process while the first one is blocked waiting for the
> i/o processor to bring in the page (assuming a multi-tasking system --
> Crays and such DO multi-task, no?). In otherwords taking advantage of
> explicit multi-processing.

Three things: (1) it is running *another* process, not the one that you want
to make really fast. The crux is, sometimes you want to make one process
really fast, at the cost of its mates. (2) Sure, your CPU isn't idle - it's
taking all sorts of TLB misses and cache faults and swapping out its
registers as it does the context switch between the blocked, paging,
process, and the new one. The IBM 3090 has already shown an example of a
system where it is faster to page synchronously rather than context switch,
in which case your CPU *is* idle while paging - and these are likely to
become more common. (3) By initiating overlays in advance, asynchronously, a
non-virtual process also does not leave the CPU idle - nor does it waste
CPU cycles in a context switch.

{Has anyone a good way of estimating time lost to context switches - totting
up register save, misses, etc.? I need a fairly abstract definition of
context for a model of multiprocessing that shows that multiprocessing is
sometimes faster than any uniprocessor system can be, for the same problem}

 
> The only one which is really valid is the table lookup times
> necessary for paging. There is also the overhead of paging in the MMU lookup
> table the initial time. 
Should that be prefetched at a context switch?

> [Page size]: If 4k was OK back in Multics days, 8K or even 16K might work
> today. 
Gould already has 8K pages. The Japanese have gone for 1M pages.

> Also note that this is WORST CASE. Your mileage will probably be better.
Sometimes you have to design for worst case (like, when the benchmarks that
customers use to decide whether they should buy your machine are your worst
case scenarios). I have a cute little idea for cramming more stuff into a
TLB, if the system will occasionally try to allocate contiguous physical
memory, the architect I showed it to said yes, my idea was better than the 
standard trick for multiple page size systems - but it wasn't worth doing,
because he would still have to make the TLB large enough to handle the
worst case, where each contiguous chunk was the smallest page size. Sigh.


> Especially considering that
> because associative lookups probably run about the same speed as
> checking bounds registers, the memory speed would be otherwise about
> the same.

Do really fast systems use bounds registers? What bounds registers?


> Another argument was that paging messes up pipelining...
> Requiring vectors to be page-aligned and having the pages the
> same size as the registers or a binary size larger would eliminate any
> pipeline lossage (just do your virtual memory grocking BEFORE you get
> your pipeline rolling).

This won't work if you want to pipeline calculations of a vector of addresses
(or indexes) with the actual access. IE. the addresses are not available
before the operation starts. So, either you don't pipeline this (a
considerable loss, but one that can be tolerated) or you make it possible
to recover from a page fault in the middle of an operation (ouch).

> And, of course, MIPS machines & others prove that for
> ordinary code, pipelining and paging are not mutually exclusive.

Carefully designed instruction sets mean that page faults will only occur
at a limited number of known locations, making recovery easier.

aglew@ccvaxa.UUCP (10/09/86)

> Loading 64K of data via 8 page faults should not be any more expensive than
> loading 64K of data in one fell swoop.

You are way off base on this. 

jlg@lanl.ARPA (Jim Giles) (10/12/86)

>>>Note that the computational speed of the Cray really does exceed I/O speed
>>>by about a factor of 10^6 - page faults in this kind of environment are
>>>EXTREMELY costly. 
>
>Not at all. The I/O processor does all the work of reading a page off
>of disk and stuffing it into memory (via DMA). Meanwhile, the CPU goes
>off and runs somebody else's job, taking advantage of that explicit
>multiprocessing. Assuming you're not using the entire machine for a
>single batch job, that is. So even if you have a working set larger
>than available memory, if you have a good scattering of jobs so that
>thrashing doesn't occur, the CPU still gets the same amount of work
>done -- it's just divided amongst users, instead of all happening to
>the same process.

Thank you for making my point so clearly!  There applications (and whole
computing environments) where turnaround is MUCH MORE important than
throughput.
>
> Also note that page faults do not occur unless the working set of the
>program exceeds available physical memory -- if a page fault occurs,
>it means your program wouldn't have run without paging, anyway.

Sure it does! And more efficiently too!  Paging means that the program
must make do with the memory management and I/O scheduling that is
provided by the VM system.  The VM system is built with all sorts of
compromises since it must work acceptably well for many different
data usage patterns - it cannot make use of any specific knowledge
of the the data usage that is known to the programmer.  (oh, yes.
Some people have pointed out that the VM system COULD be augmented
with special calls to allow the programmer to inform it of the
data usage patterns to be expected.  There are three problems with this
approach: 1> Such calls are quite vague - at least as proposed, 2> the
use of such calls is as much a programming chore as explicit memory
management anyway, 3> no existing VM system I'm aware of has this type
of call implemented.)

> ...  And if
>a page fault DOESN'T occur (that is, if the program would have run in
>physical memory), then you can discard the overhead of page faults
>altogether, and just concentrate on the overhead of virtual address
>translation. Loading is loading. Loading 64K of data via 8 page faults
>should not be any more expensive than loading 64K of data in one fell
>swoop.
>
Not true.  8 page faults is 8 system calls (or equivalent); probably 8
disk seeks (unless you are alone of the system and no other user has
requested data from that disk since the last page fault); also it's
8 disk revolutions (unless the page faults are so close together that
the disk hasn't moved past the start of the next sector before you
ask for it).  The truth is that a single long consecutive disk transfer
is ALWAYS more efficient than a lot of small transfers.

>>>desireable to make the pages large.  That way you can take advantage of the
>>>fact that most I/O comes from consecutive disk locations - therefore there
>>>are fewer seeks with large page size.
>
>I have never seen a disk system that did not fragment. Unless the disk
>page size was also similiarly large. Upon which you get lots of wasted
>space, because most files are about 1K long or so, and you'd have
>these mungo 16K disk pages only 1/16th full... Imagine, a 160 megabyte
>disk with only 10 megabytes used on it, totally full.
>
The CTSS (Cray Time Sharing System) requires an initial length in order
to create a file.  The file is then allocated CONSECUTIVELY (if there
is a large enough hole - some idle system time is spent consolidating
holes).  The file is only segmented if it is extended after its creation.
Even for an extension, the file is extended consecutively if possible.
In practice, most files are consecutive for most of their length.

Also, very little space is taken up by files as small as '1K long or so'.
To be sure, there are a lot of such files, but it would take 32,000 of
them to to equall the size of an executing program image (I assume you
mean 1k bytes, program images can be 4M words).  Also - the disk sector
size is the granularity of disk allocation: it is 512 words no matter
what size the 'pages' are.  Changing page size does not effect disk
allocation in any way.

I see that you really have no idea whatsoever about the evironment of
problems inherent in large memory systems.  Even your examples give you
away - 160MB is not a very large disk drive, 1KB is not a typical file
size, etc..  Our disk drives hold about 1.2GB, files can be as large as
40MW (=320MB), average user data files are over 1MW (=8MB) - only source
code is as small as 1KB (which is rounded up to 512 words = 4KB because
that is the sector size of the disk drives).  Such small code soures
are also rare - the main users have codes which amount to 100,000 to
500,000 lines of Fortran (not counting the library support).  In short,
you are talking about a completely different environment than I am.

Learn something about large computing environments before you try to
force unwanted features upon them!

J. Giles
Los Alamos

henry@utzoo.UUCP (Henry Spencer) (10/15/86)

> ... The IBM 3090 has already shown an example of a
> system where it is faster to page synchronously rather than context switch,
> in which case your CPU *is* idle while paging - and these are likely to
> become more common. ...

Ah, but what device are they paging from?  If I remember it correctly --
I'm not sure of this -- the 3090 is paging out of ordinary RAM, which they
would have preferred to make directly addressable except that they are
constrained by backward compatibility and not enough address bits.  This
is a kludge, not something that is likely to become widespread.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

barada@maggot.UUCP (10/16/86)

Written  2:47 am  Sep 27, 1986 by usl.UUCP!elg
>In article <78@alberta.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes:
>>In article <7597@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>>>
>>>Note that the computational speed of the Cray really does exceed I/O speed
>>>by about a factor of 10^6 - page faults in this kind of environment are
>>>EXTREMELY costly. 
>
>Not at all. The I/O processor does all the work of reading a page off
>of disk and stuffing it into memory (via DMA). Meanwhile, the CPU goes
>off and runs somebody else's job, taking advantage of that explicit
>multiprocessing. Assuming you're not using the entire machine for a
>single batch job, that is. So even if you have a working set larger
>than available memory, if you have a good scattering of jobs so that
>thrashing doesn't occur, the CPU still gets the same amount of work
>done -- it's just divided amongst users, instead of all happening to
>the same process.
>
> Also note that page faults do not occur unless the working set of the
>program exceeds available physical memory -- if a page fault occurs,
>it means your program wouldn't have run without paging, anyway. And if
>a page fault DOESN'T occur (that is, if the program would have run in
>physical memory), then you can discard the overhead of page faults
>altogether, and just concentrate on the overhead of virtual address
>translation. Loading is loading. Loading 64K of data via 8 page faults
>should not be any more expensive than loading 64K of data in one fell
>swoop.
>

	Your arguement is good, but it does have its weaknesses:

	1) The CPU would not get the same amount of work done in page
	   faults since the context switches occur, and the kernel
	   has to reschedule running processes and biases. So a program
	   that goes page-fault happy (such as a monster hashing algorithm)
	   would actually cause the WHOLE system to run slower due to all
	   of those pagefaults spending time in the scheduler.

	2) What do you do if two processes each need 80% of the available
	   memory? run one so it doesn't fault and then run the other?
	   This is great for Batch systems, not timesharing. Actually,
	   a good design would try to balance it out by allowing each
	   a page set that takes up half of memory. This would give each
	   about the same probability(theoretically) of having the proper
	   page in memory. Now if one process runs in only 10% of its
	   Workspace, then the workspace can be reduced to allow the other
	   piggish process a better chance of having a page hit.
	   So the likelyhood of a process having its WS available is better if
	   the process is small. But if it is really large, then that
	   likelyhood is inversely porportional to the WS needed. So
	   really large processes run even SLOWER since they can not get
	   the large WS unless they run in a `batch' style.

	   The calculations to determine the adjusting of the WS is done
	   on each pagefault, which can slow things down even more for huge
	   systems.


	I am not faulting VM systems for crays, I'm just pointing out some
	things that were overlooked.


	

	
--
Peter Barada				       | (617)-671-9905
Applicon, Inc. A division of Schlumberger Ltd. | Billerica MA, 01821

UUCP: {allegra|decvax|mit-eddie|utzoo}!linus!raybed2!applicon!barada
      {amd|bbncca|cbosgd|wjh12|ihnp4|yale}!ima!applicon!barada

	Sanity is yet another perversion of reality.

josh@polaris.UUCP (Josh Knight) (10/17/86)

In article <7232@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>> ... The IBM 3090 has already shown an example of a
>> system where it is faster to page synchronously rather than context switch,
>> in which case your CPU *is* idle while paging - and these are likely to
>> become more common. ...
>
>Ah, but what device are they paging from?  If I remember it correctly --
>I'm not sure of this -- the 3090 is paging out of ordinary RAM, which they
>would have preferred to make directly addressable except that they are
>constrained by backward compatibility and not enough address bits.  This
>is a kludge, not something that is likely to become widespread.

What IBM prefers to do is a difficult thing to determine (wish I
could ;-).  It is true that the expanded storage announced for the
3090 is "ordinary RAM".  However, it is not necessarily true that one
would prefer to have real memory instead of a fast page addressable
store.  The address bits required to maintain the latter (only page 
addresses are maintained) are fewer and one does not have to maintain
whatever protection mechanisms (in the case of the IBM 370 architecture
storage protect keys) are provided for "main storage".   In addition,
page stores could have different porting and provide access to multiple
processors and/or I/O servers with less overhead.  These latter features
are not supported by the 3090 expanded storage, but other systems might
choose page addressable stores over byte addressable stores for these
reassons.
 
Of course, I'm speaking only for myself, not my employer.
 
-- 

	Josh Knight, IBM T.J. Watson Research
 josh@ibm.com, josh@yktvmh.bitnet,  ...!philabs!polaris!josh

jnw@mcnc.UUCP (John White) (10/21/86)

In article <7232@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
> ... If I remember it correctly --
> I'm not sure of this -- the 3090 is paging out of ordinary RAM, which they
> would have preferred to make directly addressable except that they are
> constrained by backward compatibility and not enough address bits.  This
> is a kludge, not something that is likely to become widespread.
> -- 
> 				Henry Spencer @ U of Toronto Zoology
> 				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

With memory prices dropping, it makes some sense to use a RAMdisk as the
paging device on a high performance system. The RAMdisk can use slower,
less expensive memory. Also, the main memory bus can be driven much faster
if it is known that there will be no more than, say, 500 memory chips on it.
The RAMdisk, on the other hand, could have more than 50,000 chips.
-jnw@mcnc