dwc@hou2b.UUCP (D.CHEN) (09/16/86)
>>3) The real reason for virtual memory (and the one which won't >>away when memories get big) is that I can quickly load [a] >>working set of a large program ... >... but it seems to me that you'll use up [the faster startup] and >a lot more in page faults. Since you are reading the program >piecemeal into virtual memory you are going to be a lot slower >because of the extra seek and rotational delays. It is sort of like >a guy driving on a surface street, instead of going over and getting >on the freeway. You get started faster, but you hit all those >traffic lights. The only case that is valid is if the overwhelming >majority of the pages in the program are never referenced. another analogy (and one that i've been using) is this: imagine that you have to xerox ten sets of notes and each set consists of ten pages. if there is a large setup time on the machine, you would like to copy the 100 pages in one shot. even without a large setup time, if there are other people on line, you would probably want to copy all of your work in one shot instead of getting on the end of the line every 10 pages. the answer is "it depends". if your executable is on a unix file system, you probably would have to do multiple i/os to load the entire address space anyway. however, if it is contiguous on some swap device, then it depends on program behavior. one important aspect that people rarely consider when talking about response time and loading is what happens if, in the process of demand paging (and loading of "working sets"), memory runs out? what are the implications on response time then? it would seem that without any other aids, pure demand paging is a clear loser in this situation. danny chen ihnp4!hou2b!dwc
jlg@lanl.ARPA (Jim Giles) (09/17/86)
In article <832@hou2b.UUCP> dwc@hou2b.UUCP (D.CHEN) writes: >... >another analogy (and one that i've been using) is this: imagine >that you have to xerox ten sets of notes and each set consists of >ten pages. if there is a large setup time on the machine, you would >like to copy the 100 pages in one shot. even without a large setup >time, if there are other people on line, you would probably want to >copy all of your work in one shot instead of getting on the end of >the line every 10 pages. >... This is a remarkably good analogy! It gives exactly the flavor of computing on modern supercomputers. Working with the Cray, for example, is like having a xerox machine that can duplicate 1000 pages a second but takes 1000 seconds to get to - even when no one else is in line for it. You clearly don't want to divide your 100 page xerography task into 10 page chunks on such a machine. You also don't want to force Cray users to dribble their code into the machine in tiny amounts for the same reason. Note that the computational speed of the Cray really does exceed I/O speed by about a factor of 10^6 - page faults in this kind of environment are EXTREMELY costly. If you did page the memory on such a system, it would be disireable to make the pages large. That way you can take advantage of the fact that most I/O comes from consecutive disk locations - therefore there are fewer seeks with large page size. J. Giles Los Alamos
tuba@ur-tut.UUCP (Jon Krueger) (09/18/86)
In article <832@hou2b.UUCP> dwc@hou2b.UUCP (D.CHEN) writes: >...one important aspect that people rarely consider when talking about >response time and loading is what happens if, in the process of demand >paging (and loading of "working sets"), memory runs out? what are >the implications on response time then? it would seem that without >any other aids, pure demand paging is a clear loser in this situation. > >danny chen >ihnp4!hou2b!dwc Paging systems I'm familiar with handle this in (at least) three ways: 1) When many processes are demanding memory, and system-wide available free memory becomes low in consequence, per-process limits on physical memory are lowered. The operating system shrinks individual process's working sets. "Times are lean," says the pager, "and we're all going to have to tighten our belts for a while". Response time degrades, but smoothly. 2) If all processes have been shrunk to their lowest permissible limits, and more memory is still needed, someone gets swapped. Typically someone who was holding memory but not executable. "All of us on this lifeboat won't make it, so some of us have to go into suspended animation until we are rescued". This is a mixed paging/swapping system, which shares physical memory with paging until the sharing gets too thin to do anybody any good. Then it swaps, thus lowering the demand. For the survivors, response time degrades smoothly again. 3) If most processes are swapped out, you're still paging like mad, and you've run out of good swap candidates, you lose. A bad candidate gets swapped out. Lousy response time for him and every one else. On the other hand, such cases are examples of trying to substitute virtual memory for adequate physical memory. This is also known as belief in magic, making the wrong tradeoffs, and typical lousy system management. So I claim it's not vm's fault, it did the best it could. -- --> Jon Krueger uucp: {seismo, allegra, decvax, cmcl2, topaz, harvard}!rochester!ur-tut!tuba Phone: (716) 275-2811 work, 235-1495 home BITNET: TUBA@UORDBV USMAIL: Taylor Hall, University of Rochester, Rochester NY 14627
cdshaw@alberta.UUCP (Chris Shaw) (09/19/86)
In article <7597@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: > >Note that the computational speed of the Cray really does exceed I/O speed >by about a factor of 10^6 - page faults in this kind of environment are >EXTREMELY costly. If you did page the memory on such a system, it would be >desireable to make the pages large. That way you can take advantage of the >fact that most I/O comes from consecutive disk locations - therefore there >are fewer seeks with large page size. > >J. Giles >Los Alamos This is a good point. One should tailor the paging to the machine. However, simply saying "x is sooo expensive. gosh. Better not do x, because none of the advantages of x are attainable" .. is not a very scientific attitude. It lies in that fuzzy realm of engineering judgement. Why doesn't someone do a simulation of the Cray on a slower machine that DOES have paging, only slow the paging speed to an appropriately low rate to compare throughput. Run a VAX from floppy disk as paging store! Do an experiment! Get some numbers!! (Maybe this is silly, but you get the idea) Why? Because one's intuition about performance or throughput is usually wrong. The reasons why are that the systems we are considering are too complex, and simply looking at paging speed vs cpu speed is far too simple an analysis to base a large design decision upon. In fact, more recent news articles from people on this subject seem to imply that VM may make you lose 10% on memory references, but the job of using the machine without VM is so horrendous that you lose all the advantages. People who are saying "the 205 runs programs x, y, and z faster without VM than with VM" are looking at only one job at a time. This is fine if your 205 does ONLY 1 thing all day, every day. Most machines run a mix of jobs, however, and in this much larger context, individual program performance is far less important than overall throughput. Anyone who was around in the days of "batch only" knows this argument inside and out. Batch is fine if you want to maximize individual program performance, but overall turnaround time is dreadful, since ALL things sit in the queue, including the 1/2 second practice runs. The point? Looking at single-job performance, no matter how large the job, is not good enough. It's the old "1 benchmark doesn't tell anything" rule. You have to look at a large range of programs before making a decision. This is also true in design. Running experiments are usually a very rude awakening. Things you thought were significant aren't, and vice versa. Therefore, not designing something into a machine because you think it hits performance may be a lose if you look at only 1 type of performance, and not the overall perfomance picture. Chris Shaw cdshaw@alberta University of Alberta Bogus as HELL !
rb@cci632.UUCP (Rex Ballard) (09/22/86)
In article <832@hou2b.UUCP> dwc@hou2b.UUCP (D.CHEN) writes: >>>3) The real reason for virtual memory (and the one which won't >>>away when memories get big) is that I can quickly load [a] >>>working set of a large program ... > >>... but it seems to me that you'll use up [the faster startup] and >>a lot more in page faults. Since you are reading the program >>piecemeal into virtual memory you are going to be a lot slower >>because of the extra seek and rotational delays. It is sort of like >>a guy driving on a surface street, instead of going over and getting >>on the freeway. You get started faster, but you hit all those >>traffic lights. The only case that is valid is if the overwhelming >>majority of the pages in the program are never referenced. In computing, there is an appropriate analogy to the "freeway" described. This would be the single process, single processor, single task, single level, single loop, in which data is either accessed in a nearly "random" or "large loop" fashion. For such applications, such as scientific matrix manipulation or calculations, on single processor systems like the Cray 1, VM is definitely a lose. The next question is, is this model still apropriate? Scientific and number crunching applications are finding a better home in multi-processor environments, which effectively become multi-tasking systems even when only a single "program" is being run. >another analogy (and one that i've been using) is this: imagine >that you have to xerox ten sets of notes and each set consists of >ten pages. if there is a large setup time on the machine, you would >like to copy the 100 pages in one shot. even without a large setup >time, if there are other people on line, you would probably want to >copy all of your work in one shot instead of getting on the end of >the line every 10 pages. Ok, let's continue with this. Do you really want to make the person who needs two copies wait until your "batch job" is complete? It may be more desirable to have a second machine for the "smaller jobs", or to split the 100 page copy among more machines, by makeing one copy and putting that on another machine. If you make the "little jobs" wait, then they are more likely to collect many "little jobs" to get one "big job" that is "worth the wait". I even lean toward the "grocery store" analogy. Many such stores have "express lanes" so that the person with 2 items doesn't have to wait for five people who are buying for the month. The "bulk buyers" line may also be staffed with a bagger, a scanner, and a nice conveyer belt, while the "express lane" might only be a counter and a manual cash register. Many stores adhere to a "3 person" rule, where clerks are added as the number of people in a line exceeds 3, be they express or bulk. Without this technique, people would only come to the grocery store when they were buying large quantities, and go to other stores for small purchases. >the answer is "it depends". if your executable is on a unix file >system, you probably would have to do multiple i/os to load the >entire address space anyway. however, if it is contiguous on some >swap device, then it depends on program behavior. Some Unix variations, which are specifically designed to support VM, keep executable binaries in contiguous form. The read-only portion need not be mapped or swapped. >one important aspect that people rarely consider when talking about >response time and loading is what happens if, in the process of demand >paging (and loading of "working sets"), memory runs out? what are >the implications on response time then? it would seem that without >any other aids, pure demand paging is a clear loser in this situation. Depending on the mechanism used, this might require up to 500 processes to be running simultaneously. Assuming only a modest 1 meg memory of 1K pages. There may be an argument for "request without fault" type accesses, where the OS could be advised by the application that a new page will be needed in the near future. >danny chen >ihnp4!hou2b!dwc There are several types of processing. Batch, interactive, pipelined, I/O intensive, and transaction. Some unpleasant experiences with a non-virtual memory transaction processing system have given good cause to prefer VM for this type of application. Similar experiences with interactive applications also lead to the same conclusion. Even a batch processing compiler that attempted to do "hand optimised" overlays in a VM environment (thrashed itself to death), tend to make a case for VM. Most applications spend 90% of their time executing only 10% of their code. Interactive and transaction processing applications spend nearly 95% of their time in a "wait for I/O, parse, loop" with an occaisional "do special purpose 'case' processing". There are more than a few stories of such applications where swapping that involved a loop smaller than 2K caused serious degradation due to the 64K to 1Mb "tails" that would get swapped in with them. Perhaps when it is possible to get 4 Gigabytes of 50ns ram for $200-$300 that will require only a few watts, VM will become unnecessary. Even then however, the lack of need for "heap compaction" and the ability to "remap" instead of "copy" data from one place to another will continue to be a win for most applications. Even this statement should have :-)'s all over it. I remember thinking that 2K was a lot of memory on my VIP, the transition from PDP-8 to PDP-11 thrilled many because of the thought of 64K addressing space. Microsoft's Bill Gates couldn't imagine why anyone would need more than 640K with MS-Dos, and even Motorola was surpised to discover that some systems couldn't fit in the 16 megabyte virtual address space of the 68010. There is also the issue of software costs in relation to the system costs. When adding an enhancement requires 10-50 times the actual enhancment costs to "make room" for the enhancement, the long-term systems costs can get out of hand very quickly. With complete, detailed structure charts, data-flow diagrams and data-hierarchy, it is possible to "manually" support anything from simple "physical=logical" addressing, bank switching, overlays, segment registers, and/or swapping to "virtual memory with virtual libraries". For "automatic" support via linkers and complers, however, the benefits of "virtual memory" are difficult to match. Finally, one of the main trade-offs of VM is the TLB lookup time. When the processor is spending 10% of it's time "looking for something to do", CPU speed becomes much less important than Memory Management, Disk Caching, and Co-processor interlock.
jlg@lanl.ARPA (Jim Giles) (09/25/86)
In article <78@alberta.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes: >People who are saying "the 205 runs programs x, y, and z faster without VM than >with VM" are looking at only one job at a time. This is fine if your 205 does >ONLY 1 thing all day, every day. Most machines run a mix of jobs, however, >and in this much larger context, individual program performance is far less >important than overall throughput. Anyone who was around in the days of >"batch only" knows this argument inside and out. Batch is fine if you want >to maximize individual program performance, but overall turnaround time is >dreadful, since ALL things sit in the queue, including the 1/2 second >practice runs. There are places where you really want to maximize individual program performance. Take global weather modelling for example: these people have one big program (that must produce results faster than real time or the result in not very interesting). Other programs running in the system at the same time are incidental - or even counter-productive. Yes, there is a trade-off between throughput and individual program speed. My point is that some applications don't care a bit about throughput - the machine was purchased for the purpose of running one particular program as fast as possible. J. Giles Los Alamos
jlg@lanl.ARPA (Jim Giles) (09/25/86)
In article <389@cci632.UUCP> rb@ccird1.UUCP (Rex Ballard) writes: >Most applications spend 90% of their time executing only 10% of their >code. Interactive and transaction processing applications spend >nearly 95% of their time in a "wait for I/O, parse, loop" with an >occaisional "do special purpose 'case' processing". There are more >than a few stories of such applications where swapping that involved >a loop smaller than 2K caused serious degradation due to the 64K to >1Mb "tails" that would get swapped in with them. > I've seen codes that spend 99% of their time in just 1-2% of the code. So what? The code only occupies 2% of the memory image of the program. The rest of the memory image is data, and 90% of it is being processed repeatedly by that tight loop in the code. Yes, when you job mix is composed mainly of small and/or interactive processes, by all means you should have a VM system. I question the need for VM in contexts where this is not true. J. Giles Los Alamos
rb@cci632.UUCP (Rex Ballard) (09/26/86)
In article <7963@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >In article <389@cci632.UUCP> rb@ccird1.UUCP (Rex Ballard) writes: >>Most applications spend 90% of their time executing only 10% of their >>code. Interactive and transaction processing applications spend >>nearly 95% of their time in a "wait for I/O, parse, loop" with an >>occaisional "do special purpose 'case' processing". There are more >>than a few stories of such applications where swapping that involved >>a loop smaller than 2K caused serious degradation due to the 64K to >>1Mb "tails" that would get swapped in with them. >> >I've seen codes that spend 99% of their time in just 1-2% of the code. >So what? The code only occupies 2% of the memory image of the program. >The rest of the memory image is data, and 90% of it is being processed >repeatedly by that tight loop in the code. > >Yes, when you job mix is composed mainly of small and/or interactive >processes, by all means you should have a VM system. I question the >need for VM in contexts where this is not true. > >J. Giles Actually, when you have a lot of small processes, segmentation or paging are about equal in terms of trade-offs, especially when processes are extremely small. When large tasks are being run in a multi-tasking environment, the advantages of paging begin to be more dramatic. If a task is run to completion without swapping, overlays, or segmentation, and the OS is smart enough to prevent running of more processes that is currently available, and each task can be run in a reasonable amount of time (1/2 second or less), and there is a separate processor to handle the I/O, then it is possible to get superior performance. Having some hands-on experience with all three types of systems, most of my observations tend favor paging for multi-tasking. 100% memory resident subsystems are also a big win without paging. Get a 300meg disk and a 300meg ram, treat shared data in the same area, using signals, semaphores, and queues, "flush the world" asynchronously, and the overhead becomes the drive interrupt handler. Even under these conditions the "tagging" of memory references will reduce the number of "flushes" since only those "pages" written to since last flush are effected, rather than "read only this time around" sectors. Tagging can be done with software, but wouldn't be much faster than hardware. One of the "convincers" for me was seeing a "task segmented" system running a bunch of "named pipe" transaction processing applications. When the system was fully configured, over 200 tasks might be running at the same time. Every few seconds, the system would go through the "Big swap", during which almost nothing else could run. Converting to a paged system actually increased performance by a couple orders of magnitude, in spite of slower memory accesses due to TLB translation. Think about some of the "tails" that come with most well-written applications. Argument parsing - do it at start up, probably not used again. Buffer initialization - you might not need this more than once. File opening - one of those things you don't want to do too often. Error handling - lot's of messages you hope you'll never use. State machines - typically only one choice executed at a time. Nested subroutines - Only the "active path" is needed. Default linked libraries - Do you always use the floating point when you call printf(). Do you really want to swap all of that every time you have to swap? Even ramdisk management schemes require the cost of copying in the appropriate task, in it's entirety before executing. If the avarage application on a system is only a modest 40K, the working set in a task-segmented system is still limited to a smaller number of tasks than in a paged system with even 4K pages. One feature I've always wanted to see is mapping of sharable library executables in a paged system. Popular subroutines like "printf" are often very large (6-10K on some systems) and usually not shared. Disk cache paging can also be useful. This is different from CPU/memory access paging in that things like directory entries, commonly used data files, and similar frequently used data can be kept in core. This is real handy when running several unix shell scripts.
elg@usl.UUCP (Eric Lee Green) (09/27/86)
In article <78@alberta.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes: >People who are saying "the 205 runs programs x, y, and z faster without VM than >with VM" are looking at only one job at a time. This is fine if your 205 does >ONLY 1 thing all day, every day. Anyone who was around in the days of >"batch only" knows this argument inside and out. Batch is fine if you want >to maximize individual program performance, but overall turnaround time is >dreadful, since ALL things sit in the queue, including the 1/2 second >practice runs. >Chris Shaw cdshaw@alberta A rehash of past arguments, about the working through huge matrix problem: "By the time we reach the end of the list, the front of the list will have been paged out". If the front of the matrix has been paged out, that means that the matrix was too big to fit in physical memory. That means that with no paging, the performance would be 0 -- a 100% degradation. "We can take advantage of disk seek times -- just put the data on close tracks." All the disk systems I have ever seen become somewhat fragmented over time. Also note that, because of intelligent i/o controllers, pulling a page off of disk takes virtually no CPU time, since the processor is running another process while the first one is blocked waiting for the i/o processor to bring in the page (assuming a multi-tasking system -- Crays and such DO multi-task, no?). In otherwords taking advantage of explicit multi-processing. Most of the other arguments seen have been spurious line noise :-) :-). The only one which is really valid is the table lookup times necessary for paging. Worst case: the large matrix previously discussed. For n=page size, every n bytes we have to fetch a new associative table entry, upon which one-two levels of a tree must be traversed. There is also the overhead of paging in the MMU lookup table the initial time. With today's faster disk drives, larger pages may help, both in increasing the time N between memory-to-associative fetches, and in decreasing the size of the MMU lookup table. If 4k was OK back in Multics days, 8K or even 16K might work today. Also note that this is WORST CASE. Your mileage will probably be better. I must conclude that for 4k word pages, every 4k memory operations, a 2-4 cycle overhead would be necessary. or, a 4/4096 reduction in performance. If we are talking about a 100 MIPS machine, it would now only be running at 99.92 MIPS. Like wow. Especially considering that because associative lookups probably run about the same speed as checking bounds registers, the memory speed would be otherwise about the same. Another argument was that paging messes up pipelining. Correct me if I'm wrong (we obviously don't have a Cray handy :-), but a Cray only does pipelined vector operations upon the vector registers (although I think filling and saving those registers are pipelined operations, also). Requiring vectors to be page-aligned and having the pages the same size as the registers or a binary size larger would eliminate any pipeline lossage (just do your virtual memory grocking BEFORE you get your pipeline rolling). Since the compilers already have to do considerable grocking to vectorize code, that doesn't seem too extreme. And, of course, MIPS machines & others prove that for ordinary code, pipelining and paging are not mutually exclusive. Of course, the OTHER Big Win of paging is when you want to run a Truly Humongous working set of programs.. But I think others have already covered this, plus there's always the argument "Hey, I've got offset registers and 2 gigabytes of RAM, what more could I want!". Upon which someone will discover a problem that takes 2,147,483,650 bytes to solve... In summary, I really can't see any reason to not have paging, and since there's a multitude of reasons to HAVE paging (running problem sets larger than available memory, "graceful" user response time as vs. multitasking schemes such as segmentation or swapping, etc.)... -- Eric Green {akgua,ut-sally}!usl!elg (Snail Mail P.O. Box 92191, Lafayette, LA 70509) " In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move."
elg@usl.UUCP (Eric Lee Green) (09/27/86)
In article <78@alberta.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes: >In article <7597@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >> >>Note that the computational speed of the Cray really does exceed I/O speed >>by about a factor of 10^6 - page faults in this kind of environment are >>EXTREMELY costly. Not at all. The I/O processor does all the work of reading a page off of disk and stuffing it into memory (via DMA). Meanwhile, the CPU goes off and runs somebody else's job, taking advantage of that explicit multiprocessing. Assuming you're not using the entire machine for a single batch job, that is. So even if you have a working set larger than available memory, if you have a good scattering of jobs so that thrashing doesn't occur, the CPU still gets the same amount of work done -- it's just divided amongst users, instead of all happening to the same process. Also note that page faults do not occur unless the working set of the program exceeds available physical memory -- if a page fault occurs, it means your program wouldn't have run without paging, anyway. And if a page fault DOESN'T occur (that is, if the program would have run in physical memory), then you can discard the overhead of page faults altogether, and just concentrate on the overhead of virtual address translation. Loading is loading. Loading 64K of data via 8 page faults should not be any more expensive than loading 64K of data in one fell swoop. >>desireable to make the pages large. That way you can take advantage of the >>fact that most I/O comes from consecutive disk locations - therefore there >>are fewer seeks with large page size. I have never seen a disk system that did not fragment. Unless the disk page size was also similiarly large. Upon which you get lots of wasted space, because most files are about 1K long or so, and you'd have these mungo 16K disk pages only 1/16th full... Imagine, a 160 megabyte disk with only 10 megabytes used on it, totally full. >>J. Giles >>Los Alamos > >important than overall throughput. Anyone who was around in the days of >"batch only" knows this argument inside and out. Batch is fine if you want >to maximize individual program performance, but overall turnaround time is >dreadful, since ALL things sit in the queue, including the 1/2 second >practice runs. Very good point. See my article elsewhere about the performance drawbacks of paging (which are so minimal as to be ridiculous). -- Eric Green {akgua,ut-sally}!usl!elg (Snail Mail P.O. Box 92191, Lafayette, LA 70509) " In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move."
joel@peora.UUCP (10/08/86)
>>desireable to make the pages large. That way you can take advantage of >>the fact that most I/O comes from consecutive disk locations - therefore >>there are fewer seeks with large page size. > >I have never seen a disk system that did not fragment. Unless the disk >page size was also similiarly large. Upon which you get lots of wasted >space, because most files are about 1K long or so, and you'd have these >mungo 16K disk pages only 1/16th full... Imagine, a 160 megabyte disk >with only 10 megabytes used on it, totally full. > Eric Green {akgua,ut-sally}!usl!elg You seem to assume that all disk I/O is done with a fixed block size. On our drives you can do a disk I/O transfer using a block size from 256 bytes up to 16MB. We have a file type that lets you specify how large the file will be at allocation and reserves a contiguous block of disk space for it. We also have another file type that will allocate new blocks as required for the file, but even there the block size can be specified when the file is allocated. The fixed disk block size used by Unix seems to designed to simplfy the operating system, rather than for efficent disk I/O. I'm not familar with the internals of virtual memory versions of Unix. It would seem reasonable to me that paging I/O would attempt to bypass most of the file system overhead and use a more primitive level of disk access. It therefore seems to me that the isn't any necessary relationship between the blocking factors used by the file system and the system page size. -- Joel Upchurch @ CONCURRENT Computer Corporation (A Perkin-Elmer Company) Southern Development Center 2486 Sand Lake Road/ Orlando, Florida 32809/ (305)850-1031 {decvax!ucf-cs, ihnp4!pesnta, vax135!petsd, akgua!codas}!peora!joel
aglew@ccvaxa.UUCP (10/09/86)
> [Eric Green {akgua,ut-sally}!usl!elg]: First, my own conclusions: virtual memory may not be necessary for big job systems, but is for just about anything else. Since I don't see myself building big single job systems in the near future (or any systems, sigh) I will take virtual memory as a given. But, I still can see reasons not to have virtual memory - in certain circumstances. > A rehash of past arguments, about the working through huge matrix problem: > "By the time we reach the end of the list, the front of the list will > have been paged out". > If the front of the matrix has been paged out, that means that the > matrix was too big to fit in physical memory. That means that with no > paging, the performance would be 0 -- a 100% degradation. Talk about spurious line noise! If people explicitly overlay the matrix, they can still process it - that's a lot better than 0 performance. > "We can take advantage of disk seek times -- just put the data on > close tracks." > All the disk systems I have ever seen become somewhat fragmented > over time. Right. But high performance systems give you a way to make certain files physically contiguous. Whether it's worth the bother is another question. > Also note that, because of intelligent i/o controllers, pulling a > page off of disk takes virtually no CPU time, since the processor is > running another process while the first one is blocked waiting for the > i/o processor to bring in the page (assuming a multi-tasking system -- > Crays and such DO multi-task, no?). In otherwords taking advantage of > explicit multi-processing. Three things: (1) it is running *another* process, not the one that you want to make really fast. The crux is, sometimes you want to make one process really fast, at the cost of its mates. (2) Sure, your CPU isn't idle - it's taking all sorts of TLB misses and cache faults and swapping out its registers as it does the context switch between the blocked, paging, process, and the new one. The IBM 3090 has already shown an example of a system where it is faster to page synchronously rather than context switch, in which case your CPU *is* idle while paging - and these are likely to become more common. (3) By initiating overlays in advance, asynchronously, a non-virtual process also does not leave the CPU idle - nor does it waste CPU cycles in a context switch. {Has anyone a good way of estimating time lost to context switches - totting up register save, misses, etc.? I need a fairly abstract definition of context for a model of multiprocessing that shows that multiprocessing is sometimes faster than any uniprocessor system can be, for the same problem} > The only one which is really valid is the table lookup times > necessary for paging. There is also the overhead of paging in the MMU lookup > table the initial time. Should that be prefetched at a context switch? > [Page size]: If 4k was OK back in Multics days, 8K or even 16K might work > today. Gould already has 8K pages. The Japanese have gone for 1M pages. > Also note that this is WORST CASE. Your mileage will probably be better. Sometimes you have to design for worst case (like, when the benchmarks that customers use to decide whether they should buy your machine are your worst case scenarios). I have a cute little idea for cramming more stuff into a TLB, if the system will occasionally try to allocate contiguous physical memory, the architect I showed it to said yes, my idea was better than the standard trick for multiple page size systems - but it wasn't worth doing, because he would still have to make the TLB large enough to handle the worst case, where each contiguous chunk was the smallest page size. Sigh. > Especially considering that > because associative lookups probably run about the same speed as > checking bounds registers, the memory speed would be otherwise about > the same. Do really fast systems use bounds registers? What bounds registers? > Another argument was that paging messes up pipelining... > Requiring vectors to be page-aligned and having the pages the > same size as the registers or a binary size larger would eliminate any > pipeline lossage (just do your virtual memory grocking BEFORE you get > your pipeline rolling). This won't work if you want to pipeline calculations of a vector of addresses (or indexes) with the actual access. IE. the addresses are not available before the operation starts. So, either you don't pipeline this (a considerable loss, but one that can be tolerated) or you make it possible to recover from a page fault in the middle of an operation (ouch). > And, of course, MIPS machines & others prove that for > ordinary code, pipelining and paging are not mutually exclusive. Carefully designed instruction sets mean that page faults will only occur at a limited number of known locations, making recovery easier.
aglew@ccvaxa.UUCP (10/09/86)
> Loading 64K of data via 8 page faults should not be any more expensive than > loading 64K of data in one fell swoop. You are way off base on this.
jlg@lanl.ARPA (Jim Giles) (10/12/86)
>>>Note that the computational speed of the Cray really does exceed I/O speed >>>by about a factor of 10^6 - page faults in this kind of environment are >>>EXTREMELY costly. > >Not at all. The I/O processor does all the work of reading a page off >of disk and stuffing it into memory (via DMA). Meanwhile, the CPU goes >off and runs somebody else's job, taking advantage of that explicit >multiprocessing. Assuming you're not using the entire machine for a >single batch job, that is. So even if you have a working set larger >than available memory, if you have a good scattering of jobs so that >thrashing doesn't occur, the CPU still gets the same amount of work >done -- it's just divided amongst users, instead of all happening to >the same process. Thank you for making my point so clearly! There applications (and whole computing environments) where turnaround is MUCH MORE important than throughput. > > Also note that page faults do not occur unless the working set of the >program exceeds available physical memory -- if a page fault occurs, >it means your program wouldn't have run without paging, anyway. Sure it does! And more efficiently too! Paging means that the program must make do with the memory management and I/O scheduling that is provided by the VM system. The VM system is built with all sorts of compromises since it must work acceptably well for many different data usage patterns - it cannot make use of any specific knowledge of the the data usage that is known to the programmer. (oh, yes. Some people have pointed out that the VM system COULD be augmented with special calls to allow the programmer to inform it of the data usage patterns to be expected. There are three problems with this approach: 1> Such calls are quite vague - at least as proposed, 2> the use of such calls is as much a programming chore as explicit memory management anyway, 3> no existing VM system I'm aware of has this type of call implemented.) > ... And if >a page fault DOESN'T occur (that is, if the program would have run in >physical memory), then you can discard the overhead of page faults >altogether, and just concentrate on the overhead of virtual address >translation. Loading is loading. Loading 64K of data via 8 page faults >should not be any more expensive than loading 64K of data in one fell >swoop. > Not true. 8 page faults is 8 system calls (or equivalent); probably 8 disk seeks (unless you are alone of the system and no other user has requested data from that disk since the last page fault); also it's 8 disk revolutions (unless the page faults are so close together that the disk hasn't moved past the start of the next sector before you ask for it). The truth is that a single long consecutive disk transfer is ALWAYS more efficient than a lot of small transfers. >>>desireable to make the pages large. That way you can take advantage of the >>>fact that most I/O comes from consecutive disk locations - therefore there >>>are fewer seeks with large page size. > >I have never seen a disk system that did not fragment. Unless the disk >page size was also similiarly large. Upon which you get lots of wasted >space, because most files are about 1K long or so, and you'd have >these mungo 16K disk pages only 1/16th full... Imagine, a 160 megabyte >disk with only 10 megabytes used on it, totally full. > The CTSS (Cray Time Sharing System) requires an initial length in order to create a file. The file is then allocated CONSECUTIVELY (if there is a large enough hole - some idle system time is spent consolidating holes). The file is only segmented if it is extended after its creation. Even for an extension, the file is extended consecutively if possible. In practice, most files are consecutive for most of their length. Also, very little space is taken up by files as small as '1K long or so'. To be sure, there are a lot of such files, but it would take 32,000 of them to to equall the size of an executing program image (I assume you mean 1k bytes, program images can be 4M words). Also - the disk sector size is the granularity of disk allocation: it is 512 words no matter what size the 'pages' are. Changing page size does not effect disk allocation in any way. I see that you really have no idea whatsoever about the evironment of problems inherent in large memory systems. Even your examples give you away - 160MB is not a very large disk drive, 1KB is not a typical file size, etc.. Our disk drives hold about 1.2GB, files can be as large as 40MW (=320MB), average user data files are over 1MW (=8MB) - only source code is as small as 1KB (which is rounded up to 512 words = 4KB because that is the sector size of the disk drives). Such small code soures are also rare - the main users have codes which amount to 100,000 to 500,000 lines of Fortran (not counting the library support). In short, you are talking about a completely different environment than I am. Learn something about large computing environments before you try to force unwanted features upon them! J. Giles Los Alamos
henry@utzoo.UUCP (Henry Spencer) (10/15/86)
> ... The IBM 3090 has already shown an example of a > system where it is faster to page synchronously rather than context switch, > in which case your CPU *is* idle while paging - and these are likely to > become more common. ... Ah, but what device are they paging from? If I remember it correctly -- I'm not sure of this -- the 3090 is paging out of ordinary RAM, which they would have preferred to make directly addressable except that they are constrained by backward compatibility and not enough address bits. This is a kludge, not something that is likely to become widespread. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry
barada@maggot.UUCP (10/16/86)
Written 2:47 am Sep 27, 1986 by usl.UUCP!elg >In article <78@alberta.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes: >>In article <7597@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >>> >>>Note that the computational speed of the Cray really does exceed I/O speed >>>by about a factor of 10^6 - page faults in this kind of environment are >>>EXTREMELY costly. > >Not at all. The I/O processor does all the work of reading a page off >of disk and stuffing it into memory (via DMA). Meanwhile, the CPU goes >off and runs somebody else's job, taking advantage of that explicit >multiprocessing. Assuming you're not using the entire machine for a >single batch job, that is. So even if you have a working set larger >than available memory, if you have a good scattering of jobs so that >thrashing doesn't occur, the CPU still gets the same amount of work >done -- it's just divided amongst users, instead of all happening to >the same process. > > Also note that page faults do not occur unless the working set of the >program exceeds available physical memory -- if a page fault occurs, >it means your program wouldn't have run without paging, anyway. And if >a page fault DOESN'T occur (that is, if the program would have run in >physical memory), then you can discard the overhead of page faults >altogether, and just concentrate on the overhead of virtual address >translation. Loading is loading. Loading 64K of data via 8 page faults >should not be any more expensive than loading 64K of data in one fell >swoop. > Your arguement is good, but it does have its weaknesses: 1) The CPU would not get the same amount of work done in page faults since the context switches occur, and the kernel has to reschedule running processes and biases. So a program that goes page-fault happy (such as a monster hashing algorithm) would actually cause the WHOLE system to run slower due to all of those pagefaults spending time in the scheduler. 2) What do you do if two processes each need 80% of the available memory? run one so it doesn't fault and then run the other? This is great for Batch systems, not timesharing. Actually, a good design would try to balance it out by allowing each a page set that takes up half of memory. This would give each about the same probability(theoretically) of having the proper page in memory. Now if one process runs in only 10% of its Workspace, then the workspace can be reduced to allow the other piggish process a better chance of having a page hit. So the likelyhood of a process having its WS available is better if the process is small. But if it is really large, then that likelyhood is inversely porportional to the WS needed. So really large processes run even SLOWER since they can not get the large WS unless they run in a `batch' style. The calculations to determine the adjusting of the WS is done on each pagefault, which can slow things down even more for huge systems. I am not faulting VM systems for crays, I'm just pointing out some things that were overlooked. -- Peter Barada | (617)-671-9905 Applicon, Inc. A division of Schlumberger Ltd. | Billerica MA, 01821 UUCP: {allegra|decvax|mit-eddie|utzoo}!linus!raybed2!applicon!barada {amd|bbncca|cbosgd|wjh12|ihnp4|yale}!ima!applicon!barada Sanity is yet another perversion of reality.
josh@polaris.UUCP (Josh Knight) (10/17/86)
In article <7232@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >> ... The IBM 3090 has already shown an example of a >> system where it is faster to page synchronously rather than context switch, >> in which case your CPU *is* idle while paging - and these are likely to >> become more common. ... > >Ah, but what device are they paging from? If I remember it correctly -- >I'm not sure of this -- the 3090 is paging out of ordinary RAM, which they >would have preferred to make directly addressable except that they are >constrained by backward compatibility and not enough address bits. This >is a kludge, not something that is likely to become widespread. What IBM prefers to do is a difficult thing to determine (wish I could ;-). It is true that the expanded storage announced for the 3090 is "ordinary RAM". However, it is not necessarily true that one would prefer to have real memory instead of a fast page addressable store. The address bits required to maintain the latter (only page addresses are maintained) are fewer and one does not have to maintain whatever protection mechanisms (in the case of the IBM 370 architecture storage protect keys) are provided for "main storage". In addition, page stores could have different porting and provide access to multiple processors and/or I/O servers with less overhead. These latter features are not supported by the 3090 expanded storage, but other systems might choose page addressable stores over byte addressable stores for these reassons. Of course, I'm speaking only for myself, not my employer. -- Josh Knight, IBM T.J. Watson Research josh@ibm.com, josh@yktvmh.bitnet, ...!philabs!polaris!josh
jnw@mcnc.UUCP (John White) (10/21/86)
In article <7232@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes: > ... If I remember it correctly -- > I'm not sure of this -- the 3090 is paging out of ordinary RAM, which they > would have preferred to make directly addressable except that they are > constrained by backward compatibility and not enough address bits. This > is a kludge, not something that is likely to become widespread. > -- > Henry Spencer @ U of Toronto Zoology > {allegra,ihnp4,decvax,pyramid}!utzoo!henry With memory prices dropping, it makes some sense to use a RAMdisk as the paging device on a high performance system. The RAMdisk can use slower, less expensive memory. Also, the main memory bus can be driven much faster if it is known that there will be no more than, say, 500 memory chips on it. The RAMdisk, on the other hand, could have more than 50,000 chips. -jnw@mcnc