chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) (01/11/91)
Here is a summary of responses I got regarding the protection facilities in Cray. Questions: How does Cray provide protection I am currently investigating methods of minimizing virtual memory overheads in high-performance computers. I think I have some way of solving logical-physical address translation. But I can't think of an efficient way of providing protection. I figure since Cray doesn't have virtual memory, maybe it can teach me something about achieving protection inexpensively. Can anybody enlighten me about how Cray provides protection as normal virtual memory systems offer, or where can I find useful description in this regard ? Thank you. -- tzi-cker -------------------------------------------------------------------------- CRAYs (at least X and Y models) use base and limit register pairs (one set for code, one for data) for each process. The base register is added to the logical address (as generated by the program) to give the physical memory address. It is, I think, the physical address that is checked against the limit register, and, if out of bounds, a program or operand range exception is generated. With separate code/data base/limit registers code sharing is possible, but I don't think it is much used. -- david ------------------------------------------------------------------------------- The kind of protection I have in mind is access right control (e.g., read-only) "Normal virtual memory systems" perform this kind of protection check while doing logical-physical address mapping. The protection bits are either in page tables or TLB. Now, since Cray doesn't have virtual memory, the question is does it provide access control, if so, where does it put this check ? From the previous responses, it seemed that Cray only provides out-of-bound protection check. Furthermore, this check is done for EVERY reference. If this is indeed the case, this protection check process should be as expensive as address mapping in machines that have VM. So why does Cray get rid of virtual memory altogether ? Or does anybody know how much performance improvement can we gain from getting rid of VM ? -- tzi-cker ------------------------------------------------------------------------------ I'm not exactly sure of your question but I'll try to give you some insight. Cray uses a pair of base and limit address registers. There is a base and limit address for instruction memory and a base and limit address for data memory. When a program makes a reference to logical location zero, the base address is added to it to get the physical address. If the physical address is larger than the limit address, then a "program range" error interrupt is generated. The limit address is your protection mechanism. Thats about as simple a memory management h/w as you can get. You might also want to take a look at the memory management scheme in an old Digital pdp8 or Data General Eclipse. The DEC used memory segments and the DG used memory banks as I can remember. --Larry ------------------------------------------------------------------------------ The Cray protection scheme is so simple that it is confusing. A bounds check is done on every single address generated. But, it is easy, and takes place in the few cycles necessary, because it is just a simple integer add and compare. Period. The reason that it can do this, is because memory is *contiguous* !!! Doesn't that cause a lot of problems with memory fragmentation? Yes. The Cray kernel does a lot of copying to compact memory. And, it can only do it at certain times. This does cause poor memory utilization compared with a system with VM. Every design is a compromise. The Cray 1/X/Y are upward compatible, and the Cray-1 didn't have VM because it was viewed as an unnecessary complication at the time it was designed. (Early 1970's). -- Hugh ----------------------------------------------------------------------------- 1. On a machine of this speed, a comparison with a bounds value is _much_ faster than a second memory reference (i.e., to the page table). 2. The page table lookup has to be done before the real memory reference. The bounds check can be done in parallel with the real memory reference. 3. On a vector operation, it is possible to do the bounds checking for just the first and final addresses. Potentially, the page table lookup would have to be done for each address. 4. Cray crams as much circuitry as possible on their boards. Adding circuitry to handle the paging probably would have meant giving up something else. -- Kurt ----------------------------------------------------------------------------- chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) writes: > So why does Cray get rid of virtual memory altogether ? Or does anybody > know how much performance improvement can we gain from getting rid of VM ? I suggest you see the IEEE proceedings from the Supercomputing conference that was held last month in New York. Cray published an article in these proceedings that describes their memory architecture and gives clock timings for current and future memory architectures. Summary of what follows: - Memory speed is THE supercomputing bottleneck. - Cray can fetch from memory in 17 cycles. Demand paging would lengthen this time significantly. - Virtual memory trades speed for money. Supercomputers do not compromize on speed. - Cray Y-MP/8s have 4 gigabyte per second memory bandwidths. - Supercomputing working sets and problems sizes tend to be equal. - Demand paging would complicate an already very complicated instruction scheduler. Memory speed is THE bottleneck in supercomputing. It is was makes Cray king of the hill. The Japanese have faster peak CPU speeds, but their memory bandwidths are inferior. This is a key reason why Cray machines are the fastest computers available for most production benchmarks (with notable exceptions.) The number of cycles needed to transfer the first word from memory to a register is one of the most critical timings in the supercomputer. Cray can do this in 17 cycles. An SX3 requires 70 cycles. An ETA 10 needed hundreds of cycles. Adding demand paging will significantly lengthen this cycle time. If you can add demand paging without adding cycles to this memory fetch time, then I am sure Cray will make you a rich person. Supercomputers with virtual memories have been tried. The CDC 205 and the ETA10 are examples. When these machines ran codes where the problem size exceed the RAM size (paging), they ran 10 time slower than when paging did not occur. Virtual memory is a technique of trading time for money. Virtual memory costs less than real memory, but is slower. Slower memory is not an option for supercomputing. Witness the success of Cray and the demise of ETA. The Cray achieves two words read and one word written per clock per CPU. On a Y-MP/8 this is a memory bandwidth of 4 gigabytes per second. Disks bandwidths are not adequate to keep up with this type of demand. The theory of virtual memory depends on the working set being smaller than the problem size. In most supercomputer applications working set is the problem size. I am sure the architecture of these applications was influenced by programming for real-memory machines, so this is somewhat of a circular argument. However, for the status quo, this is true. Cray's are vector machines with extremely sophisticated instruction schedulers. The Cray often has server instructions issued at once in the same CPU. X-MPs and Y-MPs scoreboard conflicts between instructions and are able to compensate for bank and section memory delays. These delays tend to be for one to four cycles. The instruction scheduler architecture would be even more difficult if it had to account for page-fault delays of many thousands of cycles. An approach to this problem would be to require the compilers to never allow a vector sub-section to cross a page boundary. -- Kent -------------------------------------------------------------------------------- Saw your information request about Crays, and thought that I might be able to point you towards some useful information: I suggest that you check up on Control Data's Cyber 180-series (currently Cyber 2000-series) machines - they are a full hardware Multics implementation, and have some truly "unique" virtual memory hardware. I can personally vouch that the address translation hardware, which also is doing access control checking, is VERY fast, and it has several extra levels of indirectness more than most other folks' virtual memory architectures. Cyber 180 is such a complete Multics that there is actually NO REAL MEMORY ADDRESSING MODE. It is NOT POSSIBLE to access memory by real memory address, the hardware doesn't have the capability! It is also interesting that when a Cyber 180 is emulating Cyber 170 mode, it ALSO has base/limit register hardware in operation, since the 170 architecture is real-memory, and only has base/limit restrictions. When a Cyber 180 is running in 170 mode, it really is running a virtual real-memory machine on its virtual memory hardware (just saying this makes my mind feel like a pretzel). If nothing else, the CDC stuff should make interesting counter-culture reading material for you. It was/is truly different. I also suspect that in the Crays (although I have never read the hardware prints of a Cray, only the CDC machines), the bounds checking is being done on the VIRTUAL address, as it were, not the real memory address. This method allowed the old CDC machines (the ones Seymour Cray designed) to do their access checking in the CPU, not the memory controller, and thus kill of the references earlier in the instruction. -- Gregory ---------------------------------------------------------------------------- > Furthermore, this check is done for EVERY reference. >If this is indeed the case, this protection check process should be as >expensive as address mapping in machines that have VM. Why do you assume this? Given that the latency of Cray memory is 4 cycles or so, the check can be done after the address is sent off to memory and can generate a fault before the data gets back. >So why does Cray get rid of virtual memory altogether ? Well, many supercomputer applications can't page and have to swap. In that case, why provide VM? -- greg In article <1990Dec19.181343.10365@agate.berkeley.edu> you write: > The kind of protection I have in mind is access right control (e.g., read-only) > "Normal virtual memory systems" perform this kind of protection check while > doing logical-physical address mapping. The protection bits are either in page > tables or TLB. Now, since Cray doesn't have virtual memory, the question is > does it provide access control, if so, where does it put this check ? The Cray does not provide extensive access control. For each running program a (consecutive) part of actual memory is mapped to the logical address space of the program (which starts at 0). With each reference the logical address is compared to the logical bounds register, and the base register is added to it before going to memory. > From the previous responses, it seemed that Cray only provides out-of-bound > protection check. Furthermore, this check is done for EVERY reference. > If this is indeed the case, this protection check process should be as > expensive as address mapping in machines that have VM. Clearly this is much less expensive than true VM; only two registers are needed to do everything (address translation and bound checking), and those two registers reside directly in the CPU. > So why does Cray get rid of virtual memory altogether ? Or does anybody > know how much performance improvement can we gain from getting rid of VM ? This is much less expensive because check and translation go on in parallel within a single clock cycle. -- dik
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (01/11/91)
> On 10 Jan 91 23:07:15 GMT,chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) said: chiueh> chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) writes: > So why does Cray get rid of virtual memory altogether ? Or does anybody > know how much performance improvement can we gain from getting rid of VM chiueh> I suggest you see the IEEE proceedings from the Supercomputing chiueh> conference that was held last month in New York. Cray chiueh> published an article in these proceedings that describes their chiueh> memory architecture and gives clock timings for current and chiueh> future memory architectures. chiueh> Summary of what follows: - Memory speed is THE supercomputing chiueh> bottleneck. - Cray can fetch from memory in 17 cycles. chiueh> Demand paging would lengthen this time significantly. - chiueh> Virtual memory trades speed for money. Supercomputers do not chiueh> compromize on speed. - Cray Y-MP/8s have 4 gigabyte per chiueh> second memory bandwidths. - Supercomputing working sets and chiueh> problems sizes tend to be equal. - Demand paging would chiueh> complicate an already very complicated instruction scheduler. chiueh> Memory speed is THE bottleneck in supercomputing. It is was chiueh> makes Cray king of the hill. The Japanese have faster peak chiueh> CPU speeds, but their memory bandwidths are inferior. This is chiueh> a key reason why Cray machines are the fastest computers chiueh> available for most production benchmarks (with notable chiueh> exceptions.) chiueh> The number of cycles needed to transfer the first word from chiueh> memory to a register is one of the most critical timings in chiueh> the supercomputer. Cray can do this in 17 cycles. An SX3 chiueh> requires 70 cycles. An ETA 10 needed hundreds of cycles. chiueh> Adding demand paging will significantly lengthen this cycle chiueh> time. If you can add demand paging without adding cycles to chiueh> this memory fetch time, then I am sure Cray will make you a chiueh> rich person. chiueh> Supercomputers with virtual memories have been tried. The CDC 205 and the chiueh> ETA10 are examples. When these machines ran codes where the problem size chiueh> exceed the RAM size (paging), they ran 10 time slower than when paging did chiueh> not occur. chiueh> Virtual memory is a technique of trading time for money. Virtual memory chiueh> costs less than real memory, but is slower. Slower memory is not an chiueh> option for supercomputing. Witness the success of Cray and the demise of chiueh> ETA. chiueh> The Cray achieves two words read and one word written per clock per CPU. chiueh> On a Y-MP/8 this is a memory bandwidth of 4 gigabytes per second. Disks chiueh> bandwidths are not adequate to keep up with this type of demand. chiueh> The theory of virtual memory depends on the working set being smaller than chiueh> the problem size. In most supercomputer applications working set is the chiueh> problem size. I am sure the architecture of these applications was chiueh> influenced by programming for real-memory machines, so this is somewhat of chiueh> a circular argument. However, for the status quo, this is true. chiueh> Cray's are vector machines with extremely sophisticated instruction chiueh> schedulers. The Cray often has server instructions issued at once in the chiueh> same CPU. X-MPs and Y-MPs scoreboard conflicts between instructions and chiueh> are able to compensate for bank and section memory delays. These delays chiueh> tend to be for one to four cycles. The instruction scheduler architecture chiueh> would be even more difficult if it had to account for page-fault delays of chiueh> many thousands of cycles. An approach to this problem would be to require chiueh> the compilers to never allow a vector sub-section to cross a page chiueh> boundary. chiueh> -- Kent chiueh> -------------------------------------------------------------------------------- chiueh> Saw your information request about Crays, and thought that I might be chiueh> able to point you towards some useful information: chiueh> I suggest that you check up on Control Data's Cyber 180-series chiueh> (currently Cyber 2000-series) machines - they are a full hardware chiueh> Multics implementation, and have some truly "unique" virtual memory chiueh> hardware. I can personally vouch that the address translation chiueh> hardware, which also is doing access control checking, is VERY fast, chiueh> and it has several extra levels of indirectness more than most chiueh> other folks' virtual memory architectures. Cyber 180 is such a chiueh> complete Multics that there is actually NO REAL MEMORY ADDRESSING chiueh> MODE. It is NOT POSSIBLE to access memory by real memory address, the chiueh> hardware doesn't have the capability! chiueh> It is also interesting that when a Cyber 180 is emulating Cyber 170 chiueh> mode, it ALSO has base/limit register hardware in operation, since the chiueh> 170 architecture is real-memory, and only has base/limit restrictions. chiueh> When a Cyber 180 is running in 170 mode, it really is running a chiueh> virtual real-memory machine on its virtual memory hardware (just chiueh> saying this makes my mind feel like a pretzel). chiueh> If nothing else, the CDC stuff should make interesting counter-culture chiueh> reading material for you. It was/is truly different. chiueh> I also suspect that in the Crays (although I have never read the chiueh> hardware prints of a Cray, only the CDC machines), the bounds checking chiueh> is being done on the VIRTUAL address, as it were, not the real memory chiueh> address. This method allowed the old CDC machines (the ones Seymour chiueh> Cray designed) to do their access checking in the CPU, not the memory chiueh> controller, and thus kill of the references earlier in the chiueh> instruction. chiueh> chiueh> -- Gregory chiueh> ---------------------------------------------------------------------------- > Furthermore, this check is done for EVERY reference. >If this is indeed the case, this protection check process should be as >expensive as address mapping in machines that have VM. chiueh> Why do you assume this? Given that the latency of Cray memory is 4 chiueh> cycles or so, the check can be done after the address is sent off to chiueh> memory and can generate a fault before the data gets back. >So why does Cray get rid of virtual memory altogether ? chiueh> Well, many supercomputer applications can't page and have to swap. In chiueh> that case, why provide VM? chiueh> -- greg chiueh> In article <1990Dec19.181343.10365@agate.berkeley.edu> you write: > The kind of protection I have in mind is access right control (e.g., read-only) > "Normal virtual memory systems" perform this kind of protection check while > doing logical-physical address mapping. The protection bits are either in page > tables or TLB. Now, since Cray doesn't have virtual memory, the question is > does it provide access control, if so, where does it put this check ? chiueh> The Cray does not provide extensive access control. For each running program chiueh> a (consecutive) part of actual memory is mapped to the logical address space chiueh> of the program (which starts at 0). With each reference the logical address chiueh> is compared to the logical bounds register, and the base register is added chiueh> to it before going to memory. > From the previous responses, it seemed that Cray only provides out-of-bound > protection check. Furthermore, this check is done for EVERY reference. > If this is indeed the case, this protection check process should be as > expensive as address mapping in machines that have VM. chiueh> Clearly this is much less expensive than true VM; only two registers are needed chiueh> to do everything (address translation and bound checking), and those two chiueh> registers reside directly in the CPU. > So why does Cray get rid of virtual memory altogether ? Or does anybody > know how much performance improvement can we gain from getting rid of VM ? chiueh> This is much less expensive because check and translation go on in parallel chiueh> within a single clock cycle. chiueh> -- dik -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (01/11/91)
> On 10 Jan 91 23:07:15 GMT,chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) said:
chiueh> So why does Cray get rid of virtual memory altogether ? Or
chiueh> does anybody know how much performance improvement can we gain
chiueh> from getting rid of VM
kent> The number of cycles needed to transfer the first word from
kent> memory to a register is one of the most critical timings in
kent> the supercomputer. Cray can do this in 17 cycles. An SX3
kent> requires 70 cycles. An ETA 10 needed hundreds of cycles.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is/was a common misconception. The ETA-10 actually attained
vector startup times of as short as about 23 cycles if the first pages
of all the operands were in real memory. This includes both the time
required to get the first element from memory into the pipe as well as
3-4 more cycles to get the pipe filled. Results were then available
from the first operation on about the 24th cycle (if the result was to
be immediately reused) or about the 30th cycle if the result had to go
all the way back to memory. The startup time varied between about 16
and 32 cycles depending on whether the memory banks were aligned and
whether or not operations were being chained (in which case there were
two pipes to fill, not just one).
On a number of test loops, the ETA-10 was significantly *faster* on
short vector operations than the 8.5ns Cray X/MP. This did not
typically mean that short-vector *application codes* ran faster on the
ETA-10, though.... :-(
kent> Adding demand paging will significantly lengthen this cycle
kent> time. If you can add demand paging without adding cycles to
kent> this memory fetch time, then I am sure Cray will make you a
kent> rich person.
ETA/CDC did it, and it certainly did not make them rich! I believe
that the two Cray companies simply decided that the benefits of VM
were not worth the hassle. So far the market has proven them right.
kent> Supercomputers with virtual memories have been tried. The CDC
kent> 205 and the ETA10 are examples. When these machines ran codes
kent> where the problem size exceed the RAM size (paging), they ran
kent> 10 time slower than when paging did not occur.
This is hardly surprising. Anyone with any experience at all realizes
that VM is to be used to make a small class of jobs much easier to
code by letting the hardware handle the large address space -- *not*
to just run larger-than-real-memory jobs. It should be noted that it
is possible to write jobs that are larger than real memory but which
do not slow down significantly in a VM system. One application was a
straightforward LU-decomposition of a 2000x2000 dense matrix. Only
about 2 Million words were available to the user on the machine, which
required 4 Million words of virtual space. By using a block-mode
algorithm and the best ETA UNIX swapping code, our CDC applications
specialist was able to get nearly full performance on this problem.
The advantage relative to the Cray was that on the ETA it could be
done in standard Fortran, while the Cray would have required explicit
I/O.
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@brahms.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (01/11/91)
My apologies to the net for two of the three previous accidental posts. One of them was intentional -- let the reader decide. Note the Followup-To: comp.sys.super -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET
dik@cwi.nl (Dik T. Winter) (01/18/91)
In article <DREIER.91Jan13223533@husc9.harvard.edu> dreier@husc9.harvard.edu (Roland Dreier) writes: > I guess one thing that should be noted about how Crays handle memory: under > Unicos, since it lacks virtual memory, whole jobs get swapped. What this > means is that if I'm running two jobs on a 256 megaword machine and each of > them wants 129 megawords of memory, every time a job gets swapped out, I > have to do 129 megawords of IO, rather than just the small amount of overlap > that I "really" have to do; so clearly, even a slightly slower virtual memory > system would be at an advantage here. > But this is not a defect of the lack of VM; it is a defect of the OS. If the OS wants to swap out the complete job it can do so, but it is not necessary. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl