[comp.unix.cray] Summary for Protection in Cray

chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) (01/11/91)

Here is a summary of responses I got regarding the protection facilities in Cray.

Questions: How does Cray provide protection

I am currently investigating methods of minimizing virtual memory overheads
in high-performance computers. I think I have some way of solving 
logical-physical address translation. But I can't think of an efficient way 
of providing protection. I figure since Cray doesn't have virtual memory, maybe
it can teach me something about achieving protection inexpensively. 
Can anybody enlighten me about how Cray provides protection as normal virtual 
memory systems offer, or where can I find useful description in this regard ? 
Thank you. 

-- tzi-cker

--------------------------------------------------------------------------
CRAYs (at least X and Y models) use base and limit register pairs (one
set for code, one for data) for each process.  The base register is added
to the logical address (as generated by the program) to give the physical
memory address.  It is, I think, the physical address that is checked
against the limit register, and, if out of bounds, a program or operand
range exception is generated.

With separate code/data base/limit registers code sharing is possible, but
I don't think it is much used.

-- david

-------------------------------------------------------------------------------
The kind of protection I have in mind is access right control (e.g., read-only)
"Normal virtual memory systems" perform this kind of protection check while 
doing logical-physical address mapping. The protection bits are either in page 
tables or TLB.  Now, since Cray doesn't have virtual memory, the question is 
does it provide access control, if so, where does it put this check ?
From the previous responses, it seemed that Cray only provides out-of-bound
protection check. Furthermore, this check is done for EVERY reference. 
If this is indeed the case, this protection check process should be as 
expensive as address mapping in machines that have VM. 
So why does Cray get rid of virtual memory altogether ?  Or does anybody 
know how much performance improvement can we gain from getting rid of VM ?

-- tzi-cker

------------------------------------------------------------------------------
I'm not exactly sure of your question but I'll try to give you some
insight.  Cray uses a pair of base and limit address registers.
There is a base and limit address for instruction memory and a base
and limit address for data memory.  When a program makes a reference
to logical location zero, the base address is added to it to get
the physical address.  If the physical address is larger than the
limit address, then a "program range" error interrupt is generated.
The limit address is your protection mechanism.  Thats about as
simple a memory management h/w as you can get.

You might also want to take a look at the memory management scheme 
in an old Digital pdp8 or Data General Eclipse.  The DEC used 
memory segments and the DG used memory banks as I can remember.
 
  --Larry


------------------------------------------------------------------------------
The Cray protection scheme is so simple that it is confusing.  A bounds
check is done on every single address generated.  But, it is easy, and takes
place in the few cycles necessary, because it is  just a simple integer add and
compare.  Period.  The reason that it can do this, is because memory is 
*contiguous* !!!

Doesn't that cause a lot of problems with memory fragmentation?  Yes.  The
Cray kernel does a lot of copying to compact memory.  And, it can only do it
at certain times.  This does cause poor memory utilization compared with a
system with VM.  Every design is a compromise.  The Cray 1/X/Y are upward
compatible, and the Cray-1 didn't have VM because it was viewed as an
unnecessary complication at the time it was designed.  (Early 1970's).



-- Hugh 

-----------------------------------------------------------------------------
1. On a machine of this speed, a comparison with a bounds value is _much_
   faster than a second memory reference (i.e., to the page table).

2. The page table lookup has to be done before the real memory reference.  The
   bounds check can be done in parallel with the real memory reference.

3. On a vector operation, it is possible to do the bounds checking for just
   the first and final addresses.  Potentially, the page table lookup would
   have to be done for each address.

4. Cray crams as much circuitry as possible on their boards.  Adding circuitry
   to handle the paging probably would have meant giving up something else.

-- Kurt 

-----------------------------------------------------------------------------
chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) writes:
> So why does Cray get rid of virtual memory altogether ?  Or does anybody 
> know how much performance improvement can we gain from getting rid of VM 
?


I suggest you see the IEEE proceedings from the Supercomputing conference 
that was held last month in New York.  Cray published an article in these 
proceedings that describes their memory architecture and gives clock 
timings for current and future memory architectures.

Summary of what follows:
- Memory speed is THE supercomputing bottleneck.
- Cray can fetch from memory in 17 cycles.  Demand paging would
lengthen this time significantly.
- Virtual memory trades speed for money.  Supercomputers do not compromize 
on speed.
- Cray Y-MP/8s have 4 gigabyte per second memory bandwidths.
- Supercomputing working sets and problems sizes tend to be equal.
- Demand paging would complicate an already very complicated instruction 
scheduler.

Memory speed is THE bottleneck in supercomputing.  It is was makes Cray 
king of the hill.  The Japanese have faster peak CPU speeds, but their 
memory bandwidths are inferior.  This is a key reason why Cray machines 
are the fastest computers available for most production benchmarks (with 
notable exceptions.)

The number of cycles needed to transfer the first word from memory to a 
register is one of the most critical timings in the supercomputer.  Cray 
can do this in 17 cycles.  An SX3 requires 70 cycles.  An ETA 10 needed 
hundreds of cycles.  Adding demand paging will significantly lengthen this 
cycle time.  If you can add demand paging without adding cycles to this 
memory fetch time, then I am sure Cray will make you a rich person.

Supercomputers with virtual memories have been tried.  The CDC 205 and the 
ETA10 are examples.  When these machines ran codes where the problem size 
exceed the RAM size (paging), they ran 10 time slower than when paging did 
not occur.  

Virtual memory is a technique of trading time for money.  Virtual memory 
costs less than real memory, but is slower.  Slower memory is not an 
option for supercomputing.   Witness the success of Cray and the demise of 
ETA.

The Cray achieves two words read and one word written per clock per CPU.  
On a Y-MP/8 this is a memory bandwidth of 4 gigabytes per second.  Disks 
bandwidths are not adequate to keep up with this type of demand.

The theory of virtual memory depends on the working set being smaller than 
the problem size.  In most supercomputer applications working set is the 
problem size.  I am sure the architecture of these applications was 
influenced by programming for real-memory machines, so this is somewhat of 
a circular argument.  However, for the status quo, this is true.

Cray's are vector machines with extremely sophisticated instruction 
schedulers.  The Cray often has server instructions issued at once in the 
same CPU.  X-MPs and Y-MPs scoreboard conflicts between  instructions and 
are able to compensate for bank and section memory delays.  These delays 
tend to be for one to four cycles.  The instruction scheduler architecture 
would be even more difficult if it had to account for page-fault delays of 
many thousands of cycles.  An approach to this problem would be to require 
the compilers to never allow a vector sub-section to cross a page 
boundary.  

-- Kent 



--------------------------------------------------------------------------------
Saw your information request about Crays, and thought that I might be
able to point you towards some useful information:

I suggest that you check up on Control Data's Cyber 180-series
(currently Cyber 2000-series) machines - they are a full hardware
Multics implementation, and have some truly "unique" virtual memory
hardware. I can personally vouch that the address translation
hardware, which also is doing access control checking, is VERY fast,
and it has several extra levels of indirectness more than most
other folks' virtual memory architectures. Cyber 180 is such a
complete Multics that there is actually NO REAL MEMORY ADDRESSING
MODE. It is NOT POSSIBLE to access memory by real memory address, the
hardware doesn't have the capability!

It is also interesting that when a Cyber 180 is emulating Cyber 170
mode, it ALSO has base/limit register hardware in operation, since the
170 architecture is real-memory, and only has base/limit restrictions.
When a Cyber 180 is running in 170 mode, it really is running a
virtual real-memory machine on its virtual memory hardware (just
saying this makes my mind feel like a pretzel).

If nothing else, the CDC stuff should make interesting counter-culture
reading material for you. It was/is truly different.

I also suspect that in the Crays (although I have never read the
hardware prints of a Cray, only the CDC machines), the bounds checking
is being done on the VIRTUAL address, as it were, not the real memory
address. This method allowed the old CDC machines (the ones Seymour
Cray designed) to do their access checking in the CPU, not the memory
controller, and thus kill of the references earlier in the
instruction.
 
-- Gregory 



----------------------------------------------------------------------------
> Furthermore, this check is done for EVERY reference. 
>If this is indeed the case, this protection check process should be as 
>expensive as address mapping in machines that have VM. 

Why do you assume this? Given that the latency of Cray memory is 4
cycles or so, the check can be done after the address is sent off to
memory and can generate a fault before the data gets back.

>So why does Cray get rid of virtual memory altogether ?

Well, many supercomputer applications can't page and have to swap. In
that case, why provide VM?

-- greg





In article <1990Dec19.181343.10365@agate.berkeley.edu> you write:
 > The kind of protection I have in mind is access right control (e.g., read-only)
 > "Normal virtual memory systems" perform this kind of protection check while 
 > doing logical-physical address mapping. The protection bits are either in page 
 > tables or TLB.  Now, since Cray doesn't have virtual memory, the question is 
 > does it provide access control, if so, where does it put this check ?
The Cray does not provide extensive access control.  For each running program
a (consecutive) part of actual memory is mapped to the logical address space
of the program (which starts at 0).  With each reference the logical address
is compared to the logical bounds register, and the base register is added
to it before going to memory.
 > From the previous responses, it seemed that Cray only provides out-of-bound
 > protection check. Furthermore, this check is done for EVERY reference. 
 > If this is indeed the case, this protection check process should be as 
 > expensive as address mapping in machines that have VM. 
Clearly this is much less expensive than true VM; only two registers are needed
to do everything (address translation and bound checking), and those two
registers reside directly in the CPU.
 > So why does Cray get rid of virtual memory altogether ?  Or does anybody 
 > know how much performance improvement can we gain from getting rid of VM ?
This is much less expensive because check and translation go on in parallel
within a single clock cycle.

-- dik 

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (01/11/91)

> On 10 Jan 91 23:07:15 GMT,chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) said:

chiueh> chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) writes:
> So why does Cray get rid of virtual memory altogether ?  Or does anybody 
> know how much performance improvement can we gain from getting rid of VM 

chiueh> I suggest you see the IEEE proceedings from the Supercomputing
chiueh> conference that was held last month in New York.  Cray
chiueh> published an article in these proceedings that describes their
chiueh> memory architecture and gives clock timings for current and
chiueh> future memory architectures.

chiueh> Summary of what follows: - Memory speed is THE supercomputing
chiueh> bottleneck.  - Cray can fetch from memory in 17 cycles.
chiueh> Demand paging would lengthen this time significantly.  -
chiueh> Virtual memory trades speed for money.  Supercomputers do not
chiueh> compromize on speed.  - Cray Y-MP/8s have 4 gigabyte per
chiueh> second memory bandwidths.  - Supercomputing working sets and
chiueh> problems sizes tend to be equal.  - Demand paging would
chiueh> complicate an already very complicated instruction scheduler.

chiueh> Memory speed is THE bottleneck in supercomputing.  It is was
chiueh> makes Cray king of the hill.  The Japanese have faster peak
chiueh> CPU speeds, but their memory bandwidths are inferior.  This is
chiueh> a key reason why Cray machines are the fastest computers
chiueh> available for most production benchmarks (with notable
chiueh> exceptions.)

chiueh> The number of cycles needed to transfer the first word from
chiueh> memory to a register is one of the most critical timings in
chiueh> the supercomputer.  Cray can do this in 17 cycles.  An SX3
chiueh> requires 70 cycles.  An ETA 10 needed hundreds of cycles.
chiueh> Adding demand paging will significantly lengthen this cycle
chiueh> time.  If you can add demand paging without adding cycles to
chiueh> this memory fetch time, then I am sure Cray will make you a
chiueh> rich person.

chiueh> Supercomputers with virtual memories have been tried.  The CDC 205 and the 
chiueh> ETA10 are examples.  When these machines ran codes where the problem size 
chiueh> exceed the RAM size (paging), they ran 10 time slower than when paging did 
chiueh> not occur.  

chiueh> Virtual memory is a technique of trading time for money.  Virtual memory 
chiueh> costs less than real memory, but is slower.  Slower memory is not an 
chiueh> option for supercomputing.   Witness the success of Cray and the demise of 
chiueh> ETA.

chiueh> The Cray achieves two words read and one word written per clock per CPU.  
chiueh> On a Y-MP/8 this is a memory bandwidth of 4 gigabytes per second.  Disks 
chiueh> bandwidths are not adequate to keep up with this type of demand.

chiueh> The theory of virtual memory depends on the working set being smaller than 
chiueh> the problem size.  In most supercomputer applications working set is the 
chiueh> problem size.  I am sure the architecture of these applications was 
chiueh> influenced by programming for real-memory machines, so this is somewhat of 
chiueh> a circular argument.  However, for the status quo, this is true.

chiueh> Cray's are vector machines with extremely sophisticated instruction 
chiueh> schedulers.  The Cray often has server instructions issued at once in the 
chiueh> same CPU.  X-MPs and Y-MPs scoreboard conflicts between  instructions and 
chiueh> are able to compensate for bank and section memory delays.  These delays 
chiueh> tend to be for one to four cycles.  The instruction scheduler architecture 
chiueh> would be even more difficult if it had to account for page-fault delays of 
chiueh> many thousands of cycles.  An approach to this problem would be to require 
chiueh> the compilers to never allow a vector sub-section to cross a page 
chiueh> boundary.  

chiueh> -- Kent 



chiueh> --------------------------------------------------------------------------------
chiueh> Saw your information request about Crays, and thought that I might be
chiueh> able to point you towards some useful information:

chiueh> I suggest that you check up on Control Data's Cyber 180-series
chiueh> (currently Cyber 2000-series) machines - they are a full hardware
chiueh> Multics implementation, and have some truly "unique" virtual memory
chiueh> hardware. I can personally vouch that the address translation
chiueh> hardware, which also is doing access control checking, is VERY fast,
chiueh> and it has several extra levels of indirectness more than most
chiueh> other folks' virtual memory architectures. Cyber 180 is such a
chiueh> complete Multics that there is actually NO REAL MEMORY ADDRESSING
chiueh> MODE. It is NOT POSSIBLE to access memory by real memory address, the
chiueh> hardware doesn't have the capability!

chiueh> It is also interesting that when a Cyber 180 is emulating Cyber 170
chiueh> mode, it ALSO has base/limit register hardware in operation, since the
chiueh> 170 architecture is real-memory, and only has base/limit restrictions.
chiueh> When a Cyber 180 is running in 170 mode, it really is running a
chiueh> virtual real-memory machine on its virtual memory hardware (just
chiueh> saying this makes my mind feel like a pretzel).

chiueh> If nothing else, the CDC stuff should make interesting counter-culture
chiueh> reading material for you. It was/is truly different.

chiueh> I also suspect that in the Crays (although I have never read the
chiueh> hardware prints of a Cray, only the CDC machines), the bounds checking
chiueh> is being done on the VIRTUAL address, as it were, not the real memory
chiueh> address. This method allowed the old CDC machines (the ones Seymour
chiueh> Cray designed) to do their access checking in the CPU, not the memory
chiueh> controller, and thus kill of the references earlier in the
chiueh> instruction.
chiueh>  
chiueh> -- Gregory 



chiueh> ----------------------------------------------------------------------------
> Furthermore, this check is done for EVERY reference. 
>If this is indeed the case, this protection check process should be as 
>expensive as address mapping in machines that have VM. 

chiueh> Why do you assume this? Given that the latency of Cray memory is 4
chiueh> cycles or so, the check can be done after the address is sent off to
chiueh> memory and can generate a fault before the data gets back.

>So why does Cray get rid of virtual memory altogether ?

chiueh> Well, many supercomputer applications can't page and have to swap. In
chiueh> that case, why provide VM?

chiueh> -- greg





chiueh> In article <1990Dec19.181343.10365@agate.berkeley.edu> you write:
 > The kind of protection I have in mind is access right control (e.g., read-only)
 > "Normal virtual memory systems" perform this kind of protection check while 
 > doing logical-physical address mapping. The protection bits are either in page 
 > tables or TLB.  Now, since Cray doesn't have virtual memory, the question is 
 > does it provide access control, if so, where does it put this check ?
chiueh> The Cray does not provide extensive access control.  For each running program
chiueh> a (consecutive) part of actual memory is mapped to the logical address space
chiueh> of the program (which starts at 0).  With each reference the logical address
chiueh> is compared to the logical bounds register, and the base register is added
chiueh> to it before going to memory.
 > From the previous responses, it seemed that Cray only provides out-of-bound
 > protection check. Furthermore, this check is done for EVERY reference. 
 > If this is indeed the case, this protection check process should be as 
 > expensive as address mapping in machines that have VM. 
chiueh> Clearly this is much less expensive than true VM; only two registers are needed
chiueh> to do everything (address translation and bound checking), and those two
chiueh> registers reside directly in the CPU.
 > So why does Cray get rid of virtual memory altogether ?  Or does anybody 
 > know how much performance improvement can we gain from getting rid of VM ?
chiueh> This is much less expensive because check and translation go on in parallel
chiueh> within a single clock cycle.

chiueh> -- dik 
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (01/11/91)

> On 10 Jan 91 23:07:15 GMT,chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) said:

chiueh> So why does Cray get rid of virtual memory altogether ?  Or
chiueh> does anybody know how much performance improvement can we gain
chiueh> from getting rid of VM

kent> The number of cycles needed to transfer the first word from
kent> memory to a register is one of the most critical timings in
kent> the supercomputer.  Cray can do this in 17 cycles.  An SX3
kent> requires 70 cycles.  An ETA 10 needed hundreds of cycles.
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
This is/was a common misconception.  The ETA-10 actually attained
vector startup times of as short as about 23 cycles if the first pages
of all the operands were in real memory.  This includes both the time
required to get the first element from memory into the pipe as well as
3-4 more cycles to get the pipe filled.  Results were then available
from the first operation on about the 24th cycle (if the result was to
be immediately reused) or about the 30th cycle if the result had to go
all the way back to memory.  The startup time varied between about 16
and 32 cycles depending on whether the memory banks were aligned and
whether or not operations were being chained (in which case there were
two pipes to fill, not just one).

On a number of test loops, the ETA-10 was significantly *faster* on
short vector operations than the 8.5ns Cray X/MP.  This did not
typically mean that short-vector *application codes* ran faster on the
ETA-10, though....  :-(

kent> Adding demand paging will significantly lengthen this cycle
kent> time.  If you can add demand paging without adding cycles to
kent> this memory fetch time, then I am sure Cray will make you a
kent> rich person.

ETA/CDC did it, and it certainly did not make them rich!  I believe
that the two Cray companies simply decided that the benefits of VM
were not worth the hassle.  So far the market has proven them right.

kent> Supercomputers with virtual memories have been tried.  The CDC
kent> 205 and the ETA10 are examples.  When these machines ran codes
kent> where the problem size exceed the RAM size (paging), they ran
kent> 10 time slower than when paging did not occur.

This is hardly surprising.  Anyone with any experience at all realizes
that VM is to be used to make a small class of jobs much easier to
code by letting the hardware handle the large address space -- *not*
to just run larger-than-real-memory jobs.  It should be noted that it
is possible to write jobs that are larger than real memory but which
do not slow down significantly in a VM system.  One application was a
straightforward LU-decomposition of a 2000x2000 dense matrix.  Only
about 2 Million words were available to the user on the machine, which
required 4 Million words of virtual space.  By using a block-mode
algorithm and the best ETA UNIX swapping code, our CDC applications
specialist was able to get nearly full performance on this problem.
The advantage relative to the Cray was that on the ETA it could be
done in standard Fortran, while the Cray would have required explicit
I/O. 
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (01/11/91)

My apologies to the net for two of the three previous accidental
posts. One of them was intentional -- let the reader decide.

Note the Followup-To: comp.sys.super
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

dik@cwi.nl (Dik T. Winter) (01/18/91)

In article <DREIER.91Jan13223533@husc9.harvard.edu> dreier@husc9.harvard.edu (Roland Dreier) writes:
 > I guess one thing that should be noted about how Crays handle memory: under
 > Unicos, since it lacks virtual memory, whole jobs get swapped.  What this
 > means is that if I'm running two jobs on a 256 megaword machine and each of
 > them wants 129 megawords of memory, every time a job gets swapped out, I
 > have to do 129 megawords of IO, rather than just the small amount of overlap
 > that I "really" have to do; so clearly, even a slightly slower virtual memory
 > system would be at an advantage here.
 > 
But this is not a defect of the lack of VM; it is a defect of the OS.
If the OS wants to swap out the complete job it can do so, but it is
not necessary.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl