[net.arch] Cyber 205

aeh@h.cc.purdue.edu (Dale Talcott) (09/08/86)

This is not directly related to the discussion about large main
memories, but I thought I would try to supply some information
about the CDC Cyber 205, since we have one of these beasties.  The
context is the comparison between the Cray, which does not use virtual
memory, and the 205, which does.

The hardware part of virtual memory on the 205 is implemented using a
two level lookup table to translate virtual page addresses to physical
pages.  The first level is a 16 entry set of "associative registers"
(ARs) which can do this mapping in 1 cycle.  These are similar to the
"translation lookaside buffer" on other virtual memory machines, except
smaller.  The second level is the "space table", which has an entry for
each occupied real memory page (plus some dummy entries used by the
system).  This table resides at a fixed address in real memory and its
first 16 entries are loaded into the associative registers.  That is,
the ARs cache the first 16 entries of the space table.

When a memory reference cannot be mapped using the associative
registers, the affected instruction is suspended, the ARs are stored
into the start of the space table, the space table is searched for the
mapping at a rate of two entries per cycle until the match is found,
the match is moved to the first entry in the space table, the ARs are
reloaded from the space table, and the instruction is resumed.  (If a
mapping is not found, this constitutes a page fault, which I am
ignoring.)  The ARs need to be stored and reloaded because they are
kept in LRU order by the hardware, and thus rapidly get out of sync
with the part of the space table they are caching.  The overhead for
stopping, saving, reloading, and restarting is about 80 cycles.
(Choke!)

When running in monitor mode, all memory references are physical
references, but take just as long to execute.  I suspect, but have not
checked the prints to be sure, that monitor mode just forces the ARs to
always respond with the identity map.

With that as background,

1) Whatever programming technique you use on a Cray to fit ten pounds
of data into 5 pounds of real memory will also work on the 205.  That
is, if your program can be run on a Cray with X Mwords of memory, it
will run on a 205 with X Mwords of memory WITHOUT FAULTING.

2) Monitor mode on the 205 disables external interrupts, so I seriously
doubt that any sites are running their 205's "with virtual memory
disabled".

3) Nonetheless, I ran some timing tests to determine the cost of space
table searches (which would not happen in monitor mode, due to the
faked identity map).  Using as a test case a vector add of two arrays
into a third array, each array one third the size of available real
memory, I came up with a cost of 0.48% with 4 Mwords and 1.5% for 32
Mwords (both assuming a page size of 64k words, which is the largest
available on the 205).  Over the course of a day, the latter amounts to
about 22 minutes.

The previous example has about the best AR hit rate for code which
still references every word of real memory just once.  If the code were
a gather/scatter which touched only one word per page, the cost would
be a whopping 32800% for a 32 Mword system.

Fortunately, most codes have better locality of reference than either
of these, so there is little incentive to run a 205 as a single user,
real memory system.

4) Note that the previous test compares the 205 with itself.  That is,
from the results we cannot determine how much faster the 205 would run
if it had no virtual memory at all, only how much faster it would run
if we don't use what is there.  The 205, like the Cray, is very
parallel, so it is possible that the AR mapping takes place at the same
time as other functions and ends up adding nothing to the total
execution time of memory reference instructions which find a hit in the
ARs.  (Again, I haven't checked this against the prints or microcode.)

5) The 205's ancestor, the STAR-100, had at most 1 Mword of memory, and
thus all real memory could be mapped with just the ARs (if you used
only the large page size).  It would have been nice if CDC could have
kept this property as they increased the size of central memory.

(My .signature is copied from someone else's [Hi Tom] so I can't guarantee
I am really reachable where it says, nor even that it gets included.  So
far as I can tell, news is voodoo.)

-- 
Dale Talcott                            Systems programmer
ARPANET: aeh@j.cc.Purdue.EDU            Purdue University Computing Center
 BITNET: AEH@PURCCVM                    Mathematical Sciences Bldg.
 USENET: aeh@pucc-j.UUCP                West Lafayette, IN 47907
  Phone: (317) 494-1787

jlg@lanl.ARPA (Jim Giles) (09/11/86)

In article <2993@h.cc.purdue.edu> aeh@h.cc.purdue.edu.UUCP (Dale Talcott) writes:
>...
>1) Whatever programming technique you use on a Cray to fit ten pounds
>of data into 5 pounds of real memory will also work on the 205.  That
>is, if your program can be run on a Cray with X Mwords of memory, it
>will run on a 205 with X Mwords of memory WITHOUT FAULTING.
>
>2) Monitor mode on the 205 disables external interrupts, so I seriously
>doubt that any sites are running their 205's "with virtual memory
>disabled".
>
Since the trip through the Associative registers costs a machine cycle on
the 205, that means that every memory reference is delayed by at least that
ammount.  (Actually, I suspect that the cost may be higher in that the
selection of the clock rate may have been partially driven by this memory
interface.)  Most memory references are, of course, overlapped by other
computation - so you might think that the extra clock may be negligable.
Unfortunately, as we all found out, a sizable part of execution time in
a 'real' application is inherently scaler.  And for this code, an extra
clock in the memory interface usually means an extra clock over-all.  This
may be part of the explanation of the fact that Crays seem to outperform
Cybers even when small benchmarks indicate the opposite should be true.
(The other reason is the MUCH longer setup time for vector operations
on the Cyber, which can only be offset if the vector sizes are very long
on average.  The distribution of vector lengths within a computation is
not that simple, but it seems that there tend to be a lot of short vectors
in 'real' codes.)

I suspect that the main saving on the Cyber of going into monitor mode is
to prevent parts of your data from being swapped out when you don't want
it to be.  This prevents a page fault later.  As I said before, most of
the programmers I know who have run codes on Cybers prefer to have the
virtual memory off.