[comp.arch] Cray & Amdahl

pardo@june.cs.washington.edu (David Keppel) (07/22/88)

[ Comparisons of Amdahls, Crays, 68Ks, and an 11/750 ]
[ Cray 1Gb, Amdahl 256Mbyte ]

I believe (please correct me if I'm wrong ...) that the Cray's memory
limit is ~750Mword (* 8 bytes/word = ~6Gb) but that few machines have
anywhere near this much.

More to the point, I also believe that the Crays don't have virtual
memory (because it slows down the computer!) while the Amdahls do
(have virtual memory).

Relevant (really?) question: Does it make more sense to buy a little
bit of very fast memory and slow it down with virtual memory, or to
buy a whole bunch of fast physical memory and slow it down by putting
it farther away?  (Assume: $ is no problem).  Obviously the answer
depends on the access patterns (and dataset size) of the programs
being run.  I wonder if anybody has insight on this?


	;-D on  ( Registers for me )  Pardo

dik@cwi.nl (Dik T. Winter) (07/22/88)

In article <5342@june.cs.washington.edu> pardo@uw-june.UUCP (David Keppel) writes:
 > More to the point, I also believe that the Crays don't have virtual
 > memory (because it slows down the computer!) while the Amdahls do
 > (have virtual memory).
 > 
 > Relevant (really?) question: Does it make more sense to buy a little
 > bit of very fast memory and slow it down with virtual memory, or to
 > buy a whole bunch of fast physical memory and slow it down by putting
 > it farther away?  (Assume: $ is no problem).  Obviously the answer
 > depends on the access patterns (and dataset size) of the programs
 > being run.  I wonder if anybody has insight on this?
 > 
The major problem with virtual memory on vector machines is that you get
paging interrupts during the execution of an instruction.  The CDC 205
has virtual memory, and there are problems.

Let me explain a bit how it works on the 205.  The machine (of course)
maintains a page table, mapping virtual to real memory.  Of course you
do not want to interrupt a vector instruction if memory access crosses
a page boundary, so the machine has 16 associative registers that hold
the mapping entries for the 16 pages most recently accessed.  Whenever
a vector instruction crosses a page boundary to a page whose mapping
information is in the associative registers, the next page of real
memory is easily found, and the instruction continues without
interrupt (all translation etc. is done during buffering and unbuffering
the 205 performs in its pipes).  However, if the cross is to a page
whose information is not in the associative register, the mapping entry
has to be found in memory.  This involves interrupt of the instruction,
draining the pipes, saving state, reading mapping info and restarting
the instruction.  That takes a lot of time.  The 205 has two different
page sizes, large pages of 65536 words (8 bytes/word) and small pages
of (site selectable) 512, 2048 or 8192 words.  The number of
associative registers is 16, and these are shared amongst the jobs on
a system.  It appears that the selection of small page size is very
critical.  I have a small (~10 lines) program that will run 2 times
as fast on a 1 pipe 205 with small pages of 2048 words than on the
same machine with small pages of 512 words.  This is all due to the
page boundary crossing.  (Oh, we have also a single instruction that
takes 90 seconds to complete; too long for the timer to handle.)

So what this amounts to is having virtual memory will need address
translation.  This in turn requires page tables and part of these
(or all?) need to be in very fast (associative) registers.
Associative, because there is no time to do a search if you do not
want to drain the pipes.

The Cray on the other hand addresses all its memory directly, so no
address translation is needed and no vector instruction interrupt.

Strange enough, the Cray has a maximal vector length of 64, and all
instructions except load/store are through registers.  The 205 on
the other hand has only vector instructions that go from memory
to memory, and the maximal vector size is 65535.  So my arguments
above would imply that you could have a Cray with virtual memory,
but not a 205!

Another point about VM on vector processors: it makes you think
the machine is large enough to handle your problem, while it will
mostly be trashing pages.  A theoretical example for a 205 with
1 Mwords of memory:  try to multiply two 1024*1024 matrices to
get a third.  The program will be accepted, CP time will be something
like 60 seconds.  Only paging will take about 1 year (disk access only,
and they are fast).
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

dre%ember@Sun.COM (David Emberson) (07/22/88)

> 
> Relevant (really?) question: Does it make more sense to buy a little
> bit of very fast memory and slow it down with virtual memory, or to
> buy a whole bunch of fast physical memory and slow it down by putting
> it farther away?  (Assume: $ is no problem).

If money is no object, it makes sense to buy a ton of memory AND have VM.
This may be apocryphal, but I have been told that Seymour Cray, replying to
the question of why the Cray 1 did not have virtual memory, replied, "Because
I don't understand it."

With virtual caches, VM does not cause a performance penalty worth mentioning.
Even on some machines with physical caches, address translation can take place
in parallel with the cache tag access--thus no penalty.

			Dave Emberson (dre@sun.com)

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (07/22/88)

In article <7588@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:

>The major problem with virtual memory on vector machines is that you get
>paging interrupts during the execution of an instruction.  The CDC 205
>has virtual memory, and there are problems.

I beg to differ.  I have seen no "problems" with virtual memory on
the Cyber 205, other than those arising from:
1)  The confusion that users sometimes experience when they have a
larger set of facilities to choose from, or,
2)  Problems which are exactly the same on Crays- namely, there is
never enough real memory for some people and their programs.

>the instruction.  That takes a lot of time.  The 205 has two different
>page sizes, large pages of 65536 words (8 bytes/word) and small pages
>of (site selectable) 512, 2048 or 8192 words.  The number of
>associative registers is 16, and these are shared amongst the jobs on
>a system.  It appears that the selection of small page size is very
>critical.  I have a small (~10 lines) program that will run 2 times

The Cyber 205 does have a "problem" because of its memory to memory
vector instruction set, as opposed to Cray's vector registers.  The
problem is/was vector startup time was fairly long on the Cyber 205.
This problem appears to have been solved on the ETA-10 with better
overlap, etc., and short vector performance seems to be much better.
(Still the same memory to memory architecture.)  The "problem" with
small page sizes is actually an installation option intended to benefit
installations with a relatively small amount of main memory.  The
solution, if you have more memory, is to use a larger small page size.

>
>The Cray on the other hand addresses all its memory directly, so no
>address translation is needed and no vector instruction interrupt.

It is true that Cray vector instructions are atomic, and those on the
205 are restartable, but the context is saved quite efficiently on the
205.  A complete context switch actually takes fewer cycles on the 205
than it does on the Cray 1/X/Y-MP's, and many fewer than the Cray-2.
So, the assumption that virtual memory OR memory to memory vector
instructions cause long context switch times relative to Cray is not
correct.  As stated above, the "price" of the 205 architecture was
poor short vector performance, and, or course, the extra hardware that
virtual memory consumes (a virtual memory MMU takes up a LOT more
space than a non-virtual MMU - I think it is worth it but others
disagree...).  I note also that a Benefit to the 205 architecture is
excellent long vector performance.

>to memory, and the maximal vector size is 65535.  So my arguments
>above would imply that you could have a Cray with virtual memory,
>but not a 205!

There is, in fact, no reason why a Cray type architecture can't have
virtual memory.  In fact, a number of vector machines built by other
companies have done approximately that.

The real debate, in my opinion, is not between virtual and non-virtual
(virtual won a long time ago, in my opinion- Cray is an anachronism
in this respect) but between a memory to memory pipeline and a vector
register architecture.  But these are only two of many proposed
possibilities, and some good performing machines have been built with
other architectures entirely.  None have been commercially
successful yet, but I would not assume that that will always be the 
case.

-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

sharma@mit-vax.LCS.MIT.EDU (Madhumitra Sharma) (07/23/88)

	The hardware complexity also goes up if the vector pipeline is to
be able to take interrupts in the middle of a vector instruction. This
increased complexity (most probably) implies a longer cycle time for the
machine, too. Cray wanted to make his pipelines as simple as possible so 
that he could run them as fast as possible. Therefore, he decided he would
not handle any interrupts in the middle of a vector instruction. Hence,
no virtual memory.


Madhu Sharma
sharma@xx.lcs.mit.edu

blu@hall.cray.com (Brian Utterback) (07/23/88)

In article <60952@sun.uucp> dre%ember@Sun.COM (David Emberson) writes:
>If money is no object, it makes sense to buy a ton of memory AND have VM.
>This may be apocryphal, but I have been told that Seymour Cray, replying to
>the question of why the Cray 1 did not have virtual memory, replied, "Because
>I don't understand it."

I have never heard anything of the kind.  Well, sort of the kind.  What he
did say was that the CDC machines used ones-complement arithmetic instead of
twos-complement because he did not understand it.  He went on to say that
he figured it out by the time he built the Cray-1, since it is twos-complement.
I tend to think that he was joking.  By the way, this is anecdotal rather
than apocryphal because the talk was video taped and I have seen it.

Cray's continue to use only physical memory rather than virtual for one reason:
it's faster.  That's our charter: faster.
-- 
Brian Utterback     |UUCP:{ihnp4!cray,sun!tundra}!hall!blu |  "Aunt Pheobe, 
Cray Research Inc.  |ARPA:blu%hall.cray.com@uc.msc.umn.edu |  we looked like
One Tara Blvd. #301 |                                      |    Smurfs!"
Nashua NH. 03062    |Tele:(603) 888-3083                   |

ddb@ns.UUCP (David Dyer-Bennet) (07/23/88)

In article <5342@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) writes:
> Relevant (really?) question: Does it make more sense to buy a little
> bit of very fast memory and slow it down with virtual memory, or to
> buy a whole bunch of fast physical memory and slow it down by putting
> it farther away?  (Assume: $ is no problem).  
                     ^^^^^^^^^^^^^^^^^^^^^^^ 
  Obviously :-) the correct solution is to buy a whole bunch of VERY
fast physical memory.  A cache system (I consider virtual memory to be
essentially a caching system) is never as fast as an entire main
memory made out of that same technology.


-- 
	-- David Dyer-Bennet
	...!{rutgers!dayton | amdahl!ems | uunet!rosevax}!umn-cs!ns!ddb
	ddb@Lynx.MN.Org, ...{amdahl,hpda}!bungia!viper!ddb
	Fidonet 1:282/341.0, (612) 721-8967 hst/2400/1200/300

dik@cwi.nl (Dik T. Winter) (07/23/88)

In article <12174@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov.UUCP (Hugh LaMaster) writes:
 > In article <7588@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
 > >the instruction.  That takes a lot of time.  The 205 has two different
 > >page sizes, large pages of 65536 words (8 bytes/word) and small pages
 > >of (site selectable) 512, 2048 or 8192 words.  The number of
 > >associative registers is 16, and these are shared amongst the jobs on
 > >a system.  It appears that the selection of small page size is very
 > >critical.  I have a small (~10 lines) program that will run 2 times
 > 
 > The Cyber 205 does have a "problem" because of its memory to memory
 > vector instruction set, as opposed to Cray's vector registers.  The
 > problem is/was vector startup time was fairly long on the Cyber 205.
 > This problem appears to have been solved on the ETA-10 with better
 > overlap, etc., and short vector performance seems to be much better.
 > (Still the same memory to memory architecture.)  The "problem" with
 > small page sizes is actually an installation option intended to benefit
 > installations with a relatively small amount of main memory.  The
 > solution, if you have more memory, is to use a larger small page size.
 > 
Well, let me also disagree.  The factor of 2 I mentioned was from memory,
and not substantiated by fact; it is more like a factor of 30.  Anyhow,
what we experienced when the 205 was installed next door (1 Mword of memory,
small page size 512 words):  our program which was consuming lots of time
on the 205 (it used in total something like 1 CP-month) was that it ran
twice as fast when small page size was increased to 2048.  The original
runs gave no indication at all that something was wrong; we had typically
something like 250 page faults in a 1 hour run.  The problem was, the
program fitted in memory (with lots of memory to spare), but there were
not enough associative registers to cope with it, so every vector
instruction was interrupted at least 3 times for every 512 elements.
That is quite a lot if you know that vector startup time is 50 to 70 cycles.
The main problem is of course not enough associative registers (you ought
to have enough to address all of real memory).  My estimate, from experience,
is, on a 1-pipe 205 set the small page size to at least 2048, on a 2-pipe
205 use 8192; and what on a 4-pipe 205?
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

jlg@beta.lanl.gov (Jim Giles) (07/23/88)

In article <12174@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
> The real debate, in my opinion, is not between virtual and non-virtual
> (virtual won a long time ago, in my opinion- Cray is an anachronism
> in this respect) [...]

In general, this is true.  Most machines and applications are better
off with VM.  But I think Cray did the right thing for his market.
Most buyers of supercomputers have long since figured out how to do
memory management in software.  And, since these users know the data
usage patterns of their programs, their software VM is MUCH more
efficient than existing hardware can supply.  It's no coincidence
that many 205 users still run with VM turned off - their codes run
faster that way!

The problem is that hardware VM isn't flexible enough to deal with a
large variety of data usage patterns.  As a result, most VM machines
just do some variant of demand paging.  This is exactly the WRONG data
usage model for most large-scale scientific codes.  Providing more
sophisticated VM mechanisms would be more expensive and wouldn't really
help unless the user code is able to give 'hints' about the data usage
patterns.  But, if the user is required to give 'hints' in order to get
efficiency, he might as well do VM in software as he's always done
(figuring out what you need next is the hard part - actually reading it
in is easy).

Unless the hardware VM mechanism can look ahead far enough to avoid
page faults entirely (several hundred thousand instructions with the
current difference in disk and memory speed), it will never beat the
clever use of asynchronous I/O that more sophisticated users have been
doing for years.  Of course, if the hardware can somehow divine the
data usage patterns of the code automatically (a channeller perhaps? :-),
the it could maybe even beat the user's software VM.

J. Giles
Los Alamos

seanf@sco.COM (Sean Fagan) (07/24/88)

In article <12174@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov.UUCP (Hugh LaMaster) writes:
>In article <7588@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>
>It is true that Cray vector instructions are atomic, and those on the
>205 are restartable, but the context is saved quite efficiently on the
>205.  A complete context switch actually takes fewer cycles on the 205
>than it does on the Cray 1/X/Y-MP's, and many fewer than the Cray-2.

My $0.02 worth:  by a complete context switch, I assume you mean something
which will save the vector registers?  The Cray, when it does a context
switch, will save 24 registers (8 address, 8 index/offset, 8 data), plus the
program counter, the starting address of relative word 0, and the limit of
the programs size (incidently, this is, for some unkwon reason, very similar
to the CDC Cyber 170 machines context switch 8-)).  It does not save vectors
(understandable), which the operating system must then do (if it feels the
need; if all it's doing is OS type work, then there is probably no reason to
save them).  Since storing things is so *slow* (relatively speaking), you
try to avoid memory like the plague.  Again, incidently, if Seymour designed
the Cray 1 et al as he did the Cybers, when the machine does a context
switch, the hardware starts storing the exchange package (described above)
and then reading it, at the same time (the cybers had a long wire into which
the signal would go; since there was some travel time, it could safely read
into the registers without worrying about whether or not the values were
done being saved), causing *very* fast context switches (without vector
registers, of course).

As a result, there are two tradeoffs between the two architectures:  Crays
use vector registers, which are a pain to load and store, but very fast for
multiple operations (and very RISCy, of course 8-)), while the 205's (and
ETA's) allow for larger vectors, which somewhat faster memory access.

>  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster

-- 
Sean Eric Fagan  |  "An Amdahl mainframe is lots faster than a Vax 750."
seanf@sco.UUCP   |      -- Charles Simmons (chuck@amdahl.uts.amdahl.com)
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

smryan@garth.UUCP (Steven Ryan) (07/24/88)

>The problem is that hardware VM isn't flexible enough to deal with a
>large variety of data usage patterns.  As a result, most VM machines
>just do some variant of demand paging.

VSOS on 205 provides an Q5ADVISE for an asynchronous swap in/swap out.
Pretty arcane, but its there.

The real battle is between a developement shop with lots of interactive jobs and
a production shop which is dedicated to one program.

jlg@beta.lanl.gov (Jim Giles) (07/26/88)

In article <1070@garth.UUCP>, smryan@garth.UUCP (Steven Ryan) writes:
> VSOS on 205 provides an Q5ADVISE for an asynchronous swap in/swap out.
> Pretty arcane, but its there.

But, as I pointed out, if you have already worked out what you need next,
why not just do the asynchronous I/O yourself?  The hard part is figuring
out what to tell Q5ADVISE.  The part that Q5ADVISE does is just the I/O
initialization.

As I said, sophisticated users have been doing this stuff in software
for years.  What do they need hardware VM for?

J. Giles

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (07/26/88)

In article <7819@hall.cray.com> blu@hall.UUCP (Brian Utterback) writes:

>Cray's continue to use only physical memory rather than virtual for one reason:
>it's faster.  That's our charter: faster.

Help!  I have looked in my trusty computer architecture books (latest is
5 years old) and can find very little on just how much complexity (number
of gates, real estate on or off chip, etc.) various architectural features
consume.  Now, I suppose that since the whole world is interested in "RISC"
these days, there must be a whole slough of books out there which give such
information so that correct trade-offs can be made.  I would like to know
how much space a 16 entry MMU consumes versus an adder with bounds checking.
And also, how many gates deep the critical path is for each.  And so on.
None of my hardware books have anything more than the cost of several
simple adders.  Any suggestions on more recent books that contain good
information of this type?

-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

aglew@urbsdc.Urbana.Gould.COM (07/26/88)

>As I said, sophisticated users have been doing this stuff in software
>for years.  What do they need hardware VM for?
>
>J. Giles

There aren't enough sophisticated users in the world. 
We have to sell computers to the unsophisticated ones, too.

gillies@p.cs.uiuc.edu (07/27/88)

In my undergrad systems course we learned to optimize a multi-level
memory design for speed, given a constant number of $$$.  We used:

1.  Paper & pencil
2.  Simple model of cacheing/paging hit ratio versus cache size (often
    some given linear function  hitsRate = f(memorySize)).
3.  The price/performance of various kinds of memory, for each kind:
    a.  $/K
    b.  access time

For a 2-level memory system (main memory, cache), you could plot a
2-dimensional curve (main memory size versus cache size), then derive
the highest performance point on the curve.

Of course, this analysis is impossible if you don't know your
instruction mix and software paging patterns.  And if the customer
wants to expand main memory, he should probably expand the cache at
the same time (I think this is uncommon).  So I doubt many companies
pay attention to this analysis -- maybe it's mostly academic.

My point is that it's an optimization problem, which if
oversimplified, can even be handled with paper & pencil.  If not, then
it can probably be solved by nonlinear optimization methods.


Don Gillies, Dept. of Computer Science, University of Illinois
1304 W. Springfield, Urbana, Ill 61801      
ARPA: gillies@cs.uiuc.edu   UUCP: {uunet,ihnp4,harvard}!uiucdcs!gillies

jhm@cs.cmu.edu (Jim Morris) (07/28/88)

I heard him say he didn't understand VM at a talk at Livermore
in 1975.