[comp.arch] KM's vs. Supers

seanf@sco.COM (Sean Fagan) (01/07/90)

In article <39807@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>And looking just a little further ahead, an "R9000"
>(just making this up out of whole cloth) with a starting clock speed of
>100MHz, scaled up to 200 MHz by 1992 or 1993, could put Cray out of business.
>SGI will build scalar graphics workstations with half the power of the then
>current Crays at 1/100 the cost, and Ardent will do the vector version, with
>a similar price advantage.  An amusing idle speculation (?)

Ok.  I've just sat down with a list, and tried to figure something out;
hopefully, *somebody* can help me on this.  My "list" was a list of timings
for a CDC Cyber 170/760, running at 40MHz (thanks Brian!).  From experience,
I'd say that the list is correct (i.e., not a lie 8-)).  Excluding loads,
divides, and other memory references, average cycle count is between 2 and 3
clocks / instruction.

Actually, all that was for background (and to plug Cybers and Seymour 8-)).
On with the point:  the Cyber is 25 years old; I seriously doubt that any of
the Crays have *slower* performance (despite being 64-bit two's complement
instead of 60-bit one's complement, I have a high opinion of Seymour), yet,
according to Eugene Brooks, a 66MHz R6000 will outperform a Cray-2, which
runs at, what, 250MHz?

So, *how* does it do this?  The things I could come up with were:  the
Cray-2 has slower cycles than the Cyber, which is frightening; the larger
register count on the MIPS chip helps that much (possible, but I don't know;
that's why I'm asking); and / or the R6000 has more functional units than a
Cray-2.

So, now for speculation of my own.  Eugene has said that he thinks the
supercomputers of the future will be merely bunches of KM's running in
parallel; I'm not sure.  I think that Seymour (and his ilk) is definitely 
going to have to adopt some of the advantages of the KM's; I see no doubt of
that (otherwise, we *will* end up with thousands of KM's with vector
processors, and this might be best; but, then again, maybe not).  But which
ones?  Seymour has been using 8 registers per set for the last 25 or more
years (I don't know about the CDC 3x00 series); would more registers allow
for faster code to be generated, up to a certain point?  How about register
windows?

I guess part of what I'm saying, and asking, is this:  there is little
reason why a Cray *must* be slower than a MIPS chip, and, if nothing else,
there is more room on the Cray to put stuff directly in hardware (such as,
oh, a 1 cycle multiply, or, better yet, a 1 cycle divide 8-)).  What needs
to be done, and why hasn't it been done?

Sorry for the length, but this seems like a good discussion for this group
to get started on.

-- 
Sean Eric Fagan  | "Time has little to do with infinity and jelly donuts."
seanf@sco.COM    |    -- Thomas Magnum (Tom Selleck), _Magnum, P.I._
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (01/09/90)

In article <4328@scolex.sco.COM> seanf@sco.COM (Sean Fagan) writes:
>In article <39807@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:

>Ok.  I've just sat down with a list, and tried to figure something out;
>hopefully, *somebody* can help me on this.

>according to Eugene Brooks, a 66MHz R6000 will outperform a Cray-2, which
>runs at, what, 250MHz?

>So, *how* does it do this?

Once upon a time, I wrote 4 benchmarks which
showed that each of the following machines was faster:
IBM 3033, Cray-1/S, CDC Cyber 203, and CDC 7600.  I was able to do it by
knowing something about the weaknesses of each machine.

*It all depends on your applications.*  Eugene Brooks application is unusually
hard on the Cray, it appears.  On the other hand, even for codes which aren't
so hard on the Cray, there is now a cost advantage in many cases for the KMs
even if the Cray is still much faster.

>processors, and this might be best; but, then again, maybe not).  But which
>ones?  Seymour has been using 8 registers per set for the last 25 or more
>years (I don't know about the CDC 3x00 series);

This is not really correct for the Crays.  You can't forget about the
second level scalar registers ("programmable cache") or the vector registers..

> would more registers allow

He already has a lot more scalar registers than
MIPSCo.   Better to ask why the extra registers don't seem to produce
a gross advantage in cycles per instruction, which they don't seem to.  

>for faster code to be generated, up to a certain point?  How about register
>windows?

Do register windows produce fewer loads and stores?  The results seem to
indicate that they don't make much difference.   Not that they seem to hurt, 
either.  They are, it seems, no big deal- just another design choice.

>I guess part of what I'm saying, and asking, is this:  there is little
>reason why a Cray *must* be slower than a MIPS chip, and, if nothing else,

The Cray is generally faster.  The question is, rather, is it enough faster
to justify the cost.  Also, don't forget that the Cray is still the fastest
data engine around.  More throughput than anybody else.  You might even
see a Cray used as a fileserver for a farm of Killer Micros someday :-)

>there is more room on the Cray to put stuff directly in hardware (such as,
>oh, a 1 cycle multiply, or, better yet, a 1 cycle divide 8-)).  What needs
>to be done, and why hasn't it been done?

****************************************

Another speculation: would superscalar instruction issue of Cray scalar
instructions be possible?  What are the conditions necessary for issue
of multiple instructions per cycle? 

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

davec@proton.amd.com (Dave Christie) (01/09/90)

In article <4328@scolex.sco.COM> seanf@sco.COM (Sean Fagan) writes:
[Trying to understand how a 66MHz KM could possibly beat a 250MHz super]
>
>Ok.  I've just sat down with a list, and tried to figure something out;
>hopefully, *somebody* can help me on this.  My "list" was a list of timings
>for a CDC Cyber 170/760, running at 40MHz (thanks Brian!).  From experience,
>I'd say that the list is correct (i.e., not a lie 8-)).  Excluding loads,
                                                          ^^^^^^^^^^^^^^^
>divides, and other memory references, average cycle count is between 2 and 3
>clocks / instruction.
 [...]
>according to Eugene Brooks, a 66MHz R6000 will outperform a Cray-2, which
>runs at, what, 250MHz?
>
>So, *how* does it do this?  The things I could come up with were:  the

First of all, one simply can't exclude loads in coming up with a meaningful
cpi figure (divides, maybe).  And it is loads that probably make the most
difference in a KM/super comparison (specifically, loads that can't be
initiated well ahead of when the data is needed).  I don't know the memory
access time on a 760 off hand, but it will certainly be several 25ns clocks
(the 760 has no cache).  I'm willing to bet Cray-2 memory access is nowhere
near the same number of 4ns clocks; in fact I wouldn't be suprised if total
memory access time were longer on the Cray-2 (damn, I hate speculating
without the facts, but I think my point is still valid).  Considering that
loads tend to amount to 25-30% of instructions (if I may generalize), both
of these machines will spend a large amount of time simply waiting for
memory access (on certain codes) and since the memory access time doesn't
scale at 250/40, overall performance won't.  (Warning: this gross
generalization doesn't include other aspects such as branches and is 
merely for illustrative purposes.)  However, KMs, having caches, tend
to have much shorter effective memory access times (on many codes) and
so eliminate much of the time spent on loads, which could amount to 
something like 75% of the overall time spent on a problem.  So a 66MHz
KM could easily beat a 125MHz super (I believe the minimum instruction 
time on a Cray-2 is two cycles).  

Remember, this demonstration is only valid for scalar codes where loads 
cannot be initiated well ahead of time.  Which is precisely the sort of
thing that RISC architectures are optimized for.  Supers, on the other
hand, have a much different heritage - numeric codes with lots of
parallelism.  So super-bashing really isn't warrented - its largely an
apples-to-oranges comparison.

>-- 
>Sean Eric Fagan  | "Time has little to do with infinity and jelly donuts."
>seanf@sco.COM    |    -- Thomas Magnum (Tom Selleck), _Magnum, P.I._
>(408) 458-1422   | Any opinions expressed are my own, not my employers'.

------------
Dave Christie            My opinions only, not my employers