[comp.arch] superscalar

brooks@maddog.llnl.gov (06/04/89)

There has been quite a lot of discussion on what a computer architecture must
have in it to be called a "superscalar."  I thought I would contribute some
real data to this discussion.  Last week, I had the chance to benchmark Intel's
i860 running at 33MHZ with an "alpha" compiler.  The compiler did not
take advantage of delayed branches yet, and did not use any of the dual or
pipelined mode instructions.  On a radiation transport Monte Carlo code,
which is something we routinely crunch on supercomputers like the Cray
machines, that wimpy little i860 with an alpha compiler outran the
Cray 1S by 10% or so.  I don't think that anyone, including myself, took the
marketing hype that showed a little Cray machine on top of the i860 chip
seriously.  I don't think that even Intel took it seriously.  At this point,
for applications which mesh well with a cache, its not marketing hype.  Of
course, all the other microprocessor vendors are within 6 months or less of
obtaining the same performance goal.  The MIPS R3000 is probably within epsilon
of this performance level, the rumored ECL RISC implementations from various
vendors coming down the pike must be truely impressive.

For those that might say one should have compared to the XMP or YMP, the XMP is
30% faster than the Cray 1S on this application, and the YMP is 50% faster yet.
With good compilers the i860, particularly the announced 40MHZ part, or the
rumored 50MHZ models, will be knocking on the door of the YMP pretty loudly.

Needless to say, when the application starts missing cache (for any of the
microprocessors) the performance rapidly drops into a hole when compared to the
classic supercomputer.  The microprocessor vendors now need to learn the last
lesson in supercomputer architecture, which is getting adequate main memory
bandwidth.  Since interleaving memory chips with glue logic would raise cost
too much, the micro vendors need to get in close collaboration with the memory
chip vendors to get the interleaving done on the memory chips themselves.  This
may be a good way for the U.S. manufactures to get back into the memory chip
biz.  Design your micro with interleave control on the chip and then design
your memory chips that have a compatible arangement, then don't tell the
foreign memory chip vendors about the micro/memory chip interface until you
get to market.  Interleaving on the memory chip is not a difficult thing to do,
one only has to decide that it is time to do it.

Just in case the Intel marketing pukes might be tempted to use this posting
for their own purposes, please read the disclaimer below:

(C) Copyright 1989, by Eugene Brooks III, all rights reserved.
This posting is the personal opinion solely of the author, and does not
relflect the opinions of the U.S. Govt or the University of CA in any official
capacity.  This posting may be transmitted only on the USENET Newsgroup
comp.arch, for the purposes of stimulating technical discussion, and may be
excerpted for the purposes of further discussion on the USENET if the copyright
is left in place.  This posting may NOT be printed on paper, and may NOT be
used for product endorsement purposes.



brooks@maddog.llnl.gov, brooks@maddog.uucp

aglew@mcdurb.Urbana.Gould.COM (06/05/89)

>[Brooks]
>Needless to say, when the application starts missing cache (for any of the
>microprocessors) the performance rapidly drops into a hole when compared to the
>classic supercomputer.  The microprocessor vendors now need to learn the last
>lesson in supercomputer architecture, which is getting adequate main memory
>bandwidth.  Since interleaving memory chips with glue logic would raise cost
>too much, the micro vendors need to get in close collaboration with the memory
>chip vendors to get the interleaving done on the memory chips themselves.  This
>may be a good way for the U.S. manufactures to get back into the memory chip
>biz.  Design your micro with interleave control on the chip and then design
>your memory chips that have a compatible arangement, then don't tell the
>foreign memory chip vendors about the micro/memory chip interface until you
>get to market.  Interleaving on the memory chip is not a difficult thing to do,
>one only has to decide that it is time to do it.

So, we're back to memory again. I hesitate to get involved, since we all 
heard Mark Johnson saying "Don't talk about, buy it!", to us intellectual
weenies who can only talk about things - but I can only talk about it for
the moment, so here goes:

Q:  how many processor chip vendors will be willing to tie themselves
    tightly into a memory manufacturer?  Well, there are some companies
    who do both...  but are you going to risk customers refusing to
    buy your processor chip because they have to use your memory chips
    with it?
Prediction: people will be very slow to get into tightly coupled 
    processor/memory.  But when they do, the processor companies will 
    probably put out both custom memory and non-custom memory processor
    chips - probably by last stage customization of the die. Ditto memory.
    This will probably be suboptimal.

mat@uts.amdahl.com (Mike Taylor) (06/05/89)

In article <26356@lll-winken.LLNL.GOV>, brooks@maddog.llnl.gov writes:
>                          ...  On a radiation transport Monte Carlo code,
> which is something we routinely crunch on supercomputers like the Cray
> machines, that wimpy little i860 with an alpha compiler outran the
> Cray 1S by 10% or so.

I presume this code doesn't vectorize well (or at all?)  What % vector
is it on the Cray?  Just curious....
-- 
Mike Taylor                               ...!{hplabs,amdcad,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

brooks@vette.llnl.gov (Eugene Brooks) (06/06/89)

In article <14FU029t326G01@amdahl.uts.amdahl.com> mat@uts.amdahl.com (Mike Taylor) writes:
>In article <26356@lll-winken.LLNL.GOV>, brooks@maddog.llnl.gov writes:
>>                          ...  On a radiation transport Monte Carlo code,
>> which is something we routinely crunch on supercomputers like the XXXX
>> machines, that wimpy little XXX with an alpha compiler outran the
>> XXXX by 10% or so.
Please note that my posting contained a copyright notice which specifically
forbids excerpting such as this without including of the copyright.  I don't
mind excerpting sections of the posting which did not contain any reference
to specific vendors, but this type of excerpt MUST include the copyright.

>I presume this code doesn't vectorize well (or at all?)  What % vector
>is it on the Cray?  Just curious....

To answer the question, the code vectorizes with extreme recoding effort on
the Cray to get a factor of 5 in speed.  Parallel machines based on the
VLSI chips will still have a factor of 20 in cost and performance leverage
after this is taken into account.


brooks@maddog.llnl.gov, brooks@maddog.uucp

grunwald@flute.cs.uiuc.edu (Dirk Grunwald) (06/07/89)

If I remember correctly, electron transport code is seperable, and ports well
to distributed memory multi-processors right?

Intel has plans to produce a successor the iPSC/2 based on the the i860, and
finally using a reasonable I/O architecture. They plan to produce 2048-node
systems for DARPA.


--
Dirk Grunwald -- Univ. of Illinois 		  (grunwald@flute.cs.uiuc.edu)

brooks@vette.llnl.gov (Eugene Brooks) (06/09/89)

In article <GRUNWALD.89Jun7094931@flute.cs.uiuc.edu> grunwald@flute.cs.uiuc.edu writes:
>
>If I remember correctly, electron transport code is seperable, and ports well
>to distributed memory multi-processors right?
This was a photon transport code, and it evolved the interaction between photons
and atoms in an implicit manner.  The implicit coupling causes a linear system
to get created as the result of the Monte Carlo and this must be solved for
the atomic populations and resulting photon weights.  Its not as trivial to
parallelize as one might think, but the micro based machines have incredible
leverage.

>Intel has plans to produce a successor the iPSC/2 based on the the i860, and
>finally using a reasonable I/O architecture. They plan to produce 2048-node
>systems for DARPA.
I have no comment on this rumor.


brooks@maddog.llnl.gov, brooks@maddog.uucp

grunwald@flute.cs.uiuc.edu (Dirk Grunwald) (06/09/89)

In article <26641@lll-winken.LLNL.GOV> brooks@vette.llnl.gov (Eugene Brooks) writes:
  >Intel has plans to produce a successor the iPSC/2 based on the the i860, and
  >finally using a reasonable I/O architecture. They plan to produce 2048-node
  >systems for DARPA.
  I have no comment on this rumor.

--

I got this from Federal Computer Week, April 10, 1989, ``Intel Lands
DARPA Super Award''. Darpa gives $7.6M to Intel, Intel plows in $20M.
2048 i860 processors running at 60 to 80 Mhz., giving a machine ``50
to 100 times faster than YMP.'' Many quotes from Rattner.

The 3 year project is called Touchstone.
--
Dirk Grunwald -- Univ. of Illinois 		  (grunwald@flute.cs.uiuc.edu)