[comp.arch] ATTACK OF KILLER MICROS

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (10/15/89)

In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov writes:

>mash@mips.com pointed out some important considerations in the issue
>of whether supercomputers as we know them will survive.  I thought
>that I would attempt to get a discussion started.  Here is a simple
>fact for the mill, related to the question of whether or not machines
>delivering the fastest performance at any price have room in the
>market.

>Fact number 1:
>The best of the microprocessors now EXCEED supercomputers for scalar
>performance and the performance of microprocessors is not yet stagnant.
>On scalar codes, commodity microprocessors ARE the fastest machines at
>any price and custom cpu architectures are doomed in this market.

>brooks@maddog.llnl.gov, brooks@maddog.uucp

This much has been fairly obvious for a few years now, and was made
especially clear by the introduction of the MIPS R-3000 based machines
at about the beginning of 1989. I think that this point is irrelevant
to the more appropriate purpose of supercomputers, which is to run
long (or large), compute-intensive problems that happen to map well
onto available architectures.

Both factors (memory/time and efficiency) are important here.  It is
generally not necessary to run short jobs on supercomputers, and it is
not cost-effective to run scalar jobs on vector machines.  On the
other hand, I have several codes that run >100 times faster on the
ETA-10G relative to a 25 MHz MIPS R-3000.  Since I need to run these
codes for hundreds of ETA-10G hours, the equivalent time on the
workstation is over one year.

The introduction of vector workstations (Ardent & Stellar) changes
these ratios substantially.  The ETA-10G runs my codes only 20 times
faster than the new Ardent Titan.  In this environment, the important
question is, "Can I get an average of more than 1.2 hours of
supercomputer time per day".  If not, then the Ardent provides better
average wall-clock turnaround.

It seems to me that the introduction of fast scalar and vector
workstations can greatly enhance the _important_ function of
supercomputers --- which is to allow the calculation of problems that
are otherwise too big to handle.  By removing scalar jobs and vector
jobs of short duration from the machine, more resources can be
allocated to the large calculations that cannot proceed elsewhere.

Enough mumbling....
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
		   mccalpin@scri1.scri.fsu.edu
		   mccalpin@delocn.udel.edu

brooks@maddog.llnl.gov (10/15/89)

mash@mips.com pointed out some important considerations in the issue of whether
supercomputers as we know them will survive.  I thought that I would attempt
to get a discussion started.  Here is a simple fact for the mill, related to
the question of whether or not machines delivering the fastest performance
at any price have room in the market.

Fact number 1:
The best of the microprocessors now EXCEED supercomputers for scalar
performance and the performance of microprocessors is not yet stagnant.
On scalar codes, commodity microprocessors ARE the fastest machines at
any price and custom cpu architectures are doomed in this market.

brooks@maddog.llnl.gov, brooks@maddog.uucp

colwell@mfci.UUCP (Robert Colwell) (10/15/89)

In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov () writes:
>Fact number 1:
>The best of the microprocessors now EXCEED supercomputers for scalar
>performance and the performance of microprocessors is not yet stagnant.
>On scalar codes, commodity microprocessors ARE the fastest machines at
>any price and custom cpu architectures are doomed in this market.

I take my hat off to them, too, because that's no mean feat.  But don't
forget that the supercomputers didn't set out to be the fastest machines
on scalar code.  If they had, they'd all have data caches, non-interleaved
main memory, and no vector facilities.  What the supercomputer designers
are trying to do is balance their machines to optimally execute a certain 
set of programs, not the least of which are the LLL loops.  In practice
this means that said machines have to do very well on vectorizable code,
while not falling down badly on the scalar stuff (lest Amdahl's law
come to call.)

So while it's ok to chortle at how the micros have caught up on the scalar
stuff, I think it would be an unwarranted extrapolation to imply that the
supers have been superseded unless you also specify the workload.
And by the way, it's the design constraints at the heavy-duty, high
parallelism, all functional-units-going-full-tilt-using-the-entire-memory-
bandwidth that make the price of the supercomputers so high, not the
constraints that predominate at the scalar end.  That's why I conclude
that when the micro/workstation guys want to play in the supercomputer
sandbox they'll either have to bring their piggy banks to buy the
appropriate I/O and memory, or convince the users that they can live
without all that performance.

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
31 Business Park Dr.
Branford, CT 06405     203-488-6090

preston@titan.rice.edu (Preston Briggs) (10/15/89)

In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov () writes:
>The best of the microprocessors now EXCEED supercomputers for scalar
>performance and the performance of microprocessors is not yet stagnant.

Is this a fair statement?  I've played some with the i860 and
I can write (by hand so far) code that is pretty fast.
However, the programs where it really zooms are vectorizable.
That is, I can make this micro solve certain problems well;
but these are the same problems that a vector machines handle well.

Getting good FP performance from a micro seems to require
pipelining.  Keeping the pipe(s) full seems to require a certain amount
of parallelism and regularity.  Vectorizable loops work wonderfully well.

Perhaps I've misunderstood your intent, though.  Perhaps you meant 
that an i860 (or Mips or whatever) can outrun a Cray (or Nec or whatever) 
on some programs.  I guess I'm still doubtful.  Do you have examples
you can tell us about?

Thanks,
Preston Briggs

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (10/15/89)

Gordon Bell, in the September CACM (p.1095) says, "By the end of
1989, the performance of the RISC, one-chip microprocessor should
surpass and remain ahead of any available minicomputer or mainframe
for nearly every significant benchmark and computational workload.
By using ECL gate arrays, it is relatively easy to build processors
that operate at 200 MHz (5 ns. clock) by 1990." (For those who don't
know, Mr. Bell has his name on the PDP-11, the VAX, and the Ardent
workstation.)

The big iron is fighting back, and that involves reducing their chip
count.  Once, a big cpu took ~10^4 chips: now it's more like 10^2.  I
expect it will shortly be ~10 chips. Shorter paths, you know.

I see the hot micros and the big iron meeting in the middle.  What
will distinguish their processors? Mainly, there will be cheap
systems. And then, there will be expensive ones, with liquid cooling,
superdense packaging, mongo buses, bad yield, all that stuff.  Even
when no multichip processors remain, there will still be $1K systems
and $10M systems. Of course, there is no chance that the $10M system
will be uniprocessor.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

brooks@vette.llnl.gov (Eugene Brooks) (10/16/89)

In article <1081@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>So while it's ok to chortle at how the micros have caught up on the scalar
>stuff, I think it would be an unwarranted extrapolation to imply that the
>supers have been superseded unless you also specify the workload.
Microprocessor development is not ignoring vectorizable workloads.  The
latest have fully pipeline floating point and are capable of pipelining
several memory accesses.  As I noted, interleaving directly on the memory
chip is trivial and memory chip makers will do it soon.  Micros now dominate
the performance game for scalar code and are moving on to vectorizable code.
After all, these little critters mutate and become more voracious every
6 months and vectorizable code is the only thing left for them to conquer.
No NEW technology needs to be developed, all the micro-chip and memory-chip
makers need to do is to decide to take over the supercomputer market.

              They will do this with their commodity parts.

Supercomputers of the future will be scalable multiprocessors made of many
hundreds to thousands of commodity microprocessors.  They will be commodity
parts because these parts will be the fastest around and they will be cheap.
These scalable machines will have hundreds of commodity disk drives ganged up
for parallel access.  Commodity parts will again be used because of the
cost advantage leveraged into a scalable system using commodity parts.
The only custom logic will be the interconnect which glues the system together,
and error correcting logic which glues many disk drives together into a
reliable high performance system.  The CM data vault is a very good model here.

          NOTHING WILL WITHSTAND THE ATTACK OF THE KILLER MICROS!

brooks@maddog.llnl.gov, brooks@maddog.uucp

brooks@vette.llnl.gov (Eugene Brooks) (10/16/89)

In article <2121@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov () writes:
>>The best of the microprocessors now EXCEED supercomputers for scalar
>>performance and the performance of microprocessors is not yet stagnant.
>
>Is this a fair statement?  I've played some with the i860 and
Yes, in the sense that a scalar dominated program has been compiled for
the i860 with a "green" compiler, no pun intended, and the same program
was compiled with a mature optimizing compiler on the XMP, and the 40MHZ
i860 is faster for this code.  Better compilers for the i860 will open
up the speed gap relative to the supercomputers.

>I can write (by hand so far) code that is pretty fast.
>However, the programs where it really zooms are vectorizable.
Yes, this micro beats the super on scalar code, and is not too sloppy
for hand written code which exploits its cache and pipes well.  The
compilers are not there yet for the vectorizable stuff on the i860.
Even if there were good compilers, the scalar-vector speed differential
is not as great on the i860 as it is on a supercomputer.  Of course,
interleaved memory chips will arrive and microprocessors will use them.
Eventually the high performance micros will take the speed prize for
vectorizable code as well, but this will require another few years of
development.


brooks@maddog.llnl.gov, brooks@maddog.uucp

brooks@vette.llnl.gov (Eugene Brooks) (10/16/89)

In article <6523@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>Gordon Bell, in the September CACM (p.1095) says, "By the end of
>1989, the performance of the RISC, one-chip microprocessor should
>surpass and remain ahead of any available minicomputer or mainframe
>for nearly every significant benchmark and computational workload.
It has already happened for SOME workloads, those which hit cache well
and are scalar dominated.  This was done without ECL parts.  The ECL
parts will only make matters worse for custom processors, as Bell indicates,
dominating performance for all workloads.

>I see the hot micros and the big iron meeting in the middle.  What
>will distinguish their processors?
Nothing.

>Mainly, there will be cheap
>systems. And then, there will be expensive ones, with liquid cooling,
>superdense packaging, mongo buses, bad yield, all that stuff.  Even
>when no multichip processors remain, there will still be $1K systems
>and $10M systems. Of course, there is no chance that the $10M system
>will be uniprocessor.
The $10M systems will be scalable systems built out of the same microprocessor.
These systems will probably be based on coherent caches, the micros having
respectable on chip caches which stay in sync with very large off chip
caches.  The off chip caches are kept coherent through scalable networks.
The "custom" value added part of the machine for the supercomputer vendor
to design is the interconnect and the I-O system.  The supercomputer vendor
will still have a cooling problem on his hands because of the density of
heat sources in such a machine.


brooks@maddog.llnl.gov, brooks@maddog.uucp

mike@thor.acc.stolaf.edu (Mike Haertel) (10/16/89)

In article <1081@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>I take my hat off to them, too, because that's no mean feat.  But don't
>forget that the supercomputers didn't set out to be the fastest machines
>on scalar code.  If they had, they'd all have data caches, non-interleaved
>main memory, and no vector facilities.  What the supercomputer designers

Excuse me, non-interleaved main memory?  I've always assumed that
interleaved memory could help scalar code too.  After all, instruction
fetch tends to take place from successive addresses.  Of course if
main memory is very fast there is no point to interleaving it, but
if all you've got is drams with slow cycle times, I would expect
that interleaving them would benefit even straight scalar code.
-- 
Mike Haertel <mike@stolaf.edu>
``There's nothing remarkable about it.  All one has to do is hit the right
  keys at the right time and the instrument plays itself.'' -- J. S. Bach

eric@snark.uu.net (Eric S. Raymond) (10/16/89)

In <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov wrote:
> The best of the microprocessors now EXCEED supercomputers for scalar
> performance and the performance of microprocessors is not yet stagnant.
> On scalar codes, commodity microprocessors ARE the fastest machines at
> any price and custom cpu architectures are doomed in this market.

Yes. And though this is a recent development, an unprejudiced observer could
have seen it coming for several years. I did, and had the temerity to say so
in print way back in 1986. My reasoning then is still relevant; *speed goes
where the volume market is*, because that's where the incentive and development
money to get the last mw-sec out of available fabrication technology is
concentrated.

Notice that nobody talks about GaAs technology for general-purpose processors
any more? Or dedicated Lisp machines? Both of these got overhauled by silicon
microprocessors because commodity chipmakers could amortize their development
costs over such a huge base that it became economical to push silicon to
densities nobody thought it could attain.

You heard it here first:

The supercomputer crowd is going to get its lunch eaten the same way. They're
going to keep sinking R&D funds into architectural fads, exotic materials,
and the quest for ever more ethereal heights of floating point performance.
They'll have a lot of fun and generate a bunch of sexy research papers.

Then one morning they're going to wake up and discover that the commodity
silicon guys, creeping in their petty pace from day to day, have somehow
managed to get better real-world performance out of their little boxes. And
supercomputers won't have a separate niche market anymore. And the
supercomputer companies will go the way of LMI, taking a bunch of unhappy
investors with them. La di da.

Trust me. I've seen it happen before...
-- 
      Eric S. Raymond = eric@snark.uu.net    (mad mastermind of TMN-Netnews)

seibel@cgl.ucsf.edu (George Seibel) (10/16/89)

In <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov wrote:
> The best of the microprocessors now EXCEED supercomputers for scalar
> performance and the performance of microprocessors is not yet stagnant.
> On scalar codes, commodity microprocessors ARE the fastest machines at
> any price and custom cpu architectures are doomed in this market.

   Speaking of "commodities", I think a lot of people have lost sight
of, or perhaps never recognized something about the vast majority of
supercomputers.  They are shared.   How often do you get a Cray processor
all to yourself?  Not very often, unless you have lots of money, or Uncle
Sam is picking up the tab so you can design atomic bombs faster.   As
soon as you have more than one job per processor, you're talking about
*commodity Mflops*.  The issue is no longer performance at any cost, because
if it was you would order another machine at that point.   The important
thing is Mflops/dollar for most people, and that's where the micros are
going to win in a lot of cases.

George Seibel, UCSF

rpeglar@csinc.UUCP (Rob Peglar x615) (10/16/89)

In article <35825@lll-winken.LLNL.GOV>, brooks@maddog.llnl.gov writes:
> mash@mips.com pointed out some important considerations in the issue of whether
> supercomputers as we know them will survive.  I thought that I would attempt
> to get a discussion started.  Here is a simple fact for the mill, related to
> the question of whether or not machines delivering the fastest performance
> at any price have room in the market.
> 
> Fact number 1:
> The best of the microprocessors now EXCEED supercomputers for scalar
> performance and the performance of microprocessors is not yet stagnant.
> On scalar codes, commodity microprocessors ARE the fastest machines at
> any price and custom cpu architectures are doomed in this market.
> 
> brooks@maddog.llnl.gov, brooks@maddog.uucp

Brooks is making a good point here.  By "this market", I assume he means
the one defined above, (as well as by mash) - to paraphrase, "the fastest
box at any price".  I'll let go what "fastest" and "box" mean for sake
of easy discussion  :-)  Most of us, I hope, can fathom what price is.

Anyway, I agree with mash that there is - albeit small - a
market for the machine with the highest peak absolute performance 
(pick your number, the most popular one recently seems to be Linpack
100x100 all Fortran, Dongarra's Table One).  The national labs have proven
that point for almost a generation.  I believe that it will take at least
one more generation - those who weaned on machines from CDC, then CRI -
before a more reasonable approach to machine procurement comes to pass.
Thus, I disagree that there will *always* be a market for this sort of
thing.  Status symbols may be OK in cars, but for machines purchased with
taxpayer dollars, the end is near.  Hence, Brooks' "attack of the killer
micros".

However, I do believe that there will always be a market for various
types of processors and processor architectures.  Killer scalar micros
are finding wide favor as above.  Vector supers and their offspring, e.g.
the i860 and other 64-bit things, will always dominate codes which can
be easily vectorized and do not lend themselves well to parallel computation.
Medium-scale OTS-technology machines like Sequent will start (are starting)
to dominate OLTP and RDBMS work, perfect tasks for symmetric MP machines.
(Pyramid, too; hi Chris).  Massive parallel machines will eventually
settle into production shops, perhaps running one and only one application,
but running it at speeds that boggle the mind.

It's up to the manufacturers to decide 1) which game they want to play
2) for what stakes  3) with what competition  4) for how long  5) etc.
etc.etc.  That's what makes working for a manufacturer such fun and
terror at once.

Rob
------

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (10/17/89)

In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene
Brooks) writes:

>Microprocessor development is not ignoring vectorizable workloads.  The
>latest have fully pipeline floating point and are capable of pipelining
>several memory accesses.  As I noted, interleaving directly on the memory
>chip is trivial and memory chip makers will do it soon. [ ... more
> stuff deleted ... ]
>              They will do this with their commodity parts.

It is not at all clear to me that the memory bandwidth required for
running vector codes is going to be developed in commodity parts.  To
be specific, a single 64-bit vector pipe requires a sustained
bandwidth of 24 bytes per clock cycle.  Is an ordinary, garden-variety
commodity microprocessor going to be able to use 6 32-bit
words-per-cycle of memory bandwidth on non-vectorized code?  If not,
then there is a strong financial incentive not to include that excess
bandwidth in commodity products....

In addition, the engineering/cost trade-off between memory bandwidth
and memory latency will continue to exist for the "KILLER MICROS" as
it does for the current generation of supercomputers.  Some users will
be willing to sacrifice latency for bandwidth, and others will be
willing to do the opposite.  Economies of scale will not eliminate
this trade-off, except perhaps by eliminating the companies that take
the less profitable position (e.g. ETA).

>Supercomputers of the future will be scalable multiprocessors made of
>many hundreds to thousands of commodity microprocessors.  They will
>be commodity parts because these parts will be the fastest around and
>they will be cheap. 

It seems to me that the experience in the industry is that
general-purpose processors are not usually very effective in
parallel-processing applications.  There is certainly no guarantee
that the uniprocessors which are successful in the market will be
well-suited to the parallel supercomputer market -- which is not
likely to be a big enough market segment to have any control over what
processors are built....

The larger chip vendors are paying more attention to parallelism now,
but it appears to be in the context of 2-4 processor parallelism.  It
is not likely to be possible to make these chips work together in
configurations of 1000's with the application of "glue" chips....

This is not to mention the fact that software technology for these
parallel supercomputers is depressingly immature.  I think traditional
moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle
existing scientific workloads better than 1000-processor parallel
machines for quite some time....
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
		   mccalpin@scri1.scri.fsu.edu
		   mccalpin@delocn.udel.edu

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/17/89)

  There's more to supercomputing than scalar speed. One of the primary
things you can do on a supercomputer is run large programs quickly.
Virtual memory is nice, but some programs cause it to thrash. That's
when it's nice to have a real 4GB machine. The same thing can be said
about vector processing, some programs can be done using vector
processors (or lots of parallel processors) faster than scalar.

  I don't see the death of the supercomputer, but a redefinition of
problems needing one. I have more memory on my home computer than all
the computers at this site when I started working here (hell the total
was <2MB). Like wise CPU and even disk. The number of problems which I
can't solve on my home system is a lot smaller than it was back then.


  However, thats the kicker, that real problems are limited in size.
Someone said that the reason for micros catching up is that the
development cost could be spread over the users. For just that reason
the vector processors will stay expensive, because fewer users will need
(ie. buy) them. There will always be a level of hardware needed to solve
problems which are not shared by many users. While every problem has a
scalar portion, many don't need vectors, or even floating point.

  I think this goes for word size, too. When I see that the Intel 586
will have a 64 bit word I fail to generate any excitement. The main
effect will be to break all the programs which assume that short==16
bits (I've ported to the Cray, this *is* a problem). If you tell me I
can have 64 bit ints, excuse me if I don't feel the need to run right
out and place an order. Even as memory gets cheaper I frequently need
1-2 million ints, and having them double in size is not going to help
keep cost down.

  I think that the scalar market will continue to be micros, but I
don't agree with Eric that the demand for supercomputers will vanish,
or that micros will catch them for the class of problems which are
currently being run on supercomputers. The improving scalar performance
will reduce the need for vector processing, and keep them from getting
economies of scale. He may well be right that some of the companies
will fall, since the micros will be able to solve a lot of the problems
which are not massively vectorable or inherently require huge
addressing space.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/17/89)

>In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov () writes:
>

This article certainly generated some responses.  Unfortunately, some responders
seemed to miss (or chose to ignore :-) the tongue-in-cheek nature of the title.

I used to argue, only a couple of years ago, that supercomputers produced
cheaper scalar computing cycles than "smaller" systems.  That isn't true today.
However, supercomputers still produce cheaper floating point results on
vectorizable jobs.   And, they produce memory bandwidth cheaper than other
systems.  That may change, too. 

Q: What will it take to replace a Cray with a bunch of micros?

A: (IMHO) :  A "cheap" Multiport Interleaved Memory subsystem.  In order to do
that, you need to provide a way to build such subsystems out of a maximum of 3
different chips, and be able to scale the number of processors and interleaving
up and down.  A nice goal might be a 4-port/32-way-interleaved 64-bit-wide
subsystem cheap enough for a $100 K system.  (That is only enough memory
bandwidth for a 1 CPU Cray-like system,  or 4 micro based CPUs with only
1 word/cycle required, but it would sure be a big step 
forward.)  The subsystem needs to provide single level local-like memory, 
like a Cray.

[Or, show a way to make, in software, a truly distributed system as efficient
as a local memory system (PhD thesis material...- I am betting on hardware 
solutions in the short run...)].

You also need to provide a reasonably reliable way for the memory to subsystem
connections to be made.  This is sort of hard hardware level engineering.  For
example, you probably can't afford the space for 32 VME buses...

Does anyone have any suggestions on how the connections into and out of such
memory subsystems could be made without a Cray-sized bundle of connectors?

On the topic of the original posting, what I have seen is that
micro based workstations are eating away fast at the minicomputer market, 
just on the basis of price performance, leaving only workstation clusters, 
vector machines (Convex-sized to Cray-sized), and other big iron, 
such as very large central storage servers.  So, I wouldn't write off 
big iron just yet, but obviously some companies will be selling a lot more
workstations and a lot fewer minicomputers than they were planning.

Quiz:  Why does Cray use *8* way interleaving per memory *port* on the
Cray Y-MP? 

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

colwell@mfci.UUCP (Robert Colwell) (10/17/89)

In article <7369@thor.acc.stolaf.edu> mike@thor.stolaf.edu () writes:
>In article <1081@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>>I take my hat off to them, too, because that's no mean feat.  But don't
>>forget that the supercomputers didn't set out to be the fastest machines
>>on scalar code.  If they had, they'd all have data caches, non-interleaved
>>main memory, and no vector facilities.  What the supercomputer designers
>
>Excuse me, non-interleaved main memory?  I've always assumed that
>interleaved memory could help scalar code too.  After all, instruction
>fetch tends to take place from successive addresses.  Of course if
>main memory is very fast there is no point to interleaving it, but
>if all you've got is drams with slow cycle times, I would expect
>that interleaving them would benefit even straight scalar code.

I meant that as a shorthand way of putting across the idea that the usual
compromise is one of memory size, memory bandwidth, and memory latency.
For the canonical scalar code you don't need a very large memory, and
the bandwidth may not be as important to you as the latency (pointer
chasing is an example).

The point I was making was that the supercomputers have incorporated
design decisions, such as very large physical memory, and very high
bandwidth to and from that memory, so that their multiple functional
units can be kept usefully busy while executing 'parallel' code.  Were
you to set out to design a machine which didn't (or couldn't) use those
multiple buses (pin limits on a single-chip micro for instance) then
that bandwidth isn't worth as much to you and you might be better off
with a flat, fast memory, which is what most workstations do (or used
to do, anyway).

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
31 Business Park Dr.
Branford, CT 06405     203-488-6090

pan@propress.com (Philip A. Naecker) (10/17/89)

In article <35825@lll-winken.LLNL.GOV>, brooks@maddog.llnl.gov writes:
> Fact number 1:
> The best of the microprocessors now EXCEED supercomputers for scalar
> performance and the performance of microprocessors is not yet stagnant.
> On scalar codes, commodity microprocessors ARE the fastest machines at
> any price and custom cpu architectures are doomed in this market.

Alas, I believe you have been sucked into the MIPS=Performance falacy.  There
is *not* a simple relationship between something as basic as scalar performance
and something as complex as overall application (or even routine) performance.

Case in point:  The R2000 chipset implemented on the R/120 (mentioned by others
in this conversation) has, by all measures *excellent* scalar performance.  One
would benchmark it at about 12-14 times a microVAX.  However, in real-world,
doing-useful-work, not-just-simply-benchmarking situations, one finds that
actual performance (i.e., performance in very simple routines with very simple
algorithms doing simple floating point operations) is about 1/2 that expected.

Why?  Because memory bandwidth is *not* as good on a R2000 as it is on other
machines, even machines with considerably "slower" processors.  There are
several components to this, the most important being the cache implementation
on an R/120.  Other implementations using the R2000/R3000/Rx000 chipsets might
well do much better, but only with considerable effort and cost, both of which
mean that those "better" implementations will begin to approach the price/
performance of the "big" machines that you argue will be killed by the
price/performance of commodity microprocessors.

I think you are to a degree correct, but one must always tailor such
generalities with a dose of real-world applications.  I didn't, and I got bit
to the tune of a fine bottle of wine. :-(

Phil

_______________________________________________________________________________
Philip A. Naecker                            Consulting Software Engineer
Internet: pan@propress.com                   Suite 101
          uunet!prowest!pan                  1010 East Union Street
Voice:    +1 818 577 4820                    Pasadena, CA  91106-1756
FAX:      +1 818 577 0073                    Also: Technology Editor
                                                   DEC Professional Magazine
_______________________________________________________________________________

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/17/89)

In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks)
writes:

(Another amusing challenge:)

>After all, these little critters mutate and become more voracious every
>6 months and vectorizable code is the only thing left for them to conquer.

(I like the picture of fat computer vendors, or at least fat marketing depts,
hunched together in bunkers hiding from the killer micros.  I have no doubt
that they are planning a software counterattack.  Watch out for a giant MVS
robot built to save the day!  :-)

>No NEW technology needs to be developed, all the micro-chip and memory-chip
>makers need to do is to decide to take over the supercomputer market.
>
>              They will do this with their commodity parts.

The only problem I see with this is the interconnection technology.  The
*rest* of it is, or will soon be, commodity market stuff.

>Supercomputers of the future will be scalable multiprocessors made of many
>hundreds to thousands of commodity microprocessors.

The appropriate interconnection technology for this has not, to my knowledge,
been determined.  Perhaps you might explain how it will be done?  The rest,
I agree, is doable at this point, though some of it is not trivial.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/17/89)

In article <127@csinc.UUCP> rpeglar@csinc.UUCP (Rob Peglar x615) writes:
>In article <35825@lll-winken.LLNL.GOV>, brooks@maddog.llnl.gov writes:

>that point for almost a generation.  I believe that it will take at least
>one more generation - those who weaned on machines from CDC, then CRI -
>before a more reasonable approach to machine procurement comes to pass.

In my experience, gov't labs are very cost conscious.  I could tell a lot of
stories on this.  Suffice it to say that many people who have come to gov't labs
from private industry get frustrated with just how cost conscious the gov't can
be (almost an exact quote: "In my last company, if we needed another 10GBytes,
all we had to do was ask, and they bought it for us."  That was when 10 GBytes
cost $300 K.)  The reason supercomputer are used so much is that they get the 
job done more cheaply.  You may question whether or not new nuclear weapons 
need to be designed, but I doubt if the labs doing it would use Crays 
if that were not the cheapest way to get the job done.  Private industry 
concerns with the same kinds of jobs also use supercomputers the same way.  
Oil companies, for example.  At various times, oil companies have owned more
supercomputers than govt labs.

>Thus, I disagree that there will *always* be a market for this sort of
>thing.  Status symbols may be OK in cars, but for machines purchased with
>taxpayer dollars, the end is near.  Hence, Brooks' "attack of the killer
>micros".

I will make a reverse claim:  People who want status symbols buy PC's for their
office.  These PC's, the last time I checked, were only 1/1000th as cost
effective at doing scientific computations as supercomputers.  Talk about 
*waste*...  :-)

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

brooks@vette.llnl.gov (Eugene Brooks) (10/17/89)

In article <33798@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>>Supercomputers of the future will be scalable multiprocessors made of many
>>hundreds to thousands of commodity microprocessors.
>
>The appropriate interconnection technology for this has not, to my knowledge,
>been determined.  Perhaps you might explain how it will be done?  The rest,
>I agree, is doable at this point, though some of it is not trivial.
This is the stuff of research papers right now, and rapid progress is being
made in this area.  The key issue is not having the components which establish
the interconnect cost much more than the microprocessors, their off chip caches,
and their main memory.  We have been through message passing hypercubes and
the like, which minimize hardware cost while maximizing programmer effort.
I currently lean to scalable coherent cache systems which minimize programmer
effort.  The exact protocols and hardware implementation which work best
for real applications is a current research topic. The complexity of the
situtation is much too high for a vendor to just pick a protocol and build
without first running very detailed simulations of the system on real programs.


brooks@maddog.llnl.gov, brooks@maddog.uucp

brooks@vette.llnl.gov (Eugene Brooks) (10/17/89)

In article <MCCALPIN.89Oct16141656@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>The larger chip vendors are paying more attention to parallelism now,
>but it appears to be in the context of 2-4 processor parallelism.  It
>is not likely to be possible to make these chips work together in
>configurations of 1000's with the application of "glue" chips....
These microprocessors, for the most part, are being designed to work in
a small processor count coherent cache shared memory environment.  This
is the reason why examining scalable coherent cache systems is so imporant.
The same micros, with their capability to lock a cache line for a while
to do an indivisible op, will work fine in the scalable systems.
I agree that they won't be optimal, but they will be within 90% of optimal
and that is all that is required.  The MAJOR problem with current micros
in a scalable shared memory environment is their 32 bit addressing.
Unfortunately, no 4 processor system will ever need more than 32 bit
addresses, so we will have to BEG the micro vendors to put in bigger
pointer support..

>This is not to mention the fact that software technology for these
>parallel supercomputers is depressingly immature.  I think traditional
>moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle
>existing scientific workloads better than 1000-processor parallel
>machines for quite some time....
The software question is the really hary one, that is why LLNL is
sponsoring the Massively Parallel Computing Initiative.  We see
scalable machines being very cost effective and are making a substantial
effort in the application software area.



brooks@maddog.llnl.gov, brooks@maddog.uucp

brooks@vette.llnl.gov (Eugene Brooks) (10/17/89)

In article <33802@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>I will make a reverse claim:  People who want status symbols buy PC's for their
>office.  These PC's, the last time I checked, were only 1/1000th as cost
>effective at doing scientific computations as supercomputers.  Talk about 
>*waste*...  :-)
A "PC" with a MIPS R3000 or an Intel i860 in it is about 70 times more cost
effective for scalar codes, and we run a lot of those on our supercomputers
at LLNL, and about 3 to 7 times more cost effective for highly vectorized codes.
In fact, much to our computer center's dismay, research staff are voting with
their wallet and buying these "PC"s in droves.  Our computer center is responding
by buying microprocessor powered machines, currently in bus based shared memory
multiprocessor form, but eventually in scalable shared memory multiprocessor form.


brooks@maddog.llnl.gov, brooks@maddog.uucp

mg@notecnirp.Princeton.EDU (Michael Golan) (10/17/89)

This came for various people - the references are so confusing I removed them
so as not to put the wrong words in someone's mouth:

>>>Supercomputers of the future will be scalable multiprocessors made of many
>>>hundreds to thousands of commodity microprocessors.
>>
>This is the stuff of research papers right now, and rapid progress is being
>made in this area.  The key issue is not having the components which establish
>the interconnect cost much more than the micros, their off chip caches,
>I currently lean to scalable coherent cache systems which minimize programmer
>effort.  The exact protocols and hardware implementation which work best
>for real applications is a current research topic. 

Last year, I took a graduate level course in parallel computing here at
Princeton. I would like to make the following comments, which are my *own*:

1) There is no parallel machine currently the works faster than non-parallel
machines for the same price. The "fastest" machines are also non-parallel - 
these are vector processors.

2) A lot of research is going on - and went on for over 10 years now. As far
as I know, no *really* scalable parallel architecture with shared memory exists
that will scale far above 10 processors (i.e. 100). And it does not seems to
me this will be possible in the near future.
"A lot of research" does not imply any effective results - especially in CS -
just take a look how many people write articles improving time from 
O(N log log N) to O(Nlog log log N), which will never be practical for N<10^20
or so (the log log is just an example; you know what I mean).

3) personally I feel parallel computing has no real future as the single cpu
gets a 2-4 folds performance boost every few years, and parallel machines
constructions just can't keep up with that. It seems to me that for at least 
the next 10 years, non-parallel machines will still give the best performance 
and the best performance/cost.

4) I think Cray-like machines will be here for a long long time. People talk
about Cray-sharing. This is true, but when an engineer needs a simulation to
run and it takes 1 day each time, if you run it on a 2 or 3 day machine, he
sits doing nothing for that time, which costs you a lot, i.e. it is turn-around
time that really matters. And while computers get faster, its seems software
complexity and the need for faster and faster machines is growing even more
rapidly.

 Michael Golan
 mg@princeton.edu

The opinions expressed above are my *own*. You are welcome not to like them.

kleonard@gvlv2.GVL.Unisys.COM (Ken Leonard) (10/17/89)

In article <12070@cgl.ucsf.EDU> seibel@cgl.ucsf.edu (George Seibel) writes:
* In <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov wrote:
* > On scalar codes, commodity microprocessors ARE the fastest machines at
* > any price and custom cpu architectures are doomed in this market.
* 
*    Speaking of "commodities",...
* ...
* *commodity Mflops*.  The issue is no longer performance at any cost, because
* if it was you would order another machine at that point.   The important
* thing is Mflops/dollar for most people, and that's where the micros are
* going to win in a lot of cases.
---- well, first...
Maybe, even, the _commodity_ is _not_ _M_flops per dollar, but just
_flop_flops per dollar?   That is, if the cycle time to "set up the problem",
"crunch the numbers", "get the plot/list/display" is under _whatever_ upper
limit fits with _my_ mode of "useful work", then I very likely _do_not_care_
if it gets any shorter (i.e. if the _flop_flops per second per dollar goes
higher).  This becomes, IMHO, even more significant if my "useful" cycle time
is available to me _truly_ whenever _I_ darn well feel the need.
All of which works, again, to the advantage of microcrunchers.
---- and, second...
A non-trivial part of the demand for megacrunchers, IMHO, stems from
solution methods which have evolved from the days when _only_ "big"
machines were available for "big" jobs (any jobs?) and _just_had_to_be_
shared.  For what _I_ do, anyhow, (and probably a _lot_ of other folk
somewhere out there), the "all-in-one-swell-foop" analyses/programs/techniques
are not the _only_ way to get to the _results_ needed to do the job--and
they may well _not_ be the "best" way.  I often find that somewhat more of
somewhat smaller steps get me to my target faster than otherwise.  That is,
if I can only get 1 or 2 or 10 passes per day through the megacruncher, the
job takes more work from me and more time on the calendar and more bucks from
whoever is paying the tab, than if I can just as many as I need of smaller
passes.
---- also third...
And those smaller passes may well be easier (and thus faster) to program,
and more amenable to validation/assurance/etc.
And they may admit algorithms which work plenty fast on a dedicated machine
even if it is pretty small but would not work very fast at all on a shared
machine even if it is quite big (maybe especially because it is "big
architecture").
---- so, finally...
I believe in micros.
-------------
regardz,
Ken Leonard

swarren@eugene.uucp (Steve Warren) (10/17/89)

In article <33788@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
                         [...]
>Does anyone have any suggestions on how the connections into and out of such
>memory subsystems could be made without a Cray-sized bundle of connectors?
                         [...]
Multiplexed optical busses driven by integrated receivers with the optics,
decoders, and logic-level drivers on the same substrate.  It's the obvious
solution (one I think many companies are working on).


DISCLAIMER:  This opinion is in no way related to my employment with
             Convex Computer Corporation.  (As far as I know we aren't
             working on optical busses, but then I'm not in New Products).

--Steve
-------------------------------------------------------------------------
	  {uunet,sun}!convex!swarren; swarren@convex.COM

swarren@eugene.uucp (Steve Warren) (10/17/89)

In article <20336@princeton.Princeton.EDU> mg@notecnirp.edu (Michael Golan) writes:
>Last year, I took a graduate level course in parallel computing here at
>Princeton. I would like to make the following comments, which are my *own*:
>
>1) There is no parallel machine currently the works faster than non-parallel
>machines for the same price. The "fastest" machines are also non-parallel - 
>these are vector processors.
>
The Cray XMP with one processor costs approx. $2.5M.  The 4 processor Convex
C240S costs $1.5M.  On typical scientific applications the performance of the
240S is about 140% of the single processor Cray XMP.  (The 240S is the newest
model with enhanced performance CPUs).

Also, vector processors are technically nonparallel, but the implementation
involves parallel function units that are piped up so that at any one instant
in time there are multiple operations occurring.  Vectors are a way of doing
parallel processing on a single stream of data.

These were the only points I would disagree with.

--Steve
-------------------------------------------------------------------------
	  {uunet,sun}!convex!swarren; swarren@convex.COM

rpeglar@csinc.UUCP (Rob Peglar x615) (10/17/89)

In article <33802@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
> In article <127@csinc.UUCP> rpeglar@csinc.UUCP (Rob Peglar x615) writes:
> 
> >that point for almost a generation.  I believe that it will take at least
> >one more generation - those who weaned on machines from CDC, then CRI -
> >before a more reasonable approach to machine procurement comes to pass.
> 
> In my experience, gov't labs are very cost conscious.  I could tell a lot of
> stories on this.  Suffice it to say that many people who have come to gov't labs
> from private industry get frustrated with just how cost conscious the gov't can
> be (almost an exact quote: "In my last company, if we needed another 10GBytes,
> all we had to do was ask, and they bought it for us."  That was when 10 GBytes
> cost $300 K.)  The reason supercomputer are used so much is that they get the 
> job done more cheaply.  You may question whether or not new nuclear weapons 
> need to be designed, but I doubt if the labs doing it would use Crays 
> if that were not the cheapest way to get the job done.  Private industry 
> concerns with the same kinds of jobs also use supercomputers the same way.  
> Oil companies, for example.  At various times, oil companies have owned more
> supercomputers than govt labs.

Good point.  However, oil companies in particular are notorious for having
procurements follow the "biggest and baddest = best" philosophy.  Hugh, you
know as well as I that supercomputer procurement is not a rational or
scientific process - it's politics, games, and who knows who.  Cheap,
efficient, usable, etc.etc. - all take a back seat to politics.  However,
if the "job" is defined as running one (or some small number of) code(s)
for hours then there is no question that only a super will do.  The point
that Brooks' doesn't make, but implies only, is that the *way* scientific
computing is being done changes all the time.  One-job killer codes are
becoming less prevalent.  The solutions must change as the workload
changes.  Sure, there are always codes which cannot be run (Lincoln's 
attributed quote (compressed) - "supercomputer == only one generation
behind the workload" - but yesterdays' killer code, needing 8 hours of
4 million 64-bit words, can now be done on the desktop.  (see below)

> 
> >Thus, I disagree that there will *always* be a market for this sort of
> >thing.  Status symbols may be OK in cars, but for machines purchased with
> >taxpayer dollars, the end is near.  Hence, Brooks' "attack of the killer
> >micros".
> 
> I will make a reverse claim:  People who want status symbols buy PC's for their

Please.  Are you saying that NAS,LLNL,LANL,etc.etc. don't compete for status
defined as big,bad hardware?  Just the glorious battle between Ames and
Langley provides one with enough chuckles to last quite a while.

> office.  These PC's, the last time I checked, were only 1/1000th as cost
> effective at doing scientific computations as supercomputers.  Talk about 
> *waste*...  :-)
> 
> 

Look again.  I'll give you a real live example.  Buy any 386 33mhz machine,
with a reasonable cache (e.g. at least 128 kB) of fast SRAM, and 8 MB
or so of (slower) DRAM.  Plug in a Mercury co-processor board, and use
Fortran (supplied by Mercury) to compile Dr. Dongarra's Table One
Linpack.  Results on PC - 1.8Mflops.  Using a coded BLAS, you get 4.7
Mflops.  This is 64-bit math.  Last time *I* checked, the Cray Y-MP
stood at 79 Mflops.  Cost of Cray Y-MP?  You and I know what that is.
Even discounting life cycle costing (which for any Cray machine, is
huge due to bundled maintenance, analysts, etc.etc.), the performance ratio
of Y to PC is 79/1.8 = 43.88.   I'll bet my year's salary that the price
ratio is higher than that.  To ballpark, price for the PC setup is
around $20K.  Moving down all the time.  Even if the Y-MP 1/32 was
only $2M (which it is not) that would be 100:1 price ratio.

Of course, that is only one code.  Truly, your mileage will vary.  The
price/performance ratio of an overall system is dependent on many
variables.

After all that, Brooks' point is still valid.  Micros using commodity
HW and cheap (sometimes free) software are closing the gap.  They have
already smashed the price/performance barrier(for many codes), and the
slope of their absolute performance improvements over time is much larger 
than any of the true super vendors (any==1 now, at least US)

The game is nearly over.  

Rob

...uunet!csinc!rpeglar

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/17/89)

In article <35986@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:

>A "PC" with a MIPS R3000 or an Intel i860 in it is about 70 times more cost

A quick clarification:  The "PC's" I was talking about are IBM PC's and clones
based on Intel 80x86 chips, *not* SGI or DEC machines based on R3000/R3010s.
"PC" may also be extended to Apple Mac and Mac II machines by some people.
Most of the "PC" boosters that I am thinking of, and from which we have heard
in this newsgroup recently, are also "offended" by the
"excessive" power and cost of MIPSCo based machines.  Not me, obviously, but
most of these people do not consider an SGI 4D/25 a "PC".

>effective for scalar codes, and we run a lot of those on our supercomputers
>at LLNL, and about 3 to 7 times more cost effective for highly vectorized
> codes.

Well, I admit, I hadn't done a calculation for some months.  Last time I did it,
I was somewhat disappointed by the inflated claims surrounding micro based 
systems.  I have been hearing "wolf!" for 15 years, so it is easy to be blase'
about it.  But, this USENET discussion stimulated me
to look at it again.  Another quick calculation shows a *big change*. It appears
to me, on the face of it, that cost/delivered FLOP is now about even.  I don't
see the 3 -7 X advantage to the micros yet, but maybe you are looking at the
faster 60-100MHz systems that will fast be arriving.  I used SGI 4D/280's
as the basis of comparison, since that appears to be the most cost effective
of such systems that I have good pricing information on.  Anyway, how long has
it taken Cray to shave a few ns off the clock?  In less than a year we should
see systems based on the new micro chips.  Yikes.  It looks like the
ATTACK OF THE KILLER MICROS.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

desnoyer@apple.com (Peter Desnoyers) (10/18/89)

> > In my experience, gov't labs are very cost conscious.  I could tell a 
> > lot of stories on this.  Suffice it to say that many people who have come 
> > to gov't labs from private industry get frustrated with just how cost 
> > conscious the gov't can be. (almost an exact quote: "In my last company, 
> > if we needed another 10GBytes, all we had to do was ask, and they bought 
> > it for us."  That was when 10 GBytes cost $300 K.)  The reason 
> > supercomputer are used so much is that they get the  job done more 
> > cheaply.  

From what I know of DOD procurement (my father works for a US Navy lab) 
one factor may be that the time and effort needed to justify spending 
$25,000 of Uncle Sam's money on a super-micro, along with the effort of 
spec'ing it as sole-source or taking bids, is no doubt far more than 
1/400th the effort needed to procure a $10M supercomputer. 

                                      Peter Desnoyers
                                      Apple ATG
                                      (408) 974-4469

brooks@vette.llnl.gov (Eugene Brooks) (10/18/89)

In article <20336@princeton.Princeton.EDU> mg@notecnirp.edu (Michael Golan) writes:
>1) There is no parallel machine currently the works faster than non-parallel
>machines for the same price. The "fastest" machines are also non-parallel - 
>these are vector processors.
This is false.  There are many counter examples for specific applications.


>2) A lot of research is going on - and went on for over 10 years now. As far
>as I know, no *really* scalable parallel architecture with shared memory exists
>that will scale far above 10 processors (i.e. 100). And it does not seems to
>me this will be possible in the near future.
Again, this is wrong.  Many scalable architectures exist in the literature
and some of them are well proven using simulation on real application codes.

>3) personally I feel parallel computing has no real future as the single cpu
>gets a 2-4 folds performance boost every few years, and parallel machines
>constructions just can't keep up with that. It seems to me that for at least 
>the next 10 years, non-parallel machines will still give the best performance 
>and the best performance/cost.
Massively parallel computing has a future because the performance increases
are 100 or 1000 fold.  I agree with the notion that using 2 processors, if
the software problems are severe, is not worth it because next years micro
will be twice as fast.  Next years supercomputer, however, will not be twice
as fast.

>4) I think Cray-like machines will be here for a long long time. People talk
>about Cray-sharing. This is true, but when an engineer needs a simulation to
>run and it takes 1 day each time, if you run it on a 2 or 3 day machine, he
>sits doing nothing for that time, which costs you a lot, i.e. it is turn-around
>time that really matters. And while computers get faster, its seems software
>complexity and the need for faster and faster machines is growing even more
>rapidly.
Cray like machines will be here for a long time indeed.  They will, however,
be implemented on single or nearly single chip microprocessors.  I do not
think that the "architecture" is bad, only the implementation has become
nearly obselete.  It is definitely obselete for scalar code and vectorized
code will follow within 5 years.

brooks@maddog.llnl.gov, brooks@maddog.uucp

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/18/89)

In article <36057@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:

>Cray like machines will be here for a long time indeed.  They will, however,
>be implemented on single or nearly single chip microprocessors.  I do not
>think that the "architecture" is bad, only the implementation has become
>nearly obselete.  It is definitely obselete for scalar code and vectorized
>code will follow within 5 years.

I agree with you here.  In fact, did anyone notice a recent newspaper article
(In Tuesday's Merc. News - from Knight Ridder:)

"Control Data to use Mips design"

"Control Data Corp. has cast its lot with Mips Computer Systems, inc. to design
the brains of its future computers, choosing a new computer architecture
developed by the Sunnyvale Company."

...

"The joint dev. agreement with Mips means Control Data will use [...] the RISC
architecture developed by that firm..."

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

ggw@wolves.uucp (Gregory G. Woodbury) (10/18/89)

In article <MCCALPIN.89Oct16141656@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene
>Brooks) writes:
>
>>Microprocessor development is not ignoring vectorizable workloads.  The
>>latest have fully pipeline floating point and are capable of pipelining
>>several memory accesses.  
>>[ ... more stuff deleted ... ]
>
>It is not at all clear to me that the memory bandwidth required for
>running vector codes is going to be developed in commodity parts.  To
>be specific, a single 64-bit vector pipe requires a sustained
>bandwidth of 24 bytes per clock cycle.  Is an ordinary, garden-variety
>commodity microprocessor going to be able to use 6 32-bit
>words-per-cycle of memory bandwidth on non-vectorized code?  If not,
>then there is a strong financial incentive not to include that excess
>bandwidth in commodity products....
>
	This is quite a statement.  Don't forget - even if the micro
can not make FULL use of a vector pipeline, including one will enhance
performance significantly.  The theoretical folks in this forum are quite
useful in the development of theoretical maxima, but even some partial
vector capabilities in a floating point unit will be greeted with joy.

	Lots and lots of "commodity" programs out there do things that would
benefit from some primitive vector computations.  Just in the past couple of
weeks we have had some discussions here about the price/performance
aspects of these "Killer Micros".  ( I do want to acknowledge that my
price figures were a little skewed - another round of configuration work
with various vendors has shown that I can find a decent bus speed and
SCSI disks in the required price range - thanks for some of the pointers!)

>
>In addition, the engineering/cost trade-off between memory bandwidth
>and memory latency will continue to exist for the "KILLER MICROS" as
>it does for the current generation of supercomputers.  Some users will
>be willing to sacrifice latency for bandwidth, and others will be
>willing to do the opposite.  Economies of scale will not eliminate
>this trade-off, except perhaps by eliminating the companies that take
>the less profitable position (e.g. ETA).

	This is an good restatment of the recent "SCSI on steroids"
discussion.  The vendor who can first put a "real" supercomputer or
"real" mainframe on (or beside) the desktop for <$50,000 will make a
killing.  Calling something a "Personal Mainframe" makes marketing happy,
but not being able to keep that promise makes for unhappy customers ;-)

-- 
Gregory G. Woodbury
Sysop/owner Wolves Den UNIX BBS, Durham NC
UUCP: ...dukcds!wolves!ggw   ...dukeac!wolves!ggw           [use the maps!]
Domain: ggw@cds.duke.edu  ggw@ac.duke.edu  ggw%wolves@ac.duke.edu
Phone: +1 919 493 1998 (Home)  +1 919 684 6126 (Work)
[The line eater is a boojum snark! ]           <standard disclaimers apply>

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/18/89)

In article <35897@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>In article <2121@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>>In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov () writes:
>>>The best of the microprocessors now EXCEED supercomputers for scalar
>>Is this a fair statement?  I've played some with the i860 and
>Yes, in the sense that a scalar dominated program has been compiled for
>the i860 with a "green" compiler, no pun intended, and the same program
>was compiled with a mature optimizing compiler on the XMP, and the 40MHZ
>i860 is faster for this code.  Better compilers for the i860 will open

The Cray-XMP is considerably slower than the YMP.  
The single-processor XMP is no-longer a supercomputer.
Take a program requiring more than 128MBytes of memory (or 64 MBytes
for that matter (but I personally prefer more than 256M to excerice the
VM system alittle!)) (i.e. a relatively BIG job, a *supercomputer* job)
and then compare any micro you want 
or any other system you want with the YMP. or something in
that class.  and then try it on a multiprocessor YMP,  and Please 
STOP USING A SINGLE-PROCESSOR xmp AS THE DEFINITION OF A SUPERCOMPUTER, 
thank you.
And it would be nice if people used "LIST PRICE" for "COMPLETE SYSTEMS"
when comparing prices.  (LIST PRICE = PEAK PRICE !!) (COMPLETE SYSTEM =
with all needed software and a few GBytes of disk with a few controllers)

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/18/89)

In article <127@csinc.UUCP> rpeglar@csinc.UUCP (Rob Peglar x615) writes:
>(pick your number, the most popular one recently seems to be Linpack
>100x100 all Fortran, Dongarra's Table One).  The national labs have proven

Throw away ALL your copies of the LINPACK 100x100 benchmark if you
are interested in supercomputers.  The 300x300 is barely big enough
and uses a barely good-enough algorithm to qualify for supercomputer
comparison as a low-impact guideline only.
JJD has lots of warning words in the first paragraphs of his list
but looks like most people go right to the table and never read the
paper.

If you must use a single-program benchmark, use the lesson taught 
by the Sandia people (JohnGustafson, et.al.):  Keep the time fixed
and vary the problem size.

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (10/18/89)

In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene
Brooks) writes:

>Supercomputers of the future will be scalable multiprocessors made of
>many hundreds to thousands of commodity microprocessors.  They will be
>commodity parts because these parts will be the fastest around and
>they will be cheap.  These scalable machines will have hundreds of
>commodity disk drives ganged up for parallel access.  Commodity parts
>will again be used because of the cost advantage leveraged into a
>scalable system using commodity parts.  The only custom logic will be
>the interconnect which glues the system together, and error correcting
>logic which glues many disk drives together into a reliable high
>performance system.  The CM data vault is a very good model here.

I think that it is interesting that you expect the same users who
can't vectorize their codes on the current vector machines to be able
to figure out how to parallelize them on these scalable MIMD boxes.
It seems to me that the automatic parallelization problem is much
worse than the automatic vectorization problem, so I think a software
fix is unlikely....

In fact, I think I can say it much more strongly than that:
Extrapolating from current experience with MIMD machines, I don't
think that the fraction of users that can use a scalable MIMD
architecture is likely to be big enough to support the economies of
scale required to compete with Cray and their vector machines.  (At
least for the next 5 years or so).  

I *do* think is that the romance with vector machines has worn off,
and people are realizing that they are not the answer to everyone's
problems. This is a good thing --- I like it when people migrate their
scalar codes off of the vector machines that I am trying to get time
on!!!

What is driving the flight from traditional supercomputers to
high-performance micros is turnaround time on scalar codes.  From my
experience, if the code is really not vectorizable, then it is
probably not parallelizable either, and scalable machines won't scale.
These users are going to want the fastest single-processor micro
available, unless their memory requirements are too big their ability
to purchase.

The people who can vectorize their codes are still getting 100:1
improvements going to supercomputers --- my code is over 500 times
faster on an 8-cpu Cray Y/MP than on a 25 MHz R-3000/3010.  So the
market for traditional supercomputers won't disappear, it will just be
more limited than many optimists have predicted.
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
		   mccalpin@scri1.scri.fsu.edu
		   mccalpin@delocn.udel.edu

rwa@cs.AthabascaU.CA (Ross Alexander) (10/18/89)

brooks@vette.llnl.gov (Eugene Brooks) writes:

>In article <33802@ames.arc.nasa.gov> (Hugh LaMaster) writes:
>>[...] a reverse claim:  People who want status symbols buy PC's for their
>>office.  These PC's, the last time I checked, were only 1/1000th as cost
>>effective at doing scientific computations as supercomputers.  Talk about 
>>*waste*...  :-)
>A "PC" with a MIPS R3000 or an Intel i860 in it is about 70 times more cost
>effective for scalar codes, and we run a lot of those on our supercomputers

C'mon, Eugene, address the claim, not a straw man of your own
invention.  Hugh means people who buy intel-hackitecture machines from
Big Blue.  Do you really mean LLNL people buy mips-engine boxes as
office status symbols?  Not a very nice thing to say about your own
team ;-).  And then you contradict yourself by saying these same
mips-or-whatever boxes are 70 times more effective: are they status
symbols, or are they machines to do work?  Make up your mind :-) :-).

	Ross

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/18/89)

In article <33802@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:

|  I will make a reverse claim:  People who want status symbols buy PC's for their
|  office.  These PC's, the last time I checked, were only 1/1000th as cost
|  effective at doing scientific computations as supercomputers.  Talk about 
|  *waste*...  :-)

  What you say is true, but you seem to draw a strange conclusion from
it... very few people do scientific calculations on a PC. They are used
for spreadsheets, word processing, and even reading news ;-) These are
things which supercomputers do poorly. Benchmark nroff on a Cray...
EGAD! it's slower than an IBM 3081! Secondly *any* computer becomes less
cost effective as it is used less. Unless you have the workload to
heavily use a supercomputer you will find the cost gets really steep.

  Think of it this way, a technical worker cost a company about $50000 a
year (or more), counting salary and benefits. The worker works 240 days
a year (2 weeks vacation, 10 days holiday and sick), at a cost per
*working hour* of $26 more or less. For a $1600 PC to be cost effective
in just a year it must save about 16 minutes a day, which is pretty easy
to do. You also get increased productivity.

  Obviously not every PC is utilized well. Neither are workstations (how
many hours drawing fractals and playing games) or supercomputers, for
that matter. That problem is a management issue, not a factor of
computer size.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

ingoldsb@ctycal.UUCP (Terry Ingoldsby) (10/19/89)

In article <MCCALPIN.89Oct16141656@masig3.ocean.fsu.edu>, mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
> In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene
> Brooks) writes:
> 
> >Microprocessor development is not ignoring vectorizable workloads.  The
> >latest have fully pipeline floating point and are capable of pipelining
...
> It seems to me that the experience in the industry is that
> general-purpose processors are not usually very effective in
> parallel-processing applications.  There is certainly no guarantee
> that the uniprocessors which are successful in the market will be
> well-suited to the parallel supercomputer market -- which is not
> likely to be a big enough market segment to have any control over what
> processors are built....

Agreed.  The only general purpose systems that I am aware of that exploit
parallel processing do so through specialized processors to handle certain
functions (eg. matrix multipliers, I/O processors) or have a small (< 16)
number of general purpose processors.
> 
> The larger chip vendors are paying more attention to parallelism now,
> but it appears to be in the context of 2-4 processor parallelism.  It
> is not likely to be possible to make these chips work together in
> configurations of 1000's with the application of "glue" chips....

It doesn't seem to be just a case of using custom designed chips as
opposed to generic glue.  The problem is fundamentally one of designing
a system that allows the problem to be divided across many processors
AND (this is the tricky part) that provides an efficient communication
path between the sub-components of the problem.  In the general case
this may not be possible.  Note that mother nature hasn't been able to
do it (eg. the human brain isn't very good at arithmetic, but for other
applications its stupendous).
> 
> This is not to mention the fact that software technology for these
> parallel supercomputers is depressingly immature.  I think traditional
> moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle
> existing scientific workloads better than 1000-processor parallel
> machines for quite some time....
> --
I don't think we should berate ourselves about the techniques available
for splitting workloads.  No one has ever proved that such an activity
is even possible for most problems (at a *large* scale).  The activities
that are amenable to parallel processing (eg. image processing, computer
vision) will probably only be feasible on architectures specifically
designed for those functions.

Note that I'm not saying to give up on parallel processing; on the contrary
I believe that it is the only way to do certain activities.  I am saying
that the notion of a general purpose massively parallel architecture that
efficiently executes all kinds of algorithms is probably a naive and
simplistic view of the world.



-- 
  Terry Ingoldsby                       ctycal!ingoldsb@calgary.UUCP
  Land Information Systems                           or
  The City of Calgary         ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

slackey@bbn.com (Stan Lackey) (10/19/89)

In article <MCCALPIN.89Oct18103933@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene
>Brooks) writes:
>
>>Supercomputers of the future will be scalable multiprocessors made of
>>many hundreds to thousands of commodity microprocessors.

>I think that it is interesting that you expect the same users who
>can't vectorize their codes on the current vector machines to be able
>to figure out how to parallelize them on these scalable MIMD boxes.
>It seems to me that the automatic parallelization problem is much
>worse than the automatic vectorization problem, ...

Yes, there seems to be the perception running around that
"parallelization" must be harder than "vectorization".  I am not
saying it isn't, because I and not a compiler writer, but I sure can
give some reasons why it might not be.

Vectorization requires the same operation to be repeatedly performed
on the elements of a vector.  Parallel processors can perform
different operations, such as conditional branching within a loop that
is being performed in parallel.

Dependencies between loop iterations can be handled in a PP that has
the appropriate communication capabilities, whereas most (all?)
vector machines require that all elements be independent (except for
certain special cases, like summation and dot product.)  This can be
done by message passing, or if you have shared memory, with
interlocks.

Parallel processors are not limited to operations for which there are
corresponding vector instructions provided in the hardware.

Well that's all I can think of right now.  Anyone else care to add
anything?
-Stan

chris@dg.dg.com (Chris Moriondo) (10/19/89)

In article <35977@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>This is the stuff of research papers right now, and rapid progress is being
>made in this area.  The key issue is not having the components which establish
>the interconnect cost much more than the microprocessors, their off chip
>caches, and their main memory.

The only really scalable interconnect schemes of which I am aware are
multistage interconnects which grow (N log N) as you linearly increase the
numbers of processors and memories.  So in the limit the machine is essentially
ALL INTERCONNECT NETWORK, which obviously costs more than the processors and
memories.  (Maybe this is what SUN means when they say "The Network IS the
computer"?  :-)  How do you build a shared-memory multi where the cost of the
interconnect scales linearly?  Obviously I am discounting busses, which don't
scale well past very small numbers of processors.

>We have been through message passing hypercubes and
>the like, which minimize hardware cost while maximizing programmer effort.
>I currently lean to scalable coherent cache systems which minimize programmer
>effort.

While message passing multicomputers maximize programmer effort in the sense
that they don't lend themselves to "dusty deck" programs, they have the
advantage that the interconnect costs scale linearly with the size machine.
They also present a clean programmer abstraction that presents the true cost
of operations to the programmer.  I read a paper by (I think) Larry Snyder
wherein he argued that the PRAM abstraction causes programmer to produce
suboptimal parallel algorithms by leading one to think that simple operations
have linear cost when in reality they can't be better than N log N.

chrism

-- Insert usual disclaimers here --

bga@odeon.ahse.cdc.com (Bruce Albrecht) (10/19/89)

In article <35979@lll-winken.LLNL.GOV>, brooks@vette.llnl.gov (Eugene Brooks) writes:
> Unfortunately, no 4 processor system will ever need more than 32 bit
> addresses, so we will have to BEG the micro vendors to put in bigger
> pointer support..

Oh really?  CDC has several customers that have databases that exceed 2**32 
bytes.  Our file organization considers files to be virtual memory segments.
We already need pointers larger than 32 bits.  IBM's AS400 has a virtual
address space greater than 32 bits, too.  If the micro venders don't see a
need for it, they're not paying attention to what the mainframes are really
providing for their very large system customers.

chris@dg.dg.com (Chris Moriondo) (10/19/89)

In article <20336@princeton.Princeton.EDU> mg@notecnirp.edu (Michael Golan) writes:
>3) personally I feel parallel computing has no real future as the single cpu
>gets a 2-4 folds performance boost every few years, and parallel machines
>constructions just can't keep up with that. It seems to me that for at least 
>the next 10 years, non-parallel machines will still give the best performance 
>and the best performance/cost.

Actually, the rate of improvement in single cpu performance seems to have
flattened out in recent supercomputers, and they have turned to more
parallelism to continue to deliver more performance.  If you project the
slope of the clock rates of supercomputers, you will see sub-nanosecond
CYCLE times before 1995.  I don't see any technologies in the wings which
promise to allow this to continue...

chrism

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/19/89)

In article <35979@lll-winken.LLNL.GOV>, brooks@vette.llnl.gov (Eugene Brooks) writes:
|                                     The MAJOR problem with current micros
|  in a scalable shared memory environment is their 32 bit addressing.
|  Unfortunately, no 4 processor system will ever need more than 32 bit
|  addresses, so we will have to BEG the micro vendors to put in bigger
|  pointer support..

  The Intel 80386 has 32 bit segments, but its still a segmented system,
and the virtual address space is (I believe) 40 bits. The *physical*
space is 32 bits, though. The 586 has been described in the press as a
64 bit machine. Seems about right, the problem which people are seeing
right now is that file size is getting over 32 bits, and that makes all
the database stuff seriously ugly, complex, and subject to programming
error.

  I think you can assume that no begging will be needed, but if you let
the vendors think that you need it the price will rise ;-)
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

rec@dg.dg.com (Robert Cousins) (10/19/89)

In article <2450@odeon.ahse.cdc.com> bga@odeon.ahse.cdc.com (Bruce Albrecht) writes:
>In article <35979@lll-winken.LLNL.GOV>, brooks@vette.llnl.gov (Eugene Brooks) writes:
>> Unfortunately, no 4 processor system will ever need more than 32 bit
>> addresses, so we will have to BEG the micro vendors to put in bigger
>> pointer support..
>
>Oh really?  CDC has several customers that have databases that exceed 2**32 
>bytes.  Our file organization considers files to be virtual memory segments.
>We already need pointers larger than 32 bits.  IBM's AS400 has a virtual
>address space greater than 32 bits, too.  If the micro venders don't see a
>need for it, they're not paying attention to what the mainframes are really
>providing for their very large system customers.

In 1947, John Von Neumann anticipated that 4K word of 40 bits each was enough
for contemporary problems and so the majority of machines then had that much
RAM (or what passed for it in the technology of the day).  This is ~2**12
bytes worth of usefulness in todays thinking (though not in bits). Over the
next 40 years we've grown to the point where 2**32 bytes is a common theoretical
limit for machines with a large number of machines in the 2**30 bytes is
fairly common.  This translates into 18-20 bits of address over 40 years.
Or, 1 bit of address every 2 years or so.

Given the trend to having micro architectures last 5 to 8 years, this means
that a micro architecture should have atleast 4 additional address lines
at its announce or 5 additional when its development is started.

In the PC space, 16 megabytes seems to be the common upper limit.  Any PC
therefore should have not 2**24 as a limit but 2**26 at the minimum.

IMHO, at least :-)

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

hascall@atanasoff.cs.iastate.edu (John Hascall) (10/19/89)

In article <???> bga@odeon.ahse.cdc.com (Bruce Albrecht) writes:
}In article <???>, brooks@vette.llnl.gov (Eugene Brooks) writes:
}> Unfortunately, no 4 processor system will ever need more than 32 bit
 
}Oh really?  CDC has several customers that have databases that exceed 2**32 
...
}We already need pointers larger than 32 bits.  IBM's AS400 has a virtual
}address space greater than 32 bits, too.

   I don't know about CDC, but the AS/400 uses what is called Single Level
   Storage, that is, all memory and disk are in one humongous address space.

   Many people do require more than 2**32 bytes of disk farm, but very few
   people are using 2**32 bytes of memory space--so in a more typical system
   the need for (pointers) more than 32 bits is rather uncommon %

   John Hascall

   % although I'm sure we'll hear from a number of them now :-)

joel@cfctech.UUCP (Joel Lessenberry) (10/19/89)

In article <1633@atanasoff.cs.iastate.edu> hascall@atanasoff.UUCP (John Hascall) writes:
>...
>}We already need pointers larger than 32 bits.  IBM's AS400 has a virtual
>}address space greater than 32 bits, too.
>
>   I don't know about CDC, but the AS/400 uses what is called Single Level
>   Storage, that is, all memory and disk are in one humongous address space.
>   John Hascall
>

	is anyone else out their interested in starting an AS/400 thread?

	It is IBM's most advanced system..

		Single level storage
		Object Oriented Arch
		Context addressing
		Hi level machine Instruction set
		64 bit logical addressing
		True complete I/D split, no chance for self modifying
			code
	
				joel


 Joel Lessenberry, Distributed Systems | +1 313 948 3342
 joel@cfctech.UUCP                     | Chrysler Financial Corp.
 joel%cfctech.uucp@mailgw.cc.umich.edu | MIS, Technical Services
 {sharkey|mailrus}!cfctech!joel        | 2777 Franklin, Sfld, MI

hsu@uicsrd.csrd.uiuc.edu (William Tsun-Yuk Hsu) (10/20/89)

In article <220@dg.dg.com> chris@dg.dg.com (Chris Moriondo) writes:
>
>The only really scalable interconnect schemes of which I am aware are
>multistage interconnects which grow (N log N) as you linearly increase the
>numbers of processors and memories...
>
>While message passing multicomputers maximize programmer effort in the sense
>that they don't lend themselves to "dusty deck" programs, they have the
>advantage that the interconnect costs scale linearly with the size machine.

Ummm, message passing does not necessarily mean a single-stage 
interconnect. Also, most commercial message passing systems these
days are hypercubes, and it's oversimplifying to claim that the
cost of the hypercube interconnect scales linearly with system size.
Remember that there are O(logN) ports per processor. Check out the
paper by Abraham and Padmanabhan in the '86 International Conference
on Parallel Processing, for another view on interconnect cost
and performance comparisons.

Most point-to-point parallel architectures, where the fan-out per
processor also grows linearly with the system size, tend to be
things like rings and meshes that are less popular for more
general purpose parallel computing. Are you referring to these
rather than hypercubes?

Bill Hsu

brooks@vette.llnl.gov (Eugene Brooks) (10/20/89)

In article <9078@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu writes:
>The Cray-XMP is considerably slower than the YMP.  
The YMP is 30% faster than a the XMP I was referring to.  This is
for scalar dominiated compiled code and is a rather general result.
Just in case you doubt my sources, I runs codes on both a YMP 8/32
and an XMP 4/16 frequently enough to be a good judge of speed.

>The single-processor XMP is no-longer a supercomputer.
Only if the difference between supercomputer and not is a 30% speed increase.
I argue that a 30% speed increase is not significant, a frigging factor of
2 is not significant from my point of view.  Both the XMP and the YMP
are in the same class.  Perhaps later YMPs will have more memory putting
them in a slightly improved class.

>Take a program requiring more than 128MBytes of memory (or 64 MBytes
>for that matter (but I personally prefer more than 256M to excerice the
>VM system alittle!)) (i.e. a relatively BIG job, a *supercomputer* job)
>and then compare any micro you want 
>or any other system you want with the YMP. or something in
>that class.  and then try it on a multiprocessor YMP,  and Please 
>STOP USING A SINGLE-PROCESSOR xmp AS THE DEFINITION OF A SUPERCOMPUTER, 
>thank you.
I have no interest in single cpu micros with less than 128MB.
I prefer 256 MB.  I want enough main memory to hold my problems.


>And it would be nice if people used "LIST PRICE" for "COMPLETE SYSTEMS"
>when comparing prices.  (LIST PRICE = PEAK PRICE !!) (COMPLETE SYSTEM =
>with all needed software and a few GBytes of disk with a few controllers)
I am talking list price for the system.  A frigging XMP eating micro with
suitable memory, about 64 meg at the minimum, can be had for 60K.  The
YMP costs about 3 million a node.  The micro matches its performance for
my applications.  Which do you think I want to buy time on?  Of course,
I prefer a 3 million dollar parallel micro based system which has 50-100
nodes and runs circles around the YMP processor for my application.


brooks@maddog.llnl.gov, brooks@maddog.uucp

brooks@vette.llnl.gov (Eugene Brooks) (10/20/89)

In article <MCCALPIN.89Oct18103933@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>I think that it is interesting that you expect the same users who
>can't vectorize their codes on the current vector machines to be able
>to figure out how to parallelize them on these scalable MIMD boxes.
I can only point out specific examples which I have experience with.
For certain Monte Carlo radiation transport codes, vectorization is a
very painful experience which involves much code rewriting to obtain
meager performance increases.  I have a direct experience with such
a vectorization effort on a "new" and not dusty deck code.  We got
a factor of 2 as the upperbound for performance increases from vectorization
on the XMP.  The problem was all the operations performed under masks.
LOTS of wasted cycles.  The same problem, however, was easily coded
in an EXPLICITLY PARALLEL language and obtained impressive speedups
of 24 out of 30 processors on a Sequent Symmetry.  It ran at 2.8 times
XMP performance on hardware costing much less.  We are moving on to
a 126 processor BBN Butterfly-II now which should deliver more than
40 times the performance of the XMP at similar system cost.

>It seems to me that the automatic parallelization problem is much
>worse than the automatic vectorization problem, so I think a software
>fix is unlikely....
Automatic vectorization is much easier than automatic parallelization
in a global sense.  This is why high quality vectorizing compilers
exist, in addition to the high availability of hardware, and why
automatic GLOBALLY parallizing compilers dont.  The problem with some
codes is that they must be globally parallelized, and right now an
expliticly parallel lingo is the way to get it done.

>In fact, I think I can say it much more strongly than that:
>Extrapolating from current experience with MIMD machines, I don't
>think that the fraction of users that can use a scalable MIMD
>architecture is likely to be big enough to support the economies of
>scale required to compete with Cray and their vector machines.  (At
>least for the next 5 years or so).  
I do not agree, LLNL (a really big user of traditional supercomputers)
has hatched the Massively Parallel Computing Initiative to achieve
this goal on a broad application scale within 3 years.  We will see
what happens...

>What is driving the flight from traditional supercomputers to
>high-performance micros is turnaround time on scalar codes.  From my
>experience, if the code is really not vectorizable, then it is
>probably not parallelizable either, and scalable machines won't scale.
Not true, I have several counter examples of highly parallel but scalar codes.

>The people who can vectorize their codes are still getting 100:1
>improvements going to supercomputers --- my code is over 500 times
>faster on an 8-cpu Cray Y/MP than on a 25 MHz R-3000/3010.  So the
>market for traditional supercomputers won't disappear, it will just be
>more limited than many optimists have predicted.
Yes, using all 8 cpus on the YMP and if each cpu is spending most of
its time doing 2 vector reads, a multiply and an add, and one vector
write, all chained up it will run circles around the current killer
micros which are tuned for scalar performance.  This situation will
change in the next few years.


brooks@maddog.llnl.gov, brooks@maddog.uucp

rod@venera.isi.edu (Rodney Doyle Van Meter III) (10/20/89)

In article <490@ctycal.UUCP> ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes:
>
>Note that I'm not saying to give up on parallel processing; on the contrary
>I believe that it is the only way to do certain activities.  I am saying
>that the notion of a general purpose massively parallel architecture that
>efficiently executes all kinds of algorithms is probably a naive and
>simplistic view of the world.

Depends on how you classify "all" algorithms. Nary a machine ever made
is good at every algorithm ever invented.

I suspect fine-grain SIMD machines are the way to go for a broader
class of algorithms than we currently suspect. Cellular automata,
fluid flow, computer vision, certain types of image processing and
computer graphics have all shown themselves to be amenable to running
on a Connection Machine. I'm sure the list will continue to grow. In
fact Dow Jones himself now owns two; anybody know what he's doing with
them?

Peak performance for a CM-2, fully decked out, is on the order of 10
Gflops. This is with 64K 1-bit processors and 2K Weitek FP chips. The
individual processors are actually pretty slow, 10-100Kips, I think.
Imagine what this baby'd be like if they were actually fast! Their
Datavault only has something like 30MB/sec transfer rate, which seems
pretty poor for that many disks with that much potential bandwidth.

Rumors of a CM-3 abound. More memory (1 Mbit/processor?), more
processors (I think the addressing for processors is already in the
neighborhood of 32 bits), more independent actions perhaps going as
far as local loops, etc.

I was told by a guy from Thinking Machines that they get two basic
questions when describing the machine:

1) Why so many processors?

2) Why so few processors?

Answering the second one is easy: It was the most they could manage.
Answering the first one is harder, because the people who ask tend not
to grasp the concept at all.

What do I think? I think the next ten years are going to be very
interesting!

		--Rod

brooks@vette.llnl.gov (Eugene Brooks) (10/20/89)

In article <220@dg.dg.com> chris@dg.dg.com (Chris Moriondo) writes:
>The only really scalable interconnect schemes of which I am aware are
>multistage interconnects which grow (N log N) as you linearly increase the
>numbers of processors and memories.  So in the limit the machine is essentially
>ALL INTERCONNECT NETWORK, which obviously costs more than the processors and
>memories.  (Maybe this is what SUN means when they say "The Network IS the
>computer"?  :-)  How do you build a shared-memory multi where the cost of the
>interconnect scales linearly?  Obviously I am discounting busses, which don't
>scale well past very small numbers of processors.
The cost of the interconnect can't be made to scale linearly.  You can
only get a log N scaling per processor.  The key is the base of the log
and not having N too large, ie using a KILLER MICRO and not a pipsqueak.
Eight by eight switchnodes are practical at this point, with four by four
being abslolutely easy.  Pin count is the main problem, not silicon area.
Assuming 8x8 nodes, a 512 node system takes three stages, a 4096 node
system takes 4 stages.  Are 4 switch chips cheaper, or equivalent in
cost to a killer micro and 32 meg of memory?


SUNS "The network is the computer" is meant for ethernet types of things
but it really does apply to multiprocessors.  If you don't have real good
communcation capability between the computing nodes what you can do with
the machine is limited.


Could anyone handle a KILLER MICRO powered system with 4096 nodes?
Just think, 4096 times the power of a YMP for scalar but MIMD parallel
codes.  ~400 times the power of a YMP cpu for vectorized and MIMD
parallel codes.  It boggles the mind.

brooks@maddog.llnl.gov, brooks@maddog.uucp

brooks@vette.llnl.gov (Eugene Brooks) (10/20/89)

>Assuming 8x8 nodes, a 512 node system takes three stages, a 4096 node
>system takes 4 stages.  Are 4 switch chips cheaper, or equivalent in
>cost to a killer micro and 32 meg of memory?
Oops!  It should be, are 4 switch chips cheaper than 8 killer micros and
256 Meg of memory.  The switch is 4 stages deep, but there are 8 micros
hung on each switch port.  The bottom line is that the switch is probably
not more than half the cost of the machine, even given the fact that it
is not a commodity part.  Of course, a good design for the switch chip
and node interface might become a commodity part!  Depending on the cache
hit rates one might hang more than one micro on each node and further
amortize the cost of the switch.


brooks@maddog.llnl.gov, brooks@maddog.uucp

khb%chiba@Sun.COM (Keith Bierman - SPD Advanced Languages) (10/20/89)

In article <33870@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:

>I agree with you here.  In fact, did anyone notice a recent newspaper article
>(In Tuesday's Merc. News - from Knight Ridder:)
>
>"Control Data to use Mips design"
>
>"Control Data Corp. has cast its lot with Mips Computer Systems, inc. to design
>the brains of its future computers, choosing a new computer architecture
>developed by the Sunnyvale Company."

CDC has been selling the MIPS based SGI workstation under its label
for a while now ... so this is either total non-news ... or CDC has
simply decided to cut SGI out of the picture. 

When I had a chance to play with the CDC labeled SGI box I couldn't
find _any_ differences from the SGI equivalent (except that the SGI
had a newer software release and different power up message).

Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

"There is NO defense against the attack of the KILLER MICROS!"
			Eugene Brooks

rodger@chorus.fr (Rodger Lea) (10/20/89)

From article <17045@cfctech.UUCP>, by joel@cfctech.UUCP (Joel Lessenberry):

> 	It is IBM's most advanced system..
> 
> 		Single level storage
		^^^^^
		At last !!

> 		Object Oriented Arch
		
	What exactly do you/they mean by object oriented. Are we
talking something along the lines of the intel approach ?

I would be interested in details - anybody in the know ?

Rodge

rodger@chorus.fr

munck@chance.uucp (Robert Munck) (10/20/89)

In article <1259@crdos1.crd.ge.COM> davidsen@crdos1.UUCP (bill davidsen) writes:
>
>  The Intel 80386 has 32 bit segments, but its still a segmented system,
>and the virtual address space is (I believe) 40 bits. 

You're both too high and too low.  The 386 supports 16,384 segments of up
to 4GB, 14 bits plus 32 bits => 46 bit addresses.  HOWEVER, the segments map
into either real memory (page translation disabled), maximum 4GB, or 
linear virtual memory (paging enabled), also maximum 4GB.  Virtual
addresses are 46 bits and the virtual address space is 4GB.  I think it's cute.
                 -- Bob <Munck@MITRE.ORG>, linus!munck.UUCP
                 -- MS Z676, MITRE Corporation, McLean, VA 22120
                 -- 703/883-6688

hascall@atanasoff.cs.iastate.edu (John Hascall) (10/20/89)

In article <3394@chorus.fr> rodger@chorus.fr (Rodger Lea) writes:
}From article <17045@cfctech.UUCP>, by joel@cfctech.UUCP (Joel Lessenberry):
 
}> 	It is IBM's most advanced system..
}> 		Object Oriented Arch
 		
}	What exactly do you/they mean by object oriented. Are we
}talking something along the lines of the intel approach ?

	    The AS/400 architecture makes the VAX architecture look
	    like RISC--it is *so* CISC!!

	    As I understand it, there are 2 levels of microcode.
	    Your instruction (I was told one instruction was
	    "create database") executes the top level of microcode
	    which in turn executes the bottom level of microcode
	    which in turn actually causes the hardware to do something.

	    Most unusual.

John

bga@odeon.ahse.cdc.com (Bruce Albrecht) (10/21/89)

In article <126561@sun.Eng.Sun.COM>, khb%chiba@Sun.COM (Keith Bierman - SPD Advanced Languages) writes:
> CDC has been selling the MIPS based SGI workstation under its label
> for a while now ... so this is either total non-news ... or CDC has
> simply decided to cut SGI out of the picture. 

As far as I know, CDC will still be selling SGI workstations.  CDC will be
working with Mips directly to develop high-performance versions of the Mips
architecture.

chris@dg.dg.com (Chris Moriondo) (10/21/89)

In article <1989Oct19.172050.20818@ux1.cso.uiuc.edu> hsu@uicsrd.csrd.uiuc.edu (William Tsun-Yuk Hsu) writes:
>In article <220@dg.dg.com> chris@dg.dg.com (Chris Moriondo) writes:
>>
>>The only really scalable interconnect schemes of which I am aware are
>>multistage interconnects which grow (N log N) as you linearly increase the
>>numbers of processors and memories...
>>
>>While message passing multicomputers maximize programmer effort in the sense
>>that they don't lend themselves to "dusty deck" programs, they have the
>>advantage that the interconnect costs scale linearly with the size machine.
>
>Ummm, message passing does not necessarily mean a single-stage 
>interconnect. Also, most commercial message passing systems these
>days are hypercubes...

Too right.  I confess I was thinking more along the lines of the current
crop of fine-grained mesh-connected message-passing multicomputers that are
being worked on at CALTECH (Mosaic) and MIT (the Jelly-bean machine and
the Apiary.)  At least with machines of this ilk you only pay message latency
proportional to how far you are communicating, rather than paying on every
(global) memory reference with the shared-memory approach.  Some of the
hot-spot contention results indicate that the cost of accessing memory as
seen by a processor might bear little relationship to its own referencing
behavior.

>...and it's oversimplifying to claim that the
>cost of the hypercube interconnect scales linearly with system size.
>Remember that there are O(logN) ports per processor.

With hypercubes, what concerns me more than the scaling of the number of
ports is the scaling of the length of the longest wires, and the scaling of
the number of wires across the midpoint of the machine.  (Unless of course
you can figure out a way to wire your hypercube in hyperspace...  :-)

vorbrueg@bufo.usc.edu (Jan Vorbrueggen) (10/21/89)

In article <10200@venera.isi.edu> rod@venera.isi.edu.UUCP (Rodney Doyle Van Meter III) writes:
>In article <490@ctycal.UUCP> ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes:

>> ... I am saying
>>that the notion of a general purpose massively parallel architecture that
>>efficiently executes all kinds of algorithms is probably a naive and
>>simplistic view of the world.

>Depends on how you classify "all" algorithms. Nary a machine ever made
>is good at every algorithm ever invented.

I learned in school that it is hard to write a good numerical algorithm 
(e.g., to solve differential equations), but fairly easy to find an example 
which makes it stand in the rain. Maybe the same applies to building 
computers :-)

Rolf

seanf@sco.COM (Sean Fagan) (10/21/89)

In article <9078@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu writes:
>The Cray-XMP is considerably slower than the YMP.  
>The single-processor XMP is no-longer a supercomputer.
>Take a program requiring more than 128MBytes of memory (or 64 MBytes
>for that matter (but I personally prefer more than 256M to excerice the
>VM system alittle!)) (i.e. a relatively BIG job, a *supercomputer* job)

What?!  Uhm, excercising the Cray's VM system is definitely going to be an
interesting job -- Seymour doesn't *believe* in VM!  (Well, anecdote has it
that he doesn't *understand* it 8-).)

I have mixed feelings about VM (as anybody who's seen more than three of my
postings probably realizes 8-)):  on one hand, yes, getting page faults will
tend to slow things down.  However, the system can be designed, from a
software point of view, in such a way that page faults will be kept to a
minimum.  Also, having about 4 Gbytes of real memory tends to help.  And,
face it, swapping programs in and out of memory can be a time-consuming
process, even on a Cray -- if you're dealing with 100+ Mword programs!

Other supercomputers have VM, of course.  However, I have never gotten the
chance to play on, say, an ETA-10 to compare it to a Cray (I asked someone,
once, at FSU for an account, and I was turned down 8-)).  My personal
opinion is that the machine is not as fast for quite a number of
applications, but having the VM might help it beat a Cray in Real-World(tm)
situations.  Anybody got any data on that?

And, remember:  memory is like an orgasm:  it's better when it's real
(paraphrasing Seymour).  8-)

-- 
Sean Eric Fagan  | "Time has little to do with infinity and jelly donuts."
seanf@sco.COM    |    -- Thomas Magnum (Tom Selleck), _Magnum, P.I._
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

stein@dhw68k.cts.com (Rick Stein) (10/22/89)

Keywords: 

In article <220@dg.dg.com> chris@dg.dg.com (Chris Moriondo) writes:
>In article <35977@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>While message passing multicomputers maximize programmer effort in the sense
>that they don't lend themselves to "dusty deck" programs, they have the
>advantage that the interconnect costs scale linearly with the size machine.
Indeed, the "dusty deck" (aka toxic waste dump) is generally not organized
to exploit the linear scalable potential of the multicomputer.  To my
knowledge, no university in the U.S. teaches how to create linear scalable
software, the cornerstone of multicomputers.  Until the shared-memory
s/w engineering styles are abandonded, no real progress in multicomputing
can begin (at least in this country).  Europe and Japan are pressing on
without (despite us).>
>chrism
-- 
Richard M. Stein (aka, Rick 'Transputer' Stein)
Sole proprietor of Rick's Software Toxic Waste Dump and Kitty Litter Co.
"You build 'em, we bury 'em." uucp: ...{spsd, zardoz, felix}!dhw68k!stein

pcg@emerald.cs.aber.ac.uk (Piercarlo Grandi) (10/23/89)

In article <17045@cfctech.UUCP> joel@cfctech.UUCP (Joel Lessenberry) writes:
	   is anyone else out their interested in starting an AS/400 thread?

	   It is IBM's most advanced system..

		   Single level storage
		   Object Oriented Arch
		   Context addressing
		   Hi level machine Instruction set
		   64 bit logical addressing
		   True complete I/D split, no chance for self modifying
			   code


Rumours exist that the AS/400 (nee S/38) is the result of putting
Peter's Bishop dissertation (a landmark work) "Very large address
spaces and garbage collection", MIT TR 107, in the hands of the
same team that had designed the System/3 (arrgghh!). IMNHO the
S/38 is a poor implementation of a great design. That it still is
good is more a tribute to the great design than to the
implementation skills of the System/3 "architects".
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

brooks@vette.llnl.gov (Eugene Brooks) (10/23/89)

In article <27203@dhw68k.cts.com> stein@dhw68k.cts.com (Rick Stein) writes
a followup to something attributed to me, but 180 degrees out of phase
with my opinion on the great shared memory vs message passing debate:
>Indeed, the "dusty deck" (aka toxic waste dump) is generally not organized
>to exploit the linear scalable potential of the multicomputer.  To my
>knowledge, no university in the U.S. teaches how to create linear scalable
>software, the cornerstone of multicomputers.  Until the shared-memory
>s/w engineering styles are abandonded, no real progress in multicomputing
>can begin (at least in this country).  Europe and Japan are pressing on
>without (despite us).>


The posting he quoted here was incorrectly attributed to me.  It was in fact
someone's retort to something I wrote.  Scalable shared memory machines, which
provide coherent caches (local memory where shared memory is used as such), are
buildable, usable, and cost effective.  Some students and professors at
Caltech, which included someone by the name of Brooks before his rebirth into
the "real" world of computational physics, were so desperate for computer
cycles that they sidetracked the parallel computer industry by hooking up
a bunch of Intel 8086-8087 powered boxes together in a system with miserable
communication performance.  Industry, in its infinite wisdom, followed their
lead by providing machines with even poorer communication performance.

When you quote, please be sure to get the right author when it is from a
message with several levels of quoting.  I had something to do with the
message passing hypermania, but it is not my party line these days....



brooks@maddog.llnl.gov, brooks@maddog.uucp

carr@mfci.UUCP (George R Carr Jr) (10/23/89)

In article <MCCALPIN.89Oct16141656@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:

> .... [software for]
>parallel supercomputers is depressingly immature.  I think traditional
>moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle
>existing scientific workloads better than 1000-processor parallel
>machines for quite some time....

I know of several problem domains where I strongly disagree.  More than one
aerospace company is currently looking at 1000+ node parallel machines because
no Cray, ETA, NEC, or other 'conventional' machine can give them the time to
solution required.  The major area of excitement with parallel machines is to
find the problems for which algorithms exist which are now computable which
are not otherwise computable.

George R Carr Jr                  internet: carr@multiflow.com
Multiflow Computer, Inc.          uucp:     uunet!mfci!mfci-la!carr
16360 Roscoe Blvd., Suite 215     fax:      (818)891-0395
Van Nuys, CA 91406                voice:    (818)892-7172

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/23/89)

In article <36232@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>The YMP is 30% faster than a the XMP I was referring to.  This is
>for scalar dominiated compiled code and is a rather general result.

If you have scalar dominated code that fits in a workstation's memory
and you dont want to run more than one job at a time, then you are right.
I am sure other users of the YMP will be happy to keep the machine busy
and get good 64-bit megaflops.

>>The single-processor XMP is no-longer a supercomputer.
>Only if the difference between supercomputer and not is a 30% speed increase.

I have little desire to defend or promote a YMP, but you cant run a scalar 
code on a vector machine and complain, too!  On the NASA benchamrks, which
I am sure some of this audience has seen, the YMP sustained over 1 GFlops.
THAT, is significantly faster than a single processor XMP.

REWRITE the code!!  Or have someone do it for you (there was a company
that would get your code to run at least twice faster or your money back,
I forget the name and dont know them or anyone who does).  

If they dont perform,
Throw away all the dusty decks.  Refuse to use dusty-deck oriented code.
But if that's all the algorithm can do for now, then yes, use whatever
gives you the desired performance at the least life-time cost (not price!)

>I have no interest in single cpu micros with less than 128MB.
>I prefer 256 MB.  I want enough main memory to hold my problems.

A 256 MB micro can cost you some.  And not so little.  And all that
for just one user.  I am not sure the numbers come out.  And how about IO 
bandwidth and file-size.  maybe your application doesnt need any.  
Talk to a Chemist.
By the time micros become killers, they wont be micros anymore!

>I am talking list price for the system.  A frigging XMP eating micro with

Yes.  My comment about list-price was not directed at Eugene.  Sorry.
I meant to emphasize the importace of using peak-price to go with
peak-performance (I have seen cases where the reported performance is on 
a high-end machine, but the reported price is not!).

grunwald@Tokyo.ira.uka.de (Grunwald Betr. Tichy) (10/23/89)

I followed the articles for some time and want to mention some points.

1. Hardware costs are only a fraction of the cost. To do real big problems you
   need lots of support software and you relie on it. So if you use a PC
   you will have to write more code (or buy spezialised code at a high price)
   and trust your version. This is hard because numeric mathematic is not so
   easy as it seems and if your aircraft comes down or your bridge cracks, its
   to late to blame yourself.

2. Parallel computers will need a Pascal (C,Modula,Ada,..) like language, which
   can be compiled and run on a scalable architekture. Nobody wants to rewrite
   all programs, if he gets more processors. It would be even better to have
   it scaled at runtime, so the process runs faster, if the no other users want
   the processors too.
   I know only the Connection Machine doing that, and this machine is not as
   general purpose as a Workstation. (What OS has the CM ? What Languages ?
   Can you compile a CM program to work on other Computers ? (not simulated)

3. Some Problems are just to big for a PC. Even if you have a more
   sophisticated system then the normal Primitiv Computer, there are a lot
   of problems which have been scaled down to run on supercomputers. So further
   downscaling is not possible without a substantial loss of accuracy.
   (accuracy is not only the length of a floating point. Its how much points
    can your grids have? What differential equations are possible ? What about
    error control ? (Its useless getting wrong results faster. You have to know
    about the error range.))

My opinion is, supercomputers will exist a long time in future and MICROS still
have a long way to go, to match the performance.
Most people comparing the power don't think of the background of the
numbercrunchers and that are lots of software packages and big disks to record
the results, what is a big part of the machines cost.

Don't get me wrong: I'm a Micro User (OS9-680x0) and I like it, but I know that
things are not so easy in the supercomputing area as some people might think.

Knut Grunwald, Raiffeisenstr. 8, 7555 Elchesheim-Illingen, West-Germany

brooks@vette.llnl.gov (Eugene Brooks) (10/23/89)

In article <9119@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes:
>If you have scalar dominated code that fits in a workstation's memory
One should not attempt to infer that a workstation's memory is small.
An YMP 8/32 has 4 megawords (32 MB) available per processor.  If all you
want is 32 MB per processor you can buy this with a killer micro for
about 40K, simply throw it away in a year when its performance has been
eclipsed by the next killer micro, and still have your computer time work
out to be about 5 dollars an hour.  They have the gall to charge $250 an
hour for Cray YMP time, for low priority time at that.
>THAT, is significantly faster than a single processor XMP.
>
>REWRITE the code!!  Or have someone do it for you (there was a company
>that would get your code to run at least twice faster or your money back,
>I forget the name and dont know them or anyone who does).  
We did!  And we showed that you could asymtotically get the factor of 2
you suggest with infinite work.  Why suggest doing such a thing when
one can get a factor of 100 with little work on 100 killer micros?

>If they dont perform,
>Throw away all the dusty decks.  Refuse to use dusty-deck oriented code.
This was not a dusty deck.  This code was written in the last couple of
years with modern tooling, for both vectorized and MIMD parallel machines.
It is not the code which is scalar, it is the algorithm.  One could say
toss out the algorithm, but it is one most robust ones available for the
application in question.


>A 256 MB micro can cost you some.  And not so little.
But it is much cheaper than a SUPERCOMPUTER for my application,
and it is FASTER.  To bring back the car analogy, the accelerator
is still pressed to the metal for speed improvements in killer micros.


brooks@maddog.llnl.gov, brooks@maddog.uucp

henry@utzoo.uucp (Henry Spencer) (10/23/89)

In article <74731@linus.UUCP> munck@chance.UUCP (Robert Munck) writes:
>... The 386 supports 16,384 segments of up
>to 4GB, 14 bits plus 32 bits => 46 bit addresses...

Except that it's not a 46-bit address space, it's a bunch of 32-bit ones.
There is a difference.  As witness the horrors that are perpetrated on
8086/88/186/286 machines to try to cover up their lack of a unified
address space.  "Near" and "far" pointers, anyone?
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

rang@cs.wisc.edu (Anton Rang) (10/23/89)

In article <36593@lll-winken.LLNL.GOV> brooks@vette.llnl.gov (Eugene Brooks) writes:
>  They have the gall to charge $250 an
>hour for Cray YMP time, for low priority time at that.

  You haven't got much reason to complain...out here I have the
privilege of spending $300/hour for VAX-11/785 time...  :-)

  Schools can be SO much fun....

		Anton
   
+----------------------------------+------------------+
| Anton Rang (grad student)        | rang@cs.wisc.edu |
| University of Wisconsin--Madison |                  |
+----------------------------------+------------------+

henry@utzoo.uucp (Henry Spencer) (10/23/89)

In article <27203@dhw68k.cts.com> stein@dhw68k.cts.com (Rick Stein) writes:
>...no university in the U.S. teaches how to create linear scalable
>software, the cornerstone of multicomputers.  Until the shared-memory
>s/w engineering styles are abandonded, no real progress in multicomputing
>can begin (at least in this country).  Europe and Japan are pressing on
>without (despite us).>

What remains to be seen is whether they are pressing on up a blind alley.
Remember where this discussion thread started out:  the mainstream of
high-volume development has vast resources compared to the more obscure
byways.  Results from those byways have to be awfully damned good if they
are going to be competitive except in ultra-specialized niches.  As I've
mentioned in another context, "gonna have to change our whole way of thinking
to go parallel real soon, because serial's about to run out of steam" has
been gospel for quite a while now... but the difficulty of that conversion
has justified an awful lot of highly successful work on speeding up
non-parallel computing.  Work which is still going and still succeeding.

I'm neutral on the nationalism -- you're all foreigners to me :-) -- but
highly skeptical on the parallelism.
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

wen-king@cit-vax.Caltech.Edu (King Su) (10/24/89)

In article <36549@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>......................................  Some students and professors at
<Caltech, which included someone by the name of Brooks before his rebirth into
>the "real" world of computational physics, were so desperate for computer
<cycles that they sidetracked the parallel computer industry by hooking up
>a bunch of Intel 8086-8087 powered boxes together in a system with miserable
<communication performance.  Industry, in its infinite wisdom, followed their
>lead by providing machines with even poorer communication performance.

Huh?  As far as I know, every commercially available multicomputer that
were built after our original multicomputer has a better communication
performance.  We did not lead anybody into anything, as Caltech CS has
never been a strong influence to the industry.  Nor have we advocated
low communication performances.  Today's multicomputers are as much as
three orders of magnitude better in message latency and throughput,
thanks to worm-hole routing hardwares.  There will be further
improvements when low-dimensional networks are in use.

Perhaps we could have provided more positive influences to the
industry, but we are operating under the guideline that university
research groups should not be turned into joint-ventures.  The tax-
payers did not give us money for us to make more money for ourselves.
-- 
/*------------------------------------------------------------------------*\
| Wen-King Su  wen-king@vlsi.caltech.edu  Caltech Corp of Cosmic Engineers |
\*------------------------------------------------------------------------*/

gil@banyan.UUCP (Gil Pilz@Eng@Banyan) (10/27/89)

In article <12345@cit-vax.Caltech.Edu> wen-king@cit-vax.UUCP (Wen-King Su) writes:
>Perhaps we could have provided more positive influences to the
>industry, but we are operating under the guideline that university
>research groups should not be turned into joint-ventures.  The tax-
>payers did not give us money for us to make more money for ourselves.

Why not ?  If you start off a (successfull) joint venture won't you
end up employing people ? People who will be paying taxes on the money
they make as well as buying goods and services etc.  It would seem
that the "tax payers" would be much better off if a state-funded
research group _were_ turned into a joint venture rather than to let
it's research be used later by someone else outside of the taxpaying
area (i.e. it's better for California if the research funded in
California schools went to build businesses in California rather than,
say Texas . . at a national level it all evens out, but locally it
does make a difference . . this is why the whole Massachusetts
"Miracle" schtick is such a joke . . a "Miracle" wow ! . . lots of
schools & research ==> lots of start-up companies ==> a moderate
number of successfull companies ==> money coming in . . amazing !
education works !)

-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-
        Gilbert W. Pilz Jr.       gil@banyan.com
        Banyan Systems Inc.       (617) 898-1196
-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-

frazier@oahu.cs.ucla.edu (Greg Frazier) (10/27/89)

In article <562@banyan.UUCP> gil@banyan.com writes:
>In article <12345@cit-vax.Caltech.Edu> wen-king@cit-vax.UUCP (Wen-King Su) writes:
>>Perhaps we could have provided more positive influences to the
>>industry, but we are operating under the guideline that university
>>research groups should not be turned into joint-ventures.  The tax-
>>payers did not give us money for us to make more money for ourselves.
>
>Why not ?  If you start off a (successfull) joint venture won't you
>end up employing people ? People who will be paying taxes on the money
>they make as well as buying goods and services etc.  It would seem
>that the "tax payers" would be much better off if a state-funded
>research group _were_ turned into a joint venture rather than to let
[ etc about benefits of start-up and Mass miracle ]

No, the problem is that the taxpayer outlay for the start up
is not compensated by the jobs "created" or the taxes received.
I put "created" in quotes, because I do not believe that all of
these jobs are created - the increased competition is going to
cost somebody something, be it another startup which goes under,
a major corp which loses some market share and lays people off,
or the reduction of a substitutive market, such as typewriter
manufacturers.  But let's not get into social dynamics - even
if the jobs are all "created", the taxpayer has footed a major
bill, the rewards of which will, for the most part, end up in
only a few people's pockets.  To address the Mass miracle, it
wasn't the concentration of major universities which brought
it about, it was the concentration of defense contractors,
which is why the carpet was pulled out from under Mass when
the defense cutbacks went through (you will recall Mass has
had deficits recently - no more miracle).

Greg Frazier
@@@@@@@@@@@@@@@@@@@@@@@@))))))))))))))))))##############3
"They thought to use and shame me but I win out by nature, because a true
freak cannot be made.  A true freak must be born." - Geek Love

Greg Frazier	frazier@CS.UCLA.EDU	!{ucbvax,rutgers}!ucla-cs!frazier

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/28/89)

In article <36593@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>>If you have scalar dominated code that fits in a workstation's memory
>One should not attempt to infer that a workstation's memory is small.
>An YMP 8/32 has 4 megawords (32 MB) available per processor.  If all you
>want is 32 MB per processor you can buy this with a killer micro for
>about 40K, simply throw it away in a year when its performance has been
>eclipsed by the next killer micro, and still have your computer time work

Point well taken.  One of the reasons I have little desire to 
defend or promote a YMP is precisely that.  The YMP has 128 MWords of
memory, by the way.  This is for 8 processors.
Cray-2 has 256 MWords.  (but a terrible latency, even in the s model).
(the Cray C90 is supposed to have 512 MW, and Cray-4 1000+ MWords, but
these are paper machines for now).

So, the point is that fp performance alone does not make a supercomputer
anymore (surprize surprize!).  My feeling these days is that one needs
a sophisticated VM system, with a large hierarchical memory system,
and first-rate I/O and networking, so that one could run a single
job very fast (the traditional domain of supers has been just this.
They've been "benchmark machines" in my opinion) but you could also
sit in a network and handle many users and many jobs. (many = say, 37)!
and on top of that, you need libraries, compilers, debuggers, editors,
profilers, etc.

The emergence of the powerful micro is welcome, indeed.  And when they
can be ganged-up and you know how to program them and have an application
that uses their strengths and does not excercise their weaknesses,,,
indeed, they are fast.  

BUT:  1) they dont have the software, libraries, compilers, etc.
      2) they often have a low bandwidth connection to a not-so-strong
            front-end
      3) they cant handle I/O so well yet.
      4) there are no standards for anything
      5) you need a pretty large job to get speed-up, anyway.

Remember, the way the guys at Sandia
got their great speed-ups was to make the jobs larger.  Much larger.
You need over 99.99% parallelism for a 1000 processoor machine!

So the parallel part of your program should be allowed to grow
(and fortunately, if the algorithm is parallelizable, 
the parallel parts tend to grow faster than the serial parts
in many cases, if not most.  Same thing with vectorizable parts
if the algorithm is vectorizable).

Except that it just so happens that when you have a large job,
it also runs much faster on a super!   (a modern super with
lots of memory, that is, not the OLD definition of super.)

My point is that there is no point in getting too excited about highly 
parallel machines. nor about fast microprocessors.  A micro is not called
a micro just because it has a microprocessor in it.  Not anymore.
It usually has a low bandwidth memory system, not much of a cache,
no much a VM system, not much of an I/O, etc.  That's what keeps the 
price down. (prices are going down for supers, too).
Its great to have a fast micro on your desk, but it'll have plenty to
do rendering the data that you got from the super!  and delivering
mail, etc. 
And if you want to gang them up, you'll end up paying exactly
as much as you would if you got a super, maybe more!  This is how
it will end-up being.  If you had all the software and all the I/O
and all the disk and all the networking, etc...  You'll have to
pay for those!  Hardware costs will not differ enough to burry it.
(I am comparing a contemporary super with a contemporary
high-end parallel system.  They could very well be the same thing
in the 5 years that was specified. All supers are multiprocessors
now and are increasing the number of processors.  So that's another
reasonn why you'll be paying exactly the same price if not more!)

Conclusion:  Like a teacher said a long time ago,,  there is 
    the law of conservation of difficulty!!    Highly parallel 
    systems will NOT be a revolutionary deal where you suddenly
    can do something much more cheaply.  It has been evolutionary.
    which is why I said: By the time micros become killers, 
    they wont be micros anymore.
    Highly parallel systems are good.  They have merit.  They are here
    to stay, etc.  Fast micros are also nice.  But lets not sensationalize
    the issues.

And by the way,, most of the japanese machines achieve their
speed by multiple functional units: more than one adder and one
multiplier.

And a final note about "pagemaker".  No insult was intended.
pagemaker is a sophisticated application that requires all
components of the machine from the cpu to the screen to the 
printer to font calculations, etc.  It would have been quite unimaginable 
to try to do something like that on a computer 30 years ago.
I think it was clear what I meant.

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/28/89)

In article <1218@iraun1.ira.uka.de> grunwald@Tokyo.UUCP (Grunwald Betr. Tichy) writes:
>2. Parallel computers will need a Pascal (C,Modula,Ada,..) like language, which
... 
>Knut Grunwald, Raiffeisenstr. 8, 7555 Elchesheim-Illingen, West-Germany

I agree with you on the points that you made.  But your choice of languages 
was unexpected.  I don't want to start a language/religion debate, but do want 
to ask what language you use for supercomputing applications in
germany.  I am looking to see if there are trends in different countries.

mash@mips.COM (John Mashey) (10/31/89)

In article <428@propress.com> pan@propress.com (Philip A. Naecker) writes:

>Case in point:  The R2000 chipset implemented on the R/120 (mentioned by others
>in this conversation) has, by all measures *excellent* scalar performance.  One
>would benchmark it at about 12-14 times a microVAX.  However, in real-world,
>doing-useful-work, not-just-simply-benchmarking situations, one finds that
>actual performance (i.e., performance in very simple routines with very simple
>algorithms doing simple floating point operations) is about 1/2 that expected.

Please be a little more specific, as this is contrary to large numbers
of people's experience with "doing-useful-work, not-just-simply-benchmarking"
situations.  Note: it is perfectly possible that one can encounter realistic
programs for which the performance is half of what is expected, on some
given class of benchmarks.  Is the statement above:
	a) The M/120 is really a 6-7X microVAX machine
	OR
	b) We've run some programs in which it is found to be a 6-7X
	uVAX machine.

Note that, as posted, this reads more like a) than b), so please say
more.


>Why?  Because memory bandwidth is *not* as good on a R2000 as it is on other
>machines, even machines with considerably "slower" processors.  There are
>several components to this, the most important being the cache implementation
>on an R/120.  Other implementations using the R2000/R3000/Rx000 chipsets might
>well do much better, but only with considerable effort and cost, both of which
>mean that those "better" implementations will begin to approach the price/
>performance of the "big" machines that you argue will be killed by the
>price/performance of commodity microprocessors.
The R2000 in an M/120 indeed has a very simple memory system.

The rest of the comments seem overstated to me: we just announced
a new machine (the RC3240), which is a CPU-board upgrade to an M/120,
uses an R3000, gains another 40-50% performance from the same old
memory boards, and costs the same as an M/120 did when it was announced.
If it had been desgined from scratch, it would be faster, with little or
no increase in cost.
PLEASE look at the data on the various machines built of such parts.
The one-word refill of the R2000 certainly slowed it down; the multi-word
refill & instruction-streaming on the R3000 certainly help improve the
kinds of programs that hurt an R2000, and the cost differences are really
pretty minimal.

In addition, if you look at R3000s in larger system designs,
I think it is hard to claim that these implementations are anywhere near
the price/performance (anywhere near as high price, that is) as those of
"big" machines, at least for CPU performance.
>
>I think you are to a degree correct, but one must always tailor such
>generalities with a dose of real-world applications.  I didn't, and I got bit
>to the tune of a fine bottle of wine. :-(

Anyway, we all agree on that: "your mileage may vary".  How about posting
something on the particular applications to generate some insight about
what these things are good for or not?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (10/31/89)

In article <9119@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes:
 
>If you have scalar dominated code that fits in a workstation's memory
>and you dont want to run more than one job at a time, then you are right.

Note: some of this discussion has seemed to assume that micro == workstation.
To help unconfuse people, let us remember that the same CPU chip can be
used in various different configurations, only some of which are
workstations.  Note that desktop workstations are unlikely to get enouhg
memory to keep real supercomputer users happy, given the usual cost
tradeoffs.  This might not be true of big desksides, and is least

>A 256 MB micro can cost you some.  And not so little.  And all that
>for just one user.  I am not sure the numbers come out.  And how about IO 
>bandwidth and file-size.  maybe your application doesnt need any.  
Again, note that the issue is not necessarily single-user workstations
versus supercomputers, it's mixtures of desktops, desksides, and servers
versus supercomputers..


I.e., a whole lot of this discussion has seemed like a classic "domain-of-
discourse" argument: in which the argument:
	A is true
gets heated replies of "No, it isn't"
should be converted to:
	In domain 1 (not very vectorizable), A is true. (micros are tough)
	But domain 2 (elsewhere), A is not true. (micros are not so tough)
which makes clear that the real argument is more like:

	How big are domains 1 & 2?  Will that change? 

THUS:
	FOR SUPERCOMPUTER USERS:
	a) How much of your code is vectorizable?
	b) How much is parallelizable?
	c) How much mostly needs big memories?
	d) How much is dominated by turnaround time, cost-is-no-object
		processing?
	e) Do you have some more data points, i.e.,
	SUPERCOMPUTER X versus microprocessor-based-system Y,
	including elapsed times & costs?

In most of this dicussion, we've gotten a few data points.  more would help.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jesup@cbmvax.UUCP (Randell Jesup) (11/18/89)

In article <AGLEW.89Nov7130958@chant.urbana.mcd.mot.com> aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) writes:
>What has happened is that better value is being provided, but also the
>amount of money people are willing to spend on computing has gone up.
>The system that bears the same relationship to the state of the art as
>the 4,000$ PC did a few years ago now costs at least 10,000$.
>
>The inexpensive "home computer" has been slightly lost in these developments.
>The Amiga, perhaps... but even the Amiga is running up the prices.

	Well, the Amiga 500 (the "home" machine) can be gotten for about
the same price as the original C-64 (tape drive - disk drive pushes it up).
The C-64 was introduced at $600 for the CPU unit alone.  The Amiga 500 includes
a disk drive as well.  Then again, selling at good performance/price levels
is Commodore's business.

	Of course, there's very little competition in that part of the market
nowadays.  Maybe we're ripe for another upswing (following the recent 
resurgence of video games) in the home computer market.

Disclaimer: I work for Commodore-Amiga, Inc.
-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"

nelson@m.cs.uiuc.edu (11/20/89)

> parallelism to continue to deliver more performance.  If you project the
> slope of the clock rates of supercomputers, you will see sub-nanosecond
> CYCLE times before 1995.  I don't see any technologies in the wings which
> promise to allow this to continue...

Actually, I don't see this (dare I say it) EVER occuring.  Ignoring
  delay due to capacitance, a nanosecond is only 12 inches of wire --
  and I'm reasonably sure that the "critical path" length is at least
  on the order of a foot (does anyone know?).  Once capacitance delay
  comes into the picture (even on-chip there is a significant amount),
  even with new technologies, that 12 inches is being reduced at least
  a tenfold (opinion/guess).  That leaves you with an inch of wiring
  for the critical path for this super technology -- that does not
  seem nearly enough to build a nano-processor around. 

Anyone else have Opinions?  Facts?

-- Taed.

jonah@db.toronto.edu (Jeffrey Lee) (11/20/89)

nelson@m.cs.uiuc.edu writes:

>> parallelism to continue to deliver more performance.  If you project the
>> slope of the clock rates of supercomputers, you will see sub-nanosecond
>> CYCLE times before 1995.  I don't see any technologies in the wings which
>> promise to allow this to continue...
>Actually, I don't see this (dare I say it) EVER occuring.

NEVER say "never." :-)

>                                                           Ignoring
>  delay due to capacitance, a nanosecond is only 12 inches of wire --
>  and I'm reasonably sure that the "critical path" length is at least
>  on the order of a foot (does anyone know?).  Once capacitance delay
>  comes into the picture (even on-chip there is a significant amount),
>  even with new technologies, that 12 inches is being reduced at least
>  a tenfold (opinion/guess).  That leaves you with an inch of wiring
>  for the critical path for this super technology -- that does not
>  seem nearly enough to build a nano-processor around. 

Hierarchy and locality is wonderful for dodging these sorts of
problems.  Put a large register set, simple ALU, and tiny instruction
cache onto a single GaAs or ECL (or whatever) chip.  Assume an 4-level
memory where the first three levels have a .8 hit rate and a 5-fold
slowdown to the next level which is 64 times larger:

	level	access	hit   Ehit(ns)	size
	1	1ns	.8	1.0	256B
	2	5ns	.16	1.6	16KB
	3	25ns	.032	2.4	1MB
	4	125ns	.008	3.4	64MB+	[294 W/s ==> 150 MIPS]

Now, 5ns gives you just enough time to get off the chip to a close
neighbour cache chip, 25ns gives you enough time to get elsewhere on
the board, and 125ns is enough time to go to the bus.  Each critical
path gets slightly longer and slightly slower.  Each level can be made
from a slower and cheaper technology.  With a hit rate of .8, the
effective access time is 3.4 ns/word or 294 word/s.  Which should put
you in the 150 MIP range with RISC technology.  [The ratio of 2W ==> 1
MIPS assumes that each operation (on average) uses one instruction and
one data word.  The SPARC seems to have a MIPS rating of about 1/2 its
MHz.]

Ok, so the numbers are all out of a hat.  Lets try some different hats:

	level	access	hit   Ehit(ns)	size
	1	1ns	.7	1.0	256B
	2	5ns	.21	1.75	16KB
	3	25ns	.063	3.33	1MB
	4	125ns	.027	6.7	64MB+	[149 W/s ==> 75 MIPS]

	level	access	hit   Ehit(ns)	size
	1	1ns	.9	1.0	256B
	2	5ns	.09	1.35	16KB
	3	25ns	.009	1.58	1MB
	4	125ns	.001	1.7	64MB+	[588 W/s ==> 300 MIPS]

I'm more inclined to believe the values of .8 or .9 for locality given
the 64x expansion at each level.  I've no facts though.

Is a 5ns single-chip 16KB cache possible, now or in 5 years?
What about a 25ns multi-chip 1MB cache?
What is the normal hit rate for a 16KB cache?
Comments?

mcdonald@uxe.cso.uiuc.edu (11/21/89)

> parallelism to continue to deliver more performance.  If you project the
> slope of the clock rates of supercomputers, you will see sub-nanosecond
> CYCLE times before 1995.  I don't see any technologies in the wings which
> promise to allow this to continue...


>Anyone else have Opinions?  Facts?


No, of course not. How fast could the speed demon people do a
PDP8 on a chip right now - just the CPU and 4k words of memory plus
a couple of serial lines - lets say 200 megabaud? HAs anyone else
out there looked at the schematic of a PDP-8? Its pretty RISC.


Doug McDonald

paul@taniwha.UUCP (Paul Campbell) (11/22/89)

In article <46500087@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:
-No, of course not. How fast could the speed demon people do a
-PDP8 on a chip right now - just the CPU and 4k words of memory plus
-a couple of serial lines - lets say 200 megabaud? HAs anyone else
-out there looked at the schematic of a PDP-8? Its pretty RISC.

Sounds like a good student MOSIS project :-)


	Paul
-- 
Paul Campbell    UUCP: ..!mtxinu!taniwha!paul     AppleLink: CAMPBELL.P
"### Error 352 Too many errors on one line (make fewer)" - Apple Computer
"We got a thousand points of light for the homeless man,
Got a kinder, gentler, machine gun hand ..." - Neil Young 'Freedom'

nelson@m.cs.uiuc.edu (11/22/89)

> Actually, I don't see this (dare I say it) EVER occuring.  Ignoring
...
>  a tenfold (opinion/guess).  That leaves you with an inch of wiring
>  for the critical path for this super technology -- that does not
>  seem nearly enough to build a nano-processor around. 

Well, I was thinking only in terms of reasonable today-type technology
  ideas.  I came up with something of a lower bound.  I assumed that
  the smallest transistor is an angstrom in length.  Then I used
  some guesses as to what a processor has to contain, etc.

As it all comes down, it seems that our lower bound is on the order
  of 10 picoseconds for a cycle time in a processor.  Other parts
  would obviously have a lower lower bound.

Now you may say that there is no way that a transistor or transistor-
  work-alike can be built that small...  Maybe so, but it is (?) a
  lower bound.

-- Taed.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/22/89)

In article <3300084@m.cs.uiuc.edu> nelson@m.cs.uiuc.edu writes:

| As it all comes down, it seems that our lower bound is on the order
|   of 10 picoseconds for a cycle time in a processor.  Other parts
|   would obviously have a lower lower bound.

  I think your lower bound is too high. I believe that _Electronics
News_ had an article about a 200GHz counter. My subscription lapsed two
years ago.

  There is a lower limit, because you have to make things smaller (as
you said), and when the diameter of a conductor becomes small enough it
becomes an excercise in probability to see if an electron put in one end
comes out the other. An article a few years ago claimed that this occurs
at about 17 orders of magnitude smaller and faster than a Cray2.

  Warning: The only thing I'm sure is true is that the article said so ;-)
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon