[comp.arch] Don't look back

mash@mips.COM (John Mashey) (02/20/89)

In article <4290@pt.cs.cmu.edu> shivers@centro.soar.cs.cmu.edu (Olin Shivers) writes:
>Andy Glew mentioned a barrel processor discussion, and says that
>mash@mips posted a good argument against them (barrel processors, not
>discussions). I would very much like to see that argument.

I don't have a copy of the original, but the argument follows two
inter-related areas: technical issues and business issues, and
can be summarized as follows:
	1) For technical reasons, it's more complicated to build VLSI
	micros as barrels.
	2) Cheap general-purpose chips tend to dominate special-purpose
	solutions, unless the special-purpose ones have substantial
	long-term cost or performance advantages.

Good background material could be found in:
	Bell, Mudge, McNamara, "COMPUTER ENGINEERING", a DEC View of
	Hardware Systems Design, 1978, Digital Press.
Specifically, read Chapter 1 "Seven Views of Computer Systems",
especially Views 3 and 4, and especially, Figure 7 on levels of integration.

Following is a (brief) technical argument, followed by a (long) business
argument that addresses a bunch of related issues that people have asked about.
Also, sorry if I step on anybody's toes; maybe this will stir up some
discussion.

1) Technical:
	The first-order determinant of CPU performance, for general purpose
	machines, is the aggregate bandwidth into the CPU, with about
	1 VUP ==(approx) 10 MB/sec [try this rule-of-thumb and see].
	Take the same technology and cache memory.  You can either build
	an N-way barrel processor, where each barrel slot generates B VUPs,
	or you can build 1 CPU that generates about N*B VUPs, because
	the basic hardware is running at the same speed.  The single CPU
	has to fight with latency issues that are avoided by the barrel,
	but the barrel:
	-wastes whole slots whenever there are less than N tasks available;
	-needs N copies of registers and state, in general, i.e., things that
	 are fast, and therefore expensive, if only in oppurtunity cost.
	-probably has worse cache behavior, in terms of the separate
	 tasks banging into each other more.
		OF COURSE, ALL THIS NEEDS QUANTIFICATION.
	The more you split the hardware apart [like separate caches],
	the closer you get to separate processors.
I think barrel designs might make more sense in board-level
implementations than they do in chip-level designs.  It is often less
expensive to replicate state in the former, and also, to afford
really wide busses all over the place.  Maybe it might make sense to
do a barrel design for 1 design round if you think you can get to VLSI
in the next.

Anyway, quite specifically, the detailed tradeoffs in building VLSI CPU chips
seems to argue against building them as barrels:
	I don't know of any existing popular CISC or RISC chips that are
	barrels; if anybody does, please point them out.
	Likewise, although this is harder data to know, the next round of
	chips is not likely to do this either: everybody is working on
	more integrated chips and things like super-scalar or super-pipelined
	designs, but the MIPS Competitive Intelligence Division has yet to turn
	up any barrel chips out there.  Maybe if we're all building
	2M-transistor chips, we'll find that we can't think of anything
	better to do, although I doubt it...

2) Business issues: (now, it gets long)
	The computer business is fundamentally different than it was even 10
years ago, basically because of the microprocessor.  Specifically, if you
are a systems designer, and if you choose to design your own CPUs, rather
than implement your system out of commercially-available micros, you'd
better have a Real Good reason, of which the following are some:
	a) You're building something that has to be binary-compatible with
	an existing line.  Your choice is either to build things out of
	gate-arrays, or semi-custom, or full-custom VLSI, in order of
	ascending cost and difficulty. [Gate-arrays: most supermini & mainframe
	vendors; full-custom VLSI: DEC CVAX.]
	b) You're a semiconductor vendor, also, and your business is building
	VLSI chips anyway. [Intel, Motorola, etc.]
	c) You're a system vendor who thinks they can design a CPU architecture
	and get it to be popular enough that it gets access to successive
	technologies that it stays with the leading edge of the technology
	curve in a cost-effective way [Sun & MIPS].
	d) You're building something whose performance or functionality
	cannot be done with the existing micros or next year's micros [CONVEX,
	CRAY]

However, if you're building something from scratch, it had better do something
a lot better than next year's micros, or you'll get run over from behind
by CPUs that have
	1) bottom-to-top range of applicability, not limited
	to a narrow price-performance niche,
	2) volume, and hence lower cost, and
	3) a bigger software base*, and
	4) more $$ coming in to fuel the next round of development to the
	next range of performance.
* Caveat: you do have to be careful that you don't just count # packages
available, but number of RELEVANT packages for the kinds of machines you're
building.  For example, ability to run MSDOS applications is a plus for
workstations, but probably not very relevant to somebody who wants a Convex,
so you can't compare architectures by counting applications.  Nevertheless,
application availability does count.

Not that many years ago, there used to be LOTS of companies who built
mini / supermini class machines out of TTL (and then maybe ECL).
You'd probably be surprised how many different proprietary minis have
been built: I looked at the DataPro research reports, 1987, and found about
50 different mini or supermini architectures [there used to be more].
Of these, some were produced by companies that have since disappeared;
many of them may never be upgraded; only a few are supported by companies
successful enough to make the continued enhancement worthwhile.

In the early 1980s, proprietary minis starting getting badly hurt by
the 16-bit micros, and low-end superminis were getting threatened by
32-bit micros.  Only a few mini/supermini vendors are left, really.
Of course, this is the second wave of this: consider the consolidations
in the companies building mainframes and others in the 1950s and 1960s...

OPINION, PERHAPS BIASED (REMEMBER WHERE I WORK): 
1) There exist VLSI RISCs in production that already show faster integer
and scalar FP performance than any of the popular superminis. Before the
end of the year, people will ship ECL VLSI RISCs at supermini prices,
whose corresponding uniprocessor performance is equivalent to Amdahl
5990s or IBM 3090s.  In addition, one should expect to see, during
1990-1992, CMOS or BiCMOS chips from which one can build 50-100 VUPs
machines (still uniprocessor).  There's no reason not to have a
1000-VUP multi in a large file-cabinet-size box by 1992 / 1993 (although
we'd sure better get some faster disks by then!) at costs competitive
with current superminis.

2) Most mini/supermini architectures born in the 1970s or early 1980s
are essentially doomed, unless they're owned by a company with strong
finances, a big customer base, or, perhaps, a customer base that's heavily
locked in for some reason or other.  Some of the older mainframe architectures
are also doomed, for the same reason.  [Note: doom doesn't mean they disappear
overnight, but that it gets harder and harder to justify upgrades, and if
a company takes the approach of relying only on its installed base of locked-in
customers, trouble is coming.]
	Now, this doesn't mean that the company owning those architectures
	is doomed.  Some mini companies have taken thoughtful and timely
	steps to adapt to the new technology without dumping their customers:
	HP would be a good example: think how long ago they saw the RISC stuff
	coming, and how much work went into assuring reasonable migration.
	Others have been working the problem as well; some have not, to the
	best of my knowledge, and I suspect they're going to get hurt.

3) Proprietary mini-supers are in serious danger in the next year or two:
one can already see the bloodbath going on there.  (Apologies to my
friends at various places), but it's hard to see why anybody but Convex
is really going to prosper and remain independent in this.
Note that Convex seems wisely to be taking the strategy of moving up
chasing supercomputers and staying out of the frenzy at the lower-end of
this market, which is, of course, the part starting to be attacked by
the VLSI RISCs.  I know this overlap is starting to happen, because we
(MIPS and some of its friends) are seeing a lot more competitive run-ins
with some of the mini-super guys.  We lose some (like: real vector problem,
need 1GB of memory, need some application that we don't have yet), but
we win some already on cost/performance, and sometimes even on performance.
An M/120 (a $30K thing in a PC/AT box) has been known to beat some
mini-supers in some kinds of big number-crunching benchmarks, and that is
Not Good News.... (well, it's good news for us...:-)

What happens in 1989/1990? Well, we expect to see the first VLSI ECL RISCs
appear, at least from us and Sun.  These things have got to be Bad News,
as they'll be in the 30-60 VUPs range, with reasonable scalar floating-point.
They're likely to be quite competitive (on a performance basis) with
many of the mini-supers, except in really heavy vector or vector-parallel
applications, and they'll probably win on cost/performance numbers in
even more cases, leaving a fairly narrow niche.
However, even worse is the software problem.   One of the biggest difficulties
for mini-supers is the difficulty of getting software on them: the machines
are expensive enough that you don't just leave them around at bunches of
3rd-party software developers.  BTW, 3rd-party developers are sane people,
and they don't port software for free, and they care about the number of
machines on which they can sell their software.  This makes it Real Hard
if you you only have a few hundred machines in the field, unless your machines
are among the few able to run the application. (Note how important it is to
be the first to get to a new zone of cost/performance, i.e., part of why
CRAY and Convex have been successful).
This is not a problem faced by the ECL RISCs,
which both already have large numbers of software-compatible machines out there.
To get a feeling for the scope of the problem, here are some numbers:
From COMPUTERWORLD, Feb 13, 1989, page 130 "High Performance Computers":
Minisuper installed base as of yearend 88 (Computer Technology Research Corp):
450	FPS
430	Convex
335	Alliant
110	Elxsi
 45	SCS
150??	Multiflow* (from a different source):
----
1520	TOTAL
This article didn't include Multiflow: CSN 2/13/89, p46. says "As of June
1988, Multiflow had sold 44 of its Trace computers.  Since then, the company
has stopped revealing how many systems it has sold, but Joseph Fisher,
co-founder and EVP, said the 4th and 3rd quarters generated the largest and
second-largest revenue for the company in its four-year history."
Assume the installed base in now 150 machines (probably optimistic).
(And of course, who know how accurate these numbers really are? However,
they're probably the right order of magnitude.  To be fair, the CW article
claimed minisupers were a real hot growth area, and I'm using the numbers
in the opposite direction....)

Now, MIPS and/or semiconductor partners have shipped about 20,000 chipsets,
as of YE1988.  Of course, many of them have gone into prototypes, or into
dedicated applications, or other things.  Still, MIPS itself built on
the order of 1000 machines, as well as a lot of boards that have gone into
others, and of course, some of our friends have shipped more MIPS-based
machines than we have.  Although I'm not privy to the numbers :-),
there must be 5-15K SPARC-based things out there, mostly in Sun-4s.
In late 1989,the mini-supers will have to face the spectre of competing with
fast and cost-effective machines whose CPU performance overlaps
at least the lower-middle of the minisuper performance
range, each of which has an installed base of lots of 10s of thousands,
low-end machines in the $10K range or lower, lots of software, and
little messing around to get reasonable performance.

Of course, CPU performance alone does not a minisuper make, and none of
this should be taken as disparagement of folks who work at any of these
companies, some of whom have built hardware or software that I respect
greatly.  All I suggest is that the old quote is appropriate:
	"Don't look back.  Something might be gaining on you."

To finish this long tome with the thing that started it: a barrel design
had better show some compelling and casting advantage over VLSI RISCs,
because it will probably be more expensive to build, and if it doesn't
get volume, business reality will make its life very hard.
Sorry for the length of this, but the topics have come up in a number of
side e-mail conversations, and it seemed to fit here.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

brooks@vette.llnl.gov (Eugene Brooks) (02/21/89)

In article <13582@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
A long, and well founded, analysis of why superminis are being squeezed out
of their performance niche from the rear by VLSI based machines.

This article is conservative at best, there are a whole lot of users of Cray
time buying the latest VLSI based machines as a more cost effective alternative.
With the latest microprocessors these machines are within 1/5th of the
performance of a Cray supercomputer for all but the most highly vectorized
codes.  For scalar codes the performance of these microprocessors can be
as high as 1/2 of a Cray-1S.  The performance of supercomputers has stagnated
in the last 10 years, with only about one factor of 2 in performance per CPU
having been achieved.  Needless to say, while traditional computer technology
has stagnated performance wise,  microprocessors have really accelerated as
their designers have learned the basics of pipelining and have had enough gates
on a chip to support full functionality.  Supercomputer vendors shudder when
we show them where the best microprocessors stand in relation to mainframes
for the Livermore Loops and point out where their performance will be a year
or two from now.  Next years microprocessors will meet or beat the scalar
performance of supercomputers, and I expect at least one or two further
doublings in speed of these parts before they reach an asymptote.  At that
point you will start to see higher bandwidth memory connections for these
parts (as opposed to a simple stall on a cache miss model) and the distinction
between a micro and a supercomputer architecture will be completely blurred.

Supercomputers at this point will still exist, but they will be built out of
modestly large numbers of VLSI processors (shared memory or otherwise depending
on the application).  The only hope for supercomputer vendors is to start using
higher levels of integration than they currently use so the cost of their
hardware can be reduced and their reliability increased.



Is the news software incompatible with your mailer too?
brooks@maddog.llnl.gov, brooks@maddog.uucp, uunet!maddog.llnl.gov!brooks

mccalpin@loligo.uucp (John McCalpin) (02/21/89)

In article <20667@lll-winken.LLNL.GOV>
                 brooks@maddog.llnl.gov (Eugene Brooks) writes:
>In article <13582@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>A long, and well founded, analysis of why superminis are being squeezed out
>of their performance niche from the rear by VLSI based machines.
>
>This article is conservative at best, there are a whole lot of users of Cray
>time buying the latest VLSI based machines as a more cost effective alternative
>With the latest microprocessors these machines are within 1/5th of the
>performance of a Cray supercomputer for all but the most highly vectorized
>codes.  For scalar codes the performance of these microprocessors can be
>as high as 1/2 of a Cray-1S. 

I have had a great deal of trouble believing the poor performance of
"supercomputers" on scalar code lately.  I just ran the LINPACK 100x100
test of the FSU ETA-10 (10.5 ns=95 MHz) and got a result of 3.8 64-bit
MFLOPS for fully optimized (but not vectorized) code.  I used the
version of the code with unrolled loops. This performance is EXACTLY
the same as the MIPS R-3000/3010 pair running at 25 MHz.  I understand
that there must be tradeoffs, but considering the difference in cost,
this is a bit surprising....

Of course, the vectorized version runs at 60 MFLOPS on the ETA-10 now
(90 MFLOPS with the 7 ns CPU's), and gets rapidly faster for larger
systems.

I don't mean to pick on CDC/ETA --- even the fastest Cray's are going
to get caught by the highest performance RISC chips pretty soon.

I haven't seen any MC88000 results yet, but it looks to be able to put
out results in the same performance range.  Does anyone know if the
memory bandwidth of the 88000 is going to able to keep the floating-
point pipeline filled? This could push the performance of the 88000
up to closer to 10 MFLOPS....
----------------------   John D. McCalpin   ------------------------
Dept of Oceanography & Supercomputer Computations Research Institute
mccalpin@masig1.ocean.fsu.edu		mccalpin@nu.cs.fsu.edu
--------------------------------------------------------------------

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (02/22/89)

John Mashey has argued convincingly that single-chip processors are on a
faster trend curve than mainframe processors, and in fact are just plain
catching up.

The basic reason is the speed of light. As Mr. Cray knows, small == fast.
In the long run, the smallest system is the one that fits on a single
chip.

Now that you've all nodded sagely ... I don't agree with the last
sentence above. I think that we're going to see really large chips -
perhaps the much fabled wafer-scale integration. And if you think about
wire lengths, those chips are going to have some awfully long
interconnects: wires just centimeters and centimeters long. We might do
better by going to three dimensions instead of two.

The breakthrough I'd like to see, is chip vias. For the hardware-
impaired, what I mean is, I'd like to see signal paths between the two
surfaces of a chip. I'd like to take a stack of naked chips, and then
solder them together into a solid cube.

-- 
Don		D.C.Lindsay 	Carnegie Mellon   Computer Science
-- 

colwell@mfci.UUCP (Robert Colwell) (02/22/89)

In article <7330@pyr.gatech.EDU> mccalpin@loligo.cc.fsu.edu (John McCalpin) writes:
>In article <20667@lll-winken.LLNL.GOV>
>                 brooks@maddog.llnl.gov (Eugene Brooks) writes:
>>In article <13582@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>A long, and well founded, analysis of why superminis are being squeezed out
>>of their performance niche from the rear by VLSI based machines.
>>
>I don't mean to pick on CDC/ETA --- even the fastest Cray's are going
>to get caught by the highest performance RISC chips pretty soon.

A note from the other side of the aisle.  "Even the fastest Crays"?  Are
you kidding?  If you believe the Cray-3 is going to be manufacturable
(an entertaining discussion all by itself) then how the heck do you think
a micro is going to get 1800 mflops any time soon?  I think that's wishful
thinking or outright fantasy.  Do you realize how many bits you'd be 
stuffing into its pins per unit time?  Or maybe you think you're going to
make the micro out of GaAs?  You still have to feed it.  Cray has the
wire lengths down below 1" to maintain this kind of data rate.  I don't
think you're going to touch that kind of computation rate without the same
big bucks the big boys must spend.  Are you going to do CMOS with 100K ECL
I/O's?  Or do you think you're going to get there with TTL switching levels?

And those large machines put way more than half their money into their
memory subsystems.  What sleight of hand will make it possible for the 
micros to do better?  And if they don't do better, then their systems
won't cost significantly less than the less integrated machines, in which
case their cost advantage dissipates.  The same goes for the I/O needed
to support all the flops being predicted willy-nilly in this stream of
discussion.  The cost/performance of I/O isn't increasing at anything 
near the rate of the CPUs.  So my (possibly broken) crystal ball says
that the default future isn't so much a world filled with satisfied
customers of nothing but micros so much as one filled with CPUs spending
an awful lot of time waiting on badly mismatched memory and I/O systems.

Data caches aren't going to help you much, either, running the kinds of
codes that Crays are good for.  Name a data cache size, and every user
will say it's too small.

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
175 N. Main St.
Branford, CT 06405     203-488-6090

seeger@poe.ufnet.ufl.edu (F. L. Charles Seeger III) (02/22/89)

In article <4330@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
|The breakthrough I'd like to see, is chip vias. For the hardware-
|impaired, what I mean is, I'd like to see signal paths between the two
|surfaces of a chip. I'd like to take a stack of naked chips, and then
|solder them together into a solid cube.

I believe through-wafer vias are being done, at least in some labs.  However,
as you might expect, they are rather large by VLSI standards.  Memory-check:
my recollection could be about a proposal to do these vias, rather than a
report of it being done.

AT&T is working on using wafers as circuit boards, with >= 4 conducting
layers, including power and ground planes.  Individual chips are mounted
to the wafer with solder techniques similar to SMT.  A big win here is that
this mounting is done before packaging, so that the IO pads on the chips
can be scattered about the chip anywhere that is most convenient (i.e. the
pads need not be around the periphery).  The initiative for this work was
to increase the interconnect density, which hasn't been keeping pace with
chip density.  You can then mount your CPU, MMU, FPU, sRAM, dRAM, etc. all
on one wafer, while still using different fabs for the different chips.
Though their work is currently planar, it seems that combining this
technology with through-wafer vias would point in the direction that you
suggest.

--
  Charles Seeger            216 Larsen Hall		+1 904 392 8935
  Electrical Engineering    University of Florida
  seeger@iec.ufl.edu        Gainesville, FL 32611

brooks@vette.llnl.gov (Eugene Brooks) (02/23/89)

In article <656@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>A note from the other side of the aisle.  "Even the fastest Crays"?  Are
>you kidding?  If you believe the Cray-3 is going to be manufacturable
>(an entertaining discussion all by itself) then how the heck do you think
>a micro is going to get 1800 mflops any time soon?  I think that's wishful
Whether or not the Cray-3 is manufacturable, there will certainly be super-
computers with many gigaflops of VECTOR performance in the near term.  We were
talking about scalar performance, and not vector performance.  Certain codes
which are heavily run on Cray machines are scalar and would score high hit
rates in a rather small cache.  I predict that a microprocessor will outrun the
scalar performance of the Cray-1S within a year.   The "supercomputers" will
only hold on for those applications which are 99% vectorized, which are darned
few, and because of this supercomputers will share the computer center floor
with micro based hardware soon, and on an equal footing.



Is the news software incompatible with your mailer too?
brooks@maddog.llnl.gov, brooks@maddog.uucp, uunet!maddog.llnl.gov!brooks

rpw3@amdcad.AMD.COM (Rob Warnock) (02/23/89)

As John Mashey says, with current chip technology, barrel processors don't
seem to make much sense. But there *are* upcoming technologies for which
barrel architectures will make sense, at least for a time, just as there was
a time in the past in which they made sense -- when memory was much slower
than the CPU logic (e.g. the CDC 6600 Peripheral Processors).

Case in point: Several groups -- most notably that I know of, Alan Huang
et. al. at Bell Labs (see some recent issue of "Scientific American" or
"Discover", I forget) -- are working on true optical computers, where the
fundamental logic operations are done with non-linear optics. The total
"state" of the CPU might be in the pattern of on/off dots in a planer wavefront
of light. A "microcycle" would consist of that wavefront travelling through
the "logic", mixing with itself and getting pieces sliced and diced, passing
through a regenerator (amplifier/limiter), and looping back to the beginning.
(This would *really* be done with mirrors!) That is, all of the optical devices
("gates", if you like) would be operating in parallel, on different pieces of
the wavefront.

Now I'm guessing these machines will initially have optical loop paths
("microcycle" times?) in the low to medium nanoseconds (circa 5ns/meter
in glass?), since they won't be sub-miniturized (initially). But from what
I hear, even initially the optical devices will be *very* fast (just a few
picoseconds or less), so that you'll only be "using" the gates for the
"thickness" of the wavefront. So they're already thinking about taking
another wavefront and positioning it "behind" the first one (of course
with some guard time to avoid interference).

Voila! A barrel processor! In the limit, a given hunk of glass/&c. could
support "loop_time/switching_time" CPUs in the barrel. And if I/O or access
to "main" memory (whatever that might be) was slow, it might make sense to
artificially increase the microcycle (loop) time to match the external world,
which at the same time lets you stack more CPUs in the barrel. (Pushing the
analogy, the width of one "stave" is fixed by the speed of the optical logic,
including guard bands. "Staves/sec", or circumferential speed is fixed by
speed-of-light in the glass/silicon/air/whatever in the loop. But the "RPMs"
can be slowed by adding staves to the circumference of the barrel.)

Anyway, just to point out that there is some chance that barrel processors
may live again someday...


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

paulf@ece-csc.UUCP (Paul D. Franzon) (02/23/89)

In article <19814@uflorida.cis.ufl.EDU> seeger@iec.ufl.edu (F. L. Charles Seeger III) writes:
>In article <4330@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>|The breakthrough I'd like to see, is chip vias. For the hardware-
>|impaired, what I mean is, I'd like to see signal paths between the two
>|surfaces of a chip. I'd like to take a stack of naked chips, and then
>|solder them together into a solid cube.
>
>I believe through-wafer vias are being done, at least in some labs.  However,
>as you might expect, they are rather large by VLSI standards. 

Hughes is working on mechanical through wafer vias.  They are large (0.5mm
square) but you can put circuits underneath them.  They have proposed
multi-layer structures.  I've heard nothing about mechanical reliability.

AT&T is working on through-wafer optical interconnects.  At the moment this
is at a research stage only.

>
>AT&T is working on using wafers as circuit boards, with >= 4 conducting
>layers, including power and ground planes.  Individual chips are mounted

This effort has been dropped.  Several other groups are working on Ceramic
or Al high density "circuit boards", on which chips are flip mounted.
This gives you a very high density I/O capability and very fast interconnect.
Some people here are currently  exploring structures that can use these
capabilities effectively.


-- 
Paul Franzon					Aussie in residence
Ph. (919) 737 7351				ECE Dept, NCSU

malcolm@Apple.COM (Malcolm Slaney) (02/24/89)

In article <24582@amdcad.AMD.COM> rpw3@amdcad.UUCP (Rob Warnock) writes:
>As John Mashey says, with current chip technology, barrel processors don't
>seem to make much sense. 

Maybe I missed something in the definition of a barrel processor but isn't
the new Stellar machine a barrel processor much like the HEP?  I just read
the machine overview last night and they have four "virtual" pipelines that
time share a single long pipeline.  I wonder what it is about the Stellar
architecture that makes them think they can succeed where Burton Smith (HEP)
couldn't?  Is it just the graphics?

This is one of the advantages they claim:
	At any particular moment, 12 instructions are active in the pipeline,
	but each stream (of four) has only three instructions active.  In this
	way, the architecture can achieve the performance of a deep (12 stage,
	50 nsec) pipeline, while experiencing the minimal "pipeline drain"
	cost of a shallow (3 stage, 200 nsec) pipeline.
I don't have to pay for the machine so I can't comment on its price. I did
recompile my ear models on the machine and they worked without changes and
ran much faster than a Sun...but not as fast as the code on our Cray XMP.

								Malcolm

rodman@mfci.UUCP (Paul Rodman) (02/24/89)

In article <20821@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov.UUCP (Eugene Brooks) writes:
>Whether or not the Cray-3 is manufacturable, there will certainly be super-
>computers with many gigaflops of VECTOR performance in the near term.  

Please stop using the word VECTOR. use "large data aggregate" or "parallel"
instead. There are many, many problems that are not vectorizable but
have large amounts of parallelism. You would call it "scalar" parallelism, 
but you would be in error if you thought small RISC chip would compete
in performance with a VLIW (or WISC) machine.

Its interesting the same folks that find "CISC" non-optimal can 
also refer to vector architectures without flinching! I'm waiting for the
day when somebody announces a SPARC or MIPS based processor with a vector
unit! :-) :-) 

[Reminds me of chess programs that follow MCO for the opening
but as soon as their out of the "book" they have no idea why they did what
they did, and they start undoing moves!]

>We were
>talking about scalar performance, and not vector performance.  Certain codes
>which are heavily run on Cray machines are scalar and would score high hit
>rates in a rather small cache.  
>I predict that a microprocessor will outrun the
>scalar performance of the Cray-1S within a year.   The "supercomputers" will
>only hold on for those applications which are 99% vectorized, 
>which are darned
>few, and because of this supercomputers will share the computer center floor
>with micro based hardware soon, and on an equal footing.
>

Well, I have more faith in parallel compilation than you seem to. Probably
because I've been able to build hardware for some of the best compiler-writers 
in the world.

I *DO* agree that canonical supers are dead ducks, in short order. 
VLIWs using VLSI to much greater advantage will replace them. 



    Paul K. Rodman 
    rodman@mfci.uucp

colwell@mfci.UUCP (Robert Colwell) (02/24/89)

In article <20821@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov.UUCP (Eugene Brooks) writes:
>In article <656@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>>A note from the other side of the aisle.  "Even the fastest Crays"?  Are
>>you kidding?  If you believe the Cray-3 is going to be manufacturable
>>(an entertaining discussion all by itself) then how the heck do you think
>>a micro is going to get 1800 mflops any time soon?  I think that's wishful
>
>Whether or not the Cray-3 is manufacturable, there will certainly be super-
>computers with many gigaflops of VECTOR performance in the near term.  We were
>talking about scalar performance, and not vector performance.  Certain codes

I gathered that, but I was going to just let it slide.  "Vector" 
performance does not necessarily mean "floating point" performance,
and it isn't just floating point that makes supercomputers super.
It's also the other things I mentioned.  I didn't say the micros
will hit a plateau and nothing they ever do thereafter will make
interesting applications run any faster.  I meant that making 
balanced systems is just as important for them as for their costlier
competition, and that users who see high flop numbers on benchmarks
and think it means high commensurately performance on micros may
be in for a bigger than usual shock when they try to use I/O or
fit their application into main memory.

>which are heavily run on Cray machines are scalar and would score high hit
>rates in a rather small cache.  

I guess we could each "prove" our point by judicious selection of
interesting benchmarks.

>I predict that a microprocessor will outrun the
>scalar performance of the Cray-1S within a year.   The "supercomputers" will
>only hold on for those applications which are 99% vectorized, which are darned
>few, and because of this supercomputers will share the computer center floor
>with micro based hardware soon, and on an equal footing.

I hope you mean that there will be some micro somewhere in a system
that achieves a higher throughput on a scalar program, because anything
else doesn't count.  And there, I further predict that said micro,
having achieved this feat, will then have trouble on one of two other
counts -- having enough physical memory present to handle the same
size jobs that people want, having enough flops to not be embarrassing
on more vectorizable codes, and having enough I/O to support all of the
above without making the user smash the keyboard in frustration.  And
all of that at workstation prices.  If you go higher in the cost space,
then your one-chip solution must start competing with multi-chip solutions
that have much more flexibility in their implementation.  And as I've
argued before, they don't pay all that much a penalty for it either, 
because systems at these performance levels put most of the implementation
dollars into memory and I/O, not CPU.

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
175 N. Main St.
Branford, CT 06405     203-488-6090

brooks@maddog.llnl.gov (Eugene Brooks) (02/24/89)

In article <661@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
A bit of a flame, followed by a statement that VLIW machines will dominate
the world of computing.

Some time back I posted a challenge to VLIW proponents to compile and run a
parallel Monte Carlo code of mine and compare the performance to a box full
of microprocessors.  There were no takers.  This challenge is still open.

I would also like to see a detailed justification for the statement that VLIW
processors will displace vector processors at what they do best.  A VLIW
machine has not yet outrun a vector processor yet on its favorite workload
and I do not see any technology trend that leads to this conclusion even for
the long term.


Is the news software incompatible with your mailer too?
brooks@maddog.llnl.gov, brooks@maddog.uucp, uunet!maddog.llnl.gov!brooks

mccalpin@loligo.uucp (John McCalpin) (02/24/89)

In article <656@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>In article <7330> mccalpin@loligo.cc.fsu.edu (John McCalpin) writes:
>>
>>I don't mean to pick on CDC/ETA --- even the fastest Cray's are going
>>to get caught by the highest performance RISC chips pretty soon.
>
>A note from the other side of the aisle.  "Even the fastest Crays"?  Are
>you kidding?  If you believe the Cray-3 is going to be manufacturable
>(an entertaining discussion all by itself) then how the heck do you think
>a micro is going to get 1800 mflops any time soon?  I think that's wishful
>thinking or outright fantasy. 
>Bob Colwell               ..!uunet!mfci!colwell

If you re-read my original message, you will see that I am talking about 
SCALAR code only.  Comparing the performance of EXISTING Cray machines
to the fastest RISC chips show that the Whetstone performance of a 
Cray X/MP is not that much faster than a 25 MHz MIPS R-3000/3010
(or an MC88000). Cray may have a factor of 2 better performance (I
don't have the numbers right in front on me), which I again claim is
not impressive when the clock speeds (118MHz vs 25 MHz) and prices
($3,000,000+ vs $150,000) are taken into consideration.  Not all the
important codes in the world can be vectorized to any significant degree.

I certainly agree with you that micros will never compete with the 
VECTOR performance of these machines simply because the memory
bandwidth is not going to be available.   For my large scientific 
problems (which are >98% vector code), I much prefer the CDC memory-
to-memory approach.  Having a data cache would be very little help.
However, Cray and CDC/ETA machines are not likely to ever be cost-
effective on scalar codes, precisely because most of their budget
goes into producing huge bandwidth memory subsystems....
----------------------   John D. McCalpin   ------------------------
Dept of Oceanography & Supercomputer Computations Research Institute
mccalpin@masig1.ocean.fsu.edu		mccalpin@nu.cs.fsu.edu
--------------------------------------------------------------------

turner@uicsrd.csrd.uiuc.edu (02/25/89)

In article <4330@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
> The breakthrough I'd like to see, is chip vias. For the hardware-
> impaired, what I mean is, I'd like to see signal paths between the two
> surfaces of a chip. I'd like to take a stack of naked chips, and then
> solder them together into a solid cube.

I just heard a talk by Kai Hwang who reported that Hughes labs has
produced 4" wafers in stacks of 6 that implement an array of
processors 32x32 in size!  He only had one slide on this, and he said
he had taken it from Hughes.  The specs I can remember said that 1"
square of the wafer contained circuitry.  It was all CMOS, and at
10MHz consumed 1.3W.  I believe that the processors they are talking
about are bit sliced - but I'm not sure.  They have plans for much
larger scale (512x512 procs) on 6" wafers, sometime around '93.

Meanwhile I personally feel that this type of technology has *lots* of
obstacles to overcome.  1 - heat (obviously).  2- fault tolerance to
an unheard of degree.  Think about it, if a single wafer has a fault
in a column the system may have to eliminate the entire column to
avoid it!  3- I/O how do you think the problem of pin limitation
applies to the square/cube law?

Overall, not a pretty picture.  But *I* sure won't say it can't be done.

---------------------------------------------------------------------------
 Steve Turner (on the Si prairie  - UIUC CSRD)

UUCP:    {ihnp4,seismo,pur-ee,convex}!uiucdcs!uicsrd!turner
ARPANET: turner%uicsrd.csrd.uiuc.edu
CSNET:   turner%uicsrd@uiuc.csnet            *-)    Mutants for
BITNET:  turner@uicsrd.csrd.uiuc.edu                Nuclear Power!  (-%

mash@mips.COM (John Mashey) (02/25/89)

In article <7367@pyr.gatech.EDU> mccalpin@loligo.cc.fsu.edu (John McCalpin) writes:
...
>If you re-read my original message, you will see that I am talking about 
>SCALAR code only.  Comparing the performance of EXISTING Cray machines
>to the fastest RISC chips show that the Whetstone performance of a 
>Cray X/MP is not that much faster than a 25 MHz MIPS R-3000/3010
>(or an MC88000).
Note: we haven't seen any published numbers on the 88K, yet
but we generally expect a noticable difference between the R3000 and
88K on 64-bit scalar codes [noticable = 1.5-2X].
Hopefully, DG will provide a nice bunch of performance data to go with
their 88K product announcements next week.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (02/25/89)

In article <665@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
......
>I hope you mean that there will be some micro somewhere in a system
>that achieves a higher throughput on a scalar program, because anything
>else doesn't count.  And there, I further predict that said micro,
>having achieved this feat, will then have trouble on one of two other
>counts -- having enough physical memory present to handle the same
>size jobs that people want, having enough flops to not be embarrassing
>on more vectorizable codes, and having enough I/O to support all of the
>above without making the user smash the keyboard in frustration.  And
>all of that at workstation prices.  If you go higher in the cost space,
>then your one-chip solution must start competing with multi-chip solutions
>that have much more flexibility in their implementation.  And as I've
>argued before, they don't pay all that much a penalty for it either, 
>because systems at these performance levels put most of the implementation
>dollars into memory and I/O, not CPU.

I agree 100% with Robert on the issue of building balanced systems:
it's important.  I also don't expect widely-available workstations
to give supers or minisupers a hard time any tiome soon.  [Why?  WHen you
look at the tradeoffs you tend to make for the volume workstations,
you tend to limit them in terms of memory size, I/O, and expandability.
Of course, one's idea of workstation certainly goes up over the years.]
Workstations are not small servers/minicomputers, which are
not large servers/big superminis/small mainframes, which are not
big mainframes or mini-supers, which are not supercomputers.

On the other hand, you can take the same chips that you might use in
a very unbalanced workstation [i.e., lots of CPU, and less I/O],
and build fairly powerful, balanced machines with the same technology,
and save a tremendous amount of moeny across a product line,
even though, in the largest machines, the CPU chips themselves are
basically almost free. In the small machines, you may well waste CPU
power to lower cost, and not have smart peripheral controllers, etc.
In the bigger ones, you may have multiple CPUs, smart controllers,
high-powered backplanes, etc, etc.  A good example that can be seen right
now is the way DEC uses it CMOS VAX chips in different configurations,
and I have to believe that it really helps their overall product line costs.

The second issue is a more subtle one, which is, that as the volume goes
up, the unit costs go down, and this can be seen in the I/O area as
well.  In particular, cheap things that you wished would exist don't
come into being until the systems that want them make sense.
Let me give a few examples, observing first, that in the Old Days,
if you built computers, it meant you built everything yourself,
CPUs, memories, disks, tapes, etc, etc, and if you couldn't do it,
you weren't in the business.  That's changed a lot; only the largest
companies can afford to, and even they do a lot of judicious outsourcing.
Now, you can put together some pretty good machines by integrating
a lot of other people's work:
	a) Remember when having an Ethernet controller meant you had a
	good-sized board full of logic?  If the only systems that wanted
	Ethernet were large superminis, you might not have LANCE chips
	at reasonable costs.  [Why would anybody bother to build them if
	there weren't going to be volume?]
	b) SCSI controllers: same thing.
	c) Very fast 5 1/4" (or 3 1/2") disks: why would you want these
	if all you had were mainframes [which want huge fast disks],
	or PCs [which started wanting small cheap disks].  On the other
	hand, with workstations and fast supermicros, you can get real
	use from these things, which raises the volume, which drops the
	price, and makes it worth investing the effort to make them faster.
	The best of these compete well with ESMDs on some speed metrics,
	and now people are looking real hard at building big, fast, cheap,
	disk substystems out of arrays of these (like the UCB work,
	as just one example).
	d) High-speed controller boards: people like Interphase build some
	pretty fast controller boards.  Who uses them?  People who have
		1) fast CPUs
		2) nonproprietary I/O busses
	People who have slow CPUs naturally choose lower-performance
	controllers.  People who have proprietary I/O busses spend a lot
	of money building their I/O systems (and sometimes, necessarily so
	to get performance they want, or, perhaps, redundancy or other
	special functionality.).  But look what happens when you
	get cheap, fast CPUs in systems that use industry busses.
	All of a sudden there's a market there for somebody who wants to
	supply controllers, and there might be enough volume to justify
	the effort, and although the high-performance controllers might
	command a premium price, the costs of these boards are much less
	than the corresponding costs of more proprietary ones, if the latter
	must be engineered for lower-volume products, even in the same
	performance range.

What you see is a pattern:
	a) People will build high-performance I/O products, if there are
	systems they make sense in, and the costs will go down.
	b) If cheap CPUs keep getting faster, there will be more pressure
	to boost the I/O performance (cheaply), and that creates markets
	for people doing higher-integration-level VLSI to fill the demand.

So that brings us back to the original discussion: if you build systems
out of the same (or a small number of different, like CMOS & ECL)
VLSI CPUs, the lower part of a product range gets drastic percentage
savings [a large board or two less is serious business], the middle
can take the cost savings and put some of it in I/O, and the top gets
the benefit of spending little engineering effort on the CPU, and can
take that effort and also put it into I/O.  Maybe all of this gets enough
volume that people can say "we should have a super-whizzy bus chip,
and we can now afford to build it."

(There are more words than I like in this, and I'm not sure I've
communicated the industry-interaction effects as well as I'd like,
so maybe somebody else can say it better.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mcdonald@uxe.cso.uiuc.edu (02/25/89)

>Subject is possibility that microprocessors will beat Crays in
speed sometime soon. (For scalar code.)

Remember that Cray's are in a sense special purpose machines.
For some purposes (i.e. some type of calculations) a lowly 386 PC
will beat any single processor Cray. What purpose? Can you say
"integer remainder instruction" bound code? [Some math problems
fall in this category.] One of my most common programs almost
has my PC beating a Cray (but not quite). It is hopelessly scalar,
has gigantic arrays (of numbers which never get bigger than 200),
totally integer, and has just enough remainders to make the Cray
unhappy. 


Doug McDonald

colwell@mfci.UUCP (Robert Colwell) (02/25/89)

In article <20862@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov.UUCP (Eugene Brooks) writes:
>Some time back I posted a challenge to VLIW proponents to compile and run a
>parallel Monte Carlo code of mine and compare the performance to a box full
>of microprocessors.  There were no takers.  This challenge is still open.

As I'm sure you know, Eugene, if there were a potential sale behind this
challenge, our level of interest would rise considerably.  As it is, just
what do you think you'd be entitled to conclude if we DID run something
that was tailored for a vector box and came up short?  How about this:
nothing at all.  A VLIW isn't a replacement for a vector machine.  It's
a different way of computing that does very well on vector code, but also
does well on code that chokes a vector machine.  Because of Amdahl's law,
this situation arises more often than not.  So what's your point?

>I would also like to see a detailed justification for the statement that VLIW
>processors will displace vector processors at what they do best.  A VLIW
>machine has not yet outrun a vector processor yet on its favorite workload
>and I do not see any technology trend that leads to this conclusion even for
>the long term.

The first answer is to wait and see.  What is your rejoinder when it is
pointed out that the TRACE (as an example of a VLIW) routinely achieves
fractions of a Cray XM/P far out of proportion to the difference in 
cycle times?  I'd imagine the only thing you can say is that you suspect
there's something fundamental in the design of a VLIW that will always
make it require a much longer cycle time.  You can think that if you
want to.  If you want the opinions of the folks who designed the TRACE,
we think that's hogwash.

Arguing in your style, name me a vector machine that can touch a dedicated
single-algorithm digital signal processor on IT'S favorite workload. 
Right, none.  Did I just prove that digital signal processors are the
wave of all computing in the future?

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
175 N. Main St.
Branford, CT 06405     203-488-6090

rogerk@mips.COM (Roger B.A. Klorese) (02/26/89)

In article <661@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
>I'm waiting for the
>day when somebody announces a SPARC or MIPS based processor with a vector
>unit! :-) :-) 

You mean, like Ardent?!
-- 
Roger B.A. Klorese                                  MIPS Computer Systems, Inc.
{ames,decwrl,pyramid}!mips!rogerk      928 E. Arques Ave.  Sunnyvale, CA  94086
rogerk@servitude.mips.COM (rogerk%mips.COM@ames.arc.nasa.gov)   +1 408 991-7802
"I majored in nursing, but I had to drop it.  I ran out of milk." - Judy Tenuta

rodman@mfci.UUCP (Paul Rodman) (02/27/89)

In article <20862@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov.UUCP (Eugene Brooks) writes:

>Some time back I posted a challenge to VLIW proponents to compile and run a
>parallel Monte Carlo code of mine and compare the performance to a box full
>of microprocessors.  There were no takers.  This challenge is still open.

A "box" full of microprocessors? 

Look, if you've got some money, call a salesman and benchmark your code.
If you like the results buy the Trace. Don't conclude that just because 
there were no takers that you have "proved" anything. 

Secondly, if somehow you've managed to port your application to a "box" of
micros, congratulations, but most folks don't have the time to do such things.
Most computer users have large programs that couldn't be ported to a "box"
of micros in month of Sundays.

Eventually, *more* use for  multiple-cpu systems will find its way to more
and more to rank-and-file computer users (for something more than
time-sharing peformance.) When that happens you'll still be better off
with a small number of faster machines (VLIWs) than a large number of
slow ones. Unless, of course, you have just the right application.

>
>I would also like to see a detailed justification for the statement that VLIW
>processors will displace vector processors at what they do best.  

Why would I need to prove such a silly thing? Why shouldn't *you* have to 
prove that vector processor can even come close to VLIWs at what *they* do 
best. (A much large set of programs, by the way.) Would you like to write
vector programs, or programs with parallel expressions? The latter is much
easier.

Even if your vector performance is some fraction less than a vector machines
unless the users application is incredibly vectorizable, you'll easily 
win due to non-vector speedups. I'm sure you've read Olaf Ludbeck paper
from Los Alamos, haven't you? What does that paper tell *you*? It tells
me that even if the vector machines at Los Alamos had *infinite* speed 
vector units the speedups are dismal. In a *decade*+ of programming vector
machines, some of the best scientists in the world haven't improved the
percent vectorization. 

>A VLIW
>machine has not yet outrun a vector processor yet on its favorite workload
>and I do not see any technology trend that leads to this conclusion even for
>the long term.
>

That's because you aren't looking. Vector compilers are topped out, in case
you haven't noticed. Vector compilers have been around a long, long time
and have gotten quite good. This is *bad* news for vector compilers , not
good news.

A decent VLIW+compiler has only just popped on the commercial scene, relativly
speaking. Do you think that we are standing still?



    Paul K. Rodman 
    rodman@mfci.uucp

rodman@mfci.UUCP (Paul Rodman) (02/27/89)

In article <7367@pyr.gatech.EDU> mccalpin@loligo.cc.fsu.edu (John McCalpin) writes:
>
>If you re-read my original message, you will see that I am talking about 
>SCALAR code only.  

OOOOOOhhhhh, ok, you mean you want permission to ignore Amdahl's law, do you?

Also, I keep trying to get you guys to stop using the word "SCALAR" when
what you really mean is "a small amount of parallelism". This is an
extremly sloppy situation. What if my vector lengths are of length 2? 
Aren't you going to stand by your claim? You will? Then DON'T use the
word SCALAR, please. You mean't so say "no parallelism".

The whole misuse of the terms "SCALAR" and vector on this net just
underlines the lack of understanding about what makes computers slow, or
fast. On the one hand I've got folks trashing on me saying that "box"es
of micros are going to be faster than hell due to all the parallelism
in programs. ON the other hand I've got risc/micro guys saying "well
as long as there is no parallelism, we'll beat a CRAY!". :-) :-)

>
>I certainly agree with you that micros will never compete with the 
>VECTOR performance of these machines simply because the memory
>bandwidth is not going to be available.   

Then how come I have to answer flames that claim the opposite? :-)

>For my large scientific 
>problems (which are >98% vector code), I much prefer the CDC memory-
>to-memory approach.  Having a data cache would be very little help.
>However, Cray and CDC/ETA machines are not likely to ever be cost-
>effective on scalar codes, precisely because most of their budget
>goes into producing huge bandwidth memory subsystems....

However, the ETA machine is fundementally damaged in its ability to
do other than stride 1 accesses. Presumably they will fix this someday.

You may simplify your statement: "Cray and CDC machines are not going
to be cost effective." :-)



    Paul K. Rodman 
    rodman@mfci.uucp

rodman@mfci.UUCP (Paul Rodman) (02/28/89)

In article <13888@admin.mips.COM> rogerk@mips.COM (Roger B.A. Klorese) writes:
>In article <661@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
>>I'm waiting for the
>>day when somebody announces a SPARC or MIPS based processor with a vector
>>unit! :-) :-) 
>
>You mean, like Ardent?!

Sheesh, I posted that mail expecting a farrago of replies and recieved only
4, 2 of which were from MIPS folks (the Ardent uses MIPS chips). None
were from Ardent. I guess they don't read comp.arch.....they must be 
smarter than I thought...:-):-):-). 


    Paul K. Rodman 
    rodman@mfci.uucp

Calm down?! Calm down!? But.., but..., I AM calm!....

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (02/28/89)

In article <661@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
>In article <20821@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov.UUCP (Eugene Brooks) writes:
>>computers with many gigaflops of VECTOR performance in the near term.  
>Please stop using the word VECTOR. use "large data aggregate" or "parallel"
>instead. There are many, many problems that are not vectorizable but

It makes sense to use VECTOR when that is what you mean.  The poster was
correct - higher speed vector processors are, in a sense, a known quantity.
Given a level of technology, it is a known problem of how to build a high
performance vector processor with a known behavior relative to a scalar
processor of the same technology.  Parallel processing is not, yet,
equivalent, though the number of parallel approaches to solving problems
is increasing.

>day when somebody announces a SPARC or MIPS based processor with a vector
>unit! :-) :-) 

I have not looked at the SPARC architecture in sufficient detail to know
whether a vector processor that is upward compatible with SPARC is a good
idea, but I suspect it is.  After all, the Cray machines and the CDC Cyber 205,
without their vector capabilities, are high performance "RISC" machines
(load store, instruction set which is easily pipelined, simple addressing modes,
simple R-to-R instructions).

Anyway, a vector micro makes perfect sense if you can build a data path into
and out of it that is wide enough.  If VLIW matures, you can expect to see
some VLIW "vector" micros.  There is already at least one machine that is close
to this - the Weitek 64 bit vector micro.

>I *DO* agree that canonical supers are dead ducks, in short order. 
>VLIWs using VLSI to much greater advantage will replace them. 

I'm not sure what a "canonical super" is, but VLIW machines are still SIMD,
like vector machines (of which they are a generalization from a certain point
of view).

A "true parallel" machine would be MIMD.  Like the Cray X-MP and its successors,
which allow true parallelism, but with a relatively small number of processors.
I note that Cray, CDC/ETA, Convex, etc., all seem to have concluded that
a vector (or possibly VLIW machine in the future) processor makes a jim dandy
building block to build a parallel machine out of, but that building a purely
parallel machine with only SISD sub-processors is not optimal.


  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

lm@snafu.Sun.COM (Larry McVoy) (02/28/89)

In article <675@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
$ I'm sure you've read Olaf Ludbeck paper
$ from Los Alamos, haven't you? What does that paper tell *you*? It tells
$ me that even if the vector machines at Los Alamos had *infinite* speed 
$ vector units the speedups are dismal. In a *decade*+ of programming vector
$ machines, some of the best scientists in the world haven't improved the
$ percent vectorization. 


I really don't have a lot to add to this other than to repeat it.  It
seems that a lot of people out there think VECTOR is the word of God.  I
remember this paper and I remember various discussions over coffee when I
worked at ETA.  The conclusion was then and is now "if you can't scream on
non-vectorizable, integer code, forget it" (ETA can forget it).  You want
a fast Unix box?  Get an Amdahl - 30 MIPS and the I/O to go with it.  You
want to build your own?  Concentrate on I/O and integer performance.
That's your bread and butter.

Larry McVoy, Lachman Associates.			...!sun!lm or lm@sun.com