[comp.arch] What with these Vector's anyways?

martin@felix.UUCP (Martin McKendry) (01/01/70)

In article <3652@well.UUCP> rchrd@well.UUCP (Richard Friedman) writes:

>Just a minute.  As I understand it, the so-called Dhrystone benchmarks
>are in  C  and Cray does not have an optimizing  C  compiler in general
>distribution.  What you need is to look at Dongarra's LINPAK
>benchmarks, which are in FORTRAN.  

Actually, if you really want to, you can recode the function of Dhrystone
in any language you want an you might learn something.  For example,
there is a version in Burroughs Algol.  If you will cover my legal fees,
I'll try to remember the results :-).

--
	Martin S. McKendry
	FileNet Corp
	{hplabs,trwrb}!felix!martin

jdg@elmgate.UUCP (Jeff Gortatowsky) (07/20/87)

In article <2378@ames.arpa> lamaster@ames.UUCP (Hugh LaMaster) writes:
[..............]
>vectors.  If the vectors are not contiguous, then the advantage disappears.
>
>  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
>(Disclaimer: "All opinions solely the author's responsibility")


Could someone out there explain to me what the basic idea is behind 
supercomputer CPU's?  I know what a interrupt vector is (ie an address
pointing to other address) and so forth.  But that obviously (or maybe
not?) has nothing to do with the vectors talked of when dealing with
supercomputers.  Just as Hugh mentions above, I've seen others talking
of how a certain function can be 'vectorized(??)' or can't be.  And when
they can't, supercomputers are dogs (well, slower) etc....
Now I'm no math wiz (that's why I use computers!) but if it could be
explained from a hardware standpoint I should be able to grasp the concept.
Email if you feel this isn't of interest.....
Thank you,
	Jeff



-- 
Jeff Gortatowsky       {seismo,allegra}!rochester!kodak!elmgate!jdg
Eastman Kodak Company  
These comments are mine alone and not Eastman Kodak's. How's that for a
simple and complete disclaimer?

ron@topaz.rutgers.edu (Ron Natalie) (07/21/87)

> Could someone out there explain to me what the basic idea is behind 
> supercomputer CPU's?  I know what a interrupt vector is (ie an address
> pointing to other address) and so forth.  But that obviously (or maybe
> not?) has nothing to do with the vectors talked of when dealing with
> supercomputers.

Supercomputers are categorized by being very fast computers.  There are
various ways of accomplishing this.  First you can use rather exotic
technology to just get a single CPU (such as you're use to in your VAX)
to run a thousand times faster.  This of course is very hard, the main
problem being that you can't make things small enough so that the propagation
of the signals in the wires (no more than the speed of light) doesn't
overly slow you down.

Well, here comes the tricky parts.  What happens if we use a reasonably
fast cpu parts in parallel.  We ought to be able to get a speed up
proportional to the number of parallel parts.  There are a couple of
ways we can do this.

One way is to build our processor to work with arrays in the computations
rather than just scalar numbers.  One dimensional arrays (vectors) are
useful in many calculations.  Multidimensioned array calculations can be
calculated by multiple vector operations as vector operations can be
calculated with multiple scalar operations.  Of course, you need to have
problems that using an array would be a speed up.  This is the ability
to being "vectorized" that was referred to.  Currently, one needs to figure
out how to do this in your source code, but work is underway all over to have
compilers find constructs in the code that lend themselves to vectorization
and produce the appropriate vector operations.

Another way is to replicate the processor so that many operations are done
in parallel.  One approach is to have a number of processor elements
each executing the same program but operating on different pieces of data.
This is referred to as SIMD (Single Instruction stream, Multiple Data stream)
parallelism.  Another approach is to have Mulitiple Instruction and Multiple
Data Streams (MIMD) each processor executing its own program with whatever
data.  To get a MIMD architecture to work, there must be some method of
sharing the data.  One approach is to all share the same memory (such as
with multiple CPU's on the same bus) or to have some form of interprocessor
communication between CPU's.

Lets look at some popular CPU's

CRAY 1 - Vector processor
CRAY X-MP - Vector Processor, replicated 1-4 times, shares memory.
CRAY 2 - Similar to X-MP, newer and faster technology

Denelcor HEP - MIMD parallel, CPU's grouped into PEM's.  Each PEM shares
	data memory, but all PEM's in system share the same data address
	space.  Requests for non-local memory are made and returned through
	an inter-PEM process switch.
NASA MPP (Massively Parallel Processor) - SIMD, on the order of ten thousand
	CPU data streams.
Hypercubes - MIMD, each CPU is interconnected to some number of neighbors
	as indicated by the edges in a hypercube (the CPU's being at the
	vertices).

-Ron

johnw@astroatc.UUCP (John F. Wardale) (07/21/87)

In article <13401@topaz.rutgers.edu> ron@topaz.rutgers.edu (Ron Natalie) writes:
>> Could someone out there explain to me what the basic idea is behind 
>> supercomputer CPU's?  ...vectors...etc.

Thanks Ron, for a generally Ecellent responce.  Just a few additions:

>  Of course, you need to have
>problems that using an array would be a speed up.  This is the ability
>to being "vectorized" that was referred to.

The requirements are hard to explain, but just using arrays (in
fortran or whatever is NOT enuf)  The data values have to be
know in time for the operators to operate on them.
ARY (I) = some_function_of ARY (I-1)  is *HARD* or impossible to
vectorize.

> ... SIMD (Single Instruction stream, Multiple Data stream)
> ... Mulitiple Instruction and Multiple Data Streams (MIMD) 
don't forget SISD (single Instr. Single Data)  (most common form)
and MISD  (in theory...I doubt any ever built one)

>CRAY 2 - Similar to X-MP, newer and faster technology
(a nit:)  A Cray-2 is NOT like an XMP, but that's a different argumnent.

Comments on Shared memory MIMD machines:  This is probably the
easiest and a fairly flexable solution.  It works great for ~10
CPUs, but it you try it with ~100 CPUs, the memory get very
overloaded (i.e. real-slow system-responce)

The code it different programs can be 
o SCALAR          (none of the following  use SISD)
o VECTORIZABLE    (see above;  use SIMD)
o PARALLELIZABLE  (well suited for MIMD machines)
o Vec or Parr     (use either)

The problem is that most pograms contain sections that are each
of the above classes.  Each type machine will do best on different
on different sections, and its usually the SCALAR portions that
contribute to the LARGEST part of the run times.
BTW:  The Cray-1 & XMP are also very-good Scalar machines!

			John W

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Name:	John F. Wardale
UUCP:	... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw
arpa:   astroatc!johnw@rsch.wisc.edu
snail:	5800 Cottage Gr. Rd. ;;; Madison WI 53716
audio:	608-221-9001 eXt 110

To err is human, to really foul up world news requires the net!

roy@phri.UUCP (Roy Smith) (07/21/87)

In article <687@elmgate.UUCP> jdg@aurora.UUCP (Jeff Gortatowsky) writes:
wants to know what "vector" means in the context of "vector processors"

	Let's say you have 3 floating point arrays, x, y, and z and you
want to set each element in z equal to the product of the corresponding
elements in x and y.  On a scalar processor (i.e. Vax, Sun, etc) you would
write:

	for i goes from 1 to upper-limit-of-x,y,z
	do
		z[i] = x[i] * y[i]
	end

	The problem is that the cpu wastes a lot of time doing the dunky
work of executing the loop, (increment the index and check for upper
limit), computing the addresses for the array references, fectching and
decoding the multiply instruction opcode, etc, and only after all that does
it get to do the "real" work of doing the floating-point multiply.  On a
vector processor, you would have a single instruction to do the whole loop.

	Furthur, if you look carefully at a floating multiply operation,
you see it takes a dozen or so atomic steps; multiply the mantissas, add
the exponents, normalize the result, check for under/overflow, etc.  On a
scalar machine these operations get done in series.  On a vector machine,
if you have 6 multiplies to do (call them M1-M6) once you get the pipeline
primed, you can be doing mantissa-multiply for M4 at the same time that
another bit of hardware is doing the exponent-add for M3 while some other
piece of hardware is doing the normalize for M2 and the overflow-check for
M1 is being done by yet another bit of hardware.  Thus, if it takes N clock
cycles to do a complete multiply, on a vector machine you need N clock
cycles before the first result is complete, and thereafter you get another
result every clock cycle.

	Some problems vectorize easily, some don't.  If you have the type
of problem that does, running it on a vector machine is a big win.  If you
have the type of problem that doesn't, running it on a vector machine is
just a good way to waste expensive hardware.
-- 
Roy Smith, {allegra,cmcl2,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

suhler@im4u.UUCP (Paul A. Suhler) (07/22/87)

Here's one more way to explain the purpose of vector instructions.
Pardon me if I missed it among the spate of earlier postings.

In article <2806@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>[...]
>	The problem is that the cpu wastes a lot of time doing the dunky
>work of executing the loop, (increment the index and check for upper
>limit), computing the addresses for the array references, fectching and
>decoding the multiply instruction opcode, etc, and only after all that does
>it get to do the "real" work of doing the floating-point multiply.  On a
>vector processor, you would have a single instruction to do the whole loop.
>[...]

If you view the limit on performance as being the processor-memory
bandwidth, you notice that fetching all of those instructions Roy mentions
above consumes a lot of it.  Solution:  fetch the instruction once
(and set up control registers once, etc., etc.) and then just fetch
data.  The increased data fetch rate also lets you use an arithmetic
pipeline efficiently.

[I first heard it explained this way by Harvey Cragon a few years ago.]

-- 
Paul Suhler        suhler@im4u.UTEXAS.EDU	512-474-9517/471-3903

henry@utzoo.UUCP (Henry Spencer) (07/23/87)

> BTW:  The Cray-1 & XMP are also very-good Scalar machines!

It should be noted that this is an important reason why they sell so well
(the Cray production runs are an order of magnitude larger than those of a
lot of other supercomputer projects in the last two decades).  If a program
is 90% vectorizable, than magically making the vector part of it *infinitely*
fast will only speed it up by a factor of ten.  Too many of the pre-Cray
supercomputers did vectors really fast but were pigs on scalar computation.
Cray's secret is that its machines are blindingly-fast *scalar* engines that
sort of incidentally happen to do vectors even faster.  For some reason :-),
saying "our machines run your code real fast if you rewrite it somewhat"
doesn't have nearly the customer appear of saying "our machines run your
code real fast -- even faster if you rewrite it somewhat".
-- 
Support sustained spaceflight: fight |  Henry Spencer @ U of Toronto Zoology
the soi-disant "Planetary Society"!  | {allegra,ihnp4,decvax,utai}!utzoo!henry

gene@cooper.UUCP (Gene from EK Enterprises) (07/23/87)

in article <2806@phri.UUCP>, roy@phri.UUCP (Roy Smith) says:
> [ ... ]
> elements in x and y.  On a scalar processor (i.e. Vax, Sun, etc) you would
> write:
> 
> 	for i goes from 1 to upper-limit-of-x,y,z
> 	do
> 		z[i] = x[i] * y[i]
> 	end
> [ ... ]
> 	Some problems vectorize easily, some don't.  If you have the type
> of problem that does, running it on a vector machine is a big win.  If you
> have the type of problem that doesn't, running it on a vector machine is
> just a good way to waste expensive hardware.

Ditto for some machines like the Convex C1 series. All you need is a nice
set of rather expensive compilers (C, FORTRAN, etc.) that automatically
optimize the _source_ code to make use of the vector hardware, and then
compile, making full use of the {improved|optimized} source.


					Gene

					...!ihnp4!philabs!phri!cooper!gene


	"If you think I'll sit around as the world goes by,
	 You're thinkin' like a fool 'cause it's case of do or die.
	 Out there is a fortune waitin' to be had.
	 You think I'll let it go? You're mad!
	 You got another thing comin'!"

			- Robert John Aurthur Halford

eugene@pioneer.arpa (Eugene Miya N.) (07/24/87)

In article <8344@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>If a program
>is 90% vectorizable, than magically making the vector part of it *infinitely*
>fast will only speed it up by a factor of ten.  Too many of the pre-Cray
>supercomputers did vectors really fast but were pigs on scalar computation.

How many programs do you know which are 90% vectorizable?  [as opposed
to portions 90% vectorizable.]  {Ref: G. M. Amdahl, SJCC,1967}
Too many?  I am only aware of one which was produced: the STAR-100.
The Cyber 203 came PC: Post-Cray-1.  Everything else could be considered
a paper design.   [^^ A new acronym for people in the know? ;-)]
Pig is not a nice thing to say.

>saying "our machines run your code real fast if you rewrite it somewhat"
>doesn't have nearly the customer appear of saying "our machines run your
>code real fast -- even faster if you rewrite it somewhat".

Surprisingly, people do rewrite; there are over 40 Cyber 205 (The STAR
follow on] sites and people do rewrite for those.  You also have to
distinguish between writing compiler directives and executable code.

From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene

I have begun to really learn that the people who say the most know the
least in many cases.

suhler@im4u.UUCP (Paul A. Suhler) (07/24/87)

In article <2398@ames.UUCP> eugene@pioneer.UUCP (Eugene Miya N.) writes:
>In article <8344@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>>..............................................  Too many of the pre-Cray
>>supercomputers did vectors really fast but were pigs on scalar computation.
>
>Too many?  I am only aware of one which was produced: the STAR-100.

The TI ASC (Advanced Scientific Computer) had one to four pipelines,
but no scalar unit, which hurt performance badly.  The designers
underestimated the amount of scalar code in their target programs.
Still, they sold more ASCs than CDC sold Star-100s.   (Four, I believe.)

-- 
Paul Suhler        suhler@im4u.UTEXAS.EDU	512-474-9517/471-3903

nelson@ohlone.UUCP (Bron Nelson) (07/24/87)

In article <8344@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
> > BTW:  The Cray-1 & XMP are also very-good Scalar machines!
> 
> It should be noted that this is an important reason why they sell so well
> (the Cray production runs are an order of magnitude larger than those of a
> lot of other supercomputer projects in the last two decades).  If a program
> is 90% vectorizable, than magically making the vector part of it *infinitely*
> fast will only speed it up by a factor of ten.

A good rule-of-thumb is that running a problem vectorized is about 10
times faster that doing the same problem scalar on an XMP (single
processor).  Thus, if some old problem you have is 90% vectorizable,
it will spend half of its execution time doing the part that is (still)
scalar.  So scalar performance is very important.

Now, it is very true that only a small fraction of the *number* of jobs
run in a typical day have any significant vectorizeable parts even if
you look at the sites that buy our machines: compilers and editors (and
your news reading program :-)) do not vectorize very well.  However the
kind of jobs that take the majority of the *cpu* time DO tend to have
significant vectorizable parts.  The big cpu/memory hogs tend to cycle
through huge volumes of data, frequently doing pretty much the same
thing to each datum (this is of course a vast over-simplification,
but you get the idea).  Some codes (for example, oil company's analysis
of seismic data at a potential drilling site) can be almost embarrassingly
vectorizable - 99+%.  And of course people designing large new codes tend
to consider how to make the solution vectorizable/parallelizable.  But
it IS a big selling point to be able to say "slap that crufty old monster
on this baby and she'll run like lightning from the word go."

-----------------------
Bron Nelson     {ihnp4, lll-lcc}!ohlone!nelson
Not the opinions of Cray Research

roy@phri.UUCP (Roy Smith) (07/24/87)

In article <8344@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
> Too many of the pre-Cray supercomputers did vectors really fast but
> were pigs on scalar computation.

	Correct me if I'm wrong, but don't the Cray's have relatively short
(64?) element vector registers with relatively short pipelines compared to
some of the earlier attempts at vector processors?  The result of the short
pipelines is that you get your first result faster, which is a real win on
short vectors.
-- 
Roy Smith, {allegra,cmcl2,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

johnw@astroatc.UUCP (John F. Wardale) (07/25/87)

In article <2806@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>In article <687@elmgate.UUCP> jdg@aurora.UUCP (Jeff Gortatowsky) writes:
>wants to know what "vector" means in the context of "vector processors"
>
>	for i goes from 1 to upper-limit-of-x,y,z
>	do
>		z[i] = x[i] * y[i]
>	end
>
>	The problem is that the cpu wastes a lot of time doing the dunky
>work of executing the loop, (increment the index and check for upper
>limit), computing the addresses for the array references, fectching and
>decoding the multiply instruction opcode, etc, and only after all that does
>it get to do the "real" work of doing the floating-point multiply.  On a
>vector processor, you would have a single instruction to do the whole loop.

Generally, you have a limit on the "vector-length"  (64 for the
Crays) but the compiler will break the loop into
for i goes from 1 to max by 64
	z[i..i+63] = x[i..i+64] * y[i..i+64]
(with special, set-len and mult for the last group of max mod 64)

On a scalar processor, this code need not be a grim as Roy would
lead you to think.  The loop will be in an I-cache, and the CPU
could have auto-increment modes.

There are 3 common limits:  
* issue limited:  An instruction is issued each clock.  The memory
and functions keep up.  The bottle neck is in the pipe-line
(decoding etc.)  This is an argument for vectors, and for RISC

* memory limited:  memory bandwidth is saturated.  To improve
this you'll (probably) have to change the processor bus (or
some other drastic measure).  Changing the program may help
but is labled as "cheating."

* compute limited: functional units are busy so "issues" must wait.
(common with vectors; very rare withou vectors)

The real question is: can the scalar loop fetch and store data
(into and out of memory) faster than the multiplyer can multiply.
This is an obvious requirement for vector instructions to be
practical.  [Side question:  Are there any "micros" that have (or 
could benifit) from vectors, or are there memory interfaces too
low-bandwith for this?]

>
>	Furthur, if you look carefully at a floating multiply operation,
>you see it takes a dozen or so atomic steps; multiply the mantissas, add
>the exponents, normalize the result, check for under/overflow, etc.  On a
>scalar machine these operations get done in series.  On a vector machine,
===================================================
>....[description of a pipelined multiplyer]

Not necisarrily so!  A scalar machine *CAN* have pipelined
functional units!    Another concern is segment time (or how often
you can start the op.  once a clock?  one in two clocks? ...)

			John W

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Name:	John F. Wardale
UUCP:	... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw
arpa:   astroatc!johnw@rsch.wisc.edu
snail:	5800 Cottage Gr. Rd. ;;; Madison WI 53716
audio:	608-221-9001 eXt 110

To err is human, to really foul up world news requires the net!

lamaster@pioneer.arpa (Hugh LaMaster) (07/27/87)

In article <2398@ames.UUCP> eugene@pioneer.UUCP (Eugene Miya N.) writes:

>Too many?  I am only aware of one which was produced: the STAR-100.
>The Cyber 203 came PC: Post-Cray-1.  Everything else could be considered
>a paper design.   [^^ A new acronym for people in the know? ;-)]

Don't forget about the TI ASC.  

  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

                 "IBM will have it soon"

(Disclaimer: "All opinions solely the author's responsibility")

lamaster@pioneer.arpa (Hugh LaMaster) (07/27/87)

In article <687@elmgate.UUCP> jdg@aurora.UUCP (Jeff Gortatowsky) writes:
>In article <2378@ames.arpa> lamaster@ames.UUCP (Hugh LaMaster) writes:
>[..............]
>>vectors.  If the vectors are not contiguous, then the advantage disappears.

>Could someone out there explain to me what the basic idea is behind 
>supercomputer CPU's?  I know what a interrupt vector is (ie an address

In the simplest terms, a "vector" computer is one which permits a single
CPU instruction to specify an operation on multiple operands.  The operations
may then proceed in parallel, or serially (usually using a "pipeline").  Many,
but certainly not all, engineering and scientific programs can be speeded
up on a machine with a vector instruction set.  

A vector machine is NOT the same thing as a SUPERCOMPUTER.  A supercomputer is
a loose term generally applied to this year's two or three fastest (usually
vector) or most expensive (:-) machines.  But there are now minicomputers with
vector CPU's and we can expect to see microcomputers with vector CPU's REAL
SOON NOW.  A vector micro might have a 100ns clock and a peak vector speed of
20MFLOPS, while a vector supercomputer might have a clock of 4ns, say (Cray 2)
and a peak speed of 500 MFLOPS.  If you are doing numerical simulations or
graphics you will probably benefit from a vector machine, even if it is "only"
a vector mini or micro.  

  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

                 "IBM will have it soon"

(Disclaimer: "All opinions solely the author's responsibility")"

chuck@amdahl.amdahl.com (Charles Simmons) (07/27/87)

In article <2408@ames.arpa> lamaster@ames.UUCP (Hugh LaMaster) writes:
>A vector machine is NOT the same thing as a SUPERCOMPUTER.  A supercomputer is
>a loose term generally applied to this year's two or three fastest (usually
>vector) or most expensive (:-) machines.
>
>  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!

While I'm not as exposed to supercomputers as Hugh, in my experience
the term supercomputer is only applied to extremely fast scalar processors
with attached vector processors.  For example, people around here tell
me that an Amdahl 5890 is at least as fast as a Cray (which one?) for
scalar processing.  However, a 5890 is "only" a mainframe and not a
supercomputer.  I think if you attach a vector processor to a 5890, however,
you do end up with a supercomputer.

Hopefully Hugh or Eugene will correct any of my misconceptions...

-- Chuck

fouts@orville (Marty Fouts) (07/27/87)

Actually, any vendor who has a machine which can reach a flop rating
within an order of magnitude (you pick the base ;-) of the currently
believed to be fastest speed will call his machine a supercomputer.  I
have seen advertisements for massively parallel supercomputers
recently, which achieve the "not to be exceeded" rating via >> 1 slow
processor.

What is in a name is that a rose is much easier to sell than a pansy,
so makes more money (:-(

rchrd@well.UUCP (Richard Friedman) (07/27/87)

The best supercomputers are fast scalar machines first, with vector
processing hardware for additional speedup.  Machines like the Cray X-MP
have a vector-to-scalar speedup factor of about 10.  But their scalar
performance is faster than any conventional machine.  

Everyone should also be aware of the SX-2 from NEC, handled here in the
US by Honeywell.  The SX-2 is the fastest machine today.  Beats the
Cray X-MP and Cray-2.  It has a 6 nanosecond clock (X-MP=8).
The claim 1300 megaflops (1.3 billion flops).
  
I think we will be hearing more from this machine!  There is one
now in Houston at the Houston Advanced Research Center.


...Richard Friedman [rchrd] Pacific-Sierra Research (Berkeley)                         
uucp:  {ucbvax,lll-lcc,ptsfa,hplabs}!well!rchrd
- or -   rchrd@well.uucp
-- 
...Richard Friedman [rchrd]                         
uucp:  {ucbvax,lll-lcc,ptsfa,hplabs}!well!rchrd
- or -   rchrd@well.uucp

littauer@amdahl.amdahl.com (Tom Littauer) (07/28/87)

In article <3636@well.UUCP> rchrd@well.UUCP (Richard Friedman) writes:
>The best supercomputers are fast scalar machines first, with vector
>processing hardware for additional speedup.  Machines like the Cray X-MP
>have a vector-to-scalar speedup factor of about 10.  But their scalar
>performance is faster than any conventional machine.  
                !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

If the Dhrystone benchmarks are to be believed, this isn't the case.
The May '87 report shows Cray X-MP at 18,530, and the IBM 3090-200
at 31,250. I'd mention our (much faster) machines, but that'd get
perilously close to advertising :-).

Nevertheless, the basic point is valid. If you could get it, you'd
want a machine fast enough to do EVERYTHING quickly, not just a
subset of things. This is not to demean the Cray machines: they do
vectorizable work very quickly, but not all work is vectorizable.

Until compilers are clever enough to make multithread/vector work
out of work the programmer thinks of as serial, we're just gonna
have to pick the right tool for the task at hand.

End of pontification.
-- 
-- 
UUCP:  littauer@amdahl.amdahl.com
  or:  {sun,decwrl,hplabs,pyramid,ihnp4,ames,seismo,cbosgd}!amdahl!littauer
DDD:   (408) 737-5056
USPS:  Amdahl Corp.  M/S 330,  1250 E. Arques Av,  Sunnyvale, CA 94086

I'll tell you when I'm giving you the party line. The rest of the time
it's my very own ravings (accept no substitutes).

lamaster@pioneer.arpa (Hugh LaMaster) (07/28/87)

In article <10883@amdahl.amdahl.com> chuck@amdahl.UUCP (Charles Simmons)
writes:

>In article <2408@ames.arpa> lamaster@ames.UUCP (Hugh LaMaster) writes:

>>A vector machine is NOT the same thing as a SUPERCOMPUTER.  A supercomputer is
>>a loose term generally applied to this year's two or three fastest (usually
>>vector) or most expensive (:-) machines.

> ...   in my experience
>the term supercomputer is only applied to extremely fast scalar processors
>with attached vector processors.  For example, people around here tell
>me that an Amdahl 5890 is at least as fast as a Cray (which one?) for
>scalar processing.  However, a 5890 is "only" a mainframe and not a
>supercomputer.  I think if you attach a vector processor to a 5890, however,
>you do end up with a supercomputer.

There are several points here.  The first is that "scalar" means several
things.  To some people it means the speed that that the O/S kernel will run.
To others, it means the speed that a small sort, or other "small" data
intensive application will run.  To others, it means floating point code that
is not vectorizable.  All of these are different.  A subsequent poster
mentioned Dhrystone, for example, which is very string intensive and does no
floating point; this is not the same "scalar" that a typical Cray user is
thinking of ("My Monte-Carlo code runs in scalar...").  When typical
supercomputers are touted as good scalar machines, it is with reference to the
third version of "scalar" above, not the first.  It is worth noting, though,
that for many years supercomputers from CDC and Cray were fastest in all three
categories.

The second point is that there are many other supercomputer designs besides
vector machines.  A list of them all would be exhausting to read.  It so
happens that the only really successful machines to date have been vector
machines.  In the future?  A matter of intense debate :-)

The third point, is, that I have mentioned previously, vector capability is a
cost effective thing to add to minicomputers, and now microcomputers, that are
intended for engineering, scientific, and graphics uses.  Even though it
doesn't make Dhrystone run any faster.

It is also worth noting that there are other operations besides floating point
that can be vectorized.  In particular, bitwise logical operations are useful
for image processing/graphics, and encoding/decoding.  Vector instructions
allow these operations, which can be done in one instruction, and which are
thus candidates for a RISC machine, to proceed at (say) one per CPU cycle, on
a system where the CPU cycle speed is significantly faster than memory speed
(the memory bandwidth on these machines is provided with multiple banks).  So,
vectorization is not restricted to floating point, and can pay off on even
some very simple "one cycle" instructions.

  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

                 "IBM will have it soon"

(Disclaimer: "All opinions solely the author's responsibility")

pf@diab.UUCP (Per Fogelstrom) (07/29/87)

In article <10956@amdahl.amdahl.com> littauer@amdahl.UUCP (Tom Littauer) writes:
>In article <3636@well.UUCP> rchrd@well.UUCP (Richard Friedman) writes:
>>The best supercomputers are fast scalar machines first, with vector
>>processing hardware for additional speedup.  Machines like the Cray X-MP
>>have a vector-to-scalar speedup factor of about 10.  But their scalar
>>performance is faster than any conventional machine.  
>                !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>
>If the Dhrystone benchmarks are to be believed, this isn't the case.
>The May '87 report shows Cray X-MP at 18,530, and the IBM 3090-200
>at 31,250. I'd mention our (much faster) machines, but that'd get
>perilously close to advertising :-).
>

Be serious! Do you really belive in Dhrystone? Okay i do admitt that we don't
have anything better for the moment, but soon i hope.

What this benchmark really tests is the compilers ability to remove code that
does'nt really do anything. And of course the cpu's ability to handle strings.
Just check the Am29000. The designers put in an instruction just to speed up
this benchmark.... Now what, have we reaced a new point, where cpu's are
designed for the bencmarks instead of vice versa, as it was before ????

[ There are three types of lie, lie, damned lie, and BENCHMARKS !! ]

-- 
Per Fogelstrom,  Diab Data AB
SNAIL: Box 2029, S-183 02 Taby, Sweden
ANALOG: +46 8-7680660
UUCP: seismo!mcvax!enea!diab!pf

rchrd@well.UUCP (Richard Friedman) (07/31/87)

In article <10956@amdahl.amdahl.com> littauer@amdahl.UUCP (Tom Littauer) writes:
>
>If the Dhrystone benchmarks are to be believed, this isn't the case.
>The May '87 report shows Cray X-MP at 18,530, and the IBM 3090-200
>at 31,250. 
>

Just a minute.  As I understand it, the so-called Dhrystone benchmarks
are in  C  and Cray does not have an optimizing  C  compiler in general
distribution.  What you need is to look at Dongarra's LINPAK
benchmarks, which are in FORTRAN.  

-- 
...Richard Friedman [rchrd]                         
uucp:  {ucbvax,lll-lcc,ptsfa,hplabs}!well!rchrd
- or -   rchrd@well.uucp

lamaster@pioneer.arpa (Hugh LaMaster) (07/31/87)

In article <279@diab.UUCP> pf@.UUCP (Per Fogelstrom) writes:
>In article <10956@amdahl.amdahl.com> littauer@amdahl.UUCP (Tom Littauer) writes:
>>In article <3636@well.UUCP> rchrd@well.UUCP (Richard Friedman) writes:
>>>The best supercomputers are fast scalar machines first, with vector
(omitted discussion about scalar perf. in supercomputers)
>
>Be serious! Do you really belive in Dhrystone? Okay i do admitt that we don't
>have anything better for the moment, but soon i hope.
(omitted discussion about Dhrystone)
>
>[ There are three types of lie, lie, damned lie, and BENCHMARKS !! ]
>-- 
>Per Fogelstrom,  Diab Data AB

I have to add something else to this discussion.  Ten years ago, when Crays
first came out, IBM was still trying to peddle the 370/168 and Amdahl had its
first faster machines.  Folks at the national labs started saying that the
Cray was not only fastest, but even also most cost effective, for "scalar"
work.  They were right, at the time.  A lot of water has gone under the bridge
since then.  There wasn't much of a market for fast machines in the early and
mid 70's, but the last four or five years have changed all that.  Even IBM is
trying to keep up.  But, to get back to the question of scalar performance:

Suppose you want to buy the most cost effective machine for doing large sorts.
Ten years ago, that might have been a Cray.  Parallel Computing (Vol 4 1987 pp
49-61) recently had a comparison of sorting performance using scalar and
vector algorithms on big iron.  The scalar performance of some non-Cray big
machines has now caught up with Cray scalar performance (scalar Quicksort
being a good example of a scalar code).  Vector processors are being
incorporated in more "mainstream" mainframes (Amdahl 1200 examined in the
article, but also the IBM 3090 VF machines).  And there are now vectorized
sorting algorthms which can provide significant benefits for some cases.
Overall, for sorting the Amdahl 1200 appeared to have the advantage for scalar
and vector sorting over the Cray X-MP.

There are several points here.  The first is that as more companies are
building fast machines and vector architectures have become mainstream, the
members of the set "supercomputers" are a bit harder to define (again) than
they were ten years ago.  Even for traditional "business" "scalar" computing
like sorting, there are now vector algorithms which show significant
performance improvements over scalar algorithms.  

Finally, the question of what makes a good benchmark:  

If you want to do a lot of sorting, sorting makes a good benchmark.
(Extrapolate to whatever you want to do).

The original purpose of Dhrystone was to produce a synthetic program that used
"recent statistics" for "real" programs.  Weicker's PROGRAM has been widely
criticized, but the STATISTICS behind it are probably valid for records and
pointers type code.  A new implementation of the code which prints results
which depend on the correct execution of all the code is certainly needed -
Dhrystone II?.  

A problem with "small" benchmarks which depend on multiple passes over the
same data is that typically code and data can run cache contained, which is
also very artificial.  A new Dhrystone III benchmark which uses the same
statistics but has a much larger data area would be more appropriate for
testing big machines with lots of cache and memory.  

It should be noted that one thing that Dhrystone does do "right" is make lots
of procedure calls.  In my experience, on typical machines that are similar in
other respects, it is often the cost of procedure calls, comparisons, and
branches that determine the "apparent speed" of a scalar machine used for
scalar purposes.  The reason Dhrystone looks SO slow on the Cray is very
likely due to the relatively much larger cost of procedure calls on the Cray
(and the CDC 6600, CDC 7600, Cyber 205, to name a few popular supercomputers).
This effect is real and is a valid result of Dhrystone, as long as the
compiler doesn't do true global optimization.

MIPS computers, and others, have been tending to use Un*x utilities to measure
the "general purpose" speed of machines.  This makes sense: it should be
remembered, however, with respect to Dhrystone, that when it and previous
benchmarks were written there was no standard environment available for most
processors as there is today.  Some (not on this net) still argue about it
today.

  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

                 "IBM will have it soon"

(Disclaimer: "All opinions solely the author's responsibility")

chuck@amdahl.amdahl.com (Charles Simmons) (07/31/87)

In article <279@diab.UUCP> pf@.UUCP (Per Fogelstrom) writes:
>In article <10956@amdahl.amdahl.com> littauer@amdahl.UUCP (Tom Littauer) writes:
>>If the Dhrystone benchmarks are to be believed, this isn't the case.
>>The May '87 report shows Cray X-MP at 18,530, and the IBM 3090-200
>>at 31,250. I'd mention our (much faster) machines, but that'd get
>>perilously close to advertising :-).
>>
>
>Be serious! Do you really belive in Dhrystone? Okay i do admitt that we don't
>have anything better for the moment, but soon i hope.
>
>What this benchmark really tests is the compilers ability to remove code that
>does'nt really do anything. And of course the cpu's ability to handle strings.
>Just check the Am29000. The designers put in an instruction just to speed up
>this benchmark.... Now what, have we reaced a new point, where cpu's are
>designed for the bencmarks instead of vice versa, as it was before ????
>
>Per Fogelstrom,  Diab Data AB

Be serious!  Are you really suggesting that the 370 architechture
was designed to execute Dhrystone quickly?  Also, the portable C
compiler that we use on our 5890's doesn't perform any really special
optimizations.

-- Chuck

ran@utah-cs.UUCP (Ran Ginosar) (08/01/87)

Has Weicker's Dhrystone source code ever been published? Where?

earl@mips.UUCP (Earl Killian) (08/01/87)

I was hoping someone would respond to the Cray dhrystone message.
Hugh LaMaster's reply seemed right on the mark to me.  I did want to
clarify one thing, however.  And, I'll end by suggesting two integer
benchmarks we could use to replace dhrystone.

In article <2425@ames.arpa>, lamaster@pioneer.arpa (Hugh LaMaster) writes:
> ...
> MIPS computers, and others, have been tending to use Un*x utilities
> to measure the "general purpose" speed of machines.

MIPS Computer uses Unix utilities (nroff, diff, grep, yacc) as one
component in our performance analysis.  For example, we devote 1 page
of our 20 page performance brief to those four.  While these are ten
times better than dhrystone for estimating performance, they still
overestimate our performance slightly, we believe.  For example, using
just nroff we would call our m/1000 11.7x a 780 instead of 10x at 780.
Internally, we use several phases of our compiler for rating integer
performance, since they are the largest and nastiest things we can
find.  The C front-end (one of our worst) gets as slow as 9.0x a 780
because it cache misses a lot.  Of course, our compilers aren't
something we can give out as a standard benchmark.  So, in a future
brief we will add espresso (PLA reduction) and timber wolf (routing by
simulated annealing) as large integer program benchmarks (well, timber
wolf does a little floating point).  People with integer cad
applications can probably relate to these, just as an average Unix
site can probably relate to using text processing and compiling as
benchmarks.  Perhaps espresso and timber wolf can become good general
purpose benchmarks, even for non-Unix machines?  Are there better
choices of real programs?

Fortunately, for floating point there is no shortage of decent
benchmarks.

mash@mips.UUCP (John Mashey) (08/02/87)

In article <2425@ames.arpa> lamaster@ames.UUCP (Hugh LaMaster) writes:
...long, reasonable discussion on vector machines, benchmarking

....good discussion of where Dhrystone should be changed for different
environments.  100% agree, except for the following:
>Weicker's PROGRAM has been widely
>criticized, but the STATISTICS behind it are probably valid for records and
>pointers type code....

Actually, some of the pointer behavior isn't seem quite typical, at least in the
C version.  For example, about 50% of the loads/stores (on MIPS machines,
anyway) use 0-offsets, and the more typical percentages are 10-15% in
user-level C programs.

>It should be noted that one thing that Dhrystone does do "right" is make lots
>of procedure calls....

In general, I'd agree.  However, Dhrystone somewhat overemphasizes the
importance of a fast function call. On our systems, the numbers of
instructions/call for Dhrystone are:
35	-O3 [global + inter-procedural register allocation]
36	-O2 [global opt]  [call this typical]
41	-O1 [no global opt]

here are a few other numbers for integer user-level programs:
 54	nroff
 56	ccom
 57	uopt [global optimizer]
 69	tex
 85	as1 [1st passs of assembler]
350	espresso

and some for a few floating point programs, or at least with some FP:
 38	whetstone single
 48	whetstone single
140	hspice
370	timberwolf
735	DP linpack, FORTRAN

Of course, these are INTSTRUCTION COUNTS, not including cache/tlb degradation,
or multi-cycle instruction effects, but it certainly gives a gross idea
of what's going on.   Basically, Dhrystone does function calls 1.5X to 2X
more frequently than large user-level C programs.  Needless to say, this
effect makes VAXen look especially bad, relative to how they actually
perform on other programs.  [I'm not defending slow function calls, of course!]

Actually an amusing test was using the Pascal version on an 8600, using the
Pastel compiler, which avoids the VAX CALL instructions in favor of compiler-
constructed sequences.  This is NOT exactly comparable, since the performance
difference on character strings for Dhrystone is much better in the Pascal
version [fixed-lengths, rather than null-terminated], and since Pastel
optimizes better than Ultrix 1.2's cc.  Still, there was a 2X performance
increase. I'd guess the Dhrystone understates VAX performance (relative to
architectures with leaner calls) about 15-25%, given similar compilers.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

ron@topaz.rutgers.edu (Ron Natalie) (08/03/87)

> While I'm not as exposed to supercomputers as Hugh, in my experience
> the term supercomputer is only applied to extremely fast scalar processors
> with attached vector processors.

This is not true.  This definition is perpetrated by mainframe manufacturers
who try to sell their machines as supercomputers with attached boxes usually
from FPS (Fly-by-night Convolvers).  Supercomputers are as previously defined
extremely fast (and hence generally expensive) computers.  They may accomplish
this through use of vectors and parallelism but it is generally the number of
floating point operations per second that characterize the system.  Generally,
only machines that provide a homogenous approach are considered.

-Ron

ron@topaz.rutgers.edu (Ron Natalie) (08/03/87)

> Just a minute.  As I understand it, the so-called Dhrystone benchmarks
> are in  C  and Cray does not have an optimizing  C  compiler in general
> distribution.  What you need is to look at Dongarra's LINPAK
> benchmarks, which are in FORTRAN.  

Ah, now we do see the problem with benchmarks don't we.  When our group
was procuring computers performance was measured in execution time for
C code since we do all our work in C.  Who cares how fast it could do
Fortran, we weren't about to go and rewrite our applications in Fortran.
This caused several manufactures with probably competitive machines but
with lousy C compilers to be passed over (notably ELXSI at the time).

-Ron

larry@mips.UUCP (Larry Weber) (08/05/87)

In article <24645@sun.uucp> dgh@sun.UUCP (David Hough) writes:
>                  How many C compilers can handle Spice-3 fully optimized?
>And how many Fortran compilers can handle Doduc's program fully optimized?

In case anyone wants to count, the MIPS compilers optimize both Spice-3
and Doduc's program.  It also optimizes HSPICE and the UNIX kernel.  David,
is absolutely right:  real programs puts a heavy drain in getting the optimizer
done correctly.  I wonder if any "dhrystone" runs actually have bugs in them
but no one can tell because it computes nothing that can be tested.

Larry

chris@mimsy.UUCP (Chris Torek) (08/18/87)

In article <558@winchester.UUCP> mash@mips.UUCP (John Mashey) writes:
>Actually an amusing test was using the Pascal version on an 8600, using the
>Pastel compiler, which avoids the VAX CALL instructions in favor of compiler-
>constructed sequences.

The copy of Pastel I have botches the compiler-constructed subroutine
calls, unfortunately.

>This is NOT exactly comparable, ....

Indeed.

It is also amusing to note that the faster VAXen generally have
less of a spread between `average' instruction time and CALLS+RET
instruction time.  The ratio is ~14:1 on the 780, meaning you could
execute ~14 `regular' instructions in the time it takes to do one
call and one return; on the 8600, it is more like ~4:1, I think.
(The 8600 is pipelined, which makes such ratios a bit of a joke.
Perhaps I should construct some test cases and run it on the various
VAXen we have around here: KA630, KA785, KA825, KA860---the 750s
have been officially decomissioned at last!---but there are some
downstairs I guess I could borrow....)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
Domain:	chris@mimsy.umd.edu	Path:	seismo!mimsy!chris

bcase@apple.UUCP (Brian Case) (08/18/87)

In article <567@gumby.UUCP> larry@gumby.UUCP (Larry Weber) writes:
>I wonder if any "dhrystone" runs actually have bugs in them
>but no one can tell because it computes nothing that can be tested.

This is a good point; I have one experience to relate.  While writing an
optimizing C compiler at AMD, I was a little worried that dhrystone was
running incorrectly.  To get a reasonably warm/fuzzy feeling, I put a line
like
	fprintf (outfile, "X\n");
where 'X' is a unique number (or a unique string, etc.), at each possible
fork in a branch (this includes procedure/function entrances, etc).  Then
I ran the program on the VAX and PC RT and simulated it on the Am29000.
In each case the outfile contained the same sequence of numbers.  At least
the *branching* behavior was equivalent in each case.  This seems like a
good first cut at correctness, and it should be possible to construct a
tool (based on a compiler front end) to insert the fprintf statements
automatically.  Is this reasonable or hairbrained?

    bcase

steve@nuchat.UUCP (Steve Nuchia) (09/03/87)

I wasn't going to say anything about this if a discussion
started, but it needs saying befor I expire the article:

In article <1496@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes:
> optimizing C compiler at AMD, I was a little worried that dhrystone was
> running incorrectly.  To get a reasonably warm/fuzzy feeling, I put a line
> like
> 	fprintf (outfile, "X\n");
> where 'X' is a unique number (or a unique string, etc.), at each possible
> fork in a branch (this includes procedure/function entrances, etc).  Then
> I ran the program on the VAX and PC RT and simulated it on the Am29000.
> In each case the outfile contained the same sequence of numbers.  At least
> the *branching* behavior was equivalent in each case.  This seems like a
> good first cut at correctness, and it should be possible to construct a
> tool (based on a compiler front end) to insert the fprintf statements
> automatically.  Is this reasonable or hairbrained?

This is indeed very reasonable.  Such tools exist; they are known
as test coverage or source line profiling tools.  The idea is to
measure the number of times each line (or so) of the source is
executed, either for performance tuning or to check the completeness
of the test cases executed.

I'm pretty sure I saw an article on one for C in the Bell System
Technical Journal back when someone I knew subscribed.  The author
of that article chose to run a preprocessor to insert source statements
to extract the information, leaving the compiler system intact.

The alternative is to put the tests in the code generator or a
assembly language post-processor, both of which have advantages
but were rejected by that author for the usual, valid reasons.

People in comp.compilers probably could supply a lot more info.

	steve nuchia
-- 
Steve Nuchia			Of course I'm respectable!  I'm old!
{soma,academ}!uhnix1		Politicians, ugly buildings, and whores
!nuchat!steve			all get respectable if they last long enough.
(713) 334 6720				- John Huston, Chinatown