[comp.arch] The Killer Micro From Hell

brooks@maddog.llnl.gov (Eugene Brooks) (12/20/89)

Well, at SC'89 I speculated that the MIPS R6000, specifically in the MIPS 6280
box but then that doesn't really matter, would perform at roughly 2.5 times the
speed of the XMP 4/16 CPU one some of my favorite SCALAR compute bound
applications which I have burned some serious computer time on in the recent
few years.

The XMP 4/16 is the "fast one" for those that don't know, and the YMP is only
30% faster than the XMP 4/16 on the code in question.

Well guys, I WAS WRONG!  I wish to APOLOGIZE for for the terrible error!

The 6280, and when I was given permission to share this bit of data I was
also told to inform you that this was a preliminary result on a pre-production
machine, has run at 3.3 times the performance of the XMP 4/16 CPU on a SCALAR
packet switched network simulator.

The R6000 is probably, for the very short fleeting moment in the lifetime of
a KILLER MICRO, the FASTEST UNIPROCESSOR COMPUTER IN THE WORLD on this code.

Of course, we have to keep in mind that this years Killer Micro is next
years Lawn Sprinkler Controller, but what a year it has been and what a year
the coming one will be!

           NO ONE WILL SURVIVE THE ATTACK OF THE KILLER MICROS!

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

COPYRIGHT 1989, Eugene D. Brooks III, all rights reserved.  You are
expressly forbidden to use this posting for product endorsment or
advertising purposes, or to print it on paper to show to a customer
for any reason.

This posting is the personal opinion of the author and is in no way to be
construed as the opinion of the U.S. govt. or the University of California.
brooks@maddog.llnl.gov, brooks@maddog.uucp

rhealey@umn-d-ub.D.UMN.EDU (Rob Healey) (12/27/89)

In article <42007@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>
>The 6280, and when I was given permission to share this bit of data I was
>also told to inform you that this was a preliminary result on a pre-production
>machine, has run at 3.3 times the performance of the XMP 4/16 CPU on a SCALAR
>packet switched network simulator.
>
	Was the code running on the XMP/YMP optimized to the Cray architecture
	as much as the code running on the R6000 was optimized to the MIPS
	architecture? How much time was spent on both in order to get to
	the code that gave the results above? Hmmm, I seem to remember the
	Cray's being touted for their VECTOR capability in addition to
	"respectable" SCALAR performance. Can the 6000 do seamless vector
	operations too?

	Just asking if comparing Apples to Oranges and saying Apples are better
	is a valid claim?

	Also, with appologys to DEC and Cray:

	Cray has it now...
	
	While I'm using my spare pocket change millions for other things
	right now, B^), one generally buys a high powered system for
	many reasons. The overall performance of the whole system, CPU,
	memory, I/O, networking strongly influences the sale of a system. I'd be
	interested to see the R6000 system that can beat a Cray in
	memory, I/O and networking bandwidth.

	My main reason for responding to this excited article is that
	I find it disturbing that ALOT of people pay attention only to
	MIPS, or only one aspect of a system, and not to full systems as a
	whole. To over simplify: 

	A CPU is only as fast as it's slowest sub-system.

		Just some musings,

			-Rob

#include <std/disclaimers.h>

I speak for myself and noone else.

brooks@maddog.llnl.gov (Eugene Brooks) (12/28/89)

In article <3090@umn-d-ub.D.UMN.EDU> rhealey@ub.d.umn.edu (Rob Healey) writes:
>	Was the code running on the XMP/YMP optimized to the Cray architecture
>	as much as the code running on the R6000 was optimized to the MIPS
>	architecture?
We have Cray machines on site, the MIPS R6000 system was a compile and go
benchmark run done by the vendor.  Just which system do you think the code
was "tuned" for, within the limits of keeping the code portable, readable
and maintainable?

>	Hmmm, I seem to remember the
>	Cray's being touted for their VECTOR capability in addition to
>	"respectable" SCALAR performance. Can the 6000 do seamless vector
>	operations too?
The scalar performance of the Cray machines is no longer "respectable",
is it!   I do not believe that the R6000 has vector registers, but
I haven't seen technical data on this issue.
>
>	Just asking if comparing Apples to Oranges and saying Apples are better
>	is a valid claim?
I was not comparing apples to oranges.  I was comparing the performance of
compiled C code on two computers...  In the best notion of benchmarking,
the code was one of MY compute bound applications.  Your milage will vary.
>
>	My main reason for responding to this excited article is that
>	I find it disturbing that ALOT of people pay attention only to
>	MIPS, or only one aspect of a system, and not to full systems as a
>	whole. To over simplify: 
My main reason to reponding to this article is that there are a lot
of people with their heads in the sand who still think that traditional
supercomputers or mainframes are good buys.  I hate to see people get
bushwacked by Killer Micros when they can just ride the wave.  Killer
Micro powered systems are no longer just more cost effective, for
scalar application codes they are faster...


brooks@maddog.llnl.gov, brooks@maddog.uucp

csimmons@.com (Charles Simmons) (12/28/89)

In article <42527@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
[Description of a benchmark comparing the performance of a Cray
versus the performance of an R6000.]

>My main reason to reponding to this article is that there are a lot
>of people with their heads in the sand who still think that traditional
>supercomputers or mainframes are good buys.  I hate to see people get
>bushwacked by Killer Micros when they can just ride the wave.  Killer
>Micro powered systems are no longer just more cost effective, for
>scalar application codes they are faster...
>
>brooks@maddog.llnl.gov, brooks@maddog.uucp

The comparison would be slightly more interesting if an Amdahl 5990
were compared to the R6000.  For scalar processing, Amdahl mainframes
are (were?) generally considered the fastest obtainable...

-- Chuck

rhealey@umn-d-ub.D.UMN.EDU (Rob Healey) (12/29/89)

In article <42527@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>In article <3090@umn-d-ub.D.UMN.EDU> rhealey@ub.d.umn.edu (Rob Healey) writes:
>We have Cray machines on site, the MIPS R6000 system was a compile and go
>benchmark run done by the vendor.  Just which system do you think the code
>was "tuned" for, within the limits of keeping the code portable, readable
>and maintainable?
>
	Could also be the fact that MIPS has some of the best compiler
	technology around. If I remember the Cray C compiler is a pcc
	derivitive; YUCK-O-RAMA.

>>	I find it disturbing that ALOT of people pay attention only to
>>	MIPS, or only one aspect of a system, and not to full systems as a
>>	whole. To over simplify: 
>My main reason to reponding to this article is that there are a lot
>of people with their heads in the sand who still think that traditional
>supercomputers or mainframes are good buys.  I hate to see people get
>bushwacked by Killer Micros when they can just ride the wave.  Killer
>Micro powered systems are no longer just more cost effective, for
>scalar application codes they are faster...
>
	Being my head is currently in Minnesota I'd say it might be in
	a snow bank but definitely NOT in the sand. What makes you think
	the bigger systems won't adopt the same technology as the "killer
	micros" and thus the costs go down. How good will your scalar 6000
	do HUGE data sets that require movement to and from I/O? The MIPS
	performance of the 6000 may well beat a super or mainframe but
	what about scalar problems that require heavy I/O? Will your
	low cost workstation be able to handle those problems better?

	Super computers and mainframes ARE GREAT buys when LOTS of
	users need to be serviced. You'd be foolish to think 1000 users
	would be best served by networked workstations maxed out with disk
	and memory so they can run at top speed. That situation requires a
	heirarchy of disk, CPU and memory networked together very carefully.

	My original point is being totally ignored here:

	MIPS is useless if the data can't flow in and out of the CPU
	at the rating of the CPU. The "Killer Micro" is a glorified oscillator
	when it has to wait for I/O to complete. DON'T use a diskless
	"Killer Micro" low cost workstation to try to do REAL work. Let the
	manufacturer nickel and dime you for fast disks and fast memory in
	vast quantitys.  While the MIPS arguement might work on the
	ignorant IBM PeeWee masses, technical people know better than to
	just look at one aspect of a problem and think the problem solved
	based only on that one aspect/criteria. When you solve a problem with
	a computer you have to weigh MIPS vs memory vs disk vs networking vs ??.
	You'll screw yourself over BIG TIME if you totally ignore any of the 4
	in heavy favor of 1 or 2 of the factors.

	This is my point, it looks like I picked the wrong article to bring it
	out on.

	'Nuff siad before we waste bandwidth on a subject most MIPS junkies
	will "stick their head in the sand" on...

			-Rob

I speak for no one but myself, they'd ignore me anyways...

brooks@maddog.llnl.gov (Eugene Brooks) (12/29/89)

In article <3091@umn-d-ub.D.UMN.EDU> rhealey@ub.d.umn.edu (Rob Healey) writes:
>	Could also be the fact that MIPS has some of the best compiler
>	technology around. If I remember the Cray C compiler is a pcc
>	derivitive; YUCK-O-RAMA.
No, a high quality optimizing, and vectorizing for that matter, C compiler
was used on the Cray.  It was the LLNL C/Civic hybrid compiler which uses
the same back end and optimizer as our Civic Fortran compiler.
The compiler was not a PCC derivitive.  The code quality on the Cray was
very good, the poor Cray supercomputer just couldn't be made to go faster at
reasonable coding cost.  We could have gotten another 50% out of the Cray
in speed for 6 months coding work, and possibly a factor of 2 in one man year.
The R6000 just compiled and ran the code 3.3 times faster.  What choice
would a sensible buyer of computer time make here???

>	What makes you think
>	the bigger systems won't adopt the same technology as the "killer
>	micros" and thus the costs go down.
I do think that "big systems" will adopt Killer Micro technology.
Supercomputer system integrators which don't will not survive the
the coming decade, and I personally doubt that they will survive the
next 5 years.  No one will survive the attack of the Killer Micros,
except those system integrators and users who choose to ride the wave.



>	do HUGE data sets that require movement to and from I/O? The MIPS
>	performance of the 6000 may well beat a super or mainframe but
>	what about scalar problems that require heavy I/O? Will your
>	low cost workstation be able to handle those problems better?
Yes, but I am not talking about a low cost work station here.  I am referring
to a system with a respectable number of Killer Micro processors.  Vendors
are integrating high performance and high reliability disk systems out
of commodity disks just as vendors will integrate supercomputers out of
Killer Micros.  These disk systems are appearing on boxes in a price
range which is dirt cheap compared to traditional supercomptuers but which
is much more expensive than what what you would put on a desk.  These are
time shared computers for large numbers of users.
>
>	Super computers and mainframes ARE GREAT buys when LOTS of
>	users need to be serviced. You'd be foolish to think 1000 users
I think that that cold weather has gotten to your neurons.

>	My original point is being totally ignored here:
Your original point is not being ignored, you are ignoring the high
performance I-O systems that are appearing on Killer Micro powered systems.
These high performance I-O systems are built of commodity disk drives
and are much cheaper, while being faster, than high performance disk drives
used on supercomputers.


brooks@maddog.llnl.gov, brooks@maddog.uucp

rhealey@umn-d-ub.D.UMN.EDU (Rob Healey) (12/29/89)

In article <42600@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>I do think that "big systems" will adopt Killer Micro technology.
>Supercomputer system integrators which don't will not survive the
>the coming decade, and I personally doubt that they will survive the
>next 5 years.  No one will survive the attack of the Killer Micros,
>except those system integrators and users who choose to ride the wave.
>
	You sound like your definition of super computer is static. As we
	all know micro, mini, main and super are defined in terms relative
	to the others. Traditionally every level snarfs ideas from the
	level above as technology enables it to be done. Whatever technology
	a micro uses can obviously be used on a faster and more expensive
	scale in a super, why this is not already the case is probably due
	to the fact that the supers aren't threatened enough yet.

>Yes, but I am not talking about a low cost work station here.  I am referring
>to a system with a respectable number of Killer Micro processors.  Vendors
>are integrating high performance and high reliability disk systems out
>of commodity disks just as vendors will integrate supercomputers out of
>Killer Micros.  These disk systems are appearing on boxes in a price
>range which is dirt cheap compared to traditional supercomptuers but which
>is much more expensive than what what you would put on a desk.

	Hmmm, parallel OS technology, REAL stable stuff once you get above
	a dozen or so CPU's... Again, anything in the I/O systems can easily
	be improved upon in the next level up. The need for a computer with
	abilitys beyond the killer would still exist, the killer would still
	not eliminate the super. The super wouldn't necessarily be a bunch
	of micros thrown together in parallel either.

>you are ignoring the high
>performance I-O systems that are appearing on Killer Micro powered systems.
>These high performance I-O systems are built of commodity disk drives
>and are much cheaper, while being faster, than high performance disk drives
>used on supercomputers.
>I think that that cold weather has gotten to your neurons.
>
	NOPE, I have high powered heaters for the neurons. B^) In order for the
	killer micros to beat out supers, supers would have to stand still in
	Parallel OS, I/O subsystems and implementation technologys. I sincerly
	doubt that will happen, the scale will shift as it always has. Micros
	will still be less powerful than supers, the definition of the terms
	makes that certain.

	There will always be super computers, there will just be more people
	using killer micros since that's all they can afford for what they
	need to do. But by the same token, there will always be a few
	problems that the killer micros just can't quite cut, this is
	where, by definition, supercomputers are usually used.

	As far as commodity disk drives go, let's hope our banks don't decide
	that commodity disks are more cost effective; OOOOOPS, lost a bit
	or two there Joe... Problem solved by volume shadowing and error
	correction technologys but geez that sounds familiar from somewhere...
	Again, the techniques for correction and detection can be improved
	if your data warrents it.
	
	To overuse yet another big boy phrase: One way or another, you get what
	you pay for.

	The killer micros will always be a notch or two below the killer supers
	in the real world. Just because supers haven't been threatened enough
	from below doesn't mean they won't bite back hard when they are.

	The 6000 in the original article was a VERY state of the art
	pre-production CPU, compare it's performance to a VERY state of
	the art pre-production super and see what the results are.

	Let's continue the banter via e-mail, I'm sure comp.arch is sick of
	us already.

			-Rob

chris@mimsy.umd.edu (Chris Torek) (12/29/89)

In article <42600@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov
(Eugene Brooks) writes:
>... I am not talking about a low cost work station here.

(Note that the R6000-based MIPS system is expected to be in the $100k
to $200k range, if I remember right: rather a bit more than your desktop
$10k micro.)

>Your original point is not being ignored, you are ignoring the high
>performance I-O systems that are appearing on Killer Micro powered systems.
>These high performance I-O systems are built of commodity disk drives
>and are much cheaper, while being faster, than high performance disk drives
>used on supercomputers.

A note of caution here: they are cheaper, but not (yet) faster.  The
CM Datavault (or whatever they are calling it these days) runs 39 SCSI
disks in parallel (32 bits + ECC).  These are doing fairly well if they
sustain > 1 MB/s each, so a Datavault gets ~32 MB/s.  With IPI disks
expected to do 8 MB/s each in the near future, a Datavault style system
could do 256 MB/s: still slower than Cray, but quite respectable.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

rpeglar@csinc.UUCP (Rob Peglar x615) (12/29/89)

Eugene (Brooks) has already responded to this, quite elegantly.  Just
wanted to throw in my $.02.  

In article <3091@umn-d-ub.D.UMN.EDU>, rhealey@umn-d-ub.D.UMN.EDU (Rob Healey) writes:

> 	Being my head is currently in Minnesota I'd say it might be in
> 	a snow bank but definitely NOT in the sand. What makes you think
> 	the bigger systems won't adopt the same technology as the "killer
> 	micros" and thus the costs go down. How good will your scalar 6000
> 	do HUGE data sets that require movement to and from I/O? The MIPS
> 	performance of the 6000 may well beat a super or mainframe but
> 	what about scalar problems that require heavy I/O? Will your
> 	low cost workstation be able to handle those problems better?
                                                              ^^^^^^

If the meaning of "better" is absolute performance, no, not today - but
as Eugene says, "ride the wave".  No will be yes, and Soon.

If, on the other hand, the meaning is "price/performance", most assuredly
yes, yes, yes - today, tomorrow, and forever more.

There are many meanings.  Be specific.

As far as being in the snow banks, I'm there too - fortunately, it's my
feet, not my head :-)

> 
> 	Super computers and mainframes ARE GREAT buys when LOTS of
> 	users need to be serviced. You'd be foolish to think 1000 users
> 	would be best served by networked workstations maxed out with disk
> 	and memory so they can run at top speed. That situation requires a
> 	heirarchy of disk, CPU and memory networked together very carefully.

Look around you.  The very same "systems" - in the broad sense of the word -
i.e. many components - are indeed overtaking centralized, vertical machines.
There was a long thread on this topic (degree of centralization) a while
back.  Personally, I held the same opinion (as described above) for many
years, and have since changed my mind.  Rather, opened my mind.

As in almost every problem to be solved, there are many solutions.  Use
what's best for you - and don't be afraid to change.  If a large super
serving 200 people gives those people the most "numerator" (MIPS,Flops,
I/Os,etc.) for the "denominator" (dollars,time,effort,etc.etc) then great,
use the super.  If not, swallow hard and accept the Killer Micro as a
fact of life.

> 
> 	My original point is being totally ignored here:
> 
> 	MIPS is useless if the data can't flow in and out of the CPU
> 	at the rating of the CPU. The "Killer Micro" is a glorified oscillator
> 	when it has to wait for I/O to complete. DON'T use a diskless
> 	"Killer Micro" low cost workstation to try to do REAL work. Let the
> 	manufacturer nickel and dime you for fast disks and fast memory in
> 	vast quantitys.  While the MIPS arguement might work on the
> 	ignorant IBM PeeWee masses, technical people know better than to
> 	just look at one aspect of a problem and think the problem solved
> 	based only on that one aspect/criteria. When you solve a problem with
> 	a computer you have to weigh MIPS vs memory vs disk vs networking vs ??.
> 	You'll screw yourself over BIG TIME if you totally ignore any of the 4
> 	in heavy favor of 1 or 2 of the factors.
> 
> 	This is my point, it looks like I picked the wrong article to bring it
> 	out on.

Editoral note - you aren't scoring any points for phrases like "REAL work",
"PeeWee masses", and "BIG TIME".  

Anyway, you should carefully look at the issue of CPU starvation on some
of the very machines you tout - like the Cray-2.  Some (not all) of the
smaller machines exhibit much less CPU starvation.  The ETA-10 is (was)
another notable example of real and potential CPU starvation as an
architectural flaw.

There will always be room for big supers.  The room, however, is becoming
smaller.  Don't get squeezed.

Rob
-- 
Rob Peglar	Control Systems, Inc.	2675 Patton Rd., St. Paul MN 55113
...uunet!csinc!rpeglar		612-631-7800

The posting above does not necessarily represent the policies of my employer.

mccalpin@stat.fsu.edu (John Mccalpin) (12/30/89)

In article <158@csinc.UUCP> rpeglar@csinc.UUCP (Rob Peglar x615) writes:
>
>Anyway, you should carefully look at the issue of CPU starvation on some
>of the very machines you tout - like the Cray-2.  Some (not all) of the
>smaller machines exhibit much less CPU starvation.  The ETA-10 is (was)
>another notable example of real and potential CPU starvation as an
>architectural flaw.

It seems odd to mention the Cray-2 and the ETA-10 in the same sentence
with regard to "CPU starvation".  It seems to me that the ETA-10 is a
much more balanced design with regard to memory bandwidth -- I don't
know about I/O speeds past the shared memory, though...  With the most
recent release of the operating system, we have gotten paging rates of
>500 MB/s on thrashing jobs.  This is almost 1/2 of the physical I/O
bandwidth to shared memory.  Earlier system software certainly left the
cpu hungry, but the hardware is capable of some pretty tremendous
bandwidth, and the software is finally starting to catch up....

>There will always be room for big supers.  The room, however, is becoming
>smaller.  Don't get squeezed.

When Cray Research was founded, they estimated a world market for
supercomputers that was in the neighborhood of 40 units.  Maybe they
weren't so far off after all!

Anyway, here at FSU we have been pushing the KILLER MICRO bandwagon,
too.  Lets get all those !@#$%^&* scalar jobs _off_ of our vector
machines and onto the killer micros where they belong....  Then those
of us who can effectively use the vector machines will have more time
available.

By the way, I estimate the the (soon-to-be-installed) FSU Cray
Y/MP-4/432 will only be about 125 times as fast as the new MIPS "KILLER
MICRO from HELL" on my code.  Yep, they are closing the gap all right....

>Rob Peglar	Control Systems, Inc.	2675 Patton Rd., St. Paul MN 55113
>...uunet!csinc!rpeglar		612-631-7800

mcdonald@aries.scs.uiuc.edu (Doug McDonald) (12/30/89)

>> 
>> 	My original point is being totally ignored here:
>> 
>> 	MIPS is useless if the data can't flow in and out of the CPU
>> 	at the rating of the CPU. The "Killer Micro" is a glorified oscillator
>> 	when it has to wait for I/O to complete. DON'T use a diskless
>> 	"Killer Micro" low cost workstation to try to do REAL work. Let the
>> 	manufacturer nickel and dime you for fast disks and fast memory in
>> 	vast quantitys.  While the MIPS arguement might work on the
>> 	ignorant IBM PeeWee masses, technical people know better than to
>> 	just look at one aspect of a problem and think the problem solved
>> 	based only on that one aspect/criteria. 

Well, I am a member of both the IBM PeeWee masses and a "technical person".
This comment is so obvious that should be obvious  - but I guess that it
isn't to the above poster.  

>>technical people know better than to
>> 	just look at one aspect of a problem and think the problem solved
>> 	based only on that one aspect/criteria. When you solve a problem with
>> 	a computer you have to weigh MIPS vs memory vs disk vs networking vs ??.
>> 	You'll screw yourself over BIG TIME if you totally ignore any of the 4
>> 	in heavy favor of 1 or 2 of the factors.
>> 

This is quite true - BUT - and a big but - when you DO look at the big
picture, you will find that some people need only MIPS, others
(the IBM mainframe accounting crowd?) mainly IO bandwidth, others
need abnormally large memory. Once you get to the final decision of
benchmarking systems to buy, you may well want to weigh one aspect
at 90% of the total decision. The problem with the IBM mainframes and
the Cray supercomputers is that they have very large very expensive
IO systems that some people RIGHT NOW simply don't need. That is 
(one reason ) why killer micros are selling so very well. 

>>DON'T use a diskless
>> 	"Killer Micro" low cost workstation to try to do REAL work. 

It is this statement that I find offensive. For some people it is
indeed what is needed for "real work". I once had the fastest computer
in the world run for 16 hours with ZERO "I" requests (literally) and
only a few kilobytes of "O". (This was long ago on the Illiac IV and
it was a miracle that it didn't die in the 16 hours - but it was free.)

Doug McDonald

brooks@maddog.llnl.gov (Eugene Brooks) (12/30/89)

In article <787@stat.fsu.edu> mccalpin@stat.fsu.edu (John Mccalpin) writes:
>By the way, I estimate the the (soon-to-be-installed) FSU Cray
>Y/MP-4/432 will only be about 125 times as fast as the new MIPS "KILLER
>MICRO from HELL" on my code.  Yep, they are closing the gap all right....
You you care to enlighten the masses with regard to the basis for this estimate? 
brooks@maddog.llnl.gov, brooks@maddog.uucp

brooks@maddog.llnl.gov (Eugene Brooks) (12/30/89)

In article <1989Dec28.000031.14774@oracle.com> csimmons@oracle.UUCP (Charles Simmons) writes:
>The comparison would be slightly more interesting if an Amdahl 5990
>were compared to the R6000.  For scalar processing, Amdahl mainframes
>are (were?) generally considered the fastest obtainable...

I am never one to pass up a chance to collect data...  Lets do it!
Has anyone got access to an Amdahl 5990 with a decent C compiler?

brooks@maddog.llnl.gov, brooks@maddog.uucp

mccalpin@stat.fsu.edu (John Mccalpin) (12/31/89)

In article <787@stat.fsu.edu> I wrote:
>By the way, I estimate the the (soon-to-be-installed) FSU Cray
>Y/MP-4/432 will only be about 125 times as fast as the new MIPS "KILLER
>MICRO from HELL" on my code.  Yep, they are closing the gap all right....

In article <42701@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks)
asked:
>Would you care to enlighten the masses with regard to the basis for
>this estimate? 
>brooks@maddog.llnl.gov, brooks@maddog.uucp

The estimate is based on the _observed_ performance of an 8-processor
Cray Y/MP vs a 25 MHz R-3000 (SGI 4D/2x0).  The speed ratio in that
case is 536:1, and this Cray is an internal machine with a 6.5 ns
clock, rather than the 6 ns clock that will be installed at FSU.

So applying some scaling suggests that a 4-cpu Cray Y/MP at 6 ns will
be about 290 times as fast as the R-3000 box.  Then scale the MIPS cpu
speed by the ratio of the clocks of the R-6000 to R-3000 to reduce this
ratio to about 120:1.  (I am assuming a 60 MHz clock on the R-6000 ---
I don't know what the exact value will be....).

Since the code is highly parallelizable, a multi-processor R-6000 based
machine will show good speedups up to about 16 processors.  The experience
on the Cray and Ardent machines suggests that a speedup of 12x should be
possible on a 16-cpu system.  However, multi-processor Cray Y/MP's exist
today, and multi-processor R-6000 machines do not....

The code is a hybrid finite-element/finite-difference ocean circulation
model written in portable FORTRAN-77.  The calculations are all done in
64-bit precision, and require 64-bits for reasonable accuracy.

This is all just as excuse to remind Eugene :-) that some users will
still be able to make effective use of vector supercomputers.  In
Price/Performance ratios, the scalar KILLER MICRO's are not even
significantly ahead of the traditional supercomputers on optimal
codes.  They are certainly not _yet_ competitive with regard to 
turnaround time on large vector jobs, though I agree that that will
change soon as 8-16 cpu machines in the R-6000 class become available.

My next project is porting this code to a Connection Machine CM-2.
I anticipate about the same performance as the 8-processor Y/MP,
but in a much more scalable architecture, and at about 1/4 of the
price.

brooks@maddog.llnl.gov (Eugene Brooks) (12/31/89)

In article <788@stat.fsu.edu> mccalpin@stat.fsu.edu (John Mccalpin) writes:
>So applying some scaling suggests that a 4-cpu Cray Y/MP at 6 ns will
>be about 290 times as fast as the R-3000 box.  Then scale the MIPS cpu
So to really compare one processor to one processor, as any reasonable person
would do, we divide the 290 by 4 to get a ratio of 72 for the 6 NS Y to the
R3000.  This is the kind of single cpu speed ratio that we see here,
and expect at this point, for codes running near 100% vectorization levels.
If you take the manufactuer's hint of a speed ratio of 2.5 between
the R3000 and the R6000 you get a factor of 29 for the YMP vs the R6000.
Now the ONE data point I have indicates that the ratio between the R3000 and
the R6000 can be as good as 2.7, so I am inclined to believe the manufacturers
estimate which is lower.

I do not know what kind of a deal you fellows got on a Y, but
an 8 processor Y with 32 megawords (thats 32 megabytes per cpu)
cost (system cost, disk drives included) around 3 million per processor.
Yes, we are looking at increasing the size for the memory of the one
here, at a cost I don't care to mention in an open forum.
The single cpu R6000 is going to be between 100K and 200K depending
on whether you go for more memory per cpu, and many gigabytes
of disk.  The bottom line, roughly 30 times the speed for 30 times
the cost for code which is fully vectorized on the Y.  There is an absolute
performance advantage but no cost-performance advantage.  If your
code is not 99% vectorized, however, you are very foolish to run
it on a traditional supercomputer cpu.  As you correctly point out.

>This is all just as excuse to remind Eugene :-) that some users will
>still be able to make effective use of vector supercomputers.  In
I pointed out in my posting that Killer Micros have overrun traditional
supercomputers in scalar performance.  I qualified this very explicitly
in my posting.  The notion that I need to be reminded that traditional
supercomputers are still hanging in there for codes which are nearly 100%
vectorized is silly.

brooks@maddog.llnl.gov, brooks@maddog.uucp

mccalpin@stat.fsu.edu (John Mccalpin) (12/31/89)

In article <788@stat.fsu.edu> I wrote:
>So applying some scaling suggests that a 4-cpu Cray Y/MP at 6 ns will
>be about 290 times as fast as the R-3000 box.  Then scale the MIPS cpu....

To which brooks@maddog.llnl.gov (Eugene Brooks) replied:
>So to really compare one processor to one processor, as any reasonable person
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^
>would do, we divide the 290 by 4 to get a ratio of 72 for the 6 NS Y to the
 ^^^^^^^^
>R3000. [...details deleted...] you get a factor of 29 for the YMP vs the R6000.

So, I am not a reasonable person?  I compared a configuration of the
Cray which is _smaller_ than the one I ran on with the only
configuration of the MIPS product that I have even heard of.  The MIPS
is not even announced yet as a single-processor, so it is giving a
slight advantage to the killer micro, since it is comparing a delivered
system to an unannounced one....

Maybe I should use single-cpu performance comparisons with my Connection
Machine results? :-)

>The bottom line, roughly 30 times the speed for 30 times
>the cost for code which is fully vectorized on the Y.  There is an absolute
>performance advantage but no cost-performance advantage.

If the MIPS box had enough memory and disk to run the same jobs that I
run on the Cray, then the Cray should be about 2 times more
cost-effective in that naive measure.  Of course if I have a job that
takes 100 hours on an 8-processor Y/MP, then I would have to wait for
59 weeks on the (almost) equally cost-effective "KILLER MICRO from
HELL".

>If your code is not 99% vectorized, however, you are very foolish to run
>it on a traditional supercomputer cpu.  As you correctly point out.

Well, I didn't say "very foolish", but as a taxpayer I would prefer
people to use the more expensive of the government-owned machines only for
jobs that they are reasonably cost-effective for....

I wrote:
>This is all just as excuse to remind Eugene :-) that some users will
>still be able to make effective use of vector supercomputers.  In

Eugene replied:
>I pointed out in my posting that Killer Micros have overrun traditional
>supercomputers in scalar performance.  I qualified this very explicitly
>in my posting.  The notion that I need to be reminded that traditional
>supercomputers are still hanging in there for codes which are nearly 100%
>vectorized is silly.
>brooks@maddog.llnl.gov, brooks@maddog.uucp

That's what the smiley face was there for....

By the way, the most cost-effective machine on my code is the new
Stardent 3000.  It runs at about 1/15 of the speed of the Cray on a
per-cpu basis and is less than 1/50 of the cost....  Too bad I can't
afford one!

shekita@provolone.cs.wisc.edu (E Shekita) (01/01/90)

Speaking of killer micros from hell: In case anyone missed it, 
MIPS went public recently on the OTC market.

brooks@maddog.llnl.gov (Eugene Brooks) (01/01/90)

In article <791@stat.fsu.edu> mccalpin@stat.fsu.edu (John Mccalpin) writes:
>In a short series of articles, Eugene Brooks and I have been flaming
>back and forth (in a reasonably light-hearted sort of way) about the
>relative merits of vector supercomputers vs KILLER MICROS.
I think that we can cut back on the flames a bit....

Actually, this line of discussion started out with a posting of a real
measurement for a specific "very scalar" code on the XMP4/16 CPU and on a R6000.
I speculated that the R6000 is the fastest single CPU computer in the world on
this specific code.  I will provide the code to any person who would like to run
the code on another machine and disprove this speculation.  I will even take
accurate simulation results for any traditional supercomputer which will BE
DELIVERED in the same time frame as the R6000's lifetime, which I estimate to
be at a close one year from now.

I also speculated that the R6000 is the first Killer Micro to cleanly overrun
traditional supercomputers for scalar dominated computation.  I suggest that we
compare the performance of the 5 scalar LLNL loops for the YMP and the R6000,
when MIPS cares to release the figures, as a way to decide on this question.
We should also compare the R6000 to the Japanese machines which are currently
on the market as well, I understand their their scalar performance is quite
impressive.

The notion that the R6000 is not really here yet, that it is not yet being
delivered, is a red herring.  Its lifetime will be over, having been replaced
by much meaner hardware, before any of the next generation of traditional
supers are ready for benchmarking.

How well KMs are doing on vectorizable workloads, for the purposes of this
duscussion, is a red herring.  I prefer to wait for the appropriate time
to discuss KMs overrunning traditional supercomputers for vector workloads.
Killer Micros have been quite brilliant in their strategy of market conquest,
they have always waited for a clean unambigous kill of their prey before
visibly moving into the frey.  Lets wait till they make their move to worry
about splitting the hairs on the issue.

brooks@maddog.llnl.gov, brooks@maddog.uucp

rpeglar@csinc.UUCP (Rob Peglar x615) (01/02/90)

In article <787@stat.fsu.edu>, mccalpin@stat.fsu.edu (John Mccalpin) writes:
> In article <158@csinc.UUCP> rpeglar@csinc.UUCP (Rob Peglar x615) writes:
> >
> >Anyway, you should carefully look at the issue of CPU starvation on some
> >of the very machines you tout - like the Cray-2.  Some (not all) of the
> >smaller machines exhibit much less CPU starvation.  The ETA-10 is (was)
> >another notable example of real and potential CPU starvation as an
> >architectural flaw.
> 
> It seems odd to mention the Cray-2 and the ETA-10 in the same sentence
> with regard to "CPU starvation".  It seems to me that the ETA-10 is a
> much more balanced design with regard to memory bandwidth -- I don't
> know about I/O speeds past the shared memory, though...  With the most
> recent release of the operating system, we have gotten paging rates of
> >500 MB/s on thrashing jobs.  This is almost 1/2 of the physical I/O
> bandwidth to shared memory.  Earlier system software certainly left the
> cpu hungry, but the hardware is capable of some pretty tremendous
> bandwidth, and the software is finally starting to catch up....

Sounds like the work Chris' group (particularly JPH) is finally bearing
fruit - seven months too late..... :-(

McCalpin is correct about the ETA-10 being a "more" balanced design.  Let's
take a look at the ETA-10 from the "external" memory perspective - ignoring
the "internal" (e.g. RNI) paths from 1st level store to CPU(s).  Take
my word for it, the internal paths from 1st level store to the CPUs are
sufficient.  Otherwise, multi-pipe operations would not be possible.
ETA-10 shared memory (SM) (2nd level store) can feed central memory (CM) 
(1st level store) at the rate of one 64-bit word per clock.  The CPU can
compute at the rate of needing four 64-bit operands (input) per clock
(2 pipes each doing M-M vector A op vector B).  Assume for this case that
the input operands are considered "used" after the computation, i.e. they
won't be needed (ever) again.  Thus, to avoid CPU starvation from the
hardware perspective, the SM-->CM bandwidth is too small by a factor of
four.  If the "software" (OS or application) can manage its own memory
correctly (i.e. four SM-->CM transfers of N words for every computation
on N words) then the computation can continue at peak forever.  Alas,
Babylon.  Peak rates are not sustainable.  This problem becomes even worse
if one needs third level store (typ. disk) to SM to refresh SM in a similar
manner.  This is excerbated in the liquid cooled machines, typically 
because the ratio of IOU's to SM size was too low.  Current hardware can
only extract about 70% of the max IOU-->SM bandwidth due to the handshaking
across the IOI.  Current (1.1.5) software can only get about 70% of that
through the file system.  E-mail me for more discussion.

> 
> When Cray Research was founded, they estimated a world market for
> supercomputers that was in the neighborhood of 40 units.  Maybe they
> weren't so far off after all!

Probably only a factor of ten.

> 
> Anyway, here at FSU we have been pushing the KILLER MICRO bandwagon,
> too.  Lets get all those !@#$%^&* scalar jobs _off_ of our vector
> machines and onto the killer micros where they belong....  Then those
> of us who can effectively use the vector machines will have more time
> available.

Amen.

> 
> By the way, I estimate the the (soon-to-be-installed) FSU Cray
> Y/MP-4/432 will only be about 125 times as fast as the new MIPS "KILLER
> MICRO from HELL" on my code.  Yep, they are closing the gap all right....

See the comment from Eugene Brooks.  The key words, of course, are "my
code" ... there are no absolute answers.  Once again, the "gap" of
absolute performance is there.  The "gap" of price/performance, on the
other hand, is now in the Killer Micro camp, for enough codes to make
it interesting...

John, if you want to discuss more, e-mail...

Rob
-- 
Rob Peglar	Control Systems, Inc.	2675 Patton Rd., St. Paul MN 55113
...uunet!csinc!rpeglar		612-631-7800

The posting above does not necessarily represent the policies of my employer.

desnoyer@apple.com (Peter Desnoyers) (01/04/90)

Just a few thoughts on this ongoing debate -

  Eugene Brooks is claiming that the R6000 is (probably) faster than any 
other computer for ONE specific simulation that he runs. He describes this 
as a packet-switched network simulation, almost completely scalar. 

Claims that {supercomputer X} runs {FP app Y} faster don't alter this 
claim. 

Claims that {super X} has much more memory or much more I/O bandwidth than 
the R6000 are probably irrelevant as well, as event-driven simulations (I 
assume a PSN simulator would be event-driven) may not need the amounts of 
memory and I/O that other types of simulations require.

[Gross generalization. However, consider that in many scientific codes - 
e.g. weather simulations - you can increase the simulated detail, and 
hence accuracy and memory requirements, by decreasing the grid size. To do 
the equivalent with an event-driven simulation may require describing the 
finer detail yourself in code.]

In other words, Eugene may not be comparing apples to oranges; however, he 
is discussing the merits of his apple in a conference full of orange 
growers :-)

                                      Peter Desnoyers
                                      Apple ATG
                                      (408) 974-4469

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (01/15/90)

In article <28674@amdcad.AMD.COM> davec@proton.amd.com (Dave Christie) writes:
>But coming up with faster versions of classic supercomputers
>has (IMHO) been much more difficult and costly, and the resulting
>performance improvements not so spectacular, as compared to micros over
>the past several years.

Yes. A case in point is the new Cyclone processor from Tandem. I'm
not knocking it: I'm sure that it was built by sharp people, and will
be sold successfully. It has the important property that it's bit-
compatible with Tandem's previous stack machines.

However, it was clearly a major effort - they wrote 420 KB of
microcode, and they did in-house metallization of ECL gate arrays.
Nor did the machine come out small:  each CPU+IOP fills three 18 x
18" boards, and the microcode alone takes over a hundred chips.

So, did they get much for all that? No. It only runs at 22.2 MHz,
although in its defence I should add that it often issues two
instructions per clock. At best, that's 45 MHz.  I don't know the
MIPS/VUPS ratio, but even if the ratio is better than I think, the
Cyclone still isn't as fast as the new ECL RISCs. It's also pretty
well under the wheels of the CMOS steamroller.

Is it reliable? Well, yes, it's a Tandem product. It has parity and
temperature compensation and a diagnostic processor and spare cache
RAMs. But a Killer Micro with the same throughput could be made more
reliable, at a lower price, simply from its reduced chip count.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science