[comp.arch] MIPS/MFLOPS ratio long; here we go again; sorry

mash@mips.COM (John Mashey) (07/06/89)

1. INTRODUCTION

In article <112807@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes:

1) Some comments about SPARC integer-vs-floating point that seem to
rewrite history from before when keith was at Sun, as well as some comments
about Hot Chips that need some balancing comments (which you can take either as
objective data, or as opposite-bias opinions; your call).

2) ``So the FPU
>integration/implementation variable is tilting towards SPARC (unless
>one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI,
>Solb., Prisma and all the others.''
Marketing B.S. doesn't make something ("tilt") true; only being true makes
it true; in any case, in my opinion, the logic (only if MIPS smarter is there
no tilt towwards SPARC) is flawed, and I'll show why.
-------
Some of this discussion inherently contains industry-oriented stuff,
which I'm forced into, as well as some serious technical meat, thank goodness.
If you don't like the former, hit "n" now.

OUTLINE OF REST:
2. KHB's MODEL OF SUN FP TRADEOFFS; ANOTHER MODEL
3. FP  PRESENT, INCLUDING COMPILER ISSUES
4. ANALYSIS OF "TILTING TOWARDS SPARC", INCLUDING HOT CHIPS
4.1 WHAT KHB SAYS
4.2 WHAT MASH SAYS
4.3 HOT CHIPS, GENERAL
4.4 HOT CHIPS, CMOS FPU SESSION
4.5 "TILTING TOWARDS SPARC, UNLESS MIPS SMARTER THAN EVERYBODY" : UNPROVEN

2. KHB's MODEL OF SUN FP TRADEOFFS; ANOTHER MODEL
>In article <596@megatek.UUCP> mark@megatek.UUCP () writes:

>>This seems a little out of whack... it seems that older scientific
>>processors had ratios in the 3-4 range.

>Current SPARC implementations (chips and system) from Sun were
>intended for "more general purpose use" hence the (relatively) narrow
>gap between integer performance on a Cray to a 4/330. While floating
>point is fun (and is typically my reason for existing on a project) I
>spend most of my day doing compiles, editing, runing schedtool, and
>other nonFP things. So using the 80-20 rule... the first machines
>should be the ones we need 80% of the time.

FACT:  I admit to a nasty habit of keeping old marketing material
and press clippings, which I believe predate khb's tenure at Sun;
I often keep such things as a reality check.

The following are quotes from the July 87 Sun-4 introductory material:
``Relative to other manufacturer's high-end offerings,
the Sun-4/200 excels in floating-point performance.
In fact, the Sun-4/200 will execute floating-point-intensive applications
faster than the VAX 8800 superminicomputer.'' ....
``...giving users an overwhelming reason to migrate applications that
currently run on super computers, minsupers, and superminis onto workstations.''
``..first supercomputing workstation...''
``Sun-4/200 Series is ideally suited for all compute-intensive, floating-point,
or graphics-intensive applications.  The primary markets targeted are high-end
mechanical-CAD (MCAD) applications such as solids modeling and finite element
analysis, electrical-CAD (ECAD) applications including IC and PC layout
and routing; Artificial Intelligence (AI) development, earth resources,
molecular modelling, and other compute-intensive applications.''
``..ideal for applications in the scientific computing and electrical CAD
markets.''

OPINION: FP not important?? Less important for Sun-4s??

OPINION: I think the original assertion (==VAX8800 FP) is probably true, if you
replace Sun-4/200 (1987) by SPARCstation 3xx (1989). As pointed out shortly
thereafter, the VAX 8700 and 8800 are NOT the same: 8800 has 2 8700 CPUs.
It turned out that a Sun-4/200 was usually slower on many real
FP applications than an 8700, (especially if using VMS compilers, which is
what actually runs on most 8700/8800s).  [OPINION] SS3xxs do appear to be better
balanced than Sun-4/2xxs with regard to FP versus integer performance.

3. FP  PRESENT, INCLUDING COMPILER ISSUES
(....why people think MIPS FP is faster than SPARC FP...)
>Compilers is often stated, but according to my weeks of staring at
>huge volumes of data, it seems that the compiler differences are
>minimal on large codes. The current sun compilers are somewhat less
>clever about certain operations, but not enough to explain the
>difference in performance.

I suspect much of the code looks similar, which is not surprising,
given the similarities of the register sets available at any one time,
FP instruction sets that are fairly similar, and IEEE.
At least one SPARC architectural difference was described by Tom Pennelo of
Metaware at Hot Chips, but khb failed to mention:
passing FP arguments in the integer registers, and not having
direct moves to/from IU and FP, means that (in C, at least),
saying y = glurp(x), with floats x,y, gives you something like:

(x sitting in FP reg)
	store x to memory; load it to integer register z.
call glurp
	store z to memory; reload it into FP reg; compute
	store result into memory, reload it into integer result reg
return
	store result to memory; reload into FP reg (y)
I have no idea how often this happens; fortunately for SPARC, FORTRAN is
call-by-reference.  Note also that conversions from int<->float go
thru a similar drill (which is truly architectural, not architecture+
language convention, like the previous example, which, if not architectural
is probably so wired into things it would be nontrivial to change.)

The main reasons, I think, for the differences are:
	1) The SPARC multi-cycle loads and stores, which is is not ISA,
		but SYSTEM architecture and implementation.
	2) The MIPS FPUs have lower cycle counts.
	3) The compiler thing is an open question; I haven't looked at
	much SPARC FP code lately, so I don't personally know.  Maybe
	some UNBIASED third-parties would care to comment and give some DATA.

4. ANALYSIS OF "TILTING TOWARDS SPARC", INCLUDING HOT CHIPS
4.1 WHAT KHB SAYS
>What is interesting is that the benchmarks which SPARC does worst on
>are highly FP and memory intensive (say 30-50% loads and stores).
(See the discussion on DP LINPACK later, which is actually one of the
SS3xx and Sun-4/2xx's best FP benchmarks; SPARC systems have good external
memory systems that are well-suited to memory-intensive applications.)

>MIPSco built their own FPU and tightly coupled it to their IU. This
>resulted in early units which were superior to the SPARC
>implementation philosophy (let's buy whatever is laying around and
>glue it in -- in the first implementations that meant a weitek 1164
>and 1165 and a controller ... "leftovers" from the sun3/fpa project).
>At yesterday's IEEE HOT CHIPS conference, we were treated to three
>papers about dedicated SPARC FPU's in addition to the papers focused
>on FPU's BIT is already sampling ECL SPARC chips. So the FPU
>integration/implementation variable is tilting towards SPARC (unless
>one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI,
>Solb., Prisma and all the others.

4.2 WHAT MASH SAYS
Sigh.  What does "tilting towards SPARC" mean?  Does it mean that
SPARC is getting ahead, or might be catching up ("tilting back towards
parity")?  I'm tired of this, but I can't let this
argument go past.... I believe SPARC is getting closer, but that doesn't
mean "tilting towards SPARC".

There is nothing wrong, apriori with the SPARC implementation strategy
(of using some existing FPU parts, and getting to market quickly),
although calling the WTL parts "leftovers" might be a little
Sun-centric view of the world, as those parts were used in plenty of
other machines, including early MIPS M/500s (before R2010s existed).
I'd use existing parts to get started, too; in fact, we did.
The original SPARC team was small, and didn't have infinite resources,
so this was all perfectly reasonable.  In retrospect, [OPINION], the
only problem was in not having somebody going like crazy to build a
serious CMOS SPARC FPU early enough, and I have no idea whether somebody
wanted to do this, and wasn't allowed to, or whether the partners didn't
want to, or whether nobody had time to think about it at the right time,
or what.  Maybe we could be enlightened.

In any case,the sequence is (with jiggles of a quarter possible on any date):
	MIPS				SPARC
4Q86	WTL 116x in M/500		WTL 116x in Sun-3
2Q87	R2010 in M/500 socket, M/800
3Q87					WTL 116x in Sun-4
4Q87	R2010 in M/1000
2Q88	R2010 in M/120
4Q88	R3010 in M/2000
1Q89
2Q89					TI8847 in Sun-4 and SS300
					WTL 3170 in SS1

4.3 HOT CHIPS, GENERAL
1) FACT: presentations at conferences are not deliveries of systems.

2) OPINION: The BIT+Sun ECL design looks well-done, with some reasonable
and informed thinking in many places.  Maybe before SPARC victory is declared
by khb on the ECL front we maybe ought to wait for the first actual ECL systems
to be shipped, and see how they run real programs.  Anant Agrawal's talk
was well-done, and mostly solid technical content (except for "World's
first single chip ECL 32 bit processor" and "World's fastest microprocessor.
80MHz 12.5ns cycle."  If you add "announced" to those, I might agree.:-)
Despite such claims, it didn't give any SPECIFIC performance data (simulations
of real programs).....  There was a good treatment of cache interface,
although a few interesting parts (like actual cache and MMU designs,
and getting enough fast enough SRAM hooked up) of building a
complete system are Left To The Reader.....  Khb might want to ask his
his ECL colleagues about some of these issues.  Still, this was a credible
presentation and design, and for reasons that will be obvious sooner or
later, there are more reasons for FP performance to be more similar
than past designs.

3) OPINION: Pete Wilson's Prisma talk was delightful and fascinating; I admit
that MIPS is not, to my knowledge, building a GaAs supercomputer of
the $500K-$1M ilk, so I wish them well.

4) FACT: Solbourne did not present at the conference.
Fujitsu referenced WTL 3170, but didn't otherwise talk about FP that I can
recall.  Cypress/Ross mentioned the CY7C602-FPU (which is, I think the same as
the TI ....602).

5) That leaves LSI, TI; I guess Weitek is "all the rest", unless
I missed somebody, which is possible.

4.4 HOT CHIPS, CMOS FPU SESSION
khb: "treated to three papers"

FACT: we had a session with 3 CMOS SPARC FPUs (Weitek, TI, LSI),
followed by Earl Killian of MIPS. The session chair introduced Earl as someone
who would not talk about a SPARC FPU.  This comment elicited a
noticable round of applause from the audience..... perhaps khb would
comment on that reaction to a "treat".

Now, the 3 CMOS SPARC FPU papers described reasonable devices,
that in some cases include fairly clever things.  On the other hand,
we were given almost zero serious performance analysis,
or motivational material to say why things were done differently;
the LSIL presentation did include a cycle count comparison, which unfortunately
was not included in the handouts, and I couldn't write it down fast enough,
or I'd repeat it here.  Presumably, if I were a SPARC customer, I might be able
to get enough information on realistic usages and environments to figure
out what programs would run faster with which chip combinations;
such insight was NOT obvious from the presentations.

Khb could do much to turn his comments into real DATA,
and maybe thus offer a thesis that could be analyzed, if he
would do the following:
	a) Gather all of the ACTUAL cycle counts of these various chips,
	and put them in a table like the LSIL speaker showed, and post it here.
	(This is data is clearly publicly available, I think.)
	b) Give a clear description of the overlap characteristics of these
	chips.  I think most of them overlap {add/sub/conv, mul/div/sqrt, and
	load/store}, and I don't think any of them are pipelined, but I
	could be wrong.
	c) Give a terse, clear description of these chips in terms of which
	ones are used in which currently-public SPARC systems, and dispel any
	confusion about already-cited benchmark numbers.  [When I read the
	trade press, I get confused, because they talk about things like
	shipping some SS1s with TI parts, but enough WTL parts are now available
	to use them instead, and I have no idea if that's press error, or real,
	and if real, what difference it would make.]
	d) If there REAL benchmarks, or even simulations of the performance
	of these things that exist somewhere public, point us at them.

MIPS:
	Earl Killian described the R3010 FPU, including a large set of measured
	MFLOPS numbers [Livermore harmonic, geometric, arithmetic];
	Gaussian Elimination [linpack, fortran, rolled, linpack hand-coded,
	1000x1000], Matrix Multiply [50x50 handcoded], Multiply/Add Peak.
	(i.e., all numbers from the Performance Brief).
	He explained, with examples, why we chose used low-latency, multiple
	overlapped FP operational units  (the R3010 appears to have
	somewhat more concurrency than some of the SPARC FPUs), rather than
	pipelined ones.  He talked about simulation tradeoffs, like
	simulating Spice (and other large programs) with a tweakable
	simulator to examine the effects of different pipelining
	strategies and latency tradeoffs.  He gave the cycle counts
	for most of the operations.
He also observed, that although the 25MHz R3010 was shipped in production
systems 8 months ago (almost a year ago @ 20MHz), and it was just a shrink
of the R2010, which was shipped in production systems over
2 years ago, the CMOS SPARC FPUs still haven't caught up, even the
forthcoming ones.
[MASH: Or, at least, no compelling evidence
was presented that they're going to blow it away, as there was a lot of talk
of handcoded LINPACK inner loop peak performance, sometimes offered
in tables comparing them with measured LINPACKs on real machines....
In fact, I think that only a few of the cycle counts on these
parts are better than the corresponding R3010 ones.  All of them suffer the
(SPARC architectural) lack of direct data path between CPU & FPU.
Again, if khb, or somebody would post the actual cycle counts, we can see
whether my belief has any validity.)

Now, somebody might claim [well, they do], that the forthcoming
FPUs are targeted to 33 to 50MHz, (in some cases, people only listed the
timings corresponding to these rates), and that they'll run faster than
any R3010 ever will, AND THAT THEY'LL DO IT WHILE IT STILL MATTERS.
Maybe they will, maybe they won't, but I'd suggest, that to add some
credibility, I'd ask for the following DATA:
	0) Talk about synchronizing the CPU and FPU at these speeds.
	Do you have PLL's, or some other technique, or magic?
	1) What are the access times of the SRAMs needed to
	run at 30ns, 25ns, and 20ns cycle times? (Some of these parts
	were claimed to scale to 50Mhz, so the 20ns is relevant.)
	2) What are the sizes, part-numbers, costs, and availability
	of those parts, and how many do you need? 
	3) What are the rest of the pieces that you need to
	run at those speeds?  and when can you really get them?
The only thing close to answering this question was the Cypress/Ross
chipset description, and I'm not really sure what's happening there,
simply because I have a hard time relating their chip dates to system dates.

Basically, to use the RISCar metaphor, these are simple questions
to see if a million-RPM engine can actually be put into a
{buildable, sellable, maintainable} car, or whether the engine slows down.

SPARC implementation combinations that I've heard of:
	1) Fujitsu FPC + WTL 1164/65 (Sun-4/110, 200) (1987, 1988)
	2) FPU2 (TI 8847+ FPC) for Sun-4/110,200 (1989)
	3) WTL 3170 for LSIL/Fujitsu in SS1 (1989)
	4) TI 8847+FPC in SS3xx (I think), with Cypress 601 IU (1989)
	5) WTL 3171 (coming, to go with Cypress 601s) (1989)
	6) TI TMS390C602 (coming) (which, I think really combines an 8847+FPC),
	to go with Cypress 601s (1989)
	7) LSIL L64814 FPU, coming, which also goes with Cypress 601s, or the
	LSIL IU with that pinout rather than the LSIL SPARC IUs used in SS1s.
(If I've missed anybody, I didn't mean to, and I'm sorry if I'm confused
about any of these: please correct me if I'm wrong).
BTW: as a side note to Sun: if you change FPUs in a system model, where it makes
a performance difference, PLEASE consider giving it a succinct, different
model number, or some identification, so people can know what they're
measuring and label them correctly. 

The corresponding  MIPS sequence is:
	1) R2010, with R2000  (R2xxxAs are R3xxxs in R2xxx packages) (1987, 88)
	2) R3010 (shrunken R2010) with R3000 (which was changed some) (88, 89)

Keith is right: we're horribly outnumbered....still, in the CMOS
world, nobody yet is shipping any SPARC systems that equal a 25MHz R3x pair at
FP benchmarks, and in fact, the 25MHz SS300 (based on minimal data) looks
not much different from a MIPS M/120, which has a 16.7MHz R2xxx pair.
	
4.5 "TILTING TOWARDS SPARC, UNLESS MIPS SMARTER THAN EVERYBODY" : UNPROVEN
Now, I finally get to the comment that set all of this off: ``So the FPU
>integration/implementation variable is tilting towards SPARC (unless
>one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI,
>Solb., Prisma and all the others.''

In order to bring sense from this, and to carefully avoid being
misinterpreted, I'll recast this with some logic for clarity:
	A: "....is tilting towards SPARC."
	B: "MIPSco is smarter than ...."
Now, khb's thesis may be rendered symbolicly as:
	not-A ==> B  (i.e., that's what A, unless B means).
	not-B (I think: after reading this several times, I think the
		reader is being invited to disbelieve B as impossible,
		or to expect MIPSco to disprove A by proving B (which
		is impossible, there are smart peopel at lots of companies).
		khb does not SAY this, and if he didn't mean this,
		then you can ignore a lot of this.  However, I have heard
		this syllogism before, so it's not new....]
	= not-(not-A) ==> A

I claim that:
	1) There is, as yet, little DELIVERABLE evidence for A,
	with the exception that SPARCland is ahead of MIPSland in GaAs
	supercomputers.  The ECL verdict isn't in yet; so the rest of
	this discussion covers CMOS, only.
		[I've covered this somewhat above].
	2) Not (not-A ==> B), i.e., there could be plenty of reasons
	why A might not be true, without requiring B to be true.
	4) C, where C: "MIPSco may be able to hold its own in these wars,
		based on past history, and on the requirements for doing so."

Note that my claims are NOT, and should not be misconstrued as:
	1) B (MIPSco is smarter)
	2) E: where E is "MIPS will always be ahead, at every instant."

Now, perhaps khb did not observe a difference in style or strategy
amongst the {SPARC FPUs} vs {MIPS FPU} talks.  I did observe some,
and I add some other data, in defense of assertion C:

[OPINION] Here's some of what it takes to build hot CMOS chips (& software
they need, in a timely and competitive fashion, and especially for the next
round (the integrated superchips):

a) Good simulation/analysis methodology for looking at design alternatives.
b) Close coupling of chip designers with systems designers, and smart sw folks:
	compiler folks: to answer questions  like "if we make multiply
		X cycles, how much overlap can you get back with a smarter
		pipeline organizer?" 
	OS & graphics folks, to answer all sorts of questions about
		memory hierarchy and other tradeoffs
c) Smart chip designers; we like having logic and circuit folks sitting next
	 to each other; others split it other ways.
d) People who know CMOS technology, yield, reliability, testability, etc.
e) CAD tools; diagnostics; design verification suites, etc, etc.
f) A whole lot of computing power to support all of this.
	(like, the DV folks will use an infinite amount if you let them :-)
g) Good chip technology and production.

Now, only a few of these are "smart people"..... which is what makes
the original khb thesis silly.  To do well, you need to combine at least
most of the above (not necessarily, or even usually, in one company,
but at least in a team).  
		
OK, almost done.
1) I'm NOT claiming MIPSco is smarter than everybody else;
I'm just arguing against the claim that the balance is on SPARC's side
UNLESS MIPSco is smarter than everybody else.

2) There are plenty of reasons why competitive balance swings
back and forth, and only some are smartness.

3) It really is boring having to respond to marketing FUD and
rewritings of history in comp.arch.  There are better things to do, and I'd much
see discussion of things like (to pick a simple case):
	Which is better: 2-cycle + & 5-cycle *, or 3-cycle + & 4-cycle *?
	On which kinds of benchmarks? why?
	How much difference does it make in performance? in silicon space?

I.e., things that give DATA, and even better INSIGHT........

4) It would be nice to get some clear DATA posted about the forthcoming
SPARC FPUs.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

khb%chiba@Sun.COM (chiba) (07/07/89)

In article <22792@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>1. INTRODUCTION
>
>In article <112807@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes:
>
>1) Some comments about SPARC integer-vs-floating point that seem to
>rewrite history from before when keith was at Sun,

Fellow asked a question, I "reverse engineered" history as best I
could. It is true that while RISC revolution was starting I was off
doing Kalman filtering.

> as well as some comments
>about Hot Chips that need some balancing comments (which you can take either as>objective data, or as opposite-bias opinions; your call).

The bulk of my posting was simply the schedule. The editoral comments
were quite slight. A 400+ line rebuttal seems a bit of overkill.

>
>2) ``So the FPU ...

>Marketing B.S. doesn't make something ("tilt") true; only being true makes
>it true; in any case, in my opinion, the logic (only if MIPS smarter is there
>no tilt towwards SPARC) is flawed, and I'll show why.

Counted number of chip houses, etc. BIT is shipping ECL SPARC samples ...

>-------

>The following are quotes from the July 87 Sun-4 introductory material:
>``Relative to other manufacturer's high-end offerings,
>the Sun-4/200 excels in floating-point performance.

Marketing hype is marketing hype. A Sun4/2xx was much faster for FP
than a VAX. It was not as key to the design as, say, a Cray YMP.

>3. FP  PRESENT, INCLUDING COMPILER ISSUES
>At least one SPARC architectural difference was described by Tom Pennelo of
>Metaware at Hot Chips, but khb failed to mention:
>passing FP arguments in the integer registers, and not having
>direct moves to/from IU and FP, means that (in C, at least),
>saying y = glurp(x), with floats x,y, gives you something like:

Tom's point was well taken if:

1)  Most FP codes pass by value rather than by address
2)  If one wants to penalize machines w/o FP hardware a lot,vs. w/FP a bit
3)  there wasn't an effective ABI in place ("offical" or not, solb. code
    etc. does run on a Sun, and visa versa).

Clearly using the FP registers would be better. How much better
depends on your model of the execution universe. Fortran codes (where
FP is king) are all pass by address (language semantics) so until f88
or mass conversion to c++ this is not the huge issue portrayed.

>I have no idea how often this happens; fortunately for SPARC, FORTRAN is
>call-by-reference.  Note also that conversions from int<->float go
>thru a similar drill (which is truly architectural, not architecture+
>language convention, like the previous example, which, if not architectural
>is probably so wired into things it would be nontrivial to change.)

The convention is changable. The problems are more "political" than
technical. If the statistics show that the convention should change, I
have do doubt that it will.

The int<->float is architectural, but there are few statistics to
indicate that this is a serious bottleneck in SPARC performance.

>	1) The SPARC multi-cycle loads and stores, which is is not ISA,
>		but SYSTEM architecture and implementation.

agreed. I thought I made this clear.

>	2) The MIPS FPUs have lower cycle counts.

agreed, I thought the point of all those FPU talks we sat through was
that SPARC cycle counts were dropping quite rapidly (new TI divide,
sqrt, for example).

>	3) The compiler thing is an open question; I haven't looked at
>	much SPARC FP code lately, so I don't personally know.  Maybe
>	some UNBIASED third-parties would care to comment and give some DATA.

My job is to break'em not build 'em.  So I am relatively unbiased.
Data will follow as time permits.

>
>4. ANALYSIS OF "TILTING TOWARDS SPARC", INCLUDING HOT CHIPS

>4.2 WHAT MASH SAYS
>Sigh.  What does "tilting towards SPARC" mean?  Does it mean that
>SPARC is getting ahead, or might be catching up ("tilting back towards
>parity")?  I'm tired of this, but I can't let this
>argument go past.... I believe SPARC is getting closer, but that doesn't
>mean "tilting towards SPARC".

Assuming BIT's marketing numbers are true (idealized assumption) they
are shipping samples of 14Mflop DP linpack chips now. I am not in chip
design. I am not in workstation design. Sun may or may not be using
such chips.  But sampling now at 14Mflops samples now are faster than
MIPSco samples.


>other machines, including early MIPS M/500s (before R2010s existed).
>I'd use existing parts to get started, too; in fact, we did.
>The original SPARC team was small, and didn't have infinite resources,
>so this was all perfectly reasonable.  In retrospect, [OPINION], the
>only problem was in not having somebody going like crazy to build a
>serious CMOS SPARC FPU early enough, and I have no idea whether somebody

true. Wish we had someone like you with a bat to force 'em.


>2) OPINION: The BIT+Sun ECL design looks well-done, with some reasonable
>and informed thinking in many places.  Maybe before SPARC victory is declared
>by khb on the ECL front we maybe ought to wait for the first actual ECL systems
>to be shipped, 

It was a chip conference. Not a systems conference. 


>3) OPINION: Pete Wilson's Prisma talk was delightful and fascinating; I admit
>that MIPS is not, to my knowledge, building a GaAs supercomputer of
>the $500K-$1M ilk, so I wish them well.

Neither are we (I think). So we too wish them well.


>4.4 HOT CHIPS, CMOS FPU SESSION
>khb: "treated to three papers"
>
>FACT: we had a session with 3 CMOS SPARC FPUs (Weitek, TI, LSI),
>followed by Earl Killian of MIPS. The session chair introduced Earl as someone
>who would not talk about a SPARC FPU.  This comment elicited a
>noticable round of applause from the audience..... perhaps khb would
>comment on that reaction to a "treat".

I applauded also. I went to hear about non-SPARC stuff. So much of the
audiance is already involved in SPARC that few really wanted to hear
about the pin-outs again.

>	a) Gather all of the ACTUAL cycle counts of these various chips,
>	and put them in a table like the LSIL speaker showed, and post it here.
>	(This is data is clearly publicly available, I think.)

Since Sun isn't in the business of using all known chips, and I am
already working 16 hour days, and because the data is publicly
available, I am not going to undertake that survey just know. Sorry John.

>	b) Give a clear description of the overlap characteristics of these
>	chips.  I think most of them overlap {add/sub/conv, mul/div/sqrt, and
>	load/store}, and I don't think any of them are pipelined, but I
>	could be wrong.

Again, a job for the chip vendors (at least until Sun announces
products based on given chips).

>	c) Give a terse, clear description of these chips in terms of which
>	ones are used in which currently-public SPARC systems, and dispel any
>	confusion about already-cited benchmark numbers.  [When I read the
>	trade press, I get confused, because they talk about things like
>	shipping some SS1s with TI parts, but enough WTL parts are now available
>	to use them instead, and I have no idea if that's press error, or real,
>	and if real, what difference it would make.]

Sun ships

wtl1164/65 old sun4/110 sun4/2xx
TI8847 aka FPU2 current sun4/110, sun4/2xx
wtl1170   SS-1
TI8847    SS-330

Since many of the chips are pin compatible, numbers with funny
combinations are typically real. Just take out your toolkit ...:>


>	d) If there REAL benchmarks, or even simulations of the performance
>	of these things that exist somewhere public, point us at them.

Rag on the chip houses. Or simply buy a machine and a chip and run
your real codes. It is what we do for our customers ....

>
>MIPS:

>of handcoded LINPACK inner loop peak performance, sometimes offered
>in tables comparing them with measured LINPACKs on real machines....
>In fact, I think that only a few of the cycle counts on these
>parts are better than the corresponding R3010 ones.  All of them suffer the
>(SPARC architectural) lack of direct data path between CPU & FPU.
>Again, if khb, or somebody would post the actual cycle counts, we can see
>whether my belief has any validity.)

A posting with some of that data is forthcoming.

>

>Maybe they will, maybe they won't, but I'd suggest, that to add some
>credibility, I'd ask for the following DATA:
>	0) Talk about synchronizing the CPU and FPU at these speeds.
>	Do you have PLL's, or some other technique, or magic?
>	1) What are the access times of the SRAMs needed to
>	run at 30ns, 25ns, and 20ns cycle times? (Some of these parts
>	were claimed to scale to 50Mhz, so the 20ns is relevant.)
>	2) What are the sizes, part-numbers, costs, and availability
>	of those parts, and how many do you need? 
>	3) What are the rest of the pieces that you need to
>	run at those speeds?  and when can you really get them?

Does Macy's tell Gimbels ? C'mon John, why didn't you ask the chip
vendors who were presenting ?


>Basically, to use the RISCar metaphor, these are simple questions
>to see if a million-RPM engine can actually be put into a
>{buildable, sellable, maintainable} car, or whether the engine slows down.

Since BIT is shipping parts, it is a question anyone with a yen to
experiment can try out.

>
>model number, or some identification, so people can know what they're
>measuring and label them correctly. 

And make life easy :> It would violate some sort of Marketing Policy,
no doubt :>

>
>The corresponding  MIPS sequence is:
>	1) R2010, with R2000  (R2xxxAs are R3xxxs in R2xxx packages) (1987, 88)
>	2) R3010 (shrunken R2010) with R3000 (which was changed some) (88, 89)

Your naming convention is better than ours.

>
>4.5 "TILTING TOWARDS SPARC, UNLESS MIPS SMARTER THAN EVERYBODY" : UNPROVEN
>Now, I finally get to the comment that set all of this off: ``So the FPU

Then why didn't you just comment on this ?

>	1) There is, as yet, little DELIVERABLE evidence for A,
>	with the exception that SPARCland is ahead of MIPSland in GaAs
>	supercomputers.  The ECL verdict isn't in yet; so the rest of
>	this discussion covers CMOS, only.

One can call BIT and order a chip. That means (to me) that one side is
ahead. Perhaps not by much. But anyone can order.

Did recasting an argument in symbolic logic form make clearer ?


>Now, perhaps khb did not observe a difference in style or strategy
>amongst the {SPARC FPUs} vs {MIPS FPU} talks.  I did observe some,

                                        singular

>a) Good simulation/analysis methodology for looking at design alternatives.

Does anyone NOT use simulation ? Just because you are willing to
publish many of your working notes doesn't mean that everyone doesn't
use the tools!

>b) Close coupling of chip designers with systems designers, and smart sw folks:

Intel and Moto seem to disagree. But sun clearly agrees with you. :>

>d) People who know CMOS technology, yield, reliability, testability, etc.
    all chip houses are more or less expert in these areas
>e) CAD tools; diagnostics; design verification suites, etc, etc.
    ditto.
>f) A whole lot of computing power to support all of this.
    of course. we all know verilog's appeite for cycles
>g) Good chip technology and production.
    as def

Stating the obvious eh ?
	
>
>Now, only a few of these are "smart people"..... which is what makes
>the original khb thesis silly.  To do well, you need to combine at least
>most of the above (not necessarily, or even usually, in one company,
>but at least in a team).  

Far from clear. d,e,f,g are all somewhat decoupled from a,b. Did Moto
or Intel build the best systems ?  Systems, ISA, and compilers need to
be close (so I say, and it appears you agree). d,e,f,g can be dealt
with as vendors.


>2) There are plenty of reasons why competitive balance swings
>back and forth, and only some are smartness.

true. I think you are taking the point all out of proportion.

>4) It would be nice to get some clear DATA posted about the forthcoming
>SPARC FPUs.

Chat with the chip folks. My lips are sealed.


Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)

bron@bronze.wpd.sgi.com (Bron Campbell Nelson) (07/07/89)

In article <22792@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
[ A whole bunch of stuff, including: ]

> 3) It really is boring having to respond to marketing FUD and
> rewritings of history in comp.arch.  There are better things to do, and I'd much
> see discussion of things like (to pick a simple case):
> 	Which is better: 2-cycle + & 5-cycle *, or 3-cycle + & 4-cycle *?
> 	On which kinds of benchmarks? why?
> 	How much difference does it make in performance? in silicon space?
> 
> I.e., things that give DATA, and even better INSIGHT........
[...]
> -john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
> UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
> DDD:  	408-991-0253 or 408-720-1700, x253
> USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Here is a single data point, drawn from Lawrence Livermore National Labs.
{Ref: Harry Nelson, "Using the Performance Monitors on the X-MP/48";
Tentacle; vol V, num. 9, Sept/Oct 1985 (LLNL internal publication).}
Result are reported for a 30 hour weekend full production run (i.e. almost
all batch jobs doing "real" work, very little interactivity).  Exactly which
programs were running is not known, but the author claims (based on several
similar experiments) that this is a representative sample.  Note by the
way that these were real jobs doing real work, not a set of benchmarks or
test programs.

During that time, the following operation counts were seen (1cpu):
	F.P. add:	1198 *10^9
	F.P. multiply:	1346 *10^9
	F.P. reciprocal: 135 *10^9
These numbers include by scalar and vector F.P. operations.  The multiply
numbers are slightly inflated due to lack of a F.P. divide operation on
a the X-MP; to do a full divide (i.e. A/B) requires one reciprocal and
three multiplies.  If we assume all the reciprocals represent divides,
and subtract these from the above we get
	+:	1198  => 53%
	*:	 941  => 41%
	/:	 135  =>  6%

The surprising thing (to me) is how close the + and * numbers are.  What
this unfortunately means is that the answer is not very clear.  It involves
answering questions like "how frequently can an add be overlapped with
a multiply?", and "how often is an add on the critical path?"  F.P. adds
are not so abundant (relative to multiplies) that the question can be
dismissed, but it is certainly not something I'd recommend without a lot
of supporting evidence, and even then its looks to be a fairly marginal
optimization.  The silicon might be better invested in doing something else
(maybe hardware sqrt?).

--
Bron Campbell Nelson
bron@sgi.com  or possibly  ..!ames!sgi!bron
These statements are my own, not those of Silicon Graphics.

khb%chiba@Sun.COM (chiba) (07/07/89)

In article <37530@sgi.SGI.COM> bron@bronze.wpd.sgi.com (Bron Campbell Nelson) writes:

>The surprising thing (to me) is how close the + and * numbers are.  What

It shouldn't be surprising. (If there is interest I can key in
complete op counts for some common kalman filter algorithms, as
examples). Of the infamous BLAS, both dot products and saxpy do one
multiply and one add every time through the innermost loop ... shops
which do serious number crunching typically do stuff like householder
transformations, givens rotations, matrix factorizations, etc. where
the algorithms are so typically close to tied between multiplies and
adds that most folks just count one or the other and multiply by two.

>this unfortunately means is that the answer is not very clear.  It involves
>answering questions like "how frequently can an add be overlapped with
>a multiply?", and "how often is an add on the critical path?"  F.P. adds

These can be overlapped a very large fraction of the time. The easiest
way to see this is to examine common algorithms.

Cheers.

Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)

dave@micropen (David F. Carlson) (07/07/89)

In article <114015@sun.Eng.Sun.COM>, khb%chiba@Sun.COM (chiba) writes:
> 
> Chat with the chip folks. My lips are sealed.
> 
> Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*

I know there's a reason I read this forum...  
But right now I can't think of what it is.

-- 
David F. Carlson, Micropen, Inc.
micropen!dave@ee.rochester.edu

"The faster I go, the behinder I get." --Lewis Carroll

roelof@idca.tds.PHILIPS.nl (R. Vuurboom) (07/08/89)

In article <22792@winchester.mips.COM> mash@mips.COM (John Mashey) writes:

[Another view of the HOT CHIPS conference]

[A lengthy analysis of the verb tilt as in "tilting towards sparc" :-)] 
(Something tells me we've just witnessed the birth of a new expression 
as in so-and-so's showing a definite sparc tilt today :-) :-)

[Flames aimed at Keith Bierman designed to scorch Keith's toes.]

Since I'm the "fellow" who asked the question that prompted Keiths
posting that prompted your posting and since you were worried that Keiths
posting may have been overly biased and therefor may have unduly
influenced the General Public (me plus other interested readers)
the following:

[Another view]

Thanks, your extra DATA does provide more INSIGHT :-).
No seriously I mean it. Two heads are better than one and even more so if 
one of those two heads happens to be yours.

[lengthy analysis of the expression "tilting towards sparc" :-)]
I took Keiths remark to mean simple numerical superiority viz. more
sparc like implementations not derisory. Seeing MIPS track record
MIPS may indeed be smarter. Of course its not the individual folks
that are smarter but the organization itself. The way it lets various
software and hardware folks (concepts) interact. 
The way it lets the parts become a greater sum. Anybody with half an eye 
can see that MIPS is doing pioneering work in this area.
 
[Flames at Keith]
Of course Keith can take care of himself and God can take care of us all
but I think it is a little unfair to demand full data sheets from him because
he was kind enough to give a little info (and -his- insight) on the
proceedings. Rustling up that sort of info is obviously very time
consuming and Keith is already working long days (So how come you're
reading this Keith? :-)


But (as usual) you do bring up some good points particularly:


>The main reasons, I think, for the differences are:
>	1) The SPARC multi-cycle loads and stores, which is is not ISA,
>		but SYSTEM architecture and implementation.
>	2) The MIPS FPUs have lower cycle counts.
>	3) The compiler thing is an open question; I haven't looked at
>	much SPARC FP code lately, so I don't personally know.  Maybe
>	some UNBIASED third-parties would care to comment and give some DATA.
>

>and maybe thus offer a thesis that could be analyzed, if he

or anybody else!!!

>would do the following:
>	a) Gather all of the ACTUAL cycle counts of these various chips,
>	and put them in a table like the LSIL speaker showed, and post it here.
>	(This is data is clearly publicly available, I think.)
>	chips.  I think most of them overlap {add/sub/conv, mul/div/sqrt, and
>	load/store}, and I don't think any of them are pipelined, but I
>	could be wrong.
>	c) Give a terse, clear description of these chips in terms of which
>	ones are used in which currently-public SPARC systems, and dispel any

>credibility, I'd ask for the following DATA:
>	0) Talk about synchronizing the CPU and FPU at these speeds.
>	Do you have PLL's, or some other technique, or magic?
>	1) What are the access times of the SRAMs needed to
>	run at 30ns, 25ns, and 20ns cycle times? (Some of these parts
>	were claimed to scale to 50Mhz, so the 20ns is relevant.)

>	Which is better: 2-cycle + & 5-cycle *, or 3-cycle + & 4-cycle *?
>	On which kinds of benchmarks? why?
>	How much difference does it make in performance? in silicon space?

and of course my favourite :-)
>I.e., things that give DATA, and even better INSIGHT........
>
-- 
Roelof Vuurboom  SSP/V3   Philips TDS Apeldoorn, The Netherlands   +31 55 432226
domain: roelof@idca.tds.philips.nl             uucp:  ...!mcvax!philapd!roelof

acockcroft@pitstop.West.Sun.COM (Adrian Cockcroft) (07/11/89)

There was a call for some cycle time summaries for SPARC FPU's and khb
didn't have time to provide them. I happen to have a summary online so
here it is.

The Weitek ABACUS 3170 is a LSI/Fujitsu compatible SPARC FPU which uses the
F-bus to hang off the side of the IU. It runs at 25 MHz only. As used in SS1.

The Weitek ABACUS 3171 is a Cypress compatible SPARC FPU which picks up
its operands in parallel to the IU. It runs at 25, 33 and 40 MHz.

Once the FP instructions have been despatched (I think 2 cycles on Fujitsu
and 1 cycle on Cypress) the performance is the same.

I compare below with data taken from the CY7C609 FPC (with TI8847) data sheet
For Linpack comparisons the 8847 is about 1.5 DP MFLOPS and the 3170 is about
1.36 DP MFLOPS in a SS1 (20 MHz). Some early SS1's were fitted with 8847's on
daughter boards, it's not an option because its much more expensive than
the 3170 for a marginal improvement.

I'm not sure how much pipelineing takes place inside each FPU, these cycles
are for a single instruction from start to finish.

Adrian

Instruction  3170 cycles    TI8847 cycles
fitos		10		8
fitod		5		8
fstoi		5		8
fdtoi		5		8
fstod		5		8
fdtos		5		8
fmovs		3		8
fnegs		3		8
fabss		3		8
fsqrts		60		15
fsqrtd		118		22
fadds		5		8
faddd		5		8
fsubs		5		8
fsubd		5		8
fmuls		5		8
fmuld		8		9
fdivs		38		13
fdivd		66		18
fcmps		3		8
fcmpd		3		8
fcmpes		3		8
fcmped		3		8


-- 
Adrian Cockcroft Sun Cambridge UK TSE sun!sunuk!acockcroft
Disclaimer: These are my own opinions