mash@mips.COM (John Mashey) (07/06/89)
1. INTRODUCTION In article <112807@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes: 1) Some comments about SPARC integer-vs-floating point that seem to rewrite history from before when keith was at Sun, as well as some comments about Hot Chips that need some balancing comments (which you can take either as objective data, or as opposite-bias opinions; your call). 2) ``So the FPU >integration/implementation variable is tilting towards SPARC (unless >one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI, >Solb., Prisma and all the others.'' Marketing B.S. doesn't make something ("tilt") true; only being true makes it true; in any case, in my opinion, the logic (only if MIPS smarter is there no tilt towwards SPARC) is flawed, and I'll show why. ------- Some of this discussion inherently contains industry-oriented stuff, which I'm forced into, as well as some serious technical meat, thank goodness. If you don't like the former, hit "n" now. OUTLINE OF REST: 2. KHB's MODEL OF SUN FP TRADEOFFS; ANOTHER MODEL 3. FP PRESENT, INCLUDING COMPILER ISSUES 4. ANALYSIS OF "TILTING TOWARDS SPARC", INCLUDING HOT CHIPS 4.1 WHAT KHB SAYS 4.2 WHAT MASH SAYS 4.3 HOT CHIPS, GENERAL 4.4 HOT CHIPS, CMOS FPU SESSION 4.5 "TILTING TOWARDS SPARC, UNLESS MIPS SMARTER THAN EVERYBODY" : UNPROVEN 2. KHB's MODEL OF SUN FP TRADEOFFS; ANOTHER MODEL >In article <596@megatek.UUCP> mark@megatek.UUCP () writes: >>This seems a little out of whack... it seems that older scientific >>processors had ratios in the 3-4 range. >Current SPARC implementations (chips and system) from Sun were >intended for "more general purpose use" hence the (relatively) narrow >gap between integer performance on a Cray to a 4/330. While floating >point is fun (and is typically my reason for existing on a project) I >spend most of my day doing compiles, editing, runing schedtool, and >other nonFP things. So using the 80-20 rule... the first machines >should be the ones we need 80% of the time. FACT: I admit to a nasty habit of keeping old marketing material and press clippings, which I believe predate khb's tenure at Sun; I often keep such things as a reality check. The following are quotes from the July 87 Sun-4 introductory material: ``Relative to other manufacturer's high-end offerings, the Sun-4/200 excels in floating-point performance. In fact, the Sun-4/200 will execute floating-point-intensive applications faster than the VAX 8800 superminicomputer.'' .... ``...giving users an overwhelming reason to migrate applications that currently run on super computers, minsupers, and superminis onto workstations.'' ``..first supercomputing workstation...'' ``Sun-4/200 Series is ideally suited for all compute-intensive, floating-point, or graphics-intensive applications. The primary markets targeted are high-end mechanical-CAD (MCAD) applications such as solids modeling and finite element analysis, electrical-CAD (ECAD) applications including IC and PC layout and routing; Artificial Intelligence (AI) development, earth resources, molecular modelling, and other compute-intensive applications.'' ``..ideal for applications in the scientific computing and electrical CAD markets.'' OPINION: FP not important?? Less important for Sun-4s?? OPINION: I think the original assertion (==VAX8800 FP) is probably true, if you replace Sun-4/200 (1987) by SPARCstation 3xx (1989). As pointed out shortly thereafter, the VAX 8700 and 8800 are NOT the same: 8800 has 2 8700 CPUs. It turned out that a Sun-4/200 was usually slower on many real FP applications than an 8700, (especially if using VMS compilers, which is what actually runs on most 8700/8800s). [OPINION] SS3xxs do appear to be better balanced than Sun-4/2xxs with regard to FP versus integer performance. 3. FP PRESENT, INCLUDING COMPILER ISSUES (....why people think MIPS FP is faster than SPARC FP...) >Compilers is often stated, but according to my weeks of staring at >huge volumes of data, it seems that the compiler differences are >minimal on large codes. The current sun compilers are somewhat less >clever about certain operations, but not enough to explain the >difference in performance. I suspect much of the code looks similar, which is not surprising, given the similarities of the register sets available at any one time, FP instruction sets that are fairly similar, and IEEE. At least one SPARC architectural difference was described by Tom Pennelo of Metaware at Hot Chips, but khb failed to mention: passing FP arguments in the integer registers, and not having direct moves to/from IU and FP, means that (in C, at least), saying y = glurp(x), with floats x,y, gives you something like: (x sitting in FP reg) store x to memory; load it to integer register z. call glurp store z to memory; reload it into FP reg; compute store result into memory, reload it into integer result reg return store result to memory; reload into FP reg (y) I have no idea how often this happens; fortunately for SPARC, FORTRAN is call-by-reference. Note also that conversions from int<->float go thru a similar drill (which is truly architectural, not architecture+ language convention, like the previous example, which, if not architectural is probably so wired into things it would be nontrivial to change.) The main reasons, I think, for the differences are: 1) The SPARC multi-cycle loads and stores, which is is not ISA, but SYSTEM architecture and implementation. 2) The MIPS FPUs have lower cycle counts. 3) The compiler thing is an open question; I haven't looked at much SPARC FP code lately, so I don't personally know. Maybe some UNBIASED third-parties would care to comment and give some DATA. 4. ANALYSIS OF "TILTING TOWARDS SPARC", INCLUDING HOT CHIPS 4.1 WHAT KHB SAYS >What is interesting is that the benchmarks which SPARC does worst on >are highly FP and memory intensive (say 30-50% loads and stores). (See the discussion on DP LINPACK later, which is actually one of the SS3xx and Sun-4/2xx's best FP benchmarks; SPARC systems have good external memory systems that are well-suited to memory-intensive applications.) >MIPSco built their own FPU and tightly coupled it to their IU. This >resulted in early units which were superior to the SPARC >implementation philosophy (let's buy whatever is laying around and >glue it in -- in the first implementations that meant a weitek 1164 >and 1165 and a controller ... "leftovers" from the sun3/fpa project). >At yesterday's IEEE HOT CHIPS conference, we were treated to three >papers about dedicated SPARC FPU's in addition to the papers focused >on FPU's BIT is already sampling ECL SPARC chips. So the FPU >integration/implementation variable is tilting towards SPARC (unless >one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI, >Solb., Prisma and all the others. 4.2 WHAT MASH SAYS Sigh. What does "tilting towards SPARC" mean? Does it mean that SPARC is getting ahead, or might be catching up ("tilting back towards parity")? I'm tired of this, but I can't let this argument go past.... I believe SPARC is getting closer, but that doesn't mean "tilting towards SPARC". There is nothing wrong, apriori with the SPARC implementation strategy (of using some existing FPU parts, and getting to market quickly), although calling the WTL parts "leftovers" might be a little Sun-centric view of the world, as those parts were used in plenty of other machines, including early MIPS M/500s (before R2010s existed). I'd use existing parts to get started, too; in fact, we did. The original SPARC team was small, and didn't have infinite resources, so this was all perfectly reasonable. In retrospect, [OPINION], the only problem was in not having somebody going like crazy to build a serious CMOS SPARC FPU early enough, and I have no idea whether somebody wanted to do this, and wasn't allowed to, or whether the partners didn't want to, or whether nobody had time to think about it at the right time, or what. Maybe we could be enlightened. In any case,the sequence is (with jiggles of a quarter possible on any date): MIPS SPARC 4Q86 WTL 116x in M/500 WTL 116x in Sun-3 2Q87 R2010 in M/500 socket, M/800 3Q87 WTL 116x in Sun-4 4Q87 R2010 in M/1000 2Q88 R2010 in M/120 4Q88 R3010 in M/2000 1Q89 2Q89 TI8847 in Sun-4 and SS300 WTL 3170 in SS1 4.3 HOT CHIPS, GENERAL 1) FACT: presentations at conferences are not deliveries of systems. 2) OPINION: The BIT+Sun ECL design looks well-done, with some reasonable and informed thinking in many places. Maybe before SPARC victory is declared by khb on the ECL front we maybe ought to wait for the first actual ECL systems to be shipped, and see how they run real programs. Anant Agrawal's talk was well-done, and mostly solid technical content (except for "World's first single chip ECL 32 bit processor" and "World's fastest microprocessor. 80MHz 12.5ns cycle." If you add "announced" to those, I might agree.:-) Despite such claims, it didn't give any SPECIFIC performance data (simulations of real programs)..... There was a good treatment of cache interface, although a few interesting parts (like actual cache and MMU designs, and getting enough fast enough SRAM hooked up) of building a complete system are Left To The Reader..... Khb might want to ask his his ECL colleagues about some of these issues. Still, this was a credible presentation and design, and for reasons that will be obvious sooner or later, there are more reasons for FP performance to be more similar than past designs. 3) OPINION: Pete Wilson's Prisma talk was delightful and fascinating; I admit that MIPS is not, to my knowledge, building a GaAs supercomputer of the $500K-$1M ilk, so I wish them well. 4) FACT: Solbourne did not present at the conference. Fujitsu referenced WTL 3170, but didn't otherwise talk about FP that I can recall. Cypress/Ross mentioned the CY7C602-FPU (which is, I think the same as the TI ....602). 5) That leaves LSI, TI; I guess Weitek is "all the rest", unless I missed somebody, which is possible. 4.4 HOT CHIPS, CMOS FPU SESSION khb: "treated to three papers" FACT: we had a session with 3 CMOS SPARC FPUs (Weitek, TI, LSI), followed by Earl Killian of MIPS. The session chair introduced Earl as someone who would not talk about a SPARC FPU. This comment elicited a noticable round of applause from the audience..... perhaps khb would comment on that reaction to a "treat". Now, the 3 CMOS SPARC FPU papers described reasonable devices, that in some cases include fairly clever things. On the other hand, we were given almost zero serious performance analysis, or motivational material to say why things were done differently; the LSIL presentation did include a cycle count comparison, which unfortunately was not included in the handouts, and I couldn't write it down fast enough, or I'd repeat it here. Presumably, if I were a SPARC customer, I might be able to get enough information on realistic usages and environments to figure out what programs would run faster with which chip combinations; such insight was NOT obvious from the presentations. Khb could do much to turn his comments into real DATA, and maybe thus offer a thesis that could be analyzed, if he would do the following: a) Gather all of the ACTUAL cycle counts of these various chips, and put them in a table like the LSIL speaker showed, and post it here. (This is data is clearly publicly available, I think.) b) Give a clear description of the overlap characteristics of these chips. I think most of them overlap {add/sub/conv, mul/div/sqrt, and load/store}, and I don't think any of them are pipelined, but I could be wrong. c) Give a terse, clear description of these chips in terms of which ones are used in which currently-public SPARC systems, and dispel any confusion about already-cited benchmark numbers. [When I read the trade press, I get confused, because they talk about things like shipping some SS1s with TI parts, but enough WTL parts are now available to use them instead, and I have no idea if that's press error, or real, and if real, what difference it would make.] d) If there REAL benchmarks, or even simulations of the performance of these things that exist somewhere public, point us at them. MIPS: Earl Killian described the R3010 FPU, including a large set of measured MFLOPS numbers [Livermore harmonic, geometric, arithmetic]; Gaussian Elimination [linpack, fortran, rolled, linpack hand-coded, 1000x1000], Matrix Multiply [50x50 handcoded], Multiply/Add Peak. (i.e., all numbers from the Performance Brief). He explained, with examples, why we chose used low-latency, multiple overlapped FP operational units (the R3010 appears to have somewhat more concurrency than some of the SPARC FPUs), rather than pipelined ones. He talked about simulation tradeoffs, like simulating Spice (and other large programs) with a tweakable simulator to examine the effects of different pipelining strategies and latency tradeoffs. He gave the cycle counts for most of the operations. He also observed, that although the 25MHz R3010 was shipped in production systems 8 months ago (almost a year ago @ 20MHz), and it was just a shrink of the R2010, which was shipped in production systems over 2 years ago, the CMOS SPARC FPUs still haven't caught up, even the forthcoming ones. [MASH: Or, at least, no compelling evidence was presented that they're going to blow it away, as there was a lot of talk of handcoded LINPACK inner loop peak performance, sometimes offered in tables comparing them with measured LINPACKs on real machines.... In fact, I think that only a few of the cycle counts on these parts are better than the corresponding R3010 ones. All of them suffer the (SPARC architectural) lack of direct data path between CPU & FPU. Again, if khb, or somebody would post the actual cycle counts, we can see whether my belief has any validity.) Now, somebody might claim [well, they do], that the forthcoming FPUs are targeted to 33 to 50MHz, (in some cases, people only listed the timings corresponding to these rates), and that they'll run faster than any R3010 ever will, AND THAT THEY'LL DO IT WHILE IT STILL MATTERS. Maybe they will, maybe they won't, but I'd suggest, that to add some credibility, I'd ask for the following DATA: 0) Talk about synchronizing the CPU and FPU at these speeds. Do you have PLL's, or some other technique, or magic? 1) What are the access times of the SRAMs needed to run at 30ns, 25ns, and 20ns cycle times? (Some of these parts were claimed to scale to 50Mhz, so the 20ns is relevant.) 2) What are the sizes, part-numbers, costs, and availability of those parts, and how many do you need? 3) What are the rest of the pieces that you need to run at those speeds? and when can you really get them? The only thing close to answering this question was the Cypress/Ross chipset description, and I'm not really sure what's happening there, simply because I have a hard time relating their chip dates to system dates. Basically, to use the RISCar metaphor, these are simple questions to see if a million-RPM engine can actually be put into a {buildable, sellable, maintainable} car, or whether the engine slows down. SPARC implementation combinations that I've heard of: 1) Fujitsu FPC + WTL 1164/65 (Sun-4/110, 200) (1987, 1988) 2) FPU2 (TI 8847+ FPC) for Sun-4/110,200 (1989) 3) WTL 3170 for LSIL/Fujitsu in SS1 (1989) 4) TI 8847+FPC in SS3xx (I think), with Cypress 601 IU (1989) 5) WTL 3171 (coming, to go with Cypress 601s) (1989) 6) TI TMS390C602 (coming) (which, I think really combines an 8847+FPC), to go with Cypress 601s (1989) 7) LSIL L64814 FPU, coming, which also goes with Cypress 601s, or the LSIL IU with that pinout rather than the LSIL SPARC IUs used in SS1s. (If I've missed anybody, I didn't mean to, and I'm sorry if I'm confused about any of these: please correct me if I'm wrong). BTW: as a side note to Sun: if you change FPUs in a system model, where it makes a performance difference, PLEASE consider giving it a succinct, different model number, or some identification, so people can know what they're measuring and label them correctly. The corresponding MIPS sequence is: 1) R2010, with R2000 (R2xxxAs are R3xxxs in R2xxx packages) (1987, 88) 2) R3010 (shrunken R2010) with R3000 (which was changed some) (88, 89) Keith is right: we're horribly outnumbered....still, in the CMOS world, nobody yet is shipping any SPARC systems that equal a 25MHz R3x pair at FP benchmarks, and in fact, the 25MHz SS300 (based on minimal data) looks not much different from a MIPS M/120, which has a 16.7MHz R2xxx pair. 4.5 "TILTING TOWARDS SPARC, UNLESS MIPS SMARTER THAN EVERYBODY" : UNPROVEN Now, I finally get to the comment that set all of this off: ``So the FPU >integration/implementation variable is tilting towards SPARC (unless >one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI, >Solb., Prisma and all the others.'' In order to bring sense from this, and to carefully avoid being misinterpreted, I'll recast this with some logic for clarity: A: "....is tilting towards SPARC." B: "MIPSco is smarter than ...." Now, khb's thesis may be rendered symbolicly as: not-A ==> B (i.e., that's what A, unless B means). not-B (I think: after reading this several times, I think the reader is being invited to disbelieve B as impossible, or to expect MIPSco to disprove A by proving B (which is impossible, there are smart peopel at lots of companies). khb does not SAY this, and if he didn't mean this, then you can ignore a lot of this. However, I have heard this syllogism before, so it's not new....] = not-(not-A) ==> A I claim that: 1) There is, as yet, little DELIVERABLE evidence for A, with the exception that SPARCland is ahead of MIPSland in GaAs supercomputers. The ECL verdict isn't in yet; so the rest of this discussion covers CMOS, only. [I've covered this somewhat above]. 2) Not (not-A ==> B), i.e., there could be plenty of reasons why A might not be true, without requiring B to be true. 4) C, where C: "MIPSco may be able to hold its own in these wars, based on past history, and on the requirements for doing so." Note that my claims are NOT, and should not be misconstrued as: 1) B (MIPSco is smarter) 2) E: where E is "MIPS will always be ahead, at every instant." Now, perhaps khb did not observe a difference in style or strategy amongst the {SPARC FPUs} vs {MIPS FPU} talks. I did observe some, and I add some other data, in defense of assertion C: [OPINION] Here's some of what it takes to build hot CMOS chips (& software they need, in a timely and competitive fashion, and especially for the next round (the integrated superchips): a) Good simulation/analysis methodology for looking at design alternatives. b) Close coupling of chip designers with systems designers, and smart sw folks: compiler folks: to answer questions like "if we make multiply X cycles, how much overlap can you get back with a smarter pipeline organizer?" OS & graphics folks, to answer all sorts of questions about memory hierarchy and other tradeoffs c) Smart chip designers; we like having logic and circuit folks sitting next to each other; others split it other ways. d) People who know CMOS technology, yield, reliability, testability, etc. e) CAD tools; diagnostics; design verification suites, etc, etc. f) A whole lot of computing power to support all of this. (like, the DV folks will use an infinite amount if you let them :-) g) Good chip technology and production. Now, only a few of these are "smart people"..... which is what makes the original khb thesis silly. To do well, you need to combine at least most of the above (not necessarily, or even usually, in one company, but at least in a team). OK, almost done. 1) I'm NOT claiming MIPSco is smarter than everybody else; I'm just arguing against the claim that the balance is on SPARC's side UNLESS MIPSco is smarter than everybody else. 2) There are plenty of reasons why competitive balance swings back and forth, and only some are smartness. 3) It really is boring having to respond to marketing FUD and rewritings of history in comp.arch. There are better things to do, and I'd much see discussion of things like (to pick a simple case): Which is better: 2-cycle + & 5-cycle *, or 3-cycle + & 4-cycle *? On which kinds of benchmarks? why? How much difference does it make in performance? in silicon space? I.e., things that give DATA, and even better INSIGHT........ 4) It would be nice to get some clear DATA posted about the forthcoming SPARC FPUs. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
khb%chiba@Sun.COM (chiba) (07/07/89)
In article <22792@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >1. INTRODUCTION > >In article <112807@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes: > >1) Some comments about SPARC integer-vs-floating point that seem to >rewrite history from before when keith was at Sun, Fellow asked a question, I "reverse engineered" history as best I could. It is true that while RISC revolution was starting I was off doing Kalman filtering. > as well as some comments >about Hot Chips that need some balancing comments (which you can take either as>objective data, or as opposite-bias opinions; your call). The bulk of my posting was simply the schedule. The editoral comments were quite slight. A 400+ line rebuttal seems a bit of overkill. > >2) ``So the FPU ... >Marketing B.S. doesn't make something ("tilt") true; only being true makes >it true; in any case, in my opinion, the logic (only if MIPS smarter is there >no tilt towwards SPARC) is flawed, and I'll show why. Counted number of chip houses, etc. BIT is shipping ECL SPARC samples ... >------- >The following are quotes from the July 87 Sun-4 introductory material: >``Relative to other manufacturer's high-end offerings, >the Sun-4/200 excels in floating-point performance. Marketing hype is marketing hype. A Sun4/2xx was much faster for FP than a VAX. It was not as key to the design as, say, a Cray YMP. >3. FP PRESENT, INCLUDING COMPILER ISSUES >At least one SPARC architectural difference was described by Tom Pennelo of >Metaware at Hot Chips, but khb failed to mention: >passing FP arguments in the integer registers, and not having >direct moves to/from IU and FP, means that (in C, at least), >saying y = glurp(x), with floats x,y, gives you something like: Tom's point was well taken if: 1) Most FP codes pass by value rather than by address 2) If one wants to penalize machines w/o FP hardware a lot,vs. w/FP a bit 3) there wasn't an effective ABI in place ("offical" or not, solb. code etc. does run on a Sun, and visa versa). Clearly using the FP registers would be better. How much better depends on your model of the execution universe. Fortran codes (where FP is king) are all pass by address (language semantics) so until f88 or mass conversion to c++ this is not the huge issue portrayed. >I have no idea how often this happens; fortunately for SPARC, FORTRAN is >call-by-reference. Note also that conversions from int<->float go >thru a similar drill (which is truly architectural, not architecture+ >language convention, like the previous example, which, if not architectural >is probably so wired into things it would be nontrivial to change.) The convention is changable. The problems are more "political" than technical. If the statistics show that the convention should change, I have do doubt that it will. The int<->float is architectural, but there are few statistics to indicate that this is a serious bottleneck in SPARC performance. > 1) The SPARC multi-cycle loads and stores, which is is not ISA, > but SYSTEM architecture and implementation. agreed. I thought I made this clear. > 2) The MIPS FPUs have lower cycle counts. agreed, I thought the point of all those FPU talks we sat through was that SPARC cycle counts were dropping quite rapidly (new TI divide, sqrt, for example). > 3) The compiler thing is an open question; I haven't looked at > much SPARC FP code lately, so I don't personally know. Maybe > some UNBIASED third-parties would care to comment and give some DATA. My job is to break'em not build 'em. So I am relatively unbiased. Data will follow as time permits. > >4. ANALYSIS OF "TILTING TOWARDS SPARC", INCLUDING HOT CHIPS >4.2 WHAT MASH SAYS >Sigh. What does "tilting towards SPARC" mean? Does it mean that >SPARC is getting ahead, or might be catching up ("tilting back towards >parity")? I'm tired of this, but I can't let this >argument go past.... I believe SPARC is getting closer, but that doesn't >mean "tilting towards SPARC". Assuming BIT's marketing numbers are true (idealized assumption) they are shipping samples of 14Mflop DP linpack chips now. I am not in chip design. I am not in workstation design. Sun may or may not be using such chips. But sampling now at 14Mflops samples now are faster than MIPSco samples. >other machines, including early MIPS M/500s (before R2010s existed). >I'd use existing parts to get started, too; in fact, we did. >The original SPARC team was small, and didn't have infinite resources, >so this was all perfectly reasonable. In retrospect, [OPINION], the >only problem was in not having somebody going like crazy to build a >serious CMOS SPARC FPU early enough, and I have no idea whether somebody true. Wish we had someone like you with a bat to force 'em. >2) OPINION: The BIT+Sun ECL design looks well-done, with some reasonable >and informed thinking in many places. Maybe before SPARC victory is declared >by khb on the ECL front we maybe ought to wait for the first actual ECL systems >to be shipped, It was a chip conference. Not a systems conference. >3) OPINION: Pete Wilson's Prisma talk was delightful and fascinating; I admit >that MIPS is not, to my knowledge, building a GaAs supercomputer of >the $500K-$1M ilk, so I wish them well. Neither are we (I think). So we too wish them well. >4.4 HOT CHIPS, CMOS FPU SESSION >khb: "treated to three papers" > >FACT: we had a session with 3 CMOS SPARC FPUs (Weitek, TI, LSI), >followed by Earl Killian of MIPS. The session chair introduced Earl as someone >who would not talk about a SPARC FPU. This comment elicited a >noticable round of applause from the audience..... perhaps khb would >comment on that reaction to a "treat". I applauded also. I went to hear about non-SPARC stuff. So much of the audiance is already involved in SPARC that few really wanted to hear about the pin-outs again. > a) Gather all of the ACTUAL cycle counts of these various chips, > and put them in a table like the LSIL speaker showed, and post it here. > (This is data is clearly publicly available, I think.) Since Sun isn't in the business of using all known chips, and I am already working 16 hour days, and because the data is publicly available, I am not going to undertake that survey just know. Sorry John. > b) Give a clear description of the overlap characteristics of these > chips. I think most of them overlap {add/sub/conv, mul/div/sqrt, and > load/store}, and I don't think any of them are pipelined, but I > could be wrong. Again, a job for the chip vendors (at least until Sun announces products based on given chips). > c) Give a terse, clear description of these chips in terms of which > ones are used in which currently-public SPARC systems, and dispel any > confusion about already-cited benchmark numbers. [When I read the > trade press, I get confused, because they talk about things like > shipping some SS1s with TI parts, but enough WTL parts are now available > to use them instead, and I have no idea if that's press error, or real, > and if real, what difference it would make.] Sun ships wtl1164/65 old sun4/110 sun4/2xx TI8847 aka FPU2 current sun4/110, sun4/2xx wtl1170 SS-1 TI8847 SS-330 Since many of the chips are pin compatible, numbers with funny combinations are typically real. Just take out your toolkit ...:> > d) If there REAL benchmarks, or even simulations of the performance > of these things that exist somewhere public, point us at them. Rag on the chip houses. Or simply buy a machine and a chip and run your real codes. It is what we do for our customers .... > >MIPS: >of handcoded LINPACK inner loop peak performance, sometimes offered >in tables comparing them with measured LINPACKs on real machines.... >In fact, I think that only a few of the cycle counts on these >parts are better than the corresponding R3010 ones. All of them suffer the >(SPARC architectural) lack of direct data path between CPU & FPU. >Again, if khb, or somebody would post the actual cycle counts, we can see >whether my belief has any validity.) A posting with some of that data is forthcoming. > >Maybe they will, maybe they won't, but I'd suggest, that to add some >credibility, I'd ask for the following DATA: > 0) Talk about synchronizing the CPU and FPU at these speeds. > Do you have PLL's, or some other technique, or magic? > 1) What are the access times of the SRAMs needed to > run at 30ns, 25ns, and 20ns cycle times? (Some of these parts > were claimed to scale to 50Mhz, so the 20ns is relevant.) > 2) What are the sizes, part-numbers, costs, and availability > of those parts, and how many do you need? > 3) What are the rest of the pieces that you need to > run at those speeds? and when can you really get them? Does Macy's tell Gimbels ? C'mon John, why didn't you ask the chip vendors who were presenting ? >Basically, to use the RISCar metaphor, these are simple questions >to see if a million-RPM engine can actually be put into a >{buildable, sellable, maintainable} car, or whether the engine slows down. Since BIT is shipping parts, it is a question anyone with a yen to experiment can try out. > >model number, or some identification, so people can know what they're >measuring and label them correctly. And make life easy :> It would violate some sort of Marketing Policy, no doubt :> > >The corresponding MIPS sequence is: > 1) R2010, with R2000 (R2xxxAs are R3xxxs in R2xxx packages) (1987, 88) > 2) R3010 (shrunken R2010) with R3000 (which was changed some) (88, 89) Your naming convention is better than ours. > >4.5 "TILTING TOWARDS SPARC, UNLESS MIPS SMARTER THAN EVERYBODY" : UNPROVEN >Now, I finally get to the comment that set all of this off: ``So the FPU Then why didn't you just comment on this ? > 1) There is, as yet, little DELIVERABLE evidence for A, > with the exception that SPARCland is ahead of MIPSland in GaAs > supercomputers. The ECL verdict isn't in yet; so the rest of > this discussion covers CMOS, only. One can call BIT and order a chip. That means (to me) that one side is ahead. Perhaps not by much. But anyone can order. Did recasting an argument in symbolic logic form make clearer ? >Now, perhaps khb did not observe a difference in style or strategy >amongst the {SPARC FPUs} vs {MIPS FPU} talks. I did observe some, singular >a) Good simulation/analysis methodology for looking at design alternatives. Does anyone NOT use simulation ? Just because you are willing to publish many of your working notes doesn't mean that everyone doesn't use the tools! >b) Close coupling of chip designers with systems designers, and smart sw folks: Intel and Moto seem to disagree. But sun clearly agrees with you. :> >d) People who know CMOS technology, yield, reliability, testability, etc. all chip houses are more or less expert in these areas >e) CAD tools; diagnostics; design verification suites, etc, etc. ditto. >f) A whole lot of computing power to support all of this. of course. we all know verilog's appeite for cycles >g) Good chip technology and production. as def Stating the obvious eh ? > >Now, only a few of these are "smart people"..... which is what makes >the original khb thesis silly. To do well, you need to combine at least >most of the above (not necessarily, or even usually, in one company, >but at least in a team). Far from clear. d,e,f,g are all somewhat decoupled from a,b. Did Moto or Intel build the best systems ? Systems, ISA, and compilers need to be close (so I say, and it appears you agree). d,e,f,g can be dealt with as vendors. >2) There are plenty of reasons why competitive balance swings >back and forth, and only some are smartness. true. I think you are taking the point all out of proportion. >4) It would be nice to get some clear DATA posted about the forthcoming >SPARC FPUs. Chat with the chip folks. My lips are sealed. Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist ! kbierman@sun.com I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
bron@bronze.wpd.sgi.com (Bron Campbell Nelson) (07/07/89)
In article <22792@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: [ A whole bunch of stuff, including: ] > 3) It really is boring having to respond to marketing FUD and > rewritings of history in comp.arch. There are better things to do, and I'd much > see discussion of things like (to pick a simple case): > Which is better: 2-cycle + & 5-cycle *, or 3-cycle + & 4-cycle *? > On which kinds of benchmarks? why? > How much difference does it make in performance? in silicon space? > > I.e., things that give DATA, and even better INSIGHT........ [...] > -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> > UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com > DDD: 408-991-0253 or 408-720-1700, x253 > USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 Here is a single data point, drawn from Lawrence Livermore National Labs. {Ref: Harry Nelson, "Using the Performance Monitors on the X-MP/48"; Tentacle; vol V, num. 9, Sept/Oct 1985 (LLNL internal publication).} Result are reported for a 30 hour weekend full production run (i.e. almost all batch jobs doing "real" work, very little interactivity). Exactly which programs were running is not known, but the author claims (based on several similar experiments) that this is a representative sample. Note by the way that these were real jobs doing real work, not a set of benchmarks or test programs. During that time, the following operation counts were seen (1cpu): F.P. add: 1198 *10^9 F.P. multiply: 1346 *10^9 F.P. reciprocal: 135 *10^9 These numbers include by scalar and vector F.P. operations. The multiply numbers are slightly inflated due to lack of a F.P. divide operation on a the X-MP; to do a full divide (i.e. A/B) requires one reciprocal and three multiplies. If we assume all the reciprocals represent divides, and subtract these from the above we get +: 1198 => 53% *: 941 => 41% /: 135 => 6% The surprising thing (to me) is how close the + and * numbers are. What this unfortunately means is that the answer is not very clear. It involves answering questions like "how frequently can an add be overlapped with a multiply?", and "how often is an add on the critical path?" F.P. adds are not so abundant (relative to multiplies) that the question can be dismissed, but it is certainly not something I'd recommend without a lot of supporting evidence, and even then its looks to be a fairly marginal optimization. The silicon might be better invested in doing something else (maybe hardware sqrt?). -- Bron Campbell Nelson bron@sgi.com or possibly ..!ames!sgi!bron These statements are my own, not those of Silicon Graphics.
khb%chiba@Sun.COM (chiba) (07/07/89)
In article <37530@sgi.SGI.COM> bron@bronze.wpd.sgi.com (Bron Campbell Nelson) writes: >The surprising thing (to me) is how close the + and * numbers are. What It shouldn't be surprising. (If there is interest I can key in complete op counts for some common kalman filter algorithms, as examples). Of the infamous BLAS, both dot products and saxpy do one multiply and one add every time through the innermost loop ... shops which do serious number crunching typically do stuff like householder transformations, givens rotations, matrix factorizations, etc. where the algorithms are so typically close to tied between multiplies and adds that most folks just count one or the other and multiply by two. >this unfortunately means is that the answer is not very clear. It involves >answering questions like "how frequently can an add be overlapped with >a multiply?", and "how often is an add on the critical path?" F.P. adds These can be overlapped a very large fraction of the time. The easiest way to see this is to examine common algorithms. Cheers. Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist ! kbierman@sun.com I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
dave@micropen (David F. Carlson) (07/07/89)
In article <114015@sun.Eng.Sun.COM>, khb%chiba@Sun.COM (chiba) writes: > > Chat with the chip folks. My lips are sealed. > > Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* I know there's a reason I read this forum... But right now I can't think of what it is. -- David F. Carlson, Micropen, Inc. micropen!dave@ee.rochester.edu "The faster I go, the behinder I get." --Lewis Carroll
roelof@idca.tds.PHILIPS.nl (R. Vuurboom) (07/08/89)
In article <22792@winchester.mips.COM> mash@mips.COM (John Mashey) writes: [Another view of the HOT CHIPS conference] [A lengthy analysis of the verb tilt as in "tilting towards sparc" :-)] (Something tells me we've just witnessed the birth of a new expression as in so-and-so's showing a definite sparc tilt today :-) :-) [Flames aimed at Keith Bierman designed to scorch Keith's toes.] Since I'm the "fellow" who asked the question that prompted Keiths posting that prompted your posting and since you were worried that Keiths posting may have been overly biased and therefor may have unduly influenced the General Public (me plus other interested readers) the following: [Another view] Thanks, your extra DATA does provide more INSIGHT :-). No seriously I mean it. Two heads are better than one and even more so if one of those two heads happens to be yours. [lengthy analysis of the expression "tilting towards sparc" :-)] I took Keiths remark to mean simple numerical superiority viz. more sparc like implementations not derisory. Seeing MIPS track record MIPS may indeed be smarter. Of course its not the individual folks that are smarter but the organization itself. The way it lets various software and hardware folks (concepts) interact. The way it lets the parts become a greater sum. Anybody with half an eye can see that MIPS is doing pioneering work in this area. [Flames at Keith] Of course Keith can take care of himself and God can take care of us all but I think it is a little unfair to demand full data sheets from him because he was kind enough to give a little info (and -his- insight) on the proceedings. Rustling up that sort of info is obviously very time consuming and Keith is already working long days (So how come you're reading this Keith? :-) But (as usual) you do bring up some good points particularly: >The main reasons, I think, for the differences are: > 1) The SPARC multi-cycle loads and stores, which is is not ISA, > but SYSTEM architecture and implementation. > 2) The MIPS FPUs have lower cycle counts. > 3) The compiler thing is an open question; I haven't looked at > much SPARC FP code lately, so I don't personally know. Maybe > some UNBIASED third-parties would care to comment and give some DATA. > >and maybe thus offer a thesis that could be analyzed, if he or anybody else!!! >would do the following: > a) Gather all of the ACTUAL cycle counts of these various chips, > and put them in a table like the LSIL speaker showed, and post it here. > (This is data is clearly publicly available, I think.) > chips. I think most of them overlap {add/sub/conv, mul/div/sqrt, and > load/store}, and I don't think any of them are pipelined, but I > could be wrong. > c) Give a terse, clear description of these chips in terms of which > ones are used in which currently-public SPARC systems, and dispel any >credibility, I'd ask for the following DATA: > 0) Talk about synchronizing the CPU and FPU at these speeds. > Do you have PLL's, or some other technique, or magic? > 1) What are the access times of the SRAMs needed to > run at 30ns, 25ns, and 20ns cycle times? (Some of these parts > were claimed to scale to 50Mhz, so the 20ns is relevant.) > Which is better: 2-cycle + & 5-cycle *, or 3-cycle + & 4-cycle *? > On which kinds of benchmarks? why? > How much difference does it make in performance? in silicon space? and of course my favourite :-) >I.e., things that give DATA, and even better INSIGHT........ > -- Roelof Vuurboom SSP/V3 Philips TDS Apeldoorn, The Netherlands +31 55 432226 domain: roelof@idca.tds.philips.nl uucp: ...!mcvax!philapd!roelof
acockcroft@pitstop.West.Sun.COM (Adrian Cockcroft) (07/11/89)
There was a call for some cycle time summaries for SPARC FPU's and khb didn't have time to provide them. I happen to have a summary online so here it is. The Weitek ABACUS 3170 is a LSI/Fujitsu compatible SPARC FPU which uses the F-bus to hang off the side of the IU. It runs at 25 MHz only. As used in SS1. The Weitek ABACUS 3171 is a Cypress compatible SPARC FPU which picks up its operands in parallel to the IU. It runs at 25, 33 and 40 MHz. Once the FP instructions have been despatched (I think 2 cycles on Fujitsu and 1 cycle on Cypress) the performance is the same. I compare below with data taken from the CY7C609 FPC (with TI8847) data sheet For Linpack comparisons the 8847 is about 1.5 DP MFLOPS and the 3170 is about 1.36 DP MFLOPS in a SS1 (20 MHz). Some early SS1's were fitted with 8847's on daughter boards, it's not an option because its much more expensive than the 3170 for a marginal improvement. I'm not sure how much pipelineing takes place inside each FPU, these cycles are for a single instruction from start to finish. Adrian Instruction 3170 cycles TI8847 cycles fitos 10 8 fitod 5 8 fstoi 5 8 fdtoi 5 8 fstod 5 8 fdtos 5 8 fmovs 3 8 fnegs 3 8 fabss 3 8 fsqrts 60 15 fsqrtd 118 22 fadds 5 8 faddd 5 8 fsubs 5 8 fsubd 5 8 fmuls 5 8 fmuld 8 9 fdivs 38 13 fdivd 66 18 fcmps 3 8 fcmpd 3 8 fcmpes 3 8 fcmped 3 8 -- Adrian Cockcroft Sun Cambridge UK TSE sun!sunuk!acockcroft Disclaimer: These are my own opinions