rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/07/91)
I have seen relatively little about the i860 chip on this newsgroup. Also, compared to MIPS, it doesn't seem to be very popular as the base processor for computers; Alliant uses it in their shared-memory machines (800 & 2800), Intel has the Touchstone experimental mpp machine, and the i860 seems popular as a graphics coprocessor (e.g. in the NeXt), but generally, I see surprisingly little interest in the chip. Is there something wrong with the architecture? As a platform, what are its advantages and disadvantages over the competition? I am particularly interested in this, as I am thinking of buying an Alliant F/800; I did a lot of benchmarks, and the performance seems extremely good compared to other RISC architectures (even on a per-processor basis running on throughput rather than parallelization), so I'm wondering if there isn't some "catch" I haven't encountered yet. .
yoshio@maui.cs.ucla.edu (Yoshio Turner) (05/07/91)
rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: >Is there something wrong with the architecture? As a platform, >what are its advantages and disadvantages over the competition? At the Touchstone special session at DMCC6, I recall one intel speaker who said the floating point performance suffers from insufficient memory bandwidth. He also said an updated i860 will be announced in "a couple months" that addresses this problem. Yoshio
hays@iSC.intel.com (Kirk Hays) (05/08/91)
In article <1991May7.145407.18417@midway.uchicago.edu>, rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: |> I have seen relatively little about the i860 chip on this newsgroup. |> Also, compared to MIPS, it doesn't seem to be very popular as the |> base processor for computers; Alliant uses it in their shared-memory |> machines (800 & 2800), Intel has the Touchstone experimental mpp machine, Just to set the record straight, we do sell the iPSC/860, which contains up to 128 i860 processors, connected as a hypercube. The Touchstone Delta Field Prototype (the "mpp machine" referenced above, aka "DFP") was recently available for inspection by the attendees of DMCC6 in Portland, OR. Over 500 i860 processors, connected as a mesh, with a peak performance of 30 GFLOPS. |> and the i860 seems popular as a graphics coprocessor (e.g. in the NeXt), |> but generally, I see surprisingly little interest in the chip. |> No comment. Obviously, I do not speak for Intel, being but a lowly engineer. -- Kirk Hays - NRA Life. Message for Timothy Fay - "Do not eat/wear/exploit things you will not kill."
dik@cwi.nl (Dik T. Winter) (05/08/91)
In article <1991May7.145407.18417@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: > I have seen relatively little about the i860 chip on this newsgroup. > Also, compared to MIPS, it doesn't seem to be very popular as the > base processor for computers; Alliant uses it in their shared-memory > machines (800 & 2800), Intel has the Touchstone experimental mpp machine, > and the i860 seems popular as a graphics coprocessor (e.g. in the NeXt), > but generally, I see surprisingly little interest in the chip. FPS and Stardent (will) use it as computational coprocessor (I think). > > Is there something wrong with the architecture? As a platform, > what are its advantages and disadvantages over the competition? > I am particularly interested in this, as I am thinking of buying > an Alliant F/800; I did a lot of benchmarks, and the performance > seems extremely good compared to other RISC architectures (even > on a per-processor basis running on throughput rather than > parallelization), so I'm wondering if there isn't some "catch" > I haven't encountered yet. The problem I see with the i860 is that it is very good with especially tuned code, but that it is extremely bothersome for compilers to get that performance. (This holds for f-p only, if you are thinking non-f-p it can compete with the others.) My opinion is that you would want the i860 only if your f-p work-load consist mainly of the use of standard libraries. Do not expect such a good performance if you code everything yourself. So for chemical/physical research it is quite good (if you use the standard libraries), but when you are doing research in numerical mathematics it is less well suited. On the other hand Alliants model for parallellism is very good to do basic research in parallel algorithms (in that case performance is not the main problem; we are still on an FX/4; talking about performance :-)). -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
dik@cwi.nl (Dik T. Winter) (05/08/91)
In article <yoshio.673633924@maui> yoshio@maui.cs.ucla.edu (Yoshio Turner) writes: > At the Touchstone special session at DMCC6, I recall one intel speaker > who said the floating point performance suffers from insufficient > memory bandwidth. He also said an updated i860 will be announced in > "a couple months" that addresses this problem. While, indeed, memory bandwidth is a problem, this is not the only problem. I think that lack of registers also play a role. Lack of good compilers play a role. etc. And while we are talking about memory bandwidth, I think the major problem is that at most three loads can be posted concurrently. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
jgriffit@isis.cs.du.edu (Jonathan Griffitts) (05/08/91)
The i860 also has a reputation of being buggy, and apparently the bug list is hard to get. From what I understand (not first-hand information) some of the bugs have a significant adverse effect on performance. I presume that these bugs are being/may have been worked out. --JCG -- --JCG AnyWare Engineering, Boulder CO 303 442-0556
akhiani@ricks.enet.dec.com (Homayoon Akhiani) (05/09/91)
The following is from Computer Design May 1,1991 issue: "860 bugs kept under cover" "...At least one board vendor has been required to sign a nondisclosure agreement with Intel prohibiting the discussion of the bugs with customers. ...after nearly two years in the market, the 860 still has five bugs, one of which is said to significantly affect floating-point performance."
cfj@iSC.intel.com (Charlie Johnson) (05/09/91)
In article <1991May7.145407.18417@midway.uchicago.edu>, rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: |> I have seen relatively little about the i860 chip on this newsgroup. |> Also, compared to MIPS, it doesn't seem to be very popular as the |> base processor for computers; Alliant uses it in their shared-memory |> machines (800 & 2800), Intel has the Touchstone experimental mpp machine, |> and the i860 seems popular as a graphics coprocessor (e.g. in the NeXt), |> but generally, I see surprisingly little interest in the chip. |> |> Is there something wrong with the architecture? As a platform, |> what are its advantages and disadvantages over the competition? |> I am particularly interested in this, as I am thinking of buying |> an Alliant F/800; I did a lot of benchmarks, and the performance |> seems extremely good compared to other RISC architectures (even |> on a per-processor basis running on throughput rather than |> parallelization), so I'm wondering if there isn't some "catch" |> I haven't encountered yet. |> . Intel sells the iPSC/860 which is not an experimental machine. It is a supported product which has up to 128 i860s. -- Charles Johnson Intel Corporation Supercomputer Systems Division MS CO1-01 15201 NW Greenbrier Pkwy Beaverton, OR 97006 phone: (503)629-7605 email: cfj@ssd.intel.com
paul@taniwha.UUCP (Paul Campbell) (05/09/91)
In article <yoshio.673633924@maui> yoshio@maui.cs.ucla.edu (Yoshio Turner) writes: >At the Touchstone special session at DMCC6, I recall one intel speaker >who said the floating point performance suffers from insufficient >memory bandwidth. He also said an updated i860 will be announced in >"a couple months" that addresses this problem. Oh great !-( - I can just see it - the engineers put a 128 bit bus on the thing to get the system performance up and the Intel marketting people blaze out with "Intel releases the first 128 bit micro" (remember you saw it here first :-) Paul -- Paul Campbell UUCP: ..!mtxinu!taniwha!paul AppleLink: CAMPBELL.P My son is now 2 months old, in that time he has doubled his weight, if he does this every 2 months for the next year he will weigh over 300lbs.
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (05/12/91)
In article <1991May7.145407.18417@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: >I have seen relatively little about the i860 chip on this newsgroup. >Alliant uses it in their shared-memory >machines (800 & 2800),... >Is there something wrong with the architecture? Long ago, Intel announced that it would be getting its 860 compilers from Alliant. They obviously thought (as did I) that Alliant was pretty good. That product is now about a year overdue. Unfortunately, the pessimists seem to have been on the mark. -- Don D.C.Lindsay Carnegie Mellon Robotics Institute
terry@venus.sunquest.com (Terry R. Friedrichsen) (05/14/91)
paul@taniwha.UUCP (Paul Campbell) writes: >yoshio@maui.cs.ucla.edu (Yoshio Turner) writes: >> ... an updated i860 will be announced in >>"a couple months" that addresses this [memory bandwidth] problem. >Oh great !-( - I can just see it - the engineers put a 128 bit bus on >the thing to get the system performance up and the Intel marketting >people blaze out with "Intel releases the first 128 bit micro" >(remember you saw it here first :-) OK, let's look a year or two into the future: The new chip (i870?) will only cost about 6 times what it ought to. However, you will be able to bring this down to 3 times what it ought to cost by purchasing a crippled version of the chip (i870SX) which multiplexes the 128-pin bus onto the lower 64 pins. Then, for just 150% of the cost of the original i870, you will be able to buy an i871 co-processor chip. This will be nothing more than an i870 with "i871" printed on the outside, which fits into a socket alongside the crippled i870SX and disables it, restoring the 128-pin bus capability of your machine. Thus, Intel will once again give the customer what he wants (can you guess what that is?). End of trip to the future. (If you think this sounds a lot like a current Intel processor's evolution, by golly, you're right!) Oh, what I would give to have the whole planet just say "NO" to the i586 ... Terry R. Friedrichsen terry@venus.sunquest.com (Internet) uunet!sunquest!terry (Usenet) terry@sds.sdsc.edu (alternate address; I live in Tucson) Quote: "Do, or do not. There is no 'try'." - Yoda, The Empire Strikes Back
pteich@cayman.amd.com (Paul Teich) (05/15/91)
In article <3486@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: | In article <1991May7.145407.18417@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: | > Alliant uses it in their shared-memory machines (800 & 2800) | FPS and Stardent (will) use it as computational coprocessor (I think). The Anderson Report, April 1991 "Stardent has discontinued the Stardent 500 Stiletto workstation family. They had problems with the vector unit integration of the MIPS R3000 and Intel i860 processors. Stiletto customers are being offered the new 750 system as a replacement..." RISC Management, March 1991 "Stratus Computer, which is still using 68K processors while struggling to implement i860-based systems, ... The company did start delivery of FTX, its long-delayed UNIX SVR3 port last December; no word yet, however, on the i860-based systems." "Another company apparently having difficulty with the i860 is Stardent Computer. ... Stardent has been working on an i860-based replacement for the former Stellar product line, but, like Stratus, must be questioning whether to continue." "The remaining announced i860-based multiuser system vendor, Alliant Computer Systems, reported its third straight loss, $3 million." For those designing high-end graphics subsystems, remember, there is an alternative to the i860 - the Am29050. This is not an advertisement, so no flames...; call or email me for more information. I speak only for myself, etc. -- Paul R. Teich pteich@cayman.AMD.COM Advanced Micro Devices, Inc. Direct 1-512-462-4268 5900 E. Ben White Blvd., MS 561 WATS 1-800-531-5202 x54268 Austin, Texas 78741 FAX 1-512-462-5051 =============================================================================== The mind of man, the soul, spirit or whatever, is infinite in its grasp; the universe may be only finite. At least there is no limit to what we've been able to imagine so far. -Jessie Greenstein, Radio Astronomer,_The_Astronomers_
carroll@ssc-vax (Jeff Carroll) (05/16/91)
In article <3486@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >In article <1991May7.145407.18417@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: > > I have seen relatively little about the i860 chip on this newsgroup. > > Also, compared to MIPS, it doesn't seem to be very popular as the > > base processor for computers; Alliant uses it in their shared-memory > > machines (800 & 2800), Intel has the Touchstone experimental mpp machine, > > and the i860 seems popular as a graphics coprocessor (e.g. in the NeXt), > > but generally, I see surprisingly little interest in the chip. >FPS and Stardent (will) use it as computational coprocessor (I think). FPS pitched this box to me more than a year ago, and if it's not out yet it's probably never going to be. It's essentially the old Celerity box with SPARCs where the Celerity chips used to be, and two or three optional add-in boards, one of which is supposed to be up to 84 i860s sharing a single bank of memory. They claim that bandwidth to memory is sufficient; I don't remember the actual numbers and will thus leave this to the judgment of the reader, but I thought that claim a little hard to believe at the time... > > > > Is there something wrong with the architecture? As a platform, > > what are its advantages and disadvantages over the competition? > > I am particularly interested in this, as I am thinking of buying > > an Alliant F/800; I did a lot of benchmarks, and the performance > > seems extremely good compared to other RISC architectures (even > > on a per-processor basis running on throughput rather than > > parallelization), so I'm wondering if there isn't some "catch" > > I haven't encountered yet. The i860, to the best of my knowledge, has the best floating-point performance of any microprocessor in the world today (possible bugs notwithstanding; I haven't seen 'em, but I don't build i860 systems. I just use 'em.) The integer performance is closer to other recent RISC chips. >The problem I see with the i860 is that it is very good with >especially tuned code, but that it is extremely bothersome for >compilers to get that performance. (This holds for f-p only, if you >are thinking non-f-p it can compete with the others.) But if you are thinking non-floating-point there might well be strong reasons for going with another chip. As long as you don't have to write the compilers, or live with them while the vendor is coming up the learning curve, you don't care how "bothersome" it is. In fact there are some i860 compilers that are better than others, and I've been told by people I trust that the Portland Group compilers are quite good (maybe some day soon I'll be able to verify this myself). > My opinion is >that you would want the i860 only if your f-p work-load consist mainly >of the use of standard libraries. Do not expect such a good >performance if you code everything yourself. So for chemical/physical >research it is quite good (if you use the standard libraries), I use a canned library on our i860 boards, but I do it mostly to save time in coding, not because it is especially fast (which it isn't). I'll withhold the name in order to protect the guilty, and also the innocent (namely myself). > but >when you are doing research in numerical mathematics it is less well >suited. On the other hand Alliants model for parallellism is very >good to do basic research in parallel algorithms (in that case >performance is not the main problem; we are still on an FX/4; talking >about performance :-)). Actually the i860 made possible some work on parallel numerical algorithms that would have been orders of magnitude more expensive (in both time and dollars) without it. In particular an Intel/Boeing team found that it was possible to achieve very nearly peak performance from the i860 using very little assembly code, on certain problems. -- Jeff Carroll carroll@ssc-vax.boeing.com "Do you think I care? ... I have an infinite amount of money." -Bill Gates
dank@blacks.jpl.nasa.gov (Dan Kegel) (05/17/91)
carroll@ssc-vax (Jeff Carroll) writes: >... an Intel/Boeing team found that it was possible to achieve very nearly >peak performance from the i860 using very little assembly code, on certain >problems. I just ran a stupid benchmark (the FFT from Numerical Recipies, length 1024, 2000 reps) on both a Sun 4/470 and an Alliant FX/2800. On the Sun, I used f77 -fast; on the Alliant, fortran -O. Performance was within 10% of identical. [Gee, everyone said the i860 was much faster than a Sun; guess I misunderstood.] Now that I've compared the speed of Fortran programs, I'd like to compare the speed of hand-coded assembly versions of the same thing. The Alliant comes with a canned signal processing library which is presumably hand-optimized assembly code. Does anyone know of such a library for the Sun 4? - Dan K
preston@ariel.rice.edu (Preston Briggs) (05/17/91)
dank@blacks.jpl.nasa.gov (Dan Kegel) writes: >I just ran a stupid benchmark (the FFT from Numerical Recipies, length 1024, >2000 reps) on both a Sun 4/470 and an Alliant FX/2800. >Performance was within 10% of identical. >[Gee, everyone said the i860 was much faster than a Sun; >guess I misunderstood.] The i860 can smoke a Sparc, but it takes a smart (mostly non-existant) compiler and the right applications. A man at Alliant charactarized the 860 as having a small "sweet spot". In early experiments (meaning primitive compilers), we measured a factor of 21 improvement from (very tedious) hand coding and compiler results. Basically, really advanced architectures require really advanced compilers to get best results. For example, vector machines need vectorizing compilers. The i860 needs many of the same techniques. Preston Briggs
wright@Stardent.COM (David Wright) (05/17/91)
In article <1991May15.152456.4246@dvorak.amd.com> pteich@cayman.amd.com (Paul Teich) writes: >In article <3486@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >| In article <1991May7.145407.18417@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: >| > Alliant uses it in their shared-memory machines (800 & 2800) >| FPS and Stardent (will) use it as computational coprocessor (I think). > >The Anderson Report, April 1991 > > "Stardent has discontinued the Stardent 500 Stiletto workstation family. >They had problems with the vector unit integration of the MIPS R3000 and >Intel i860 processors. Stiletto customers are being offered the new 750 >system as a replacement..." > >RISC Management, March 1991 > > "Another company apparently having difficulty with the i860 is Stardent >Computer. ... Stardent has been working on an i860-based replacement for the >former Stellar product line, but, like Stratus, must be questioning >whether to continue." I'm an OS engineer, not an official spokesman for the company, so let's be clear that I've now established deniability, OK? That done, a few things should be straightened out here... The discontinuation of Stiletto had nothing to do with defects in the i860 or the i860 being a bad processor. (No, I am not going to discuss why it was cancelled.) We are NOT questioning whether to continue our i860-based product line. Quite the contrary. This week, Stardent announced three new models based on the i860, one of which contains a nifty graphics accelerator that contains i860s of its own. And we certainly have plans for future systems based around the i860 and its successors. (Intel has been a pretty good outfit to deal with, as far as we're concerned.) So don't believe everything you read in the papers. (Or do believe everything, but then be prepared to be contradicted a lot.) ><blatant AMD self-promotion deleted> No, the i860 is not perfect and the trap code can be kind of a pain, but it's a pretty slick chip, even if the floating point is a bear to optimize. -- David Wright, not officially representing Stardent Computer Inc wright@stardent.com or uunet!stardent!wright "In accordance with our principles of free enterprise and healthy competition, I'm going to ask you two to fight to the death for it." -- Monty Python
brooks@tazdevil.llnl.gov (Eugene D. Brooks III) (05/17/91)
In article <1991May16.221437.10751@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes: >The i860 can smoke a Sparc, but it takes a smart (mostly non-existant) >compiler and the right applications. A man at Alliant charactarized >the 860 as having a small "sweet spot". The "sweet spot" of the i860, 4 K bytes, is indeed very small. That the chip is so difficult to compile for indicates a poorly designed architecture. The "sweet spot" of several more recently available micros, IBM's and HP's, is much larger with data caches of 128 K bytes. Sweet spots which are 32 times larger are definitely better. The "sweet spot" of a true supercomputer ranges in the hundreds of megawords. Sweet spots which are more than a thousand times larger are better, but not always a thousand times better. It is clear that the fundamental problem with leading edge micros today is when they miss their sweet spot they go hungry. In time, micro vendors and memory chip vendors will get together and provide DRAM chips with the internal latches required to interleave memory on a chip. Then the micro's sweet spot will in the same size class as that of a supercomputer and the conventional supercomputer will move on to complete extinction. By then, the i860 and all the other micros lacking some form of explicit vector functionality to saturate the available memory bandwidth will be fodder for lawn sprinkler controllers.
uh311ae@sunmanager.lrz-muenchen.de (Henrik Klagges) (05/17/91)
Hi, could you forward me a little information about the 29050 ? Thanks a lot ! Rick@vee.lrz-muenchen.de
preston@ariel.rice.edu (Preston Briggs) (05/17/91)
preston@ariel.rice.edu (Preston Briggs) writes: >>The i860 can smoke a Sparc, but it takes a smart (mostly non-existant) >>compiler and the right applications. A man at Alliant charactarized >>the 860 as having a small "sweet spot". brooks@tazdevil.llnl.gov (Eugene D. Brooks III) writes: >The "sweet spot" of the i860, 4 K bytes, is indeed very small. By sweet spot, I would include more than just cache (which is 8K btw). If the application is not FP intense, then you're wasting a good portion of the chip. If it's DP, you give up a factor of 2 in multiplication rate. If it's not balanced in FP adds and multiplies, you give up further big chunks. You also need fair-sized loops to help mitigate the long latencies associated with the pipelines. And, if there's inadequare reuse of data, you're going to be bound up by cost of memory accesses (for example, when adding 2 long vectors). Note that larger caches are no help if there's no reuse. >By then, the i860 and all the other micros lacking some form of explicit >vector functionality to saturate the available memory bandwidth will be >fodder for lawn sprinkler controllers. But the 860 (and RS/6000 and HP's machines) can easily saturate the available memory bandwidth! What we want is more bandwidth or paths to memory or something to keep up with the FP. Of course, this is all expensive. >That the chip is so difficult to compile for indicates a >poorly designed architecture. Perhaps so. But weren't vectors machines considered difficult targets for years (aren't they still)? And consider the difficulties compilers have had with parallel machines of all sorts. The 860's just a new and interesting problem. The RS/6000 and HP-snakes spend more hardware on the implementation and are able to have a cleaner and simpler architecture (an approach I support), but the 860 is still faster in some cases (say, multiplication of large matrices). Preston Briggs
woan@exeter.austin.ibm.com (Ronald S Woan) (05/17/91)
In article <848@llnl.LLNL.GOV> brooks@tazdevil.llnl.gov (Eugene D. Brooks III) writes: >In article <1991May16.221437.10751@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes: >>The i860 can smoke a Sparc, but it takes a smart (mostly non-existant) >>compiler and the right applications. A man at Alliant charactarized >>the 860 as having a small "sweet spot". >The "sweet spot" of the i860, 4 K bytes, is indeed very small. That >the chip is so difficult to compile for indicates a poorly designed >architecture. I seriously doubt that by "sweet spot", he was referring to just the size of the cache which can't be directly compared across architectures based on just size as it matters how they are managed and replenished. Rather, I think he was referring to the complexity of coding for the i860's multiple processing units which I believe they expose to the software (not one of those supersmart superscalar out-of-order-execution guys). -- +-----All Views Expressed Are My Own And Are Not Necessarily Shared By------+ +------------------------------My Employer----------------------------------+ + Ronald S. Woan woan@cactus.org or woan@austin.vnet.ibm.com + + other email addresses Prodigy: XTCR74A Compuserve: 73530,2537 +
carroll@ssc-vax (Jeff Carroll) (05/18/91)
In article <13008@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: > >Long ago, Intel announced that it would be getting its 860 compilers >from Alliant. They obviously thought (as did I) that Alliant was >pretty good. > >That product is now about a year overdue. Unfortunately, the >pessimists seem to have been on the mark. Intel has also OEMed i860 compilers from Green Hills and the Portland Group, and struck an (unannounced) deal with Multiflow for compiler technology at about the same time they announced the Alliant deal. I think that Alliant's failure to produce its compiler on schedule likely says more about Alliant than it does about the i860. -- Jeff Carroll carroll@ssc-vax.boeing.com "Like sands through the hourglass, so are the days of our lives." - Socrates
chased@rbbb.Eng.Sun.COM (David Chase) (05/19/91)
carroll@ssc-vax.UUCP (Jeff Carroll) writes: > Intel has also OEMed i860 compilers from Green Hills and the Portland >Group, and struck an (unannounced) deal with Multiflow for compiler >technology at about the same time they announced the Alliant deal. > > I think that Alliant's failure to produce its compiler on schedule >likely says more about Alliant than it does about the i860. I don't think this is a valid criticism. There were compilers for the i860 some time ago; the question is whether the writers of those compilers were attempting to generate code for (as Preston Briggs put it) the i860's "sweet spot". If so, the only thing they can be faulted for is thinking they could be done by now. The BEST code for the i860 is very hard to generate. I expect that when compilers are finally generating very good code for that chip, they will do so by blocking techniques used to ensure that operands are in the cache at known alignments (that is, blocks will be copied). It is also possible in good situations to generate code that does not require this copying, but I've only seen it done by hand. I tried some of it myself (taking into account all the rules about stalled instructions in the reference manual) and managed to write some code for matrix multiply (not the transposed version in the manual) that would apparently hit "full speed" for any matrix where a single row would fit in the cache. It was very hard. The difficulty comes from having to get everything right simultaneously -- in the triply nested loop, you unroll an outer loop by three, jam it into the innermost loop, unroll that by four, recognize that you can accumulate three inner products simultaneously in the adder pipeline (but that's because you know to select the right instruction), and load the proper operands from the cache, load the other operands using the cache-bypassing pipelined load instructions, and know that everything is correctly aligned so that multi-word loads can be used. If you didn't unroll by three, or if you selected the wrong instruction, or if you didn't do the registers-in-the-pipeline trick, or if you didn't get the proper assign of operands to cache and main memory, or if the alignments weren't right, then the performance drops precipitously. For other operations (e.g., elementary row operations of linear algebra) you pretty much have to reorganize operands so that everything is in the cache. If not, the bottleneck is architectural; if you assume that B is cached but A is not (i.e., we expect to eliminate B from many rows, so we put it in the cache) in A[i] = A[i] + F * B[i] you end up with 64 bits of off-chip I/O per cycle. This is the max, and the chip could do it, except that the instruction grouping rules say, "one from column F, one from column I". If you have N Mpy-adds, that's N instructions to schlep A around, but you still need N/4 more to load B from the cache (using quad-word loads that are not available in the pipelined form) plus one to control the loop. Of course, it goes w/o saying that you've already done dependence analysis on everything, and that you are running the chip in dual-instruction mode. I haven't even begun to talk about the details, so you begin to see what a "challenge" this chip is. Another approach would be to use pattern-matching (big patterns, too) and just call canned routines written by hand. David Chase Sun
jgreen@Alliant.COM (John C Green Jr) (05/21/91)
Eugene D Brooks III <llnl.gov> and Preston Briggs <rice.edu> have mentioned the size of the i860 "sweet spot". Alliant builds the FX/2800, an air cooled supercomputer from multiple i860s with global shared memory and has extensive experience in looking for, finding, and expanding the "sweet spot." RISC microprocessors, including the i860 need lots of bandwidth. Alliant believes the way to deliver this bandwidth to the i860 is a multiple level cache: Location Size Bandwidth Per Processor =========== ============================= ======================= On chip 4 KB Instruction+8 KB Data 960 MB/sec On board 256 KB 128 MB/sec Global 16 MB 80 MB/sec Main Memory 4096 MB 40 MB/sec If your program does enough data reuse to make most of its memory references out of cache then it will run well. The i860 "sweet spot" may be small, but the Alliant caches enlarge it to include: Advertised speed of light 2240 MFLOPS SP Advertised speed of light 1120 MFLOPS DP SPECthru 313 Linpack 100x100 31 MFLOPS DP Linpack 1000x1000 325 MFLOPS DP Convolution (500 filterx50,000 data) 2150 MFLOPS SP 2-D FFT (Complex 1K) 420 MFLOPS SP Matrix Multiply 1000x1000 (DGEMM) 985 MFLOPS DP (DP REAL matrix) Matrix Multiply 1000x1000 (ZGEMM) 1018 MFLOPS DP (DP COMPLEX matrix) To put this in perspective: * As of Dongarra's 3/19/91 Linpack report the 100x100 31 MFLOPS was the fastest air cooled system available. The next fastest was a NEC SX-1E eeking out 32 MFLOPS for $Millions more. * Also as of 3/19/91 the Linpack 1000x1000 325 MFLOPS was the fastest air cooled system available. The next fastest are the late ETA10-E at 334 MFLOPS and the Cray-2/4-256 at 360 MFLOPS. * Getting more realistic in price: on May 7 Convex announced the $8 Million air cooled GaAs C3800. This machine's advertised speed of light is 960 MFLOPS DP and 1920 MFLOPS SP. In important scientific library routines the Alliant i860 killer micro delivers 1018 and 2150 MFLOPS, i.e. greater than the C3800 speed of light for about 1/5 the price. * The Alliant FX/2800 was announced in Jan 1990 and shipped in Mar 1990. The FX/800 verstion starts at $189K. * As Eugene Brooks has said: "Nothing will survive the attack of the killer micros."
alan@uh.msc.umn.edu (Alan Klietz) (05/21/91)
In article <848@llnl.LLNL.GOV> brooks@tazdevil.llnl.gov (Eugene D. Brooks III) writes:
<The "sweet spot" of the i860, 4 K bytes, is indeed very small.
<That the chip is so difficult to compile for indicates a poorly designed architecture.
Another difficulty is the lack of registers. A trivial application
like a vector sum requires 7 registers. With more than a few vector
operands you quickly run out of registers for pipelines, and then have to
either stall to reuse registers or else break the vector expression into
chunks and stream them in and out of memory, tripling the memory bandwidth
requirement.
--
Alan E. Klietz
Minnesota Supercomputer Center, Inc.
1200 Washington Avenue South
Minneapolis, MN 55415
Ph: +1 612 626 1737 Internet: alan@msc.edu
rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/21/91)
>(jgreen of Alliant writes about the performance of the >FX/2800, which uses the i860) This summary may sound a little 'boosterish', so I thought I'd chime in, as a user, that I have been considering the FX/800, and done extensive benchmarks; the thing really does perform as advertised, and if the are any "gotchas" I haven't found them yet. On a lot of scientific codes I have been using, the performance of a detached processor in the Alliant runs at about the speed of an IBM RS/6000 model 530, which I think is pretty good, given that you have 8 processors in the thing. There does seem to be some problem with the onboard chip cache in a parallel environment. In going from running code on a detached processor to running parallel,you have to disable the onboard cache. This is a big performance hit, and has the result that running parallel on two processors often isn't a whole lot faster than running detached on a single processor. You see big performance gains by the time you get up to four processors in parallel, but on an FX/800, that is your limit of how many processors you can normally use on a job (because of memory bandwidth). In hand-coded routines, like their FFT, Alliant manages to use the onboard cache even in a parallel situation, but maintaining cache-coherency through clever tricks. This trick also lets them use all 8 processors in parallel (given that there is a lot of data re-use in the FFT). Chip mods to the i860 that allowed compilers to find this kind of thing would greatly enhance the utility of the chip. Basically, I think it is POSSIBLE to write a good compiler for the chip, and Alliant has more or less done it, but the extremes to which they had to go suggest also that there are some design flaws in the i860.
brandis@inf.ethz.ch (Marc Brandis) (05/21/91)
In article <1991May17.143025.24242@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes: >>That the chip is so difficult to compile for indicates a >>poorly designed architecture. > >Perhaps so. But weren't vectors machines considered difficult targets >for years (aren't they still)? And consider the difficulties >compilers have had with parallel machines of all sorts. >The 860's just a new and interesting problem. In other words: The hardware designers guarantee that compiler researchers still have something to do by making a bad design from time to time. -:) It was always my impression that a good architecture tried to balance the things done in hardware and the things done in software, to get the most out of current technology in both fields. The market does not care whether somebody may eventually master the hurdle to write a compiler that achieves good speed on the i860, the i860 has to stand against current architecture/ compiler pairs, not against the ones in 1995. >The RS/6000 and HP-snakes spend more hardware on the implementation >and are able to have a cleaner and simpler architecture (an approach >I support), but the 860 is still faster in some cases >(say, multiplication of large matrices). They spend more hardware, and they get more performance. The RS/6000 FP hardware it three times as fast as the one in the i860 (both the adder and the multiplier run in one cycle, while on the i860 they need three). If you have completely vectorizable code, you can get one add and one multiply per cycle out of both the RS/6000 and the i860. Note that the startup time for such a pipelined loop is pretty high on the i860. Anyway, could you please explain how the i860 should be faster on matrix multiply than the RS/6000, assuming both run at the same clock rate? Marc-Michael Brandis Computer Systems Laboratory, ETH-Zentrum (Swiss Federal Institute of Technology) CH-8092 Zurich, Switzerland email: brandis@inf.ethz.ch
colin@array.UUCP (Colin Plumb) (05/30/91)
In article <dank.674423728@blacks> dank@blacks.jpl.nasa.gov (Dan Kegel) writes: >Now that I've compared the speed of Fortran programs, I'd like to compare the >speed of hand-coded assembly versions of the same thing. >The Alliant comes with a canned signal processing library which is presumably >hand-optimized assembly code. Various i860 library vendors claim the following times for 1024-point complex FFT's: 741 us, 830 us, 745 us, 1040 us. I've heard rumours that someone's got a routine operating at <700 us, but that's on the edge of credibility based on some analyses I've made. -- -Colin