[comp.arch] Anything wrong with the i860

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/07/91)

I have seen relatively little about the i860 chip on this newsgroup.
Also, compared to MIPS, it doesn't seem to be very popular as the
base processor for computers;  Alliant uses it in their shared-memory
machines (800 & 2800), Intel has the Touchstone experimental mpp machine,
and the i860 seems popular as a graphics coprocessor (e.g. in the NeXt),
but generally, I see surprisingly little interest in the chip.

Is there something wrong with the architecture?  As a platform,
what are its advantages and disadvantages over the competition?
I am particularly interested in this, as I am thinking of buying
an Alliant F/800; I did a lot of benchmarks, and the performance
seems extremely good compared to other RISC architectures (even
on a per-processor basis running on throughput rather than
parallelization), so I'm wondering if there isn't some "catch"
I haven't encountered yet.
.

yoshio@maui.cs.ucla.edu (Yoshio Turner) (05/07/91)

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
>Is there something wrong with the architecture?  As a platform,
>what are its advantages and disadvantages over the competition?

At the Touchstone special session at DMCC6, I recall one intel speaker
who said the floating point performance suffers from insufficient
memory bandwidth.  He also said an updated i860 will be announced in
"a couple months" that addresses this problem.

Yoshio

hays@iSC.intel.com (Kirk Hays) (05/08/91)

In article <1991May7.145407.18417@midway.uchicago.edu>, rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
|> I have seen relatively little about the i860 chip on this newsgroup.
|> Also, compared to MIPS, it doesn't seem to be very popular as the
|> base processor for computers;  Alliant uses it in their shared-memory
|> machines (800 & 2800), Intel has the Touchstone experimental mpp machine,

Just to set the record straight, we do sell the iPSC/860, which contains up to
128 i860 processors, connected as a hypercube.  The Touchstone Delta Field
Prototype (the "mpp machine" referenced above, aka "DFP") was recently available
for inspection by the attendees of DMCC6 in Portland, OR.  Over 500 i860
processors, connected as a mesh, with a peak performance of 30 GFLOPS.

|> and the i860 seems popular as a graphics coprocessor (e.g. in the NeXt),
|> but generally, I see surprisingly little interest in the chip.
|> 

No comment.

Obviously, I do not speak for Intel, being but a lowly engineer.
-- 
Kirk Hays - NRA Life.
Message for Timothy Fay - "Do not eat/wear/exploit things you will not kill."

dik@cwi.nl (Dik T. Winter) (05/08/91)

In article <1991May7.145407.18417@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
 > I have seen relatively little about the i860 chip on this newsgroup.
 > Also, compared to MIPS, it doesn't seem to be very popular as the
 > base processor for computers;  Alliant uses it in their shared-memory
 > machines (800 & 2800), Intel has the Touchstone experimental mpp machine,
 > and the i860 seems popular as a graphics coprocessor (e.g. in the NeXt),
 > but generally, I see surprisingly little interest in the chip.
FPS and Stardent (will) use it as computational coprocessor (I think).
 > 
 > Is there something wrong with the architecture?  As a platform,
 > what are its advantages and disadvantages over the competition?
 > I am particularly interested in this, as I am thinking of buying
 > an Alliant F/800; I did a lot of benchmarks, and the performance
 > seems extremely good compared to other RISC architectures (even
 > on a per-processor basis running on throughput rather than
 > parallelization), so I'm wondering if there isn't some "catch"
 > I haven't encountered yet.
The problem I see with the i860 is that it is very good with especially tuned
code, but that it is extremely bothersome for compilers to get that performance.
(This holds for f-p only, if you are thinking non-f-p it can compete with the
others.)  My opinion is that you would want the i860 only if your f-p work-load
consist mainly of the use of standard libraries.  Do not expect such a good
performance if you code everything yourself.  So for chemical/physical research
it is quite good (if you use the standard libraries), but when you are doing
research in numerical mathematics it is less well suited.  On the other hand
Alliants model for parallellism is very good to do basic research in
parallel algorithms (in that case performance is not the main problem; we are
still on an FX/4; talking about performance :-)).
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

dik@cwi.nl (Dik T. Winter) (05/08/91)

In article <yoshio.673633924@maui> yoshio@maui.cs.ucla.edu (Yoshio Turner) writes:
 > At the Touchstone special session at DMCC6, I recall one intel speaker
 > who said the floating point performance suffers from insufficient
 > memory bandwidth.  He also said an updated i860 will be announced in
 > "a couple months" that addresses this problem.
While, indeed, memory bandwidth is a problem, this is not the only problem.
I think that lack of registers also play a role.  Lack of good compilers play
a role.  etc.  And while we are talking about memory bandwidth, I think the
major problem is that at most three loads can be posted concurrently.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

jgriffit@isis.cs.du.edu (Jonathan Griffitts) (05/08/91)

The i860 also has a reputation of being buggy, and apparently the bug
list is hard to get.  From what I understand (not first-hand
information) some of the bugs have a significant adverse effect on
performance.

I presume that these bugs are being/may have been worked out.


				--JCG

--
			   --JCG
AnyWare Engineering, Boulder CO
303 442-0556

akhiani@ricks.enet.dec.com (Homayoon Akhiani) (05/09/91)

The following is from Computer Design May 1,1991 issue:

"860 bugs kept under cover"
"...At least one board vendor has been required to sign a nondisclosure 
agreement with Intel prohibiting the discussion of the bugs with customers.
...after nearly two years in the market, the 860 still has five bugs,
one of which is said to significantly affect floating-point performance."

cfj@iSC.intel.com (Charlie Johnson) (05/09/91)

In article <1991May7.145407.18417@midway.uchicago.edu>, rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
|> I have seen relatively little about the i860 chip on this newsgroup.
|> Also, compared to MIPS, it doesn't seem to be very popular as the
|> base processor for computers;  Alliant uses it in their shared-memory
|> machines (800 & 2800), Intel has the Touchstone experimental mpp machine,
|> and the i860 seems popular as a graphics coprocessor (e.g. in the NeXt),
|> but generally, I see surprisingly little interest in the chip.
|> 
|> Is there something wrong with the architecture?  As a platform,
|> what are its advantages and disadvantages over the competition?
|> I am particularly interested in this, as I am thinking of buying
|> an Alliant F/800; I did a lot of benchmarks, and the performance
|> seems extremely good compared to other RISC architectures (even
|> on a per-processor basis running on throughput rather than
|> parallelization), so I'm wondering if there isn't some "catch"
|> I haven't encountered yet.
|> .

Intel sells the iPSC/860 which is not an experimental machine.  It is a
supported product which has up to 128 i860s.

-- 
Charles Johnson
Intel Corporation
Supercomputer Systems Division
MS CO1-01
15201 NW Greenbrier Pkwy
Beaverton, OR  97006           phone: (503)629-7605  email: cfj@ssd.intel.com

paul@taniwha.UUCP (Paul Campbell) (05/09/91)

In article <yoshio.673633924@maui> yoshio@maui.cs.ucla.edu (Yoshio Turner) writes:
>At the Touchstone special session at DMCC6, I recall one intel speaker
>who said the floating point performance suffers from insufficient
>memory bandwidth.  He also said an updated i860 will be announced in
>"a couple months" that addresses this problem.


Oh great !-( - I can just see it - the engineers put a 128 bit bus on
the thing to get the system performance up and the Intel marketting
people blaze out with "Intel releases the first 128 bit micro"
(remember you saw it here first :-)


	Paul


-- 
Paul Campbell    UUCP: ..!mtxinu!taniwha!paul     AppleLink: CAMPBELL.P

My son is now 2 months old, in that time he has doubled his weight,
if he does this every 2 months for the next year he will weigh over 300lbs.

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (05/12/91)

In article <1991May7.145407.18417@midway.uchicago.edu>
   rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
>I have seen relatively little about the i860 chip on this newsgroup.
>Alliant uses it in their shared-memory
>machines (800 & 2800),...
>Is there something wrong with the architecture?

Long ago, Intel announced that it would be getting its 860 compilers
from Alliant. They obviously thought (as did I) that Alliant was
pretty good.

That product is now about a year overdue. Unfortunately, the
pessimists seem to have been on the mark.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Robotics Institute

terry@venus.sunquest.com (Terry R. Friedrichsen) (05/14/91)

paul@taniwha.UUCP (Paul Campbell) writes:
>yoshio@maui.cs.ucla.edu (Yoshio Turner) writes:
>> ... an updated i860 will be announced in
>>"a couple months" that addresses this [memory bandwidth] problem.

>Oh great !-( - I can just see it - the engineers put a 128 bit bus on
>the thing to get the system performance up and the Intel marketting
>people blaze out with "Intel releases the first 128 bit micro"
>(remember you saw it here first :-)

OK, let's look a year or two into the future:

The new chip (i870?) will only cost about 6 times what it ought to.
However, you will be able to bring this down to 3 times what it ought
to cost by purchasing a crippled version of the chip (i870SX) which
multiplexes the 128-pin bus onto the lower 64 pins.

Then, for just 150% of the cost of the original i870, you will be able
to buy an i871 co-processor chip.  This will be nothing more than an i870
with "i871" printed on the outside, which fits into a socket alongside the
crippled i870SX and disables it, restoring the 128-pin bus capability of
your machine.

Thus, Intel will once again give the customer what he wants (can you guess
what that is?).

End of trip to the future.  (If you think this sounds a lot like a
current Intel processor's evolution, by golly, you're right!)

Oh, what I would give to have the whole planet just say "NO" to the i586 ...

Terry R. Friedrichsen

terry@venus.sunquest.com  (Internet)
uunet!sunquest!terry	  (Usenet)
terry@sds.sdsc.edu        (alternate address; I live in Tucson)

Quote:  "Do, or do not.  There is no 'try'." - Yoda, The Empire Strikes Back

pteich@cayman.amd.com (Paul Teich) (05/15/91)

In article <3486@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
| In article <1991May7.145407.18417@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
|  > Alliant uses it in their shared-memory machines (800 & 2800)
| FPS and Stardent (will) use it as computational coprocessor (I think).

The Anderson Report, April 1991

  "Stardent has discontinued the Stardent 500 Stiletto workstation family.
They had problems with the vector unit integration of the MIPS R3000 and
Intel i860 processors.  Stiletto customers are being offered the new 750
system as a replacement..."

RISC Management, March 1991

  "Stratus Computer, which is still using 68K processors while struggling to
implement i860-based systems, ... The company did start delivery of FTX, its
long-delayed UNIX SVR3 port last December; no word yet, however, on the
i860-based systems."
  "Another company apparently having difficulty with the i860 is Stardent
Computer. ... Stardent has been working on an i860-based replacement for the
former Stellar product line, but, like Stratus, must be questioning whether to
continue."
  "The remaining announced i860-based multiuser system vendor, Alliant
Computer Systems, reported its third straight loss, $3 million."

For those designing high-end graphics subsystems, remember, there is an
alternative to the i860 - the Am29050.  This is not an advertisement, so
no flames...; call or email me for more information.

I speak only for myself, etc.
--
Paul R. Teich                          pteich@cayman.AMD.COM
Advanced Micro Devices, Inc.           Direct 1-512-462-4268
5900 E. Ben White Blvd., MS 561        WATS   1-800-531-5202 x54268
Austin, Texas  78741                   FAX    1-512-462-5051
===============================================================================
The mind of man, the soul, spirit or whatever, is infinite in its grasp; the
universe may be only finite.  At least there is no limit to what we've been
able to imagine so far.  -Jessie Greenstein, Radio Astronomer,_The_Astronomers_

carroll@ssc-vax (Jeff Carroll) (05/16/91)

In article <3486@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <1991May7.145407.18417@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
> > I have seen relatively little about the i860 chip on this newsgroup.
> > Also, compared to MIPS, it doesn't seem to be very popular as the
> > base processor for computers;  Alliant uses it in their shared-memory
> > machines (800 & 2800), Intel has the Touchstone experimental mpp machine,
> > and the i860 seems popular as a graphics coprocessor (e.g. in the NeXt),
> > but generally, I see surprisingly little interest in the chip.
>FPS and Stardent (will) use it as computational coprocessor (I think).

	FPS pitched this box to me more than a year ago, and if it's not
out yet it's probably never going to be. It's essentially the old
Celerity box with SPARCs where the Celerity chips used to be, and two
or three optional add-in boards, one of which is supposed to be up to
84 i860s sharing a single bank of memory. They claim that bandwidth to
memory is sufficient; I don't remember the actual numbers and will thus
leave this to the judgment of the reader, but I thought that claim a 
little hard to believe at the time...

> > 
> > Is there something wrong with the architecture?  As a platform,
> > what are its advantages and disadvantages over the competition?
> > I am particularly interested in this, as I am thinking of buying
> > an Alliant F/800; I did a lot of benchmarks, and the performance
> > seems extremely good compared to other RISC architectures (even
> > on a per-processor basis running on throughput rather than
> > parallelization), so I'm wondering if there isn't some "catch"
> > I haven't encountered yet.

	The i860, to the best of my knowledge, has the best floating-point
performance of any microprocessor in the world today (possible bugs
notwithstanding; I haven't seen 'em, but I don't build i860 systems.
I just use 'em.) The integer performance is closer to other recent RISC chips.

>The problem I see with the i860 is that it is very good with
>especially tuned code, but that it is extremely bothersome for
>compilers to get that performance.  (This holds for f-p only, if you
>are thinking non-f-p it can compete with the others.) 

But if you are thinking non-floating-point there might well be strong
reasons for going with another chip. As long as you don't have to write
the compilers, or live with them while the vendor is coming up the 
learning curve, you don't care how "bothersome" it is. In fact there
are some i860 compilers that are better than others, and I've been told
by people I trust that the Portland Group compilers are quite good
(maybe some day soon I'll be able to verify this myself).

> My opinion is
>that you would want the i860 only if your f-p work-load consist mainly
>of the use of standard libraries.  Do not expect such a good
>performance if you code everything yourself.  So for chemical/physical
>research it is quite good (if you use the standard libraries),

I use a canned library on our i860 boards, but I do it mostly to save time
in coding, not because it is especially fast (which it isn't). I'll
withhold the name in order to protect the guilty, and also the innocent
(namely myself).

> but
>when you are doing research in numerical mathematics it is less well
>suited.  On the other hand Alliants model for parallellism is very
>good to do basic research in parallel algorithms (in that case
>performance is not the main problem; we are still on an FX/4; talking
>about performance :-)).

Actually the i860 made possible some work on parallel numerical algorithms
that would have been orders of magnitude more expensive (in both time and
dollars) without it. In particular an Intel/Boeing team found that it
was possible to achieve very nearly peak performance from the i860 using
very little assembly code, on certain problems.


-- 
Jeff Carroll
carroll@ssc-vax.boeing.com

"Do you think I care? ... I have an infinite amount of money."	-Bill Gates

dank@blacks.jpl.nasa.gov (Dan Kegel) (05/17/91)

carroll@ssc-vax (Jeff Carroll) writes:
>... an Intel/Boeing team found that it was possible to achieve very nearly 
>peak performance from the i860 using very little assembly code, on certain 
>problems.

I just ran a stupid benchmark (the FFT from Numerical Recipies, length 1024,
2000 reps) on both a Sun 4/470 and an Alliant FX/2800.  On the Sun, I used
f77 -fast; on the Alliant, fortran -O.  Performance was within 10% of identical.
[Gee, everyone said the i860 was much faster than a Sun; guess I misunderstood.]

Now that I've compared the speed of Fortran programs, I'd like to compare the
speed of hand-coded assembly versions of the same thing.
The Alliant comes with a canned signal processing library which is presumably
hand-optimized assembly code.
Does anyone know of such a library for the Sun 4?
- Dan K

preston@ariel.rice.edu (Preston Briggs) (05/17/91)

dank@blacks.jpl.nasa.gov (Dan Kegel) writes:

>I just ran a stupid benchmark (the FFT from Numerical Recipies, length 1024,
>2000 reps) on both a Sun 4/470 and an Alliant FX/2800.
>Performance was within 10% of identical.
>[Gee, everyone said the i860 was much faster than a Sun; 
>guess I misunderstood.]

The i860 can smoke a Sparc, but it takes a smart (mostly non-existant)
compiler and the right applications.  A man at Alliant charactarized
the 860 as having a small "sweet spot".

In early experiments (meaning primitive compilers), we measured a factor
of 21 improvement from (very tedious) hand coding and compiler results.

Basically, really advanced architectures require really advanced compilers
to get best results.  For example, vector machines need vectorizing
compilers.  The i860 needs many of the same techniques.

Preston Briggs

wright@Stardent.COM (David Wright) (05/17/91)

In article <1991May15.152456.4246@dvorak.amd.com>
pteich@cayman.amd.com (Paul Teich) writes: 
>In article <3486@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>| In article <1991May7.145407.18417@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
>|  > Alliant uses it in their shared-memory machines (800 & 2800)
>| FPS and Stardent (will) use it as computational coprocessor (I think).
>
>The Anderson Report, April 1991
>
>  "Stardent has discontinued the Stardent 500 Stiletto workstation family.
>They had problems with the vector unit integration of the MIPS R3000 and
>Intel i860 processors.  Stiletto customers are being offered the new 750
>system as a replacement..."

>
>RISC Management, March 1991
>
>  "Another company apparently having difficulty with the i860 is Stardent
>Computer. ... Stardent has been working on an i860-based replacement for the
>former Stellar product line, but, like Stratus, must be questioning
>whether to continue."

I'm an OS engineer, not an official spokesman for the company, so
let's be clear that I've now established deniability, OK?  That done,
a few things should be straightened out here...

The discontinuation of Stiletto had nothing to do with defects in the
i860 or the i860 being a bad processor.  (No, I am not going to
discuss why it was cancelled.)

We are NOT questioning whether to continue our i860-based product
line.  Quite the contrary.  This week, Stardent announced three new
models based on the i860, one of which contains a nifty graphics
accelerator that contains i860s of its own.  And we certainly have
plans for future systems based around the i860 and its successors.
(Intel has been a pretty good outfit to deal with, as far as we're
concerned.)

So don't believe everything you read in the papers.  (Or do believe
everything, but then be prepared to be contradicted a lot.)

><blatant AMD self-promotion deleted>

No, the i860 is not perfect and the trap code can be kind of a pain,
but it's a pretty slick chip, even if the floating point is a bear to
optimize.

  -- David Wright, not officially representing Stardent Computer Inc
     wright@stardent.com  or  uunet!stardent!wright

"In accordance with our principles of free enterprise and healthy
competition, I'm going to ask you two to fight to the death for it."
  -- Monty Python

brooks@tazdevil.llnl.gov (Eugene D. Brooks III) (05/17/91)

In article <1991May16.221437.10751@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
>The i860 can smoke a Sparc, but it takes a smart (mostly non-existant)
>compiler and the right applications.  A man at Alliant charactarized
>the 860 as having a small "sweet spot".
The "sweet spot" of the i860, 4 K bytes, is indeed very small.
That the chip is so difficult to compile for indicates a poorly designed architecture.

The "sweet spot" of several more recently available micros, IBM's and HP's, is
much larger with data caches of 128 K bytes.  Sweet spots which are 32 times
larger are definitely better.

The "sweet spot" of a true supercomputer ranges in the hundreds of megawords.
Sweet spots which are more than a thousand times larger are better,
but not always a thousand times better.

It is clear that the fundamental problem with leading edge micros today is
when they miss their sweet spot they go hungry.  In time, micro vendors
and memory chip vendors will get together and provide DRAM chips with
the internal latches required to interleave memory on a chip.  Then the
micro's sweet spot will in the same size class as that of a supercomputer
and the conventional supercomputer will move on to complete extinction.

By then, the i860 and all the other micros lacking some form of explicit
vector functionality to saturate the available memory bandwidth will be
fodder for lawn sprinkler controllers.

uh311ae@sunmanager.lrz-muenchen.de (Henrik Klagges) (05/17/91)

Hi, could you forward me a little information about the 29050 ?
Thanks a lot !

Rick@vee.lrz-muenchen.de

preston@ariel.rice.edu (Preston Briggs) (05/17/91)

preston@ariel.rice.edu (Preston Briggs) writes:
>>The i860 can smoke a Sparc, but it takes a smart (mostly non-existant)
>>compiler and the right applications.  A man at Alliant charactarized
>>the 860 as having a small "sweet spot".

brooks@tazdevil.llnl.gov (Eugene D. Brooks III) writes:
>The "sweet spot" of the i860, 4 K bytes, is indeed very small.

By sweet spot, I would include more than just cache (which is 8K btw).
If the application is not FP intense, then you're wasting a good
portion of the chip.  If it's DP, you give up a factor of 2 in
multiplication rate.  If it's not balanced in FP adds and multiplies,
you give up further big chunks.  You also need fair-sized loops
to help mitigate the long latencies associated with the pipelines.
And, if there's inadequare reuse of data, you're going to be bound
up by cost of memory accesses (for example, when adding 2 long vectors).
Note that larger caches are no help if there's no reuse.

>By then, the i860 and all the other micros lacking some form of explicit
>vector functionality to saturate the available memory bandwidth will be
>fodder for lawn sprinkler controllers.

But the 860 (and RS/6000 and HP's machines) can easily saturate the
available memory bandwidth!  What we want is more bandwidth or paths
to memory or something to keep up with the FP.  Of course, this is
all expensive.

>That the chip is so difficult to compile for indicates a
>poorly designed architecture.

Perhaps so.  But weren't vectors machines considered difficult targets
for years (aren't they still)?  And consider the difficulties
compilers have had with parallel machines of all sorts.
The 860's just a new and interesting problem.

The RS/6000 and HP-snakes spend more hardware on the implementation
and are able to have a cleaner and simpler architecture (an approach
I support), but the 860 is still faster in some cases
(say, multiplication of large matrices).

Preston Briggs

woan@exeter.austin.ibm.com (Ronald S Woan) (05/17/91)

In article <848@llnl.LLNL.GOV> brooks@tazdevil.llnl.gov (Eugene D. Brooks III) writes:
>In article <1991May16.221437.10751@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
>>The i860 can smoke a Sparc, but it takes a smart (mostly non-existant)
>>compiler and the right applications.  A man at Alliant charactarized
>>the 860 as having a small "sweet spot".
>The "sweet spot" of the i860, 4 K bytes, is indeed very small.  That
>the chip is so difficult to compile for indicates a poorly designed
>architecture.

I seriously doubt that by "sweet spot", he was referring to just the
size of the cache which can't be directly compared across
architectures based on just size as it matters how they are managed
and replenished. Rather, I think he was referring to the complexity of
coding for the i860's multiple processing units which I believe they
expose to the software (not one of those supersmart superscalar
out-of-order-execution guys).

-- 
+-----All Views Expressed Are My Own And Are Not Necessarily Shared By------+
+------------------------------My Employer----------------------------------+
+ Ronald S. Woan                woan@cactus.org or woan@austin.vnet.ibm.com +
+ other email addresses             Prodigy: XTCR74A Compuserve: 73530,2537 +

carroll@ssc-vax (Jeff Carroll) (05/18/91)

In article <13008@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>
>Long ago, Intel announced that it would be getting its 860 compilers
>from Alliant. They obviously thought (as did I) that Alliant was
>pretty good.
>
>That product is now about a year overdue. Unfortunately, the
>pessimists seem to have been on the mark.

	Intel has also OEMed i860 compilers from Green Hills and the Portland
Group, and struck an (unannounced) deal with Multiflow for compiler
technology at about the same time they announced the Alliant deal.

	I think that Alliant's failure to produce its compiler on schedule
likely says more about Alliant than it does about the i860.



-- 
Jeff Carroll
carroll@ssc-vax.boeing.com

"Like sands through the hourglass, so are the days of our lives." - Socrates

chased@rbbb.Eng.Sun.COM (David Chase) (05/19/91)

carroll@ssc-vax.UUCP (Jeff Carroll) writes:
>	Intel has also OEMed i860 compilers from Green Hills and the Portland
>Group, and struck an (unannounced) deal with Multiflow for compiler
>technology at about the same time they announced the Alliant deal.
>
>	I think that Alliant's failure to produce its compiler on schedule
>likely says more about Alliant than it does about the i860.

I don't think this is a valid criticism.  There were compilers for the
i860 some time ago; the question is whether the writers of those
compilers were attempting to generate code for (as Preston Briggs put
it) the i860's "sweet spot".  If so, the only thing they can be
faulted for is thinking they could be done by now.

The BEST code for the i860 is very hard to generate.  I expect that
when compilers are finally generating very good code for that chip,
they will do so by blocking techniques used to ensure that operands
are in the cache at known alignments (that is, blocks will be copied).

It is also possible in good situations to generate code that does not
require this copying, but I've only seen it done by hand.  I tried
some of it myself (taking into account all the rules about stalled
instructions in the reference manual) and managed to write some code
for matrix multiply (not the transposed version in the manual) that
would apparently hit "full speed" for any matrix where a single row
would fit in the cache.  It was very hard.

The difficulty comes from having to get everything right
simultaneously -- in the triply nested loop, you unroll an outer loop
by three, jam it into the innermost loop, unroll that by four,
recognize that you can accumulate three inner products simultaneously
in the adder pipeline (but that's because you know to select the right
instruction), and load the proper operands from the cache, load the
other operands using the cache-bypassing pipelined load instructions,
and know that everything is correctly aligned so that multi-word loads
can be used.

If you didn't unroll by three, or if you selected the wrong
instruction, or if you didn't do the registers-in-the-pipeline trick,
or if you didn't get the proper assign of operands to cache and main
memory, or if the alignments weren't right, then the performance drops
precipitously.

For other operations (e.g., elementary row operations of linear
algebra) you pretty much have to reorganize operands so that
everything is in the cache.  If not, the bottleneck is architectural;
if you assume that B is cached but A is not (i.e., we expect to
eliminate B from many rows, so we put it in the cache) in

   A[i] = A[i] + F * B[i]

you end up with 64 bits of off-chip I/O per cycle.  This is the max,
and the chip could do it, except that the instruction grouping rules
say, "one from column F, one from column I".  If you have N Mpy-adds,
that's N instructions to schlep A around, but you still need N/4 more
to load B from the cache (using quad-word loads that are not available
in the pipelined form) plus one to control the loop.

Of course, it goes w/o saying that you've already done dependence
analysis on everything, and that you are running the chip in
dual-instruction mode.  I haven't even begun to talk about the
details, so you begin to see what a "challenge" this chip is.

Another approach would be to use pattern-matching (big patterns, too)
and just call canned routines written by hand.

David Chase
Sun

jgreen@Alliant.COM (John C Green Jr) (05/21/91)

Eugene D Brooks III <llnl.gov> and Preston Briggs <rice.edu> have mentioned the
size of the i860 "sweet spot". Alliant builds the FX/2800, an air cooled
supercomputer from multiple i860s with global shared memory and has extensive
experience in looking for, finding, and expanding the "sweet spot."

RISC microprocessors, including the i860 need lots of bandwidth. Alliant
believes the way to deliver this bandwidth to the i860 is a multiple level
cache:

    Location    Size                          Bandwidth Per Processor
    =========== ============================= =======================
    On chip        4 KB Instruction+8 KB Data  960 MB/sec
    On board     256 KB                        128 MB/sec
    Global        16 MB                         80 MB/sec
    Main Memory 4096 MB                         40 MB/sec

If your program does enough data reuse to make most of its memory references
out of cache then it will run well. The i860 "sweet spot" may be small, but the
Alliant caches enlarge it to include:

    Advertised speed of light            2240 MFLOPS SP
    Advertised speed of light            1120 MFLOPS DP
    SPECthru                              313
    Linpack 100x100                        31 MFLOPS DP
    Linpack 1000x1000                     325 MFLOPS DP
    Convolution (500 filterx50,000 data) 2150 MFLOPS SP
    2-D FFT (Complex 1K)                  420 MFLOPS SP
    Matrix Multiply 1000x1000 (DGEMM)     985 MFLOPS DP (DP REAL matrix)
    Matrix Multiply 1000x1000 (ZGEMM)    1018 MFLOPS DP (DP COMPLEX matrix)

To put this in perspective:

* As of Dongarra's 3/19/91 Linpack report the 100x100 31 MFLOPS was the fastest
  air cooled system available. The next fastest was a NEC SX-1E eeking out 32
  MFLOPS for $Millions more.

* Also as of 3/19/91 the Linpack 1000x1000 325 MFLOPS was the fastest air
  cooled system available. The next fastest are the late ETA10-E at 334 MFLOPS
  and the Cray-2/4-256 at 360 MFLOPS.

* Getting more realistic in price: on May 7 Convex announced the $8 Million air
  cooled GaAs C3800. This machine's advertised speed of light is 960 MFLOPS DP
  and 1920 MFLOPS SP. In important scientific library routines the Alliant i860
  killer micro delivers 1018 and 2150 MFLOPS, i.e. greater than the C3800 speed
  of light for about 1/5 the price.

* The Alliant FX/2800 was announced in Jan 1990 and shipped in Mar 1990. The
  FX/800 verstion starts at $189K.

* As Eugene Brooks has said: "Nothing will survive the attack of the killer
  micros."

alan@uh.msc.umn.edu (Alan Klietz) (05/21/91)

In article <848@llnl.LLNL.GOV> brooks@tazdevil.llnl.gov (Eugene D. Brooks III) writes:
<The "sweet spot" of the i860, 4 K bytes, is indeed very small.
<That the chip is so difficult to compile for indicates a poorly designed architecture.

Another difficulty is the lack of registers.  A trivial application
like a vector sum requires 7 registers.  With more than a few vector
operands you quickly run out of registers for pipelines, and then have to
either stall to reuse registers or else break the vector expression into
chunks and stream them in and out of memory, tripling the memory bandwidth
requirement.

--
Alan E. Klietz
Minnesota Supercomputer Center, Inc.
1200 Washington Avenue South
Minneapolis, MN  55415
Ph: +1 612 626 1737	       Internet: alan@msc.edu

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/21/91)

>(jgreen of Alliant writes about the performance of the
>FX/2800, which uses the i860)

This summary may sound a little 'boosterish', so I thought I'd
chime in, as a user, that I have been considering the FX/800, and
done extensive benchmarks; the thing really does perform as
advertised, and if the are any "gotchas" I haven't found them yet.

On a lot of scientific codes I have been using, the performance of
a detached processor in the Alliant runs at about the speed of
an IBM RS/6000 model 530, which I think is pretty good, given
that you have 8 processors in the thing.

There does seem to be some problem with the onboard chip cache in
a parallel environment.  In going from running code on a detached
processor to running parallel,you have to disable the onboard cache.
This is a big performance hit, and has the result that running
parallel on two processors often isn't a whole lot faster than
running detached on a single processor.  You see big performance
gains by the time you get up to four processors in parallel, but
on an FX/800, that is your limit of how many processors you can
normally use on a job (because of memory bandwidth).  

In hand-coded routines, like their FFT, Alliant manages to use
the onboard cache even in a parallel situation, but maintaining
cache-coherency through clever tricks.  This trick also lets
them use all 8 processors in parallel (given that there is a
lot of data re-use in the FFT).  Chip mods to the i860 that
allowed compilers to find this kind of thing would greatly
enhance the utility of the chip.

Basically, I think it is POSSIBLE to write a good compiler for
the chip, and Alliant has more or less done it, but the extremes
to which they had to go suggest also that there are some
design flaws in the i860.

brandis@inf.ethz.ch (Marc Brandis) (05/21/91)

In article <1991May17.143025.24242@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
>>That the chip is so difficult to compile for indicates a
>>poorly designed architecture.
>
>Perhaps so.  But weren't vectors machines considered difficult targets
>for years (aren't they still)?  And consider the difficulties
>compilers have had with parallel machines of all sorts.
>The 860's just a new and interesting problem.

In other words: The hardware designers guarantee that compiler researchers
still have something to do by making a bad design from time to time. -:)
It was always my impression that a good architecture tried to balance the
things done in hardware and the things done in software, to get the most out
of current technology in both fields. The market does not care whether 
somebody may eventually master the hurdle to write a compiler that achieves
good speed on the i860, the i860 has to stand against current architecture/
compiler pairs, not against the ones in 1995.

>The RS/6000 and HP-snakes spend more hardware on the implementation
>and are able to have a cleaner and simpler architecture (an approach
>I support), but the 860 is still faster in some cases
>(say, multiplication of large matrices).

They spend more hardware, and they get more performance. The RS/6000 FP 
hardware it three times as fast as the one in the i860 (both the adder and
the multiplier run in one cycle, while on the i860 they need three). If you
have completely vectorizable code, you can get one add and one multiply per cycle
out of both the RS/6000 and the i860. Note that the startup time for such a
pipelined loop is pretty high on the i860. Anyway, could you please explain
how the i860 should be faster on matrix multiply than the RS/6000, assuming
both run at the same clock rate?

Marc-Michael Brandis
Computer Systems Laboratory, ETH-Zentrum (Swiss Federal Institute of Technology)
CH-8092 Zurich, Switzerland
email: brandis@inf.ethz.ch

colin@array.UUCP (Colin Plumb) (05/30/91)

In article <dank.674423728@blacks> dank@blacks.jpl.nasa.gov (Dan Kegel) writes:

>Now that I've compared the speed of Fortran programs, I'd like to compare the
>speed of hand-coded assembly versions of the same thing.
>The Alliant comes with a canned signal processing library which is presumably
>hand-optimized assembly code.

Various i860 library vendors claim the following times for 1024-point
complex FFT's:  741 us, 830 us, 745 us, 1040 us.  I've heard rumours
that someone's got a routine operating at <700 us, but that's on the
edge of credibility based on some analyses I've made.
-- 
	-Colin