[comp.arch] i860 and compilers

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (03/21/89)

In all this talk about the Intel 860, there's been very little mention
of compilers.

It seems clear that pipeline mode is beyond the compiler technology
that Intel is bringing to bear. I expect them to ship a library of
handcoded routines, which do dot-product and the like. At some point,
the compilers will acquire directives ( "do a dot product" ) and/or
pattern matching ( "gosh, that looks like code for a dot product" ).
The compiler will then inline the library handcode, with appropriate
parameterization. 

Part of the problem is the (poor) idea of having to catch the hot,
smoking data as it flys out of the pipe.  Self-draining pipelines would
have been easier on both the compilers and the interrupt handling.  The
pipeline mode's startup and shutdown transients, which trash things,
are also undesirable. 

The data-chaining features sound intractable.  Perhaps another
dual-instruction mode would have been better - one instruction to the
FADD unit, the other to FMUL.

The existing dual instruction mode is OK, because compilers can largely
ignore it. A postprocessor (or an optimizing assembler) can do all the
code scheduling. It can do a good job, too, if the compiler emits
enough hints.

Perhaps Intel should have dropped the floating point unit, and put in a
second integer ALU instead.  It would have goosed the Dhrystones, no?

-- 
Don		D.C.Lindsay 	Carnegie Mellon School of Computer Science
-- 

mccalpin@loligo.uucp (John McCalpin) (03/21/89)

In article <4524@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>It seems clear that pipeline mode is beyond the compiler technology
>that Intel is bringing to bear. I expect them to ship a library of
>handcoded routines, which do dot-product and the like. At some point,
>the compilers will acquire directives ( "do a dot product" ) and/or
>pattern matching ( "gosh, that looks like code for a dot product" ).
>The compiler will then inline the library handcode, with appropriate
>parameterization. 

In other words, you expect them to repeat the last 12 years of
supercomputer compiler/vectorizer development :-) ....
Of course, source-to-source vectorizers exist already - I expect that
the folks at Pacific Sierra or Kuck and Associates would be happy to 
sell their products to Intel (for appropriate remuneration).

But vectorization is only part of the problem - see below....

>Part of the problem is the (poor) idea of having to catch the hot,
>smoking data as it flys out of the pipe.  Self-draining pipelines would
>have been easier on both the compilers and the interrupt handling.  The
>pipeline mode's startup and shutdown transients, which trash things,
>are also undesirable. 

The real problem with the i860 (and the other micro-based vector
processors) is memory bandwidth.  The Cray-1 is an excellent example of
what happens when the pipe cannot be kept full.  You need 2 input
elements per clock and generate one output element, but the memory
channel could only transfer one word per cycle.  There are 512 64-bit
vector registers on the Cray-1, which can be considered as a (very) small
cache.  Only very special codes have enough locality to get good performance
on this machine.  On the X/MP, the memory bandwidth was upped to 3
words/clock and all of the sudden, *lots* of code was running at a good
fraction of full speed.

CDC/ETA is now hitting the same trouble on the ETA-10, only the "cache"
in this case is the 32 MB local cpu memory. This (SRAM) "cache" is
connected to the 512+ MB (DRAM) "shared memory" by a 1 word/cycle
interface, so code that overflows the cache has a noticeable performance
hit.  There are real live applications that suffer from this bottleneck.
(The same comments apply to the Cray SSD).

The (ridiculously small) data cache on the i860 makes absolutely no sense
on a vector processor. 

Of course I don't expect an i860 system to have greater than 32 MB of
cache (!), but the experience in the vector-computing world is that
memory bandwidth performance limitations are always nearby, and are 
not always possible to beat with compiler technology alone.  Just think
about the cost of a memory subsytem that can feed a 100 MFLOPS vector
unit (100 million*4bytes/word*(2 ins+1 out)=1.2 GB/second) out of main
memory!

>The data-chaining features sound intractable.  Perhaps another
>dual-instruction mode would have been better - one instruction to the
>FADD unit, the other to FMUL.

Why is chaining intractable?  Cray and CDC have been doing it for years.

>Perhaps Intel should have dropped the floating point unit, and put in a
>second integer ALU instead.  It would have goosed the Dhrystones, no?

If the chip had been designed as a dhrystone engine, the caches would
be large enough to be useful, too.

>Don		D.C.Lindsay 	Carnegie Mellon School of Computer Science
----------------------   John D. McCalpin   ------------------------
Dept of Oceanography & Supercomputer Computations Research Institute
mccalpin@masig1.ocean.fsu.edu		mccalpin@nu.cs.fsu.edu
--------------------------------------------------------------------

bcase@cup.portal.com (Brian bcase Case) (03/22/89)

>Part of the problem is the (poor) idea of having to catch the hot,
>smoking data as it flys out of the pipe.  Self-draining pipelines would
>have been easier on both the compilers and the interrupt handling.  The
>pipeline mode's startup and shutdown transients, which trash things,
>are also undesirable. 

Yes, I thought the idea of have to catch "the hot, smoking data" (I
*love* that characterization!) was a dumb one too, until I realized
(i.e. was told) why it is done that way:  This way, the FP register
file is much more useful.  With "normal" interlocking, a full FP pipe
would lock nearly *all* the FP registers.  With the way the i860 does
it, the registers that will be destinations can be used while the
results that will eventually go there are being computed.  This makes
sense, assuming that one can have only 32 FP registers (16 dp registers).
That might be where the mistake was made....

>The data-chaining features sound intractable.  Perhaps another
>dual-instruction mode would have been better - one instruction to the
>FADD unit, the other to FMUL.

Yep, it's a bitch.  Again, the constraint was probably somewhat
related to implementation technology.  Again, I think they should
have given up 4K bytes of the data cache in order to permit them to
implement a cleaner *architecture* instead of sacrificing architecture
to *implementation*.  But then, what do I know.

>Perhaps Intel should have dropped the floating point unit, and put in a
>second integer ALU instead.  It would have goosed the Dhrystones, no?

I love it!  *Now* you're talking!  I mean, what's more important,
FP or having the longest drystone histogram bar in your marketing
"literature"! :-) :-) :-)