[comp.arch] i860 floating-point

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (08/03/89)

Well, I haven't heard anything about the i860 lately on this group, so
maybe I should start things up again....

I have been re-reading various postings about the i860 floating-point unit
and my current impression is that it is seriously bizarre.

The pipelined floating-point instructions are definitely not vector
instructions, since one instruction is required for every operation.
(Some instructions use both the adder and the multiplier, but it still
requires a separate instruction for each input data pair).

So it looks like a optimized fp code would be something the code below -
apologies to those who know what the instruction format _really_ is. :-)
All the src? and dest? are register addresses, pointing to the 16 64-bit
floating-point registers.

	f_mul	src1, scr2, dest0	% begin loading pipe
	f_mul	src3, src4, dest0	% load second stage of pipe
	pf_mul	src5, src6, dest1	% specifies dest for 1st multiply!!!
	pf_mul	src7, src8, dest2	% specifies dest for 2nd multiply
	pf_mul	dum0, dum0, dest3	% begin unloading pipe
	pf_mul	dum0, dum0, dest4	% end unloading pipe

At this point I am finished, as I have used 8 source registers and 4
destination registers.  Three more registers are available to store
other scalars and such, but since loads are by 128-bit quantities, it
is convenient to work with an even number of elements. (Note that only
15 registers can be used, since fp0 is a fixed floating-point zero.)

So I have done 4 floating-point operations in 6 cycles, and now I have
to store the results to memory and grab some new sources from memory.
I can load/store in pipelined mode, too, with two 64-bit operands
transferred every cycle, with a 3-cycle pipeline length.  So I would
need two store instructions (4 cycles), and four load instructions (6
cycles) to re-fill the registers that I used.

Overall, then, it takes something like 16 cycles to deliver 4 64-bit
floating-point operations, assuming no data or instruction cache misses.
To get this level of performance requires loop unrolling to a depth of
4 for this simple operation.  More complicated loops may not be
unrollable to even this depth, since only 16 registers are available.

Lots of questions:

(1) Am I missing something obvious?

(2) Can more things be overlapped than this?

(3) The programmers manual refers to instructions that use both the
    adder and the multiplier, but most of these look like accumulate
    functions (e.g. dot product).  Is there a "linked triad" instruction
    which takes 3 operands and does x = y + q*z, where 4 registers are
    used?

Comments welcome....



--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu
		   mccalpin@delocn.udel.edu

chase@Ozona.orc.olivetti.com (David Chase) (08/04/89)

In article <MCCALPIN.89Aug3124057@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>Well, I haven't heard anything about the i860 lately on this group, so
>maybe I should start things up again....

>I have been re-reading various postings about the i860 floating-point unit
>and my current impression is that it is seriously bizarre.

>Lots of questions:

>(1) Am I missing something obvious?
>(2) Can more things be overlapped than this?
>(3) The programmers manual refers to instructions that use both the
>    adder and the multiplier, but most of these look like accumulate
>    functions (e.g. dot product).  Is there a "linked triad" instruction
>    which takes 3 operands and does x = y + q*z, where 4 registers are
>    used?

(1) I would say, no, you are not missing anything obvious.
(2) Yes, quite a bit.
(3) Sort of.

I bet you'd like an explanation of that, wouldn't you?  The best way
to make the chip scream is to get it into dual-instruction mode,
issuing one dual-operation floating point instruction and one core
instruction at each as clock tick.  This is how you get the 120 MOPS
figure for a 40MHz chip (I have no idea what clock rate is actually
being shipped, if any, but 40 is a nice round number).  Anyhow, to do
this, and actually be doing something useful, you have to (1) come up
with some computation that can actually make use of the many strange
ways that the adder and multiplier pipes can be hooked up and (2)
figure out how to amortize the load-store bandwidth of one floating
point instruction onto ONE core unit instruction.

There are three problems that illustrate this, sort of (assume
row-major storage, i.e., not Fortran, in the examples that follow)

(1) row-row dot product
    inner loop of A * B-transpose
    (This example appears in the i860 PRM in several forms)

(2) row-column dot product
    inner loop of A * B

(3) elementary row operations
    Gaussian elimination, among other things

For a row-row dot product, the inner loop looks like (in LaTeX):

d.ml2apm.ss  f4, f12, f0 & fld.q  16(r29)++, f8     & Load $A$ \\
d.ml2apm.ss  f5, f13, f0 & pfld.d 8(r24)++, f16     & Load $B$ \\
d.ml2apm.ss  f6, f14, f0 & pfld.d 8(r24)++, f18     & Load $B$ \\
d.ml2apm.ss  f7, f15, f0 & fld.q  16(r29)++, f4     & Load $A$ \\
d.ml2apm.ss  f8, f16, f0 & nop                      &          \\
d.ml2apm.ss  f9, f17, f0 & pfld.d 8(r24)++, f12     & Load $B$ \\
d.ml2apm.ss f10, f18, f0 & bla r27,r28, inner\_loop &          \\
d.ml2apm.ss f11, f19, f0 & pfld.d 8(r24)++, f14     & Load $B$

The idea is that the row of matrix A is stored in the cache, but the
row of matrix B is not.  The loop is unrolled 8 times in order to get
the required separation (for top speed) between the instructions which
load the registers and the instructions which use them.  The 8
floating point instructions require 16 input operands, and these must
somehow be loaded in 7 core unit instructions.  This is achieved by
using quad-word loads from cache (8/4 = 2) and double-word pipeline
loads (8/2 = 4).  I have omitted the gory details of start-up code,
and bothering to detail which load puts the results where, when.

One detail that should not be omitted is the floating point
instruction used here.  "ml2apm.ss" advances the (3-stage) multiplier
pipeline one step and advances the (3-stage) adder pipeline one step.
The inputs to the multiplier are the two register operands; the inputs
to the adder are the output of the multiplier and the output of the
adder.  In all the instructions above the output register is f0,
meaning "discard".  (Those of you who care about re-association of
your floating point arithmetic should be a little worried by now.)

Anyhow, the chip runs at full speed once the row of A is in the cache.
Since only that row is cached, it can be pretty large.


(2) is much more subtle.  The problem is that however you unroll that
loop, the elements of the column of B are not stored consecutively,
and thus must be loaded one-at-a-time.  For an unrolling by N, this
means that N (load B) + N/4 (load A) + 1 (branch) core instructions
are required for N floating point instructions.  That is, if N = 8
then the chip can only operate at 8/11 (72%) of potential MFlops.

However, matrix multiplication (of the non-Strassen sort) is
O(N-cubed), and matrix transposition is only O(N-squared).
Asymptotically, you can win that way.

However however, there is a way to do it in place.  Observe that so
far we have been multiplying one row of A by one column of B, and we
are choking on the instructions to load B.  Obviously, all we need to
do is reuse those loads a few times.  We can do this by multiplying 3
rows of A by one column of B.  Why 3?  The adder has three stages, and
when running as an accumulator we can treat those as registers.  It is
claimed that the following code will do the trick:

d.ml2apm.ss f6,f16,f0  & nop                       & \\
d.ml2apm.ss f10,f16,f0 & nop                       & \\
d.ml2apm.ss f14,f16,f0 & pfld.s r20(r19)++, f18    & \\
\\
d.ml2apm.ss f7,f17,f0  & 16(r16)++, fld.q f4       & \\
d.ml2apm.ss f11,f17,f0 & 16(r17)++, fld.q f8       & \\
d.ml2apm.ss f15,f17,f0 & 16(r18)++, fld.q f12      & \\
\\
d.ml2apm.ss f4,f18,f0  & pfld.s r20(r19)++, f19    & \\
d.ml2apm.ss f8,f18,f0  & nop                       & \\
d.ml2apm.ss f12,f18,f0 & pfld.s r20(r19)++, f16    & \\
\\
d.ml2apm.ss f5,f19,f0  & nop                       & \\
d.ml2apm.ss f9,f19,f0  & bla r21, r22, inner\_loop & \\
d.ml2apm.ss f13,f19,f0 & pfld.s r20(r19)++, f17    & \\

In this case, the floating point unit is actually computing three
inner products simultaneously.  Doing this also means that the
additions in the inner product are not reordered, which is a small win.

This might help a little.  Each iteration of the loop will multiply
these registers, loaded from A

   f6 &  f7 &  f4 &  f5
  f10 & f11 &  f8 &  f9
  f14 & f15 & f12 & f13

by these registers, loaded from B

  f16
  f17
  f18
  f19

updating three quantities circulating around in the adder pipeline.
At the end of the loop, the pipelines look like this:

|--------Multiplier----------|-------------Adder----------------|

                                 +------------------------------+
                                 V                              |
f13*f19 -> f9*f19 -> f5*f19 ---> f12*f18 --> f8*f18 --> f4*f18 -+-> f0
                                +f15*f17    +f11*f17   +f7*f17
                                +f14*f16    +f10*f16   +f6*f16
                                +etc        +etc       +etc

Obviously, this is not obvious, and all the edge conditions are enough
to make you tear your hair.  Also note that those delightful load
operations prefer that their operands be aligned (at least, that's how
I understood it).  Anyhow, once you cross all the i's and dot all the
t's, you should be able to multiply matrices in the "usual way" at
full speed, i.e. 80 MFlops for a 40Mhz chip.

I'll post more, later, if people are interested.  Dinner and a movie
are waiting.  Problem (3) is not so easy, because of bandwidth
constraints (more operands), but I haven't tried to rearrange the
algorithms yet.

David

preston@titan.rice.edu (Preston Briggs) (08/05/89)

In article <MCCALPIN.89Aug3124057@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>The pipelined floating-point instructions are definitely not vector
>instructions, since one instruction is required for every operation.

    However, vectoring compiler technology can be readily (has been, in fact)
    adapted to make good code for the i860.  They treat the on-chip
    cache as vector registers of reasonable length, and use tight,
    highly pipelined loops to operate in these "registers."
    The scheme is actually more flexible than using "real" vector registers
    since they can do arbitrarily complex operations,
    not just simple adds or multiplies.

>So it looks like a optimized fp code would be something the code below -
>All the src? and dest? are register addresses, pointing to the 16 64-bit
>floating-point registers.

    Actually, there are 32x32 bit FP registers.  f0 and f1 are reserved.
    They can be treated as 16x64 bit if desired.

>	f_mul	src1, scr2, dest0	% begin loading pipe
>	f_mul	src3, src4, dest0	% load second stage of pipe
>	pf_mul	src5, src6, dest1	% specifies dest for 1st multiply!!!
>	pf_mul	src7, src8, dest2	% specifies dest for 2nd multiply
>	pf_mul	dum0, dum0, dest3	% begin unloading pipe
>	pf_mul	dum0, dum0, dest4	% end unloading pipe

    You're close.  There's actually another stage in the pipeline.
    More accurate is
	pfmul.ss	f2,f3,f0	start f2*f3 (single precision)
	pfmul.ss	f4,f5,f0	dump garbage into f0
	pfmul.ss	f6,f7,f0
	pfmul.ss	f8,f9,f2	f2 gets results of f2*f3
	...

>At this point I am finished, as I have used 8 source registers and 4
>destination registers.  Three more registers are available to store
>other scalars and such, but since loads are by 128-bit quantities, it
>is convenient to work with an even number of elements. (Note that only
>15 registers can be used, since fp0 is a fixed floating-point zero.)

>So I have done 4 floating-point operations in 6 cycles, and now I have
>to store the results to memory and grab some new sources from memory.
>I can load/store in pipelined mode, too, with two 64-bit operands
>transferred every cycle, with a 3-cycle pipeline length.  So I would
>need two store instructions (4 cycles), and four load instructions (6
>cycles) to re-fill the registers that I used.

>Overall, then, it takes something like 16 cycles to deliver 4 64-bit
>floating-point operations, assuming no data or instruction cache misses.
>To get this level of performance requires loop unrolling to a depth of
>4 for this simple operation.  More complicated loops may not be
>unrollable to even this depth, since only 16 registers are available.

>Lots of questions:
>(1) Am I missing something obvious?
>(2) Can more things be overlapped than this?
>(3) The programmers manual refers to instructions that use both the
>    adder and the multiplier, but most of these look like accumulate
>    functions (e.g. dot product).  Is there a "linked triad" instruction
>    which takes 3 operands and does x = y + q*z, where 4 registers are
>    used?
>John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu
>		   mccalpin@delocn.udel.edu


Basically, more stuff can be overlapped.

1st, there is the "dual instruction mode".
This allows you to simultaneously execute an integer ("core") instruction and
a floating-point instruction.  Core instructions include loading and
storing FP registers.  So, we can overlap loading and storing
with actual arithmetic.  The i860 also has an auto-increment addressing
mode for zapping quickly through vectors.

2nd, there is are the "dual operation instructions".
These are the ones that allow simulataneous addition and
multiplication.  Unfortunately, it isn't possible to be completely
general when specifying the sources and destinations of the adder
and multiplier.  However, it is possible to use both the
dual instruction mode and the dual operation instructions
at the same time.

3rd and so on, 
you can get more use out of the registers
than you have suggested.  After the 1st instruction (in my example),
it would be possible to reuse registers f2 and f3 immediately.

It's also possible to use techniques like "software pipelining"
or "perfect pipelining" to help keep the pipelines full during loops.

For matrix multiply, it looks possible to achieve a rate of 1 FLOP/cycle
without using the dual-operation instructions.  Using
the dual-operation instruction, it should be possible to approach
5/3 FLOP/cycle.  This includes memory latency, and is perhaps
achievable using automatic techniques.

All this will be fairly difficult for a simple compiler to use.
However, by using vectoring front-ends, they can take
advantage of hand-built subroutines implementing many
common operations.  In addition, a really agressive
compiler will have lots of opportunities for optimization.

I think that a fairly complex chip like the i860 will
accent the difference between the best dependence-based
optimizing compilers and the comparatively simple compilers
common today.  Instead of a 30% improvement, or even a factor
of 2 improvement, over a PCC-like compiler, we'll see a 
fairly sizeable integer factor.

Optimistically yours,
Preston Briggs

chase@Ozona.orc.olivetti.com (David Chase) (08/05/89)

Continuing a previous follow-up:

Note -- the instruction written as "ml2apm" in the previous posting is
probably really called "m12apm".  If you're reading this in a true
typewriter font, you won't see the difference.

>> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>> Well, I haven't heard anything about the i860 lately on this group, so
>> maybe I should start things up again....

>>(3) The programmers manual refers to instructions that use both the
>>    adder and the multiplier, but most of these look like accumulate
>>    functions (e.g. dot product).  Is there a "linked triad" instruction
>>    which takes 3 operands and does x = y + q*z, where 4 registers are
>>    used?

>There are three problems that illustrate this, sort of (assume
>row-major storage, i.e., not Fortran, in the examples that follow)

[1 & 2 discussed in previous post]

(3) elementary row operations
    Gaussian elimination, among other things

Elementary row operations are not as easy to get going fast as inner
product operations, because each instruction has (as written) two
input operands and one output operand.  (Actually, there are three
input operands, but we're about to answer John McCalpin's question
above).  Typically, also, it will happen that one of the input
operands will be cached, but the other input operand and the output
operand will not be cached (i.e., will be loaded and stored by the
pipelined load/store operations).  The difficulty with this is that
pipelined load/store operands have a maximum operand size of 64 bits
(consider the width of the bus to memory, and you'll see why), or only
two operands.  Working in this style with a loop unrolled 8 times will
give at best 8/11 of peak floating-point speed.

A digression -- typically, the repeated part of a row operation is of
the form:

     Row1[k] := Row1[k] + factor * Row2[k]

for increasing i.

That is, there are four operands, but one of them doesn't change.  The
i860 FPU accomodates this by having a number of register-like things
in the FPU named KR, KI, and T.  KR and KI can act as inputs to the
multiplier, and T can buffer a result from the multiplier before it is
fed to the adder.  That is, m12apm feeds the result of the multiplier
directly to the adder (with the result of the adder used for the other
adder input), while m12ttpa inserts T between the adder and the
multiplier.  All in all, I count 32 different ways to hook up the
multiplier and the adder.  In addition, the adder can also subtract,
making for a grand total of 64 different dual-operation pipelined FPU
instructions.  For the example above, either r2p1 or i2p1 might do the
trick.
(End digression)

Anyhow, row operations don't go at full speed if the updated row is
stored in memory and the operands are single precision.  Things get
even slower in double precision, since even more instructions are
necessary to load the operands.

I'm still thinking about how this might be sped up.  It might be
possible to double- or triple-up the row operations (they're usually
part of a larger algorithm, like QR factorization or Gaussian
elimination), but I haven't thought it through yet.

David