[comp.arch] what if the i860 had vector registers... ?

gl8f@astsun9.astro.Virginia.EDU (Greg Lindahl) (05/24/91)

After reading a tech report on i860 performance (available from
uvacs.cs.virginia.edu: pub/techreports/ipc...), I was left with one
question:

The i860 has an 8k data cache, and both normal loads and pipelined
loads that avoid the data cache. It seems to be very nasty to write a
compiler that can do the right thing, with the particular memory
system that Intel put in the iPSC i860 machine (page mode DRAM).

So, what if the i860 instead had a small data cache, and pipelined
loads into vector registers? It is relatively easy to design a
compiler that can do a good job with vector registers. And you would
be able to do pipelined loads of data to be re-used. And you wouldn't
end up starved for registers. Did Intel simulate this, or did they
guess? Did they guess wrong?

chased@rbbb.Eng.Sun.COM (David Chase) (05/25/91)

gl8f@astsun9.astro.Virginia.EDU (Greg Lindahl) writes:
>After reading a tech report on i860 performance (available from
>uvacs.cs.virginia.edu: pub/techreports/ipc...), I was left with one
>question:
...
>So, what if the i860 instead had a small data cache, and pipelined
>loads into vector registers? It is relatively easy to design a
>compiler that can do a good job with vector registers. And you would
>be able to do pipelined loads of data to be re-used. And you wouldn't
>end up starved for registers.

Can you be more explicit about what you mean by vector registers?  I
was about to say that using vector registers was about the same as
calling library routines, but then I remembered the matrix
multiplication example -- calling an "inner product" subroutine does
not get the same performance as simultaneously computing three inner
products, which requires that you know something about the algorithm
surrounding the inner product.

One constraint on elementary row operations on the i860 (implemented
in an elementary way -- i.e., without blocking the algorithm within
which the EROs are used) is that the FP unit is capable of
consuming/producing 64 bits of off-chip (single precision) operands
per cycle.  (Actually, the grouping rules preclude attaining this rate;
best you can do in that style is about 80% of speed-of-light). I'm
using the model:

  offchip[i..n] := offchip[i..n] + factor * onchip[i..n]

(row major, i.e., C/Pascal, array layout.)  In a "naive"
implementation of Gaussian elimination (and QR, too, I believe) the
intent is to eliminate the column "i" from a collection of offchip
rows by repeatedly subtracting a multiple of the onchip row from them.
Each cycle, the i860 can complete one "off := off + F * on" operation,
which means that (on average) 32 bits must come on per cycle and 32
bits must be stored per cycle.  This is the "speed of light" for data
on the i860 -- unless you use a wider bus, or a faster clocking rate,
or something, vector registers won't get the data on and off chip any
faster.

Note -- max pipelined load width is 64 bits, max cached load width is
128 bits.  An N-unrolled loop to do the elementary row op has N
pipelined loads and stores, plus N/4 cached loads, plus one branch
instruction.  In double op mode, you pair this with N mpy-and-adds and
N/4 + 1 fnops.  Thus, in N + N/4 + 1 cycles you perform N mpy-and-adds
at best, attaining a speed of N/(N + N/4 + 1) times C -- for N == 16,
this is .76C.

Note, too, that something as simple as a quad-word (128 bit) pipelined
load (i.e., double the width of the off-chip bus) solves the
speed-of-light problem without adding vector registers or operations.
This still leaves the compiler working hard, of course.

David Chase
Sun

gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) (05/29/91)

In article <13990@exodus.Eng.Sun.COM> chased@rbbb.Eng.Sun.COM (David Chase) writes:

>>So, what if the i860 instead had a small data cache, and pipelined
>>loads into vector registers? It is relatively easy to design a
>>compiler that can do a good job with vector registers. And you would
>>be able to do pipelined loads of data to be re-used. And you wouldn't
>>end up starved for registers.
>
>Can you be more explicit about what you mean by vector registers?

I was thinking of something like Cray vector registers, or just
explicitly-managed cache, with pipelined loads into cache instead of
the current non-pipelined loads into cache. The the ALU would be
cache-to-cache instead of register-to-register.

>One constraint on elementary row operations on the i860 (implemented
>in an elementary way -- i.e., without blocking the algorithm within
>which the EROs are used) is that the FP unit is capable of
>consuming/producing 64 bits of off-chip (single precision) operands
>per cycle.

This is what kills the iPSC i860 box -- my code is guaranteed to run
at the rate of the memory system, with no useful caching effects. I
read a lot of data. And the compilers can't even get up to memory
speed, because they can't figure out very well how to optimize the
loads and stores, because the problem is so complex and they run out
of registers.

>Note -- max pipelined load width is 64 bits, max cached load width is
>128 bits.  An N-unrolled loop to do the elementary row op has N
>pipelined loads and stores, plus N/4 cached loads, plus one branch
>instruction.  In double op mode, you pair this with N mpy-and-adds and
>N/4 + 1 fnops.  Thus, in N + N/4 + 1 cycles you perform N mpy-and-adds
>at best, attaining a speed of N/(N + N/4 + 1) times C -- for N == 16,
>this is .76C.

This analysis is a bit theoretical in my case, as the iPSC memory
system provides (max pipelining) 2 cycle loads if you're on the same
page, with a big penalty when you go to a different page. Even when
you block loads until you run out of registers, you're paying a nasty
penalty and complicating your compiler.

For example:

do i = 1, 128
   z(i) = a * x(i) + y(i)
enddo

With vector registers you can do a simple pipelined load of all of x
and y while calculating the results, and then write the results back.
3 page-mode-dram miss penalties total, and a solved compiler problem.

With the current scheme you unroll by 4, load 4 x's, load 4 y's taking
a page-miss penalty, while calculating the results, and then write the
results back taking another page miss. That's 3 memory system page
misses per cycle. (Note that I'm not talking about virtual memory page
faults; for details on the memory system don't trust my memory, read
the techreport.)

When you go to more complex expressions I think vector registers win
bigger and bigger, as long as you're looking at vector-type
expressions. For scalar expressions, I think the two methods would
provide about the same speed.

So, to repeat: did Intel simulate this kind of alternative design, and
would it provide better performance with a simpler compiler?

preston@ariel.rice.edu (Preston Briggs) (05/29/91)

gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes:

>This is what kills the iPSC i860 box -- my code is guaranteed to run
>at the rate of the memory system, with no useful caching effects.

This problem kills most everything except vector machines (ala Cray).
That's why BLAS-2 and BLAS-3 routines were invented (i.e., to provide
higher-level operations with enough data reuse that the memory
system can keep up with the FP unit).

Try saxpy's on an HP or IBM when then data is not in cache.
They aren't going to get close to peak performance either.

--------------------

>For example:
>
>do i = 1, 128
>   z(i) = a * x(i) + y(i)
>enddo

>With the current scheme you unroll by 4, load 4 x's, load 4 y's taking
>a page-miss penalty, while calculating the results, and then write the
>results back taking another page miss. That's 3 memory system page
>misses per cycle.
            ^^^^^  iteration

Generally, you won't be able to use the load/store quad instructions because
it's difficult to be assured of alignment.

It's better to handle the above case by copying x into an aligned
chunk of memory (called VR1, say) using pipelined loads (4 in a row)
and store-quads.  Then copy y into VR2.  Finally, calculate the sum
into VR3 (using quad-loads and stores).  Then copy VR3 back to z
using quad-loads and singleton stores.

VR3 might be identical with either VR1 or VR2.
Generally, the VRs will end up in cache.

Naturally, we could use subroutines for the copying between
memory and cache/VRs and for the vector add.

------------

Of course, Intel though of all this.
They do it, using a traditional vectorizing front-end.

The problem is that there's still insufficient memory bandwidth
to support vector operations at 80 MFlops.  That kind of performance
requires hefty $$s.

------------

Note also that the "page misses" mentioned above and in Moyer's TR
are on a particular board, build around 8 MBytes of dynamic RAM.
Other boards have been built with static RAM ($$'s) that don't have
the problem of crossing page boundaries.

-------------

I think the difficulties with the i860 lie more in generating code
for the exposed pipelines and handling their unhelpful multiply-add
instruction.  The bandwidth questions are going to afflict all
the micro's for a while.

Preston Briggs