[comp.arch] i860 registers, follow up

rstewart@megatek.UUCP (Rich Stewart) (06/13/90)

I guess I should add a bit more info to my previous posting.
I am interested in the chip from the point of view of
the best software perfomance I can get in critical routines.
I still have not found a compiler that does a good job of this so,
this is why I feel separating the register sets leads to slower
software:

On the i860 you have to do an integer to float register move in order
to do an integer multiply, and then you have to move it back to do
any other integer ops on the result. (Integer multiplies take place in
the fpu)

Also the floating point stores and loads can work on 64 bit aligned
words in the same amount of time as 32 bits, but you have got to
move all of those integer results back into the floating point registers
to take advantage of this. 

ssr@taylor.Princeton.EDU (Steve S. Roy) (06/16/90)

In article <501@tau.megatek.uucp> rstewart@megatek.UUCP (Rich Stewart) writes:
>I guess I should add a bit more info to my previous posting.
>I am interested in the chip from the point of view of
>the best software perfomance I can get in critical routines.
>I still have not found a compiler that does a good job of this so,
>this is why I feel separating the register sets leads to slower
>software:
>
>On the i860 you have to do an integer to float register move in order
>to do an integer multiply, and then you have to move it back to do
>any other integer ops on the result. (Integer multiplies take place in
>the fpu)
>
>Also the floating point stores and loads can work on 64 bit aligned
>words in the same amount of time as 32 bits, but you have got to
>move all of those integer results back into the floating point registers
>to take advantage of this. 


Well, I think that separating the integer and floating point registers
isn't what makes integer multiplies slow, and it isn't exactly what
makes the i860 difficult to write a compiler for.

The main reason there are different integer and floating point
registers is because the integer and floating point units are very
separate on this chip.  They run almost independantly, and one can
freeze and let the other continue.  You can have them run in complete
parallel.  This is a major source of the speed of the chip, since you
can overlap reading or writing to memory with computation.

There's no real intrinsic reason they couldn't have put in an integer
multiply instruction, they just didn't feel it was worth the chip
space do duplicate what was already in the floating point section.  As
I understand it, when people do instruction frequency analysis, they
find that integer multiply isn't used that often, so a few extra
clocks don't hurt too much.

But given that you aren't going to have in integer-register to
integer-register multiply it does make compler writing a bit tougher.
And the fact that you cannot have a floating-register to
floating-register truncate also hurts.  But writing a compiler that
just works and produces correct code is no more difficult on the i860
than on anything else, what is really difficult is to produce one that
runs as fast as possible.

After all, part of the point of this chip is that it's supposed to be
really fast.  Peak speeds of 60 double precision MFLOPS and 80 single
precision MFLOPS is approaching Cray 1 speeds.  I've written code that
does that.  What makes it difficult to write a compiler for this chip
that actually gets that sort of speed is:

   1:  The processor is faster than standard memory, the on-chip
       cache is microscopic, and there are no real provisions for an
       off chip cache.
   2:  The fast multiplies and adds are pipelined, meaning that you
       can have several going at once.  Current compiler technology
       doesn't seem to know how to deal with that.  There are some
       isolated groups that do but their knowlege hasn't diffused out.
   3:  The multiply-accumulate instructions are arcane and don't even
       begin to think about being orthogonal.
   4:  There are only 15 double precision registers.  That may sound
       like a lot to standard microprocessor folks, but with
       the cache, pipeling, and non-orthogonality stuff you need more
       than that.

I don't think it's impossible to write a compiler that gets a
significant fraction of the peak speed of this machine, but it is
difficult.  As a matter of fact it's more difficult than it had to be.
It seems like the people designing the chip never talked to compiler
writers because there are several things that spuriously make it
difficult to write compilers.

Steve Roy