[comp.sys.transputer] T800 Transputer Floating Point

fred@meiko.co.uk (Mark Homewood) (02/09/90)
> In reply to vorbrueggen@mpih-frankfurt.mpg.dbp.de

Dear Jan,
 
Somebody handed me the mail you had sent about the T800.
 
I was asked in 1984 by David May to look at floating point on the then integer
transputer. In December 1986 WE were able to produce a fpu fabricated on
the side of the T414 die.  A intermediate stopgap had been
produced on the T414 by me and Guy Harriman which made its single length
fp go three times faster. Actually it might be quite a good idea here to
list the people who worked positively on the T800 as Inmos as a company never
gave many of them recognition of their achiement, neither publicly, verbally
or in their pay packets. Needless to say a couple of people will say I
missed them off but I won't have meant to I'm not nasty enough otherwise
I would have done much better while at Inmos.

Principle Work		Person		Current Contact
T424 Improvement :	Guy Harriman	NeXT
			Jon Beecroft	jon@meiko.co

I worked on the implementation of floating point
with quite obviously. While I was doing this I worked closely with Guy Harriman
and Jon Beecroft had implemented the
T414 and were involved in various projects to improve its perfomance and
all round 'niceness' 
 
1) Workspace zero is used by some scheduling instruction, ALT ..
   This is documented vaguely in the Transputer Instruction Set.
 
2) GUY is the most awful piece of tack that Inmos ever exposed. It was
   written by Boris (Cownie) now at Meiko at the request of Guy Harriman
   for writing test code. The effects in the occam compilers was always
   rather bizarre even to Boris who always denies writing it. The
   pipeline of the compiler is interrupted and odd tokens passed
   between the parser and code generator. All names are lost and relative
   addressing (addressing at all was an upgrade) is near impossible.
   Try guyharriman@next.com for a second opinion.

3) The timings put in the manual are a compromise. They reflect what a user
   MIGHT see in his code. The nature of the interface between the FPU and
   IU, and how the FPU instructions execute is too complex to sum up in a
   single number but that is what I had to do. ie FPUDUP takes 2 cycles on
   the IU, but only 1 on the FPU. FPLDNLDB takes 3 on the IU and 2 on the
   FPU, the synchronisation points vary between the instructions.
 
   The CPU doesn't preprocess FPDUP. The FPU stack is entirely seperate
   from the IU(CPU). The IU merely initiates data transfers and hands
   over instructions.
 
   The figures you have for fpadd and fpmul are wrong. Nearly all the
   floating point operations have a variable duration while executing on
   the FPU. The instruction is automatically decoded being fetched as part
   of the IUs instruction code (Actually we chose them to be exactly the
   same). To give you some idea as to the complexity of ADD for example
   here is a piece of occam which describes the first two cycles of an
   aligned add operation.
 
          {{{  FPUADD
          NextInst = FPUADD -- 1
            SEQ -- VBC
              ExpZbus := Breg [Exp] - Areg [Exp]
              Sreg := ExpZbus
              ShiftRight (Shifter, Aregslave [Frac], Sreg)
              FracZbus := Breg [Frac]
 
              ExpZbusEqZ := ExpZbus = ZERO
              ExpZbusNeg := ExpZbus < ZERO
              IF
                ExpZbusEqZ
                  IF
                    {{{  Can't happen
                    ExpZbusNeg
                      NextInst := FPUMULENDsetUpLoop
                    }}}
                    {{{  Aligned
                    TRUE
                      NextInst := FPUADDaligned
                    }}}
                TRUE
                  IF
                    {{{  A greater than B
                    ExpZbusNeg
                      SEQ
                        NextInst := FPUADDaGtb
                    }}}
                    {{{  B greater than A
                    TRUE
                      NextInst := FPUADDbGta
                    }}}
          }}}
          {{{  FPUADDaligned ()
          NextInst = FPUADDaligned -- 2
            SEQ -- VBC
              ExpBregslaveToAreg ()
              ExpAregslaveToBreg ()
              FracZbus := Breg [Frac] - Areg [Frac]
              ExpZbus := Breg [Exp] - ExpConstant [s.RealExp][Areglen]
 
              ExpZbusEqZ := ExpZbus = ZERO
              SignsDiffer := Aregslave [Sign] <> Bregslave [Sign]
              IF
                ExpZbusEqZ
                  IF
                    {{{  Both nan or inf, Signs differ
                    SignsDiffer
                      NextInst := FPUADDbothINsignsDiffer
                    }}}
                    {{{  Both nan or inf, Signs same
                    TRUE
                      NextInst := FPUADDbothINsignsSame
                    }}}
                TRUE
                  IF
                    {{{  AlignedSub
                    SignsDiffer
                      SEQ
                        NextInst := FPUADDalignedSubtract
                    }}}
                    {{{  AlignedAdd
                    TRUE
                      SEQ
                        NextInst := FPUADDalignedAddition
                    }}}
          }}}

   Please Note the above is from my memory and may not be exact but serves
   my argument.
 
   It is pretty clear that the majority of the operations are conditional.
   For this reason varying types of ADD ie aligned, AgtB,BgtA ....
   execute at different speeds. Futher even the rounding functions are
   conditional, naturally we optimised the "normal" paths but it must
   remain that operations vary in length. This is mearly an attribute of
   the way in which both the FPU (and IU) were implemented, using microcoded
   datapaths. Other FPUs, weitek for example, execute  all operations in
   a fixed number of cycles at a much greater cost in hardware.
   See the IEEE standard 754 and this will become obvious.
 
   The cycle time of ADD can therefore wary depending on how simple it is
   for the hardware to do. (Denormalised numbers are slow, but most weitek
   chips dont implement them at all !!!). ie its data dependant and although
   I would have liked to quote a non-integer number for cycle count based
   on number distributions common in numerical work, it would only
   have begged more questions, and at least put you off till now.
   Try some denormalised numbers in round minus and you may get
   cycle times of 10.
 
   You're much closer on multiply the 3 bits per cycle takes 9 cycles,
   Booth recoding requires at least one extra bit because it goes into
   a redundant form. The two extra cycles are for the ROUND and TIDY
   functions which occur on all arithmetic functions which require rounding.
   MUL is obviously a lot less conditional than ADD and never requires
   normalisation.
 
   Memory request are handled seperately.
 
4) CJ was like that for no good reason, supposedly compiler studies
   indicated that it was best but I think they got the sense of the results
   wrong and CJ false and destroy (because of time slice) was in fact worst.
 
   I wish I could have had just one primary for the FPU but it was too
   late. Although the instruction set hadn't been released (so Inmos could
   change the instruction set if it liked ??) Dave wasn't
   prepared to screw up all the compiler work by such a radical difference
   between t414 and t800. The short secondaries were also allocated and
   were a lot less use to the FPU, a FOPR is what I wanted.
 
   I added dup to the T800 because it ought to have been there all the time.
   It ought to have been a short. The xenophobia at Inmos however was such
   that for OCCAM you could always compile such that expressions wouldn't
   require ROT or DUP or DROP, so what use could they be ?
 
Fred