fred@meiko.co.uk (Mark Homewood) (02/09/90)
> In reply to vorbrueggen@mpih-frankfurt.mpg.dbp.de Dear Jan, Somebody handed me the mail you had sent about the T800. I was asked in 1984 by David May to look at floating point on the then integer transputer. In December 1986 WE were able to produce a fpu fabricated on the side of the T414 die. A intermediate stopgap had been produced on the T414 by me and Guy Harriman which made its single length fp go three times faster. Actually it might be quite a good idea here to list the people who worked positively on the T800 as Inmos as a company never gave many of them recognition of their achiement, neither publicly, verbally or in their pay packets. Needless to say a couple of people will say I missed them off but I won't have meant to I'm not nasty enough otherwise I would have done much better while at Inmos. Principle Work Person Current Contact T424 Improvement : Guy Harriman NeXT Jon Beecroft jon@meiko.co I worked on the implementation of floating point with quite obviously. While I was doing this I worked closely with Guy Harriman and Jon Beecroft had implemented the T414 and were involved in various projects to improve its perfomance and all round 'niceness' 1) Workspace zero is used by some scheduling instruction, ALT .. This is documented vaguely in the Transputer Instruction Set. 2) GUY is the most awful piece of tack that Inmos ever exposed. It was written by Boris (Cownie) now at Meiko at the request of Guy Harriman for writing test code. The effects in the occam compilers was always rather bizarre even to Boris who always denies writing it. The pipeline of the compiler is interrupted and odd tokens passed between the parser and code generator. All names are lost and relative addressing (addressing at all was an upgrade) is near impossible. Try guyharriman@next.com for a second opinion. 3) The timings put in the manual are a compromise. They reflect what a user MIGHT see in his code. The nature of the interface between the FPU and IU, and how the FPU instructions execute is too complex to sum up in a single number but that is what I had to do. ie FPUDUP takes 2 cycles on the IU, but only 1 on the FPU. FPLDNLDB takes 3 on the IU and 2 on the FPU, the synchronisation points vary between the instructions. The CPU doesn't preprocess FPDUP. The FPU stack is entirely seperate from the IU(CPU). The IU merely initiates data transfers and hands over instructions. The figures you have for fpadd and fpmul are wrong. Nearly all the floating point operations have a variable duration while executing on the FPU. The instruction is automatically decoded being fetched as part of the IUs instruction code (Actually we chose them to be exactly the same). To give you some idea as to the complexity of ADD for example here is a piece of occam which describes the first two cycles of an aligned add operation. {{{ FPUADD NextInst = FPUADD -- 1 SEQ -- VBC ExpZbus := Breg [Exp] - Areg [Exp] Sreg := ExpZbus ShiftRight (Shifter, Aregslave [Frac], Sreg) FracZbus := Breg [Frac] ExpZbusEqZ := ExpZbus = ZERO ExpZbusNeg := ExpZbus < ZERO IF ExpZbusEqZ IF {{{ Can't happen ExpZbusNeg NextInst := FPUMULENDsetUpLoop }}} {{{ Aligned TRUE NextInst := FPUADDaligned }}} TRUE IF {{{ A greater than B ExpZbusNeg SEQ NextInst := FPUADDaGtb }}} {{{ B greater than A TRUE NextInst := FPUADDbGta }}} }}} {{{ FPUADDaligned () NextInst = FPUADDaligned -- 2 SEQ -- VBC ExpBregslaveToAreg () ExpAregslaveToBreg () FracZbus := Breg [Frac] - Areg [Frac] ExpZbus := Breg [Exp] - ExpConstant [s.RealExp][Areglen] ExpZbusEqZ := ExpZbus = ZERO SignsDiffer := Aregslave [Sign] <> Bregslave [Sign] IF ExpZbusEqZ IF {{{ Both nan or inf, Signs differ SignsDiffer NextInst := FPUADDbothINsignsDiffer }}} {{{ Both nan or inf, Signs same TRUE NextInst := FPUADDbothINsignsSame }}} TRUE IF {{{ AlignedSub SignsDiffer SEQ NextInst := FPUADDalignedSubtract }}} {{{ AlignedAdd TRUE SEQ NextInst := FPUADDalignedAddition }}} }}} Please Note the above is from my memory and may not be exact but serves my argument. It is pretty clear that the majority of the operations are conditional. For this reason varying types of ADD ie aligned, AgtB,BgtA .... execute at different speeds. Futher even the rounding functions are conditional, naturally we optimised the "normal" paths but it must remain that operations vary in length. This is mearly an attribute of the way in which both the FPU (and IU) were implemented, using microcoded datapaths. Other FPUs, weitek for example, execute all operations in a fixed number of cycles at a much greater cost in hardware. See the IEEE standard 754 and this will become obvious. The cycle time of ADD can therefore wary depending on how simple it is for the hardware to do. (Denormalised numbers are slow, but most weitek chips dont implement them at all !!!). ie its data dependant and although I would have liked to quote a non-integer number for cycle count based on number distributions common in numerical work, it would only have begged more questions, and at least put you off till now. Try some denormalised numbers in round minus and you may get cycle times of 10. You're much closer on multiply the 3 bits per cycle takes 9 cycles, Booth recoding requires at least one extra bit because it goes into a redundant form. The two extra cycles are for the ROUND and TIDY functions which occur on all arithmetic functions which require rounding. MUL is obviously a lot less conditional than ADD and never requires normalisation. Memory request are handled seperately. 4) CJ was like that for no good reason, supposedly compiler studies indicated that it was best but I think they got the sense of the results wrong and CJ false and destroy (because of time slice) was in fact worst. I wish I could have had just one primary for the FPU but it was too late. Although the instruction set hadn't been released (so Inmos could change the instruction set if it liked ??) Dave wasn't prepared to screw up all the compiler work by such a radical difference between t414 and t800. The short secondaries were also allocated and were a lot less use to the FPU, a FOPR is what I wanted. I added dup to the T800 because it ought to have been there all the time. It ought to have been a short. The xenophobia at Inmos however was such that for OCCAM you could always compile such that expressions wouldn't require ROT or DUP or DROP, so what use could they be ? Fred