[comp.sys.transputer] TDS Questions, and the Speed of Light on a T800

vorbrueggen@mpih-frankfurt.mpg.dbp.DE (02/03/90)

Hello Transputer wizards,

I've got a few, somewhat esoteric questions on the TDS and
transputer (esp. T800) instructions. If you're not interested in
counting processor cycles, just skip this :-)

1. When I converted an occam routine (consisting almost entirely
of assembler) from an `in-line' procedure to an SC, I noticed that
the compiler reserves a workspace slot (via ajw) that is never used.
I noticed this because I accessed the argument list of the procedure
directly, and that broke (of course) after the conversion to an SC.
Fortunately, the compiler is able to generate the correct instructions
for LDLP vector.argument (even turns it into an LDL!), and it also
understands LDL (SIZE vector.argument), although this is undocumented;
thus, the need for direct access to the workspace is diminished.

Nonetheless, I don't see why this space (and the time, in some cases,
for the ajw pair) has to be wasted in SC's. My programmes consist
almost entirely of nested SC's, and this waste of valuable on-chip RAM
is quite undesirable!

2. Another annoying `feature' of GUYs (just quibbling, I know) is that
the offset for LEND has to be put in by hand. Or can we write
LDC (loop.end-loop.start) ? (I haven't tried, I must admit.)

3. Now comes the strange part. In Parallelogram 17/89, Jerry Sanders
calls peak theoretical performance "only that the manufacturer
guarantees that programmes will not exceed these rates, a speed of
light for a computer". Just to play around (and to see whether I could
beat Topexpress's Veclib), I wrote an assembler routine for the T800 to
compute the magnitude of an n-dimensional vector. The core of the 
instructions per vector element are:

         fpldnlsn      ( 3)
         fpdup         ( 1)
         fpmul         (11)
         fpadd         ( 7)

The numbers are the processor cycles required, according to the
Transputer Instruction Set. If you add them up, you get 22 cycles
per element - assuming all code and operands on-chip, and *not*
accounting for the fact that they are long instructions, i.e.,
require a prefix to build the operation code. This should be overlapped
for the load (with the preceding add) and the add (with the preceding
multiply), but not for the duplicate and the multiply. Thus we expect
two cycles in addition. On the other hand, maybe the FPU does the
operand fetch (for the fpldnlsn), so the CPU would be free to
pre-process the fpdup in the meantime. Of course, there are some CPU
instructions for address arithmetic and loop control; we'll assume
that they completely overlap FPU operations.

Now, here are the results:
Unrolling the loop by 8 and computing the time difference for
different vector lengths (all multiples of 8), one trip around the
loop takes 8.150 us on a 20MHz T800, code and data on-chip.
This time corresponds to 163 cycles. The loop overhead (time wasted
jumping back and not overlapped with calculations) could be checked
via the analogous routine unrolled by a factor of 4 and is exactly 3
cycles. This leaves us with 160 cycles per 8 elements or 20 cycles per
element. We've beaten the speed of light of the T800! 

BTW, the times given for the fpadd and fpmul instructions are quite
consistent with other documentation. According to section 6.4 of
Communicating Process Architecture, the T800 multiplies three bits per
cycle and normalises in two cycles. This gives: 

fpadd: 1 (get inst) + 2 (denormalize) + 1 (add) + 2 (normalize result) = 6
fpmul: 1 (get inst) + 8 (24 bits/3 multiply) + 2 (normalize result) = 11

I don't know why the fpadd seems to take only 6 cycles when the
documentation says 7.

Now the question goes to somebody who knows about the internal workings
of the T800: Why do I get these results?

Connected with this is the question: How are memory accesses (both
internal and external) handled? Is there an independent entity (i.e.,
a seperate process running in parallel to CPU, FPU, and the DMA
machines) which handles these requests? This would explain why a LDL
takes two cycles (have to wait for data to arrive), while STL takes
only one (just get rid of it and go on). Of course, there would be a
penalty fo a LDL following a STL, but maybe not if the load comes from
internal memory and the store goes to external. What about that?
(This is no idle speculation. Compliler writers for and users of
vector machines spend much time avoiding these types of memory access
conflicts!) 

4. Lastly, I'd like to know why some instructions have been done the
way they are. I can understand that cj (conditional jump) behaves as a
`jump zero' or `jump false', that's a matter of taste - a RISC
processor just doesn't have the orthogonal instruction set of a CISC.
But why, for whoever's sake, does it remove the operand if not jumping
and leave the zero when jumping? Most of the time, I find myself
preceding it with a dup so I have a copy of my laboriously computed
loop count after I've checked whether it's zero! Well, of course the
zero is easier to remove (with a diff) than a non-zero, but in that
case, we could invert the result of the comparison.

In general, I find the selection of direct and short instructions very
reasonable. Exceptions are the startp and endp instructions: Why do
they, taking 10 and 11 cycles, get the privilege of a one-byte
instruction, while the much more useful dup now takes two cycles just
because it has to be a two-byte instruction? (Why dup wasn't there
from the start, but was added as an afterthought to the T800, would
probably make an amusing historical anecdote. Does anybody know the
reason?)

Any comments on these little criticisms?
Ok, I now eagerly await the flood of mail from all over the world!

   Jan Vorbr"uggen
   Max-Planck-Insitute for Brain Research
   Frankfurt, F.R.G.

des@inmos.co.uk (David Shepherd) (02/06/90)

In reply to article <268:vorbrueggen@mpih-frankfurt.mpg.dbp.de>
>Hello Transputer wizards,

hi!

>3. Now comes the strange part. In Parallelogram 17/89, Jerry Sanders
>calls peak theoretical performance "only that the manufacturer
>guarantees that programmes will not exceed these rates, a speed of
>light for a computer". Just to play around (and to see whether I could
>beat Topexpress's Veclib), I wrote an assembler routine for the T800 to
>compute the magnitude of an n-dimensional vector. The core of the 
>instructions per vector element are:
>
>         fpldnlsn      ( 3)
>         fpdup         ( 1)
>         fpmul         (11)
>         fpadd         ( 7)
>
>The numbers are the processor cycles required, according to the
>Transputer Instruction Set. If you add them up, you get 22 cycles
>per element - assuming all code and operands on-chip, and *not*
>accounting for the fact that they are long instructions, i.e.,
>require a prefix to build the operation code. This should be overlapped
>for the load (with the preceding add) and the add (with the preceding
>multiply), but not for the duplicate and the multiply. Thus we expect
>two cycles in addition. 

I think the prefix for the fpdup will get overlapped (I can't remember
exactly how that overlappping works - that is if I ever understood it !)

> ..... This leaves us with 160 cycles per 8 elements or 20 cycles per
>element. We've beaten the speed of light of the T800! 
>
>BTW, the times given for the fpadd and fpmul instructions are quite
>consistent with other documentation. 

glad to hear this :-)

However if you look carefully at the Transputer Instruction Set book
you will find the following paragraph

	The time taken to execute many floating point instructions
	is extremely data dependent. The timings given here are
	typical times for such instructions, such as arithmetic on
	normalised data. Certain values in the registers may cause
	execution to take more or less time.

basically your problem is due to the fact that the time for fpadd its
*totally* data dependent. 7 cycles is on the pessimistic side and 5 or 6
is probably more likely for data that doesn't involve denorms.

>Communicating Process Architecture, the T800 multiplies three bits per
>cycle and normalises in two cycles. This gives: 
>
>fpadd: 1 (get inst) + 2 (denormalize) + 1 (add) + 2 (normalize result) = 6
                          ^^^^^^^^^^^                 ^^^^^^^^^^^^^^^^
these ought to read       align fractions             round result
        ^^^^^^^^^^^
        get inst shouldn't be here - we hope this is overlapped.

>fpmul: 1 (get inst) + 8 (24 bits/3 multiply) + 2 (normalize result) = 11
        ^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^
        shouldn't be counted here                 round result
                       should be 9 as you need to generate 26+ bits for rounding

>Now the question goes to somebody who knows about the internal workings
>of the T800: Why do I get these results?

see above - fpu timings very data dependant -- eg arithmetic involving
0s. infs or NaNs typically takes 3 cycles for all instructions as the
result is immediate. Other timings vary depending on whether denorms
need to be normalized (mul and div) or fractions need to be aligned (add
and sub) etc.

>4. Lastly, I'd like to know why some instructions have been done the
>way they are. I can understand that cj (conditional jump) behaves as a
>`jump zero' or `jump false', that's a matter of taste - a RISC
>processor just doesn't have the orthogonal instruction set of a CISC.
>But why, for whoever's sake, does it remove the operand if not jumping
>and leave the zero when jumping?

a very good question! i believe someone was being over zealous in tuning
the instructions to implement short circuit boolean evalution in an
optimal manner.!

> Most of the time, I find myself
>preceding it with a dup so I have a copy of my laboriously computed
>loop count after I've checked whether it's zero! Well, of course the
>zero is easier to remove (with a diff) than a non-zero, but in that
>case, we could invert the result of the comparison.

well, at least you have a dup instruction now :-)

>In general, I find the selection of direct and short instructions very
>reasonable. Exceptions are the startp and endp instructions: Why do
>they, taking 10 and 11 cycles, get the privilege of a one-byte
>instruction, while the much more useful dup now takes two cycles just
>because it has to be a two-byte instruction? (Why dup wasn't there
>from the start, but was added as an afterthought to the T800, would
>probably make an amusing historical anecdote. Does anybody know the
>reason?)

again i believe someone said that it wouldn't be needed for an occam
compiler.

david shepherd
INMOS ltd

disclaimer: all the above is probably based on personal opinion and bias
and should not be taken as representing anything approaching official
INMOS support, opinion or whatever .... etc.