[comp.sys.transputer] Transputer timing

eb@lri.lri.fr (Erick Bizouarn) (05/10/89)

We have some questions on the notion of time of the transputer. It has a 
strange behavior. Consider the following example where we measures the
execution time of the ldc (load constant) instruction. The time does not
always increase with the number of the instruction in the measured loop.

We have obtained the following results on a T414:

	-----------------------------------------
	| number of ldc inst |  execution time	|
	-----------------------------------------
	|	0	     |  T = 16832 us.	|
	-----------------------------------------
	|	1	     |  T - 960 us.	|
	-----------------------------------------
	|	2	     |  T - 1984 us.	|
	-----------------------------------------
	|	3	     |  T - 512 us.	|
	-----------------------------------------
	|	4	     |  T + 2048 us.	|
	-----------------------------------------


-------------------------------------------------------------------------------

	.text
_mess:
	.ascii	"time = %d us.\12\0"
.globl _main
_main:
	adjw	-3
	ldc	0
	sttimer
	ldtimer
	stl	1
	ldc	10000
	stl	2
	.align 2	; alignment on a divisible by 4 boundary.
L0:
###
ldc	0
.....	(n times)
ldc	0
###
	ldl	2
	adc	-1
	stl	2

	ldl	2
	eqc	0
	cjmp	L0

	ldtimer
	ldl	1
	sub
	ldc	64	; for low priority timer
	mul
	adc	-16832	; time for a loop without any ldc 0 instruction.
	ldc	_mess-L1
	ldpi
L1:
	ldl	4
	call	_printf
	adjw	3
	ret

--------------------------------------------------------------------------------

Where is the mistake ???

May a longer program be executed faster ? Perhaps it will be explained by
instruction prefetch, but we would like an detailled explanation.

Furthermore is there a means to write out of the memory of the INMOS B004 
transputer card without halting the transputer.

Thanks in advance for any hint.

********************************************************************************

Bizouarn Erick
L.R.I. (Universite d'Orsay) France

E-mail: eb@lri.lri.fr

********************************************************************************

krste@gec-rl-hrc.co.uk (krste) (05/24/89)

Dear Erick

After our meeting last week I returned to find your question about
transputer timings on the mailing list. I am posting this reply via the
list as I think this might be of general interest.

Briefly, the original question was over execution times for a loop coded
as follows:

        ldc     10000
        stl     2

.align  /* Put L0 on a word boundary. */

L0:
        /* Insert variable number of "ldc 0" instructions. */

        ldl     2
        adc     -1
        stl     2
        ldl     2
        eqc     0
        cj      L0

The number of "ldc 0" instructions is varied and the total execution time
for the run is measured. Below are Erick's original timings (for a Parsytek
card?) together with some I gathered on the Tadpole card in our Sun. The
execution time *decreases* with added "ldc 0" instructions!

Number of | Erick's       | Tadpole
"ldc 0"s  | timings in us | card
----------+---------------+---------
    0     |    16382      |  20352
    1     |    15422      |  19328
    2     |    14398      |  17856
    3     |    15870      |  17856
    4     |    18430      |  22848
    5     |       ?       |  21888

As Erick suggested, this occurs due to the operation of the instruction
pre-fetch buffer. Below, I've attempted a more detailed analysis based on
my own understanding of the transputer's architecture but would appreciate
any more precise information (Inmos?).

Instructions are pre-fetched in parallel with program execution. On a given
cycle, the CPU is taking instructions from one 4-byte buffer while the
pre-fetch may be filling another. When the CPU takes the last instruction
from it's buffer, the buffers are swapped and another instruction pre-fetch
cycle is started. The transputer has only one memory space (i.e. non-Harvard
architecture) and so instruction pre-fetches compete with data accesses for
the memory bus.

When a jump occurs, an instruction fetch must occur before the CPU can begin
executing code at the new location. However, a second pre-fetch cycle is
started immediately to fill the pre-fetch buffer. If the user code contains
instructions which access memory (ldl, stl etc.) then these will be delayed
until after the second pre-fetch cycle completes.

If the "cj" instruction is the last byte in a word, then a new pre-fetch is
started even if jump is taken. This pre-fetch will be for the instructions
five bytes after the "cj" byte and must be completed before the instructions
at the jump destination can be fetched. The "cj" instruction takes 4 cycles
if taken and these can overlap with the useless pre-fetch, so this end of the
pipeline break is not too disastrous.

I've drawn some diagrams below which show the interaction between the
pre-fetch's use of the memory bus and the program's data accesses, and these
illustrate how extra instructions can cause faster execution of the loop.
The diagrams are for the Tadpole card which has a 20MHz T414 with
250ns DRAM cycles. Each character position horizontally represents
one clock cycle (50ns). Instructions codes are written vertically and
are placed on the cycle on which they're read. The periods labelled
"execution" show when the transputer is running program code. The asterisks
show when either prefetch or the program's data access is using the memory
bus.

First with no "ldc" instructions, then with 3 "ldc" instructions:

cycle no.           11111111112222222222333333333344
          012345678901234567890123456789012345678901

inst      L    l          nas    l          enc
          0    d          fdt    d          qfj
          :    l          icl    l          ciL
               2          x 2    2          0x0
execution           .............     ............
data acc.           ******  *****     ******
prefetch  **********             *****         *****  takes 42 cycles


inst      L    llll       nas       l     enc
          0    dddd       fdt       d     qfj
          :    cccl       icl       l     ciL
               0002       x 2       2     0x0
execution      ...  ........   .................
data acc.           ******     ***********
prefetch  **********      *****           *****       takes 37 cycles

These diagrams are not 100% accurate, I believe greater overlap occurs than
these diagrams suggest (I haven't the time to set up a logic
analyser and Inmos keep quiet about the transputer micro-architecture).
However, they are within a cycle or two of the measured figures.

CONCLUSIONS

In general, the "ldc" instructions can be replaced with any instruction which
doesn't access external memory. These will then be executed in parallel with
the second instruction pre-fetch. A compiler could re-arrange code to take
advantage of this, by placing code which doesn't reference memory at the head
of word-aligned looping constructs. Also, within a section of code it is
theoretically possible to optimise code (perhaps by adding "pfix 0" nops)
so that instruction pre-fetches occur when non-memory instructions are
executed, but this is very tricky and linked to a given hardware
implementation's memory cycle.

Unnecessary pre-fetches at the tail of a loop are avoided if the "cj"
instruction is not the last byte of a word. This might also be tricky to
arrange and doesn't gain you much given that "cj" takes four cycles in any
case.

The second execution trace shows how well the transputer can saturate its
memory bus, but also shows how tightly its performance is coupled to
the speed of external memory (how do you fit a Lisp system into internal
RAM?). Expect to see caches on future transputers.....


+---------------------------------------------------------------------------+
| Krste Asanovic                        email: krste@uk.co.gec-rl-hrc       |
| GEC Hirst Research Centre             phone: +44 (1) 908 9662             |
| East Lane                             fax:   +44 (1) 904 7582             |
| Wembley                               telex: 923429 GECLAB G              |
| Middlesex                                                                 |
| HA9 7PP                                                                   |
| England                                                                   |
+---------------------------------------------------------------------------+