eb@lri.lri.fr (Erick Bizouarn) (05/10/89)
We have some questions on the notion of time of the transputer. It has a strange behavior. Consider the following example where we measures the execution time of the ldc (load constant) instruction. The time does not always increase with the number of the instruction in the measured loop. We have obtained the following results on a T414: ----------------------------------------- | number of ldc inst | execution time | ----------------------------------------- | 0 | T = 16832 us. | ----------------------------------------- | 1 | T - 960 us. | ----------------------------------------- | 2 | T - 1984 us. | ----------------------------------------- | 3 | T - 512 us. | ----------------------------------------- | 4 | T + 2048 us. | ----------------------------------------- ------------------------------------------------------------------------------- .text _mess: .ascii "time = %d us.\12\0" .globl _main _main: adjw -3 ldc 0 sttimer ldtimer stl 1 ldc 10000 stl 2 .align 2 ; alignment on a divisible by 4 boundary. L0: ### ldc 0 ..... (n times) ldc 0 ### ldl 2 adc -1 stl 2 ldl 2 eqc 0 cjmp L0 ldtimer ldl 1 sub ldc 64 ; for low priority timer mul adc -16832 ; time for a loop without any ldc 0 instruction. ldc _mess-L1 ldpi L1: ldl 4 call _printf adjw 3 ret -------------------------------------------------------------------------------- Where is the mistake ??? May a longer program be executed faster ? Perhaps it will be explained by instruction prefetch, but we would like an detailled explanation. Furthermore is there a means to write out of the memory of the INMOS B004 transputer card without halting the transputer. Thanks in advance for any hint. ******************************************************************************** Bizouarn Erick L.R.I. (Universite d'Orsay) France E-mail: eb@lri.lri.fr ********************************************************************************
krste@gec-rl-hrc.co.uk (krste) (05/24/89)
Dear Erick After our meeting last week I returned to find your question about transputer timings on the mailing list. I am posting this reply via the list as I think this might be of general interest. Briefly, the original question was over execution times for a loop coded as follows: ldc 10000 stl 2 .align /* Put L0 on a word boundary. */ L0: /* Insert variable number of "ldc 0" instructions. */ ldl 2 adc -1 stl 2 ldl 2 eqc 0 cj L0 The number of "ldc 0" instructions is varied and the total execution time for the run is measured. Below are Erick's original timings (for a Parsytek card?) together with some I gathered on the Tadpole card in our Sun. The execution time *decreases* with added "ldc 0" instructions! Number of | Erick's | Tadpole "ldc 0"s | timings in us | card ----------+---------------+--------- 0 | 16382 | 20352 1 | 15422 | 19328 2 | 14398 | 17856 3 | 15870 | 17856 4 | 18430 | 22848 5 | ? | 21888 As Erick suggested, this occurs due to the operation of the instruction pre-fetch buffer. Below, I've attempted a more detailed analysis based on my own understanding of the transputer's architecture but would appreciate any more precise information (Inmos?). Instructions are pre-fetched in parallel with program execution. On a given cycle, the CPU is taking instructions from one 4-byte buffer while the pre-fetch may be filling another. When the CPU takes the last instruction from it's buffer, the buffers are swapped and another instruction pre-fetch cycle is started. The transputer has only one memory space (i.e. non-Harvard architecture) and so instruction pre-fetches compete with data accesses for the memory bus. When a jump occurs, an instruction fetch must occur before the CPU can begin executing code at the new location. However, a second pre-fetch cycle is started immediately to fill the pre-fetch buffer. If the user code contains instructions which access memory (ldl, stl etc.) then these will be delayed until after the second pre-fetch cycle completes. If the "cj" instruction is the last byte in a word, then a new pre-fetch is started even if jump is taken. This pre-fetch will be for the instructions five bytes after the "cj" byte and must be completed before the instructions at the jump destination can be fetched. The "cj" instruction takes 4 cycles if taken and these can overlap with the useless pre-fetch, so this end of the pipeline break is not too disastrous. I've drawn some diagrams below which show the interaction between the pre-fetch's use of the memory bus and the program's data accesses, and these illustrate how extra instructions can cause faster execution of the loop. The diagrams are for the Tadpole card which has a 20MHz T414 with 250ns DRAM cycles. Each character position horizontally represents one clock cycle (50ns). Instructions codes are written vertically and are placed on the cycle on which they're read. The periods labelled "execution" show when the transputer is running program code. The asterisks show when either prefetch or the program's data access is using the memory bus. First with no "ldc" instructions, then with 3 "ldc" instructions: cycle no. 11111111112222222222333333333344 012345678901234567890123456789012345678901 inst L l nas l enc 0 d fdt d qfj : l icl l ciL 2 x 2 2 0x0 execution ............. ............ data acc. ****** ***** ****** prefetch ********** ***** ***** takes 42 cycles inst L llll nas l enc 0 dddd fdt d qfj : cccl icl l ciL 0002 x 2 2 0x0 execution ... ........ ................. data acc. ****** *********** prefetch ********** ***** ***** takes 37 cycles These diagrams are not 100% accurate, I believe greater overlap occurs than these diagrams suggest (I haven't the time to set up a logic analyser and Inmos keep quiet about the transputer micro-architecture). However, they are within a cycle or two of the measured figures. CONCLUSIONS In general, the "ldc" instructions can be replaced with any instruction which doesn't access external memory. These will then be executed in parallel with the second instruction pre-fetch. A compiler could re-arrange code to take advantage of this, by placing code which doesn't reference memory at the head of word-aligned looping constructs. Also, within a section of code it is theoretically possible to optimise code (perhaps by adding "pfix 0" nops) so that instruction pre-fetches occur when non-memory instructions are executed, but this is very tricky and linked to a given hardware implementation's memory cycle. Unnecessary pre-fetches at the tail of a loop are avoided if the "cj" instruction is not the last byte of a word. This might also be tricky to arrange and doesn't gain you much given that "cj" takes four cycles in any case. The second execution trace shows how well the transputer can saturate its memory bus, but also shows how tightly its performance is coupled to the speed of external memory (how do you fit a Lisp system into internal RAM?). Expect to see caches on future transputers..... +---------------------------------------------------------------------------+ | Krste Asanovic email: krste@uk.co.gec-rl-hrc | | GEC Hirst Research Centre phone: +44 (1) 908 9662 | | East Lane fax: +44 (1) 904 7582 | | Wembley telex: 923429 GECLAB G | | Middlesex | | HA9 7PP | | England | +---------------------------------------------------------------------------+