[comp.sys.transputer] Transputer instruction set critisisms

roger@inmos.co.uk (Roger Shepherd) (02/06/90)
There were some critisisms made about the transputer instruction set.
These are my responses, I hope they are correct, I was around while
these decisions were being made but time causes one's memory to fade
(just like DRAM).

I suspect that one of my collegues has explained the positronics used in
the transputers which enables instructions to complete before they have
started and thus to go faster than light :-)

>                        ............   How are memory accesses (both
> internal and external) handled? Is there an independent entity (i.e.,
> a seperate process running in parallel to CPU, FPU, and the DMA
> machines) which handles these requests? This would explain why a LDL
> takes two cycles (have to wait for data to arrive), while STL takes
> only one (just get rid of it and go on). Of course, there would be a
> penalty fo a LDL following a STL, but maybe not if the load comes from
> internal memory and the store goes to external. What about that?
> (This is no idle speculation. Compliler writers for and users of
> vector machines spend much time avoiding these types of memory access
> conflicts!) 

The above description of the operation of LDL and STL is almost correct. 
LDL takes 2 cycles, the first cycle performs an address calculation, the 
second makes the memory access. For STL the first cycle is spent performing
the same address calculation and during the next cycle the write occurs.
The operation of the sequence STL; LDL takes 3 cycles as the memory
access for STL overlaps the address calculation of the LDL.  The operation 
of LDL; STL also takes 3 cycles.

> 4. Lastly, I'd like to know why some instructions have been done the
> way they are. I can understand that cj (conditional jump) behaves as a
> `jump zero' or `jump false', that's a matter of taste - a RISC
> processor just doesn't have the orthogonal instruction set of a CISC.
> But why, for whoever's sake, does it remove the operand if not jumping
> and leave the zero when jumping? Most of the time, I find myself
> preceding it with a dup so I have a copy of my laboriously computed
> loop count after I've checked whether it's zero! Well, of course the
> zero is easier to remove (with a diff) than a non-zero, but in that
> case, we could invert the result of the comparison.

The reason that CJ behaves as it does is that it optimises the
generation  of the occam OR and AND operators (equivalent to the C ||
and && operators). In retrospect it would have been better to have had
the popping behaviour of CJ reversed, so that it threw away a known
value and kept an unknown one.  The choice of JUMP IF FALSE or JUMP IF
TRUE is non-trival. The instruction set of the transputer as set up
allows easy generation of fairly dense code and this involves a careful
choice for the instruction set. For example, if we assume CJ as it is,
then the choice of whether to have a EQC (equals constant) or (NEQC)
not equals constant instruction depends of the frequency of constructs
such as "IF x = constant" verses "IF x <> constant".  

> In general, I find the selection of direct and short instructions very
> reasonable. Exceptions are the startp and endp instructions: Why do
> they, taking 10 and 11 cycles, get the privilege of a one-byte
> instruction, while the much more useful dup now takes two cycles just
> because it has to be a two-byte instruction? (Why dup wasn't there
> from the start, but was added as an afterthought to the T800, would
> probably make an amusing historical anecdote. Does anybody know the
> reason?)

DUP wasn't in the T424 from the start because it did not seem a very
useful instruction. So, you ask, why didn't DUP seem like a useful
instruction?  Well, the working assumptions that made in the early 80s
when the instruciton set was designed included things like, all code
would be compiler generated, register allocation is hard (one reason
for producing a stack machine), LDL would be 1-cycle, global
optimisation is hard.....  When viewed in this light, DUP would not be
a common instruction and would, therefore be 2 cycles. There would be a
distinct penalty for using the sequence LDL; DUP rather than LDL; LDL,
there would be no benefit to using DUP; STL rather than STL; LDL. So
why introduce DUP?  Well, DUP was introduced to improve various of the
double length conversion operations in the T800. The primary reason for
its inclusion was that (i) LDL was a two cycle operation, (ii) LDL was
often even slower since workspaces were not always on chip.

As to which instructions should be short codes and which long codes I
don't think we did too badly given that we were guessing about the
structure of concurrent programs. I now think that the assignment of
opcodes to instructions should be driven so as to minimise byte of code
fetched during program execution. At the end of the day the memory
bandwidth taken by instruction fetch will be limiting - this is what
will limit the speed of RISCs and what will cause a re-emergence of
machines with addressing modes - they will get better code density.


Roger Shepherd, INMOS Ltd   JANET:    roger@uk.co.inmos 
1000 Aztec West             UUCP:     ukc!inmos!roger or uunet!inmos-c!roger
Almondsbury                 INTERNET: roger@inmos.com
+44 454 616616              ROW:      roger@inmos.com OR roger@inmos.co.uk