[net.micro.68k] EA orthogonality

doug@terak.UUCP (Doug Pardee) (05/14/85)

> ... looking at over 7,000 instructions generated by
> Pascal and C compilers on the NS32032 we have found very little use
> of any memory to memory operations not already available on the M68000
> family. In fact, the National code (with half the registers of the M68000)
> only uses them about 2 percent of the time.

Jeepers, is the architecture of the 68xxx series being dictated entirely
by what instructions compilers use?  Why should a compiler care if the
instruction set is orthogonal or not?  The people who care are the
assembly language programmers.

Okay, what are the criteria we need to consider?

 1) Ease of programming in assembler -- the more flexible, the better.
 2) Size of object code -- it depends on whether the instruction words
    for the old register-type operations expand.  If not, then the
    memory-to-memory mode could only shrink the object code.
 3) Speed of execution -- this is a more complex issue.  Details follow.

The big advantage to having memory-to-memory instructions is that the
address of the destination is only computed once even though the
destination is also an operand.  If the address is quite complex this
means that a 4-instruction sequence (LEA,MOVE,operation,MOVE) can be
compressed into one instruction.  It also bypasses using two registers
(one for the address, one for the computation).

When we're considering speed of execution, we are almost totally
concerned with instructions which are located inside of loops.  Inside
of a loop, it might well be that the operations are done on values in
registers, with the operands loaded prior to entering the loop and
stored after exiting the loop.  In which case the memory-to-memory
instructions wouldn't be helpful.

If the performance of the CPU on the old register-type operations is
degraded by checking for memory-to-memory operations, then there would
be serious doubt about their value.  The NS32016 requires 4 clock cycles
to figure out that it's doing a register-to-register operation for most
instructions (the really important ones like MOV and ADD have been
suboptimized to remove that overhead).

A less obvious degradation occurs if the co-processors also support the
memory-to-memory operations.  The NS32081 FPU is a "slave" processor,
not a co-processor like the MC68881.  The difference is that when the
main CPU requests a floating-point operation, the National CPU has to
stop and wait for the operation to be done, because the result might
have to be stored back into memory.  The Motorola CPU does not have to
wait until it needs the FPU again, because the result will simply be
left in an FPU register.  (At least, that's the way I understand it).

In summary, memory-to-memory is fine, but I wouldn't tolerate very much
in the way of penalties to have it.
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

brooks@lll-crg.ARPA (Eugene D. Brooks III) (06/09/85)

> And furthermore, the orthogonal sequence is normally atomic;
> in an OS kernel the non-orthogonal sequence might easily have to
> be protected by a "disable/enable interrupt" sequence around it,
> or "test-and-set" or some such in a multi-processor system 
> (e.g., "a" and "b" might be global vars).
> Multi-process user-programs would need "enter/exit monitor" or
> "block-on-semaphore" sequences.  Besides being a pain (sometimes
> a royal pain) this has the potential for eating a lot of CPU time.
> -- 
Considerations for multiprocessing are one of the strongest arguments
in favor of a load/store type of instruction set.  The fundamental problem
to be overcome in a multiprocessor is memory latency.  You increase efficiency
in an environment with high memory latency by using a load/store type of
instruction set in conjunction with a processor composed of pipelined functional
units and careful instruction ordering.  For example:

a += b;

load r0,_a
load r1,_b
add r0,r1
store r0,_a

The performance gain is achieved with there is more work to do. For example:

a += b;
c += d;

load r0,_a
load r1,_b
load r2,_c
load r3,_d
add r0,r1
add r2,r3
store r0,_a
store r2,_c

The loads overlap their latencies resulting in a higher performance than is
capable with the sequence

add _a,_b
add _c,_d