doug@terak.UUCP (Doug Pardee) (05/14/85)
> ... looking at over 7,000 instructions generated by > Pascal and C compilers on the NS32032 we have found very little use > of any memory to memory operations not already available on the M68000 > family. In fact, the National code (with half the registers of the M68000) > only uses them about 2 percent of the time. Jeepers, is the architecture of the 68xxx series being dictated entirely by what instructions compilers use? Why should a compiler care if the instruction set is orthogonal or not? The people who care are the assembly language programmers. Okay, what are the criteria we need to consider? 1) Ease of programming in assembler -- the more flexible, the better. 2) Size of object code -- it depends on whether the instruction words for the old register-type operations expand. If not, then the memory-to-memory mode could only shrink the object code. 3) Speed of execution -- this is a more complex issue. Details follow. The big advantage to having memory-to-memory instructions is that the address of the destination is only computed once even though the destination is also an operand. If the address is quite complex this means that a 4-instruction sequence (LEA,MOVE,operation,MOVE) can be compressed into one instruction. It also bypasses using two registers (one for the address, one for the computation). When we're considering speed of execution, we are almost totally concerned with instructions which are located inside of loops. Inside of a loop, it might well be that the operations are done on values in registers, with the operands loaded prior to entering the loop and stored after exiting the loop. In which case the memory-to-memory instructions wouldn't be helpful. If the performance of the CPU on the old register-type operations is degraded by checking for memory-to-memory operations, then there would be serious doubt about their value. The NS32016 requires 4 clock cycles to figure out that it's doing a register-to-register operation for most instructions (the really important ones like MOV and ADD have been suboptimized to remove that overhead). A less obvious degradation occurs if the co-processors also support the memory-to-memory operations. The NS32081 FPU is a "slave" processor, not a co-processor like the MC68881. The difference is that when the main CPU requests a floating-point operation, the National CPU has to stop and wait for the operation to be done, because the result might have to be stored back into memory. The Motorola CPU does not have to wait until it needs the FPU again, because the result will simply be left in an FPU register. (At least, that's the way I understand it). In summary, memory-to-memory is fine, but I wouldn't tolerate very much in the way of penalties to have it. -- Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug ^^^^^--- soon to be CalComp
brooks@lll-crg.ARPA (Eugene D. Brooks III) (06/09/85)
> And furthermore, the orthogonal sequence is normally atomic; > in an OS kernel the non-orthogonal sequence might easily have to > be protected by a "disable/enable interrupt" sequence around it, > or "test-and-set" or some such in a multi-processor system > (e.g., "a" and "b" might be global vars). > Multi-process user-programs would need "enter/exit monitor" or > "block-on-semaphore" sequences. Besides being a pain (sometimes > a royal pain) this has the potential for eating a lot of CPU time. > -- Considerations for multiprocessing are one of the strongest arguments in favor of a load/store type of instruction set. The fundamental problem to be overcome in a multiprocessor is memory latency. You increase efficiency in an environment with high memory latency by using a load/store type of instruction set in conjunction with a processor composed of pipelined functional units and careful instruction ordering. For example: a += b; load r0,_a load r1,_b add r0,r1 store r0,_a The performance gain is achieved with there is more work to do. For example: a += b; c += d; load r0,_a load r1,_b load r2,_c load r3,_d add r0,r1 add r2,r3 store r0,_a store r2,_c The loads overlap their latencies resulting in a higher performance than is capable with the sequence add _a,_b add _c,_d