[comp.os.misc] Complex Instructions

schow@bnr-public.uucp (Stanley Chow) (04/27/89)

Since this discussion of complex instructions is venturing outside of
architectural issues and getting into language/OS issues, I have cross
posted to what I think are appropiate groups. Hopefully, we can get some
other viewpoints on this issue. Please limit followups to appropiate group(s).


In article <25384@amdcad.AMD.COM> prem@crackle.amd.com (Prem Sobel) writes:
>Many years ago, there was a machine called the Interdata Model 70 which
>had instructions for atomically adding or removing items from circular
>double ended queues. The data structure was defined reasonable effeciently
>and the machine was microcoded.
>
>Yet no one, no compiler seriously used these instructions. The reason was,
>amazingly, that individual instructions were faster!!! I never looked at
>the microcode, so I cannot comment why that was.

Strangely enough, we have a proprietary machine that have micro-coded
instructions for much the same functions. The queueing instructions happen
to be at the top of the usage list.

Even more amazing, micro-coding of frequently used instruction sequences
essentially doubled performance. Since I wrote much of the micro-code, (and
did much of analysis to begin with), I can state that the main reasons are:
  - reduced program bandwidth
  - better pipelining of program and data access
  - better parallelism for using the hardware units.

All this is done with a peephole optimizer! And *all* the instructions 
fitted into 4K by 40 bits of micro-code!


You think the VAX procedure calling instructions are big? We have special
instructions for swapping processes in and out! The code for swapping process
is something like:
     
     SaveRegisters();                   ; one instruction. 
					; old process is implicit
     RestoreRegisters(new_process);     ; another instruction

These instructions play with the hardware registers, firmware registers, help 
the scheduler do software stuff, calculate the CPU time spent in the current
process, save/restore the runtime stacks and some other things that I cannot
remember off hand.

The end result is that process swapping happens at data memory bandwidth. We
looked at the options, and concluded that even with absolutely no program store
wait-states, it is impossible for any software (compiled or hand-tuned) to 
evan come close to this performance.



Note that this is on a machine designed for micro-coding in the early 70's
so the comparisons may not be valid for current machines. Considering that it
uses only MSI TTL on 4-layer boards, we get very good through-put. [We have
already come out with a 68K based replacement and are working on more fun
stuff, more about that next decade.]

A word of caution for people that want to look into micro-coding: get control
of your operating system and compiler before you try it. There is no point in
micro-coding instructions for your application unless you can make the OS and
the compiler like it. (I  managed to introduce new syntax into the language and
changed whole chunks of the OS to support some of the fancy micro-code).



Stanley Chow    ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
                (613) 763-2831

Disclaimer: Since I am only talking about an old system, and all the information
	    has already been published in one form or another, I don't think my
	    employer minds we talking about it. That does not mean I represent
	    anyone.
--
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request