[comp.arch] What should be in hardware but isn'

aglew@ccvaxa.UUCP (09/23/87)

| cik@l.cc.purdue.edu (Herman Rubin)
| There are many instructions which are easy to implement in hardware, but
| for which software implementation may even be so costly that a procedure
| using the instruction may be worthless.  Some of these instructions have
| been implemented in the past and have died because the ill-designed
| languages do not even recognize their existence.  Others have not been
| included due to the non-recognition of them by the so-called experts and
| by the stupid attitude that something should not be implemented unless
| 99.99% of the users of the machine should be able to want the instruction
| _now_.  As you can tell from this article, I consider the present CISC
| computers to be RISCy.

Well, I'm defintely a member of the RISC camp, but I do think that Herman
has raised two good points:
	(1) there are instructions that are not in the 99% usage group
	    that can still be more efficiently implemented in hardware
	    than software.
	(2) languages should provide better (more efficient) ways of
	    interfacing to machine features, without the expense of
	    wrapping them in procedure call overhead.

(1) Remember, frequency of use is not the fundamental criterion for inclusion
    of an instruction. It's more like (Number of times that the operation
    is required) * (Speed of software implementation)/(speed of hardware
    implementation) /(slowdown of other instructions...)
	With (customers who simply want a benchmark that can make good use
    of that particular instruction) factored in.
	Of course, these are highly nonlinear functions, and frequency of
    use is often a good approximation, other things being equal.

       	But other things are not always equal. "Unusual" instructions do not
    always mean microcode; if they can be combinatorically implemented,
    especially without state machines, then there may be no slowdown apart
    from wiring effects. And they may be extremely slow to do in software.
	One of my favorite examples of this is the bit-reversed indexing that
    is so convenient for FFT applications. Bit reversing an 8 or 16 bit
    address isn't too bad, but some applications are getting into the 32 bit
    range, and may be passing that soon. Even with table lookup in decent
    sized tables, bit reversal is expensive in software - and yet it can
    be easily done in hardware.
	If FFTs are a primary application for your system it may be
    worthwhile looking at bit reversal or carry-reversed addition.
    Even if they aren't, the instruction may be so cheap to implement
    that it may be worth including - because it may turn out to be
    important in the future.

	Of course, there is a designer's trap here: each additional instruction
    may appear cheap, but the cost of a horde of extra instructions may
    exceed the sum of them each because you have exceeded a hard limit
    of your implementation, like chip size.

	I try to avoid instructions that require sequencing, but occasionally
    amuse myself by thinking of purely combinatoric operations that might
    be useful, and cheap to implement. Like bit reversal, or counting
    the number of set bits in an instruction, or branching (ooops) based
    on the uppermost three bits. Unfortunately, there aren't too many.

(2) The second point, about interfacing to high level languages, is more
    important. Say that I do have a code that needs to count the number
    of set bits in a bitstring, and it already uses a POP function.
    Well, I can simply replace the POP function by my POP instruction,
    can't I? Unfortunately, most $%^^#@!!! languages do not let you;
    you either have to wrap the POP instruction in a function call,
    which loses most of it's benefit, or you have to use asms after
    putting stuff into register variables that you *KNOW* the compiler
    maps to R7 and R6... Not good either way.
	It would be nice if the compiler could be made to know about every
    instruction in the machine, even though it didn't generate code
    for them; it would be nice if you could say
		count = asm_pop(bitreg)
    and the compiler could say
	"Oh, he wants to use the pop instruction that this strange machine
	provides. I don't know how to use it myself, but I know how
	to arrange for the programmer to use it. Let's see, it takes
	an input in a register and puts an output into a register.
	He wants pop of bitreg - well, that's in memory, so it needs
	to be fetched into a register. THere's no register free, so I'll
	have to save R7. Now, he wants the result in count. Oh, that's
	already in a register R3. Well, I just have to unsave R7, and
	now I've got
		save R7
		R7 = bitreg
		pop R7 -> count
		unsave R7
	Well, that's a nice bit of code that I wouldn't have been able
	to produce automatically, but at least I was able to arrange
	the register use for my programmer"

	I've just received a UNIX PC in the fire sale, and I'm told that
    the inline assembler functions can be made to do something like
    this, so maybe the world is slowly getting better.

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms.arpa

I always felt that disclaimers were silly and affected, but there are people
who let themselves be affected by silly things, so: my opinions are my own,
and not the opinions of my employer, or any other organisation with which I am
affiliated. I indicate my employer only so that other people may account for
any possible bias I may have towards my employer's products or systems.

lamaster@pioneer.arpa (Hugh LaMaster) (09/29/87)

In article <28200048@ccvaxa> aglew@ccvaxa.UUCP writes:

>
>    important. Say that I do have a code that needs to count the number
>    of set bits in a bitstring, and it already uses a POP function.
>    Well, I can simply replace the POP function by my POP instruction,
>    can't I? Unfortunately, most $%^^#@!!! languages do not let you;
>    you either have to wrap the POP instruction in a function call,
>    which loses most of it's benefit, or you have to use asms after
>    putting stuff into register variables that you *KNOW* the compiler
>    maps to R7 and R6... Not good either way.
>	It would be nice if the compiler could be made to know about every
>    instruction in the machine, even though it didn't generate code
>    for them; it would be nice if you could say

As described, it is semi-machine-independent.  This is hard to do.  But, there
have been compilers that let you put instructions in in a machine dependent
way.  This is a nice compromise with assembly language code, because most of
the time you can let the compiler do the work.  The Cyber 205 has a set of
"Q8" calls, so called because they look like Q8MERGE say to use a "MERGE"
instruction.  The compiler uses program names to generate the addresses:
e.g.  CALL MERGE(A,B,C)  so you don't have to do anything wierd to get the
address.  It is a nice feature, as long as it doesn't encourage the compiler
writer to get lazy.  ("Well, I won't bother to recognize that special case and
generate the "MERGE" instruction, because if someone really needs it they can
call it directly.")  I wish more compilers had this feature; it eliminates the
need for a lot of assembly language coding.  (Or speeds up the process,
depending on how you look at it.)

  Hugh LaMaster, m/s 233-9,  UUCP {topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

(Disclaimer: "All opinions solely the author's responsibility")

tim@amdcad.AMD.COM (Tim Olson) (10/02/87)

In article <340@oracle.UUCP> bradbury@oracle.UUCP (Robert Bradbury) writes:
| So from my experience the cost of mapping CISC functions into CISC instructions
| can be quite a large part of the code generator of a compiler.  Do the RISC
| people have any measures of how much work goes into a RISC code generator
| for things like DIV/MUL, STRCPY/MEMCPY or BRANCH scheduling?  (Some of the code
| published for the AMD 29000 indicates these aren't afternoon efforts :-).)

In our "development" C compiler, div/mul, strcpy/memcpy are simply calls
to the runtime routines to perform these functions, so there was no cost
in the code generator for these.  I didn't write the code generator, but
the delayed-branch scheduling code in the optimizer is very small.


| Have we gotten to the point where we can estimate the hardware development
| costs of branch destination caching vs. the software development costs
| of branch scheduling and trade them off against each other?

The two aren't mutually-exclusive (the Am29000 implements both). 
Delayed-branches allow execution of instructions following the branch
which are already in the pipeline, while the Branch Target Cache reduces
or eliminates the latency involved in starting a new instruction stream.

Perhaps you mean the tradeoff between delayed-branches and branch
prediction?

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

turner@uicsrd.csrd.uiuc.edu (10/02/87)

> Written  9:22 pm  Sep 28, 1987 by stachour@umn-cs.UUCP in comp.arch
> 
> I've NEVER seen anyone design compilers for a machine that is only
> being similated, and chose the architecture of the hardware based
> on measurement, and build the machine later. (Well, one exception,
> Multics many years ago, but that design set goals seldom met now.)
> 
> Paul Stachour
> Honeywell SCTC (Stachour@HI-Multics)
> UMinn. Computer Science (stachour at umn-cs.edu)

Here at CSRD we have been designing compilers for Cedar since BEFORE
the machine was ever simulated.  I think that compiler design is
obviously important enough to be done concurrently with architecture
evaluation, anyone else feel differently??
---------------------------------------------------------------------------
 Steve Turner (on the Si prairie  - UIUC CSRD)

UUCP:	 {ihnp4,seismo,pur-ee,convex}!uiucdcs!uicsrd!turner
ARPANET: turner%uicsrd@a.cs.uiuc.edu           
CSNET:	 turner%uicsrd@uiuc.csnet            *-))    Mutants for
BITNET:	 turner@uicsrd.csrd.uiuc.edu                Nuclear Power  (-%

aglew@ccvaxa.UUCP (10/07/87)

>I believe the current AT&T C compilers (Release 3) allow "Enhanced Assembly
>Language Escapes for C" which include a fairly sophisticated mechanism
>for defining asm() pseudo-macros which look like C functions.  Does anyone
>know how much this "cost" in compiler effort?  
> 
>Robert Bradbury

I'd heard about this, and had been hoping to play with it when I received
my AT&T 3B1 UNIX PC, running 3.5.1, which I thought was up to R3.

But, if this compiler has these pseudo-functions, they're well hidden.

Does anyone know if they exist on the 3B1?

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms.arpa

I always felt that disclaimers were silly and affected, but there are people
who let themselves be affected by silly things, so: my opinions are my own,
and not the opinions of my employer, or any other organisation with which I am
affiliated. I indicate my employer only so that other people may account for
any possible bias I may have towards my employer's products or systems.

howard@cpocd2.UUCP (Howard A. Landman) (10/16/87)

>> Written  9:22 pm  Sep 28, 1987 by stachour@umn-cs.UUCP in comp.arch
>> I've NEVER seen anyone design compilers for a machine that is only
>> being similated, and chose the architecture of the hardware based
>> on measurement, and build the machine later. (Well, one exception,
>> Multics many years ago, but that design set goals seldom met now.)

In article <43700027@uicsrd> turner@uicsrd.csrd.uiuc.edu writes:
>Here at CSRD we have been designing compilers for Cedar since BEFORE
>the machine was ever simulated.  I think that compiler design is
>obviously important enough to be done concurrently with architecture
>evaluation, anyone else feel differently??

No, I agree.  When we were designing the RISC I at Berkeley a while back,
we had a compiler and an instruction-level simulator long before the design
finalized.  We also could run lower level simulations (eventually, right down
to switch simulation of the transistor netlist extracted from the chip layout)
at the same time and compare them with the high-level simulation.  This was
used to test out the design as it progressed.  Finally, the coding for the
control PLA of the chip was generated automatically from the instruction-set
level description used for high level simulation.  Thus it was not only
possible, but routine, for us to change the instruction set, "push a button",
and have a new control PLA generated, dropped into the chip layout, the new
chip layout extracted, 3 levels of simulation run and results compared, all
in under 24 hours with no human intervention.  This was done about 30 times
in the last 45 days of the design process.  The final "tweak" to the
architecture was done 1 day before we sent out the database for maskmaking!

The switch level simulator was never run for more than about 100 instruction
cycles, because of CPU limitations, but the high-level simulator had many
complete programs run on it.

Only one minor functional bug, affecting the condition codes after a certain
instruction, got past all this into the final chip.  Rather than declare it
a "feature", we modified the assembler to insert an additional instruction
where needed to assure that the condition code was correct.  This had an
insignificant effect on the performance of the machine.

While the design wasn't based on measurements taken from the compiler before
design began, it *was* based on many, many measurements of the performance
of VAXes, and was modified as compiler results became available.

-- 
	Howard A. Landman
	{oliveb,hplabs}!intelca!mipos3!cpocd2!howard	<- works
	howard%cpocd2%sc.intel.com@RELAY.CS.NET		<- recently flaky
	"Unpick a ninny - recall Mecham"