[comp.arch] RISC is NOT micro-code

schow@bnr-public.uucp (Stanley Chow) (07/16/89)

In article <12233@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>
>Let's assume that the RISC and CISC machines under consideration are
>identical except for instruction encoding; the CISC being microcoded and
>each RISC instruction corresponding to one micro instruction.

This has been stated many times in the RISC vs CISC debates. Usually,
the argument is that RISC opcodes running in I-Cache is the equavilent
of micro-code in a CISC machine. 

This argument is wrong for the following reasons:

  1) CISC machines do not have to be micro-coded. It just happens that
     a majority of the CISC CPU's are micro-coded. Please remeber that
     VAX is just one example of CISC. There are many other examples of
     CISC that are totally unlike VAX. If you are arguing against the
     VAX, then say so. Arguing against CISC means you think the CISC
     concept has some *inherent* flaw that you are exposing.

  2) The typical RISC instruction is far from equal to a typical
     micro-code instrcution. Almost any interesting micro-coded CPU
     will have micro-word that is much wider than 32 bits with much
     parallel operations going on. In fact, the whole premise of RISC
     gurantees that a RISC instruction will be *less* than a micro-word.

  3) Even if a sequence of RISC instructions correspond to a sequence
     of micro-code, the effects are still not the same. Usually, a
     micro-coded instruction is atomic (or can be easily made to be).
     Most RISC CPU's require very special fudging to achive fine
     grain atomicity.


>
>CISC compilers tend to pack as many operations together in single
>instructions as they possibly can, but this implies a certain amount of
>redundancy because the complex instructions represent general case sequences
>of micro instructions.  Initially, there will be several RISC instructions
>for each of the CISC instructions, perhaps 3:1.  However, many of them will
>be optimized away.
>

  4) RISC has more opportunaty for optimizations (I will leave open
     the question of whether RISC *requires* optimization). This means
     the compiler is much harder to get right. Please note that I am
     not saying the current RISC compilers don't work. I am merely
     saying that if I have an obscure bug in my program, I would be
     more inclined to suspect a RISC compiler than a dumb CISC compiler.


>This is because the RISC compiler can recognize and remove redundancies in
>the CISC micro instruction sequences.  Generally, RISC implies more available
>expressions, which makes value propagation and common subexpression
>elimination more productive.  Taking arbitrary basic blocks of RISC code and
>doing rather simple analysis, we have often found that only about 1/4 of the
>operations remain after optimization...  this is a very unlikely scenario
>for CISC operations.

The question is, What make it possible to throw away the 3/4? The possible
candidates:
  - Data-flow analysis
  - other optimizations
  - more registers
  - RISC instruction set

It is not clear to me that the RISC instruction set should get all the
credit just because the compiler is a RISC compiler.



Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
Me? Represent other people? Don't make them laugh so hard.

hankd@pur-ee.UUCP (Hank Dietz) (07/17/89)

In article <753@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes:
>In article <12233@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>>
>>Let's assume that the RISC and CISC machines under consideration are
        ^^^^^^
>>identical except for instruction encoding; the CISC being microcoded and
>>each RISC instruction corresponding to one micro instruction.
>
>This has been stated many times in the RISC vs CISC debates. Usually,
>the argument is that RISC opcodes running in I-Cache is the equavilent
>of micro-code in a CISC machine. 
>
>This argument is wrong for the following reasons:
....

You missed my point.  I wasn't saying that people should or do design RISC
and CISC machines this way, but rather that this is a convenient and
reasonable assumption so that a more formal analysis could be made of the
possibility of RISC:CISC instruction counts being less than 1:1.

>  4) RISC has more opportunaty for optimizations (I will leave open
>     the question of whether RISC *requires* optimization). This means
>     the compiler is much harder to get right. Please note that I am
>     not saying the current RISC compilers don't work. I am merely
>     saying that if I have an obscure bug in my program, I would be
>     more inclined to suspect a RISC compiler than a dumb CISC compiler.

Not me...  I'd be more inclined to suspect a bug in the NEWER compiler, be it
RISC or CISC.  However, if it's an optimizing CISC compiler, I'll bet it has
at least a few bugs even if it has been out for years.  Perhaps I'm too much
into the field of optimizing compilers, but from personal experience I have
seen *decent* CISC compilers nearly always be harder than those performing
similar optimizations for RISC...  it is just that the average RISC compiler
is expected to do more optimization than the average CISC compiler.

>>elimination more productive.  Taking arbitrary basic blocks of RISC code and
>>doing rather simple analysis, we have often found that only about 1/4 of the
>>operations remain after optimization...  this is a very unlikely scenario
>>for CISC operations.
>
>The question is, What make it possible to throw away the 3/4? The possible
>candidates:
>  - Data-flow analysis
>  - other optimizations
>  - more registers
>  - RISC instruction set
>
>It is not clear to me that the RISC instruction set should get all the
>credit just because the compiler is a RISC compiler.

The RISC instruction sets tend to naturally require intermediate results to
be placed in registers and this naturally yields more available expressions.
I'm not quoting gospel here, but the little compilers we've been doing often
manage to get rid of about 3/4 of the ops just doing things straight from
Aho & Ullman's "Principles of Compiler Design":

[1]	Value propagation (get values from where cheapest, remember what
	values appear where and which quantities are equivalent)
[2]	Constant folding (also uses value prop.:  A=5; B=A*A; would get
	coded as if it were A=5; B=25;)
[3]	Common subexpression elimination
[4]	Dead code elimination (get rid of dead stores, etc.)

By the way, my best guess is that a comparable CISC compiler would get rid of
perhaps 1/3 of the instructions.  But that hardly matters.  My point relies
only on the fact that if an entire CISC instruction can be eliminated, then
so can the sequence of RISC instructions which would implement it; whereas a
RISC instruction might be eliminated but the CISC instruction which would
incorporate it cannot be eliminated (because other parts of the CISC
instruction are needed).

I don't want to start another war, but we *never* run out of registers, so
that has no impact.

The prime point of my posting is that it is quite possible that, for
"equivalent" RISC and CISC machines (using the definition I gave in the
first sentence of my posting), the RISC:CISC instruction count ratio could
be 1:1. Lower ratios would indicate that the CISC compiler isn't doing
analysis which the RISC compiler is -- somebody ought to go fix the CISC
compiler.

							-hankd

henry@utzoo.uucp (Henry Spencer) (07/17/89)

In article <753@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes:
>  4) RISC has more opportunaty for optimizations ... This means
>     the compiler is much harder to get right...

No, it means it is harder to get optimal.  There's a difference between
generating suboptimal code and generating wrong code.  Truly well-built
compilers (obviously the semantics of "well-built" here are somewhat a
matter of opinion) do much optimization by applying transformations that
are *hoped* to improve the code but are *known* to preserve semantics.
Such compilers may not always generate the fastest code but will seldom
generate wrong code.

Incidentally, do you really think the greater complexity of CISC
instruction sets makes CISC compilers *easier* to get right?  The
compiler, if nobody else, has to know all the gory details of the
machine if it's really pushing for good code quality.
-- 
$10 million equals 18 PM       |     Henry Spencer at U of Toronto Zoology
(Pentagon-Minutes). -Tom Neff  | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

schow@bnr-public.uucp (Stanley Chow) (07/17/89)

In article <1989Jul17.035238.6497@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <753@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes:
>>  4) RISC has more opportunaty for optimizations ... This means
>>     the compiler is much harder to get right...
>
>No, it means it is harder to get optimal.  There's a difference between
>generating suboptimal code and generating wrong code.  Truly well-built
>compilers (obviously the semantics of "well-built" here are somewhat a
>matter of opinion) do much optimization by applying transformations that
>are *hoped* to improve the code but are *known* to preserve semantics.
>Such compilers may not always generate the fastest code but will seldom
>generate wrong code.
>
>Incidentally, do you really think the greater complexity of CISC
>instruction sets makes CISC compilers *easier* to get right?  The
>compiler, if nobody else, has to know all the gory details of the
>machine if it's really pushing for good code quality.
>-- 

Henry (and other posters) quite correctly pointed out the mistake in my
statment. What I meant to say is:

    a) RISC "needs" more optimization in compiler (to achive the "same"
       degree of "optimality")
    b) Therefore, in *minimal* (meaning compilers with barely enough brains
       to get acceptable results) compilers, the RISC compiler will tend to
       have more and fancier optimizations.
    c) Therefore, in minimal compilers, the RISC compiler will tend to be
       less reliable.

Obviously, the "native" complexity of a CPU will have substantial
impact on the complexity and reliability of the compiler. As other posters
have pointed out, CISC doees not *necessarily* mean a complex CPU. CISC can
be very regular and simple to use, especially if you don't attempt to optimize
in the compiler.

I am by no means saying that *all* CISC machines lend themself to a nice
simple reliable compiler. I *am* saying that CISC is just as likely to *allow*
such a compiler. It all depends on the actual complexity of the machine, not
whether it is RISC or CISC.

Incidentally, I agree that truly optimal compilers are much harder for 
complex machines. I will even agree that most of the "classical" machines
are not nice machines for optimizing compilers.
Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
Me? Represent other people? Don't make them laugh so hard.

bruce@heather.pooh.com (Bruce Robertson) (07/19/89)

These optimizations have been mentioned:

[1]	Value propagation (get values from where cheapest, remember what
	values appear where and which quantities are equivalent)
[2]	Constant folding (also uses value prop.:  A=5; B=A*A; would get
	coded as if it were A=5; B=25;)
[3]	Common subexpression elimination
[4]	Dead code elimination (get rid of dead stores, etc.)

Shouldn't all these optimizations be done in the machine independent
portion of the compiler?  The question of RISC v. CISC is raised
primarily in the code generator, which in my experience is not the
largest portion of the compiler.  A well designed compiler should have
all the machine dependencies isolated in one relatively small module
(relative to the size of the entire compiler). RISCy things like delay
slot filling are easiest to handle in the assembler, or other post
processor.
--
	Bruce Robertson
	Hundred Acre Software, Reno, NV
	Domain: bruce@pooh.com
	UUCP:   ...!uunet!tahoe.unr.edu!heather!bruce

cjosta@tasu77.UUCP (Jonathan Sweedler) (07/19/89)

In article <756@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes:
>In article <1989Jul17.035238.6497@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>>In article <753@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes:
>>>  4) RISC has more opportunaty for optimizations ... This means
>>>     the compiler is much harder to get right...
>>
>>No, it means it is harder to get optimal.  There's a difference between
>>generating suboptimal code and generating wrong code.  
>
>Henry (and other posters) quite correctly pointed out the mistake in my
>statment. What I meant to say is:
>

Wait a second.  I wouldn't give in so quickly if I were you.  I'm more
of a hardware guy than a software guy, but I was always under the
impression that RISC machines need more complex compilers, not just
because of the need for more optimization, but because RISC chips are
'dumber' than CISC chips.  i.e. a lot of the complexity of the chip is
moved to the software.  Don't forget that MIPS, one of the early RISC
chips, stands for 'Microprocessor without Interlocked Pipe Stages.'
This means that interlocks must be handled in the compiler.  So
forgetting about all optimization, just writing a compiler that WORKS
on a RISC chip is by DEFINITION harder than writing a compiler for a
CISC chip.

Correct me if I'm wrong, but I remember looking at some of the initial
specs for the Intel 860 and that thing looked like a nightmare to write
a compiler for.  Forget about optimizations, just getting floating
point results out of the FP pipeline, requires that the compiler have
knowledge of the internals of the hardware.  How many CISC chips do
you know that make the same requirements?  (I'm probably going to be
sorry I asked that :-)  But I think you know the point I'm trying to make).

Just so I don't start a new war here, I'm not saying any of this is
bad.  RISC seems to be proving its point.  But I think that a RISC
compiler is NECESSARILY more complex than a CISC compiler, regardless
of the level of optimization performed by the compiler.
Jonathan Sweedler  ===  National Semiconductor Israel
UUCP:    ...!{amdahl,hplabs,decwrl}!nsc!taux01!cjosta
Domain:  cjosta@taux01.nsc.com

slackey@bbn.com (Stan Lackey) (07/19/89)

In article <2203@taux01.UUCP> cjosta@tasu77.UUCP (Jonathan Sweedler) writes:
>impression that RISC machines need more complex compilers, not just
>because of the need for more optimization, but because RISC chips are
>'dumber' than CISC chips...
>This means that interlocks must be handled in the compiler.  So
>forgetting about all optimization, just writing a compiler that WORKS
>on a RISC chip is by DEFINITION harder than writing a compiler for a
>CISC chip.

Not necessarily.  The winner of the 'dumb compiler that works' award
goes to one that just puts nop's anywhere that could have a problem; 
after loads, in branch delay slots...

>Correct me if I'm wrong, but I remember looking at some of the initial
>specs for the Intel 860 and that thing looked like a nightmare to write
>a compiler for.  Forget about optimizations, just getting floating
>point results out of the FP pipeline, requires that the compiler have
>knowledge of the internals of the hardware.

This is true, and makes me believe that IEEE FP and RISC are
incompatible.  Both major implementations (860 and 88000 (I'm glad I
can admit knowlege of the 88000 now)) show that this force-fit causes
problems.  By RISC arguments, let sw handle the exceptions.  But IEEE
FP has so many exceptions by its over-complex nature that [not the
compiler really, the OS] absolutely must be able to handle said
hardware internals, and must do it efficiently; underflow happens
more often than you think.
-Stan

henry@utzoo.uucp (Henry Spencer) (07/19/89)

In article <2203@taux01.UUCP> cjosta@tasu77.UUCP (Jonathan Sweedler) writes:
>... I was always under the
>impression that RISC machines need more complex compilers, not just
>because of the need for more optimization, but because RISC chips are
>'dumber' than CISC chips...

As has been pointed out before (sigh), this is just not true.  Most
compilers can make very little use of the extra complexity of CISCs;
simple code generation for a RISC may not exploit the full power of
the machine, but the result is often reasonably fast code.

>... Don't forget that MIPS, one of the early RISC
>chips, stands for 'Microprocessor without Interlocked Pipe Stages.'
>This means that interlocks must be handled in the compiler...

Things like software pipe interlocks, delayed branches, etc. can almost
always be deferred to a cleanup pass that is very stupid and simple.  (On
some RISC chips, this sort of thing is done in the assembler, not the
compiler at all.)  One of the surprises that came out of RISC work is
that moving some of these things into the software really costs very
little.  You may occasionally miss an optimization opportunity by not
having the earlier passes aware of the final rearrangement, but it's
not really very frequent.

>Correct me if I'm wrong, but I remember looking at some of the initial
>specs for the Intel 860 and that thing looked like a nightmare to write
>a compiler for...

Hey, but look who made it. :-)  I mean, you really expected the source
of the 8086 to produce a machine that was easy to generate code for?
Not everyone has made that big a mess of their RISC designs.

The other side of the coin, of course, is that if you are willing (as
John Mashey puts it) "to commit unspeakable acts", the i860 is REALLY
FAST.  You get what you pay for.
-- 
$10 million equals 18 PM       |     Henry Spencer at U of Toronto Zoology
(Pentagon-Minutes). -Tom Neff  | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

jlg@lanl.gov (Jim Giles) (07/20/89)

From article <BRUCE.89Jul18181114@heather.pooh.com>, by bruce@heather.pooh.com (Bruce Robertson):
> These optimizations have been mentioned:
> [1]	Value propagation (get values from where cheapest, remember what
> 	values appear where and which quantities are equivalent)
> [2]	Constant folding (also uses value prop.:  A=5; B=A*A; would get
> 	coded as if it were A=5; B=25;)
> [3]	Common subexpression elimination
> [4]	Dead code elimination (get rid of dead stores, etc.)

They may have been mentioned, but I didn't mention them.  All the
above are machine independent (except perhaps [1], which relates
to register allocation).  The optimizations I mentioned were:

(A)   Instruction selection (if several different sequences of instructions
      can perform a given action - which is best?)

(B)   Register allocation (store intermediate results in registers if at
      all possible).

(C)   Code ordering (issue the instructions in the order which best uses
      pipelining or other parellelism in the archetecture).

Any one of the above is fairly easy to do in isolation.  Any two of the
above are difficult to do because they interfere with each other.  Only
CISC archetectures even have to consider (A) since, in a RISC, this
step is _very_ simple.  I claim that CISC compilers must do all three
of the above in order to be competitive.  Hence, my claim that CISC is
harder to compile for than RISC.

jlg@lanl.gov (Jim Giles) (07/20/89)

From article <2203@taux01.UUCP>, by cjosta@tasu77.UUCP (Jonathan Sweedler):
> [...]                                      I was always under the
> impression that RISC machines need more complex compilers, not just
> because of the need for more optimization, but because RISC chips are
> 'dumber' than CISC chips.  i.e. a lot of the complexity of the chip is
> moved to the software.

You are confusing complexity with quantity.  RISC compilers usually
(but not always) generate more instructions for a given task than
CISC compilers do.  However, the task of generating code for RISC
is _much_ easier than generating code for CISC.  So, the RISC compiler
is simpler than the CISC compiler even if it does produce more output.


> [...]                   Don't forget that MIPS, one of the early RISC
> chips, stands for 'Microprocessor without Interlocked Pipe Stages.'
> This means that interlocks must be handled in the compiler.  So
> forgetting about all optimization, just writing a compiler that WORKS
> on a RISC chip is by DEFINITION harder than writing a compiler for a
> CISC chip.

You stated that incorrectly.  Forgetting about all optimization, just
writing a compiler that WORKS on a non-interlocked chip is by DEFINITION
harder than writing a compiler for an interlocked chip.  This doesn't
necessarily have anything to do with RISC vs. CISC.

[...]
> Just so I don't start a new war here, I'm not saying any of this is
> bad.  RISC seems to be proving its point.  But I think that a RISC
> compiler is NECESSARILY more complex than a CISC compiler, regardless
> of the level of optimization performed by the compiler.

The strange fact is: for a given degree of optimization, it is harder
to compile for CISC than RISC.  This is hardly surprising.  Computers
are real good at simple repetitive tasks (like generating code for
RISC machines).  Complex tasks (liking making good use of a rich
instruction set on a CISC) require more effort on the part of the
_programmer_.

mash@mips.COM (John Mashey) (07/20/89)

In article <42942@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
....
>>Correct me if I'm wrong, but I remember looking at some of the initial
>>specs for the Intel 860 and that thing looked like a nightmare to write
>>a compiler for.  Forget about optimizations, just getting floating
>>point results out of the FP pipeline, requires that the compiler have
>>knowledge of the internals of the hardware.

>This is true, and makes me believe that IEEE FP and RISC are
>incompatible.  Both major implementations (860 and 88000 (I'm glad I
>can admit knowlege of the 88000 now)) show that this force-fit causes
>problems.  By RISC arguments, let sw handle the exceptions.  But IEEE
>FP has so many exceptions by its over-complex nature that [not the
>compiler really, the OS] absolutely must be able to handle said
>hardware internals, and must do it efficiently; underflow happens
>more often than you think.

This seems to be an over-generalization.
Maybe this means that 860s and 88000s have problems (beyond whether or
not these are the 2 "major" implementations :-).  I assume these refer to
the issues of dealing with imprecise exceptions, which is certainly
possible, if complicated, or extremely complicated.  Anyway, please explain
more why IEEE FP and RISC are incompatible, as the RISCs that are
already out there in systems manage to deal with it....
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jac@paul.rutgers.edu (Jonathan A. Chandross) (07/20/89)

cjosta@tasu77.UUCP (Jonathan Sweedler)
> Don't forget that MIPS, one of the early RISC
> chips, stands for 'Microprocessor without Interlocked Pipe Stages.'
> This means that interlocks must be handled in the compiler.  So
> forgetting about all optimization, just writing a compiler that WORKS
> on a RISC chip is by DEFINITION harder than writing a compiler for a
> CISC chip.

Actually, this is usually handled by a recoding assembler.  You 
don't really want to handle it in your compiler unless your compiler 
integrates the assembly phase and emits .o files directly.  The reason 
is that you have to do some analysis on the whole program when you 
are finished to try to fill the delayed branch slots, and to reorder 
instructions, inserting no-ops where necessary, to avoid pipeline 
conflicts.  So you might as well do it then.

And writing a RISC compiler isn't `by DEFINITION'' harder: you just
pad out everything with no-ops (both delayed branch slots and the 
actual instructions to insure that the pipeline flushes.)  Pretty 
poor object code, but you don't have to do any analysis at all.

Writing a compiler that makes use of the non-trivial features of a 
machine, RISC or CISC, is very tricky.  For instance, I still haven't
found out how I can generate code to use those all those nice BCD 
instructions in my 68k.....


Jonathan A. Chandross
Internet: jac@paul.rutgers.edu
UUCP: rutgers!paul.rutgers.edu!jac

mash@mips.COM (John Mashey) (07/20/89)

In article <2203@taux01.UUCP> cjosta@tasu77.UUCP (Jonathan Sweedler) writes:
.....
>Correct me if I'm wrong, but I remember looking at some of the initial
>specs for the Intel 860 and that thing looked like a nightmare to write
>a compiler for.  Forget about optimizations, just getting floating
>point results out of the FP pipeline, requires that the compiler have
>knowledge of the internals of the hardware.  How many CISC chips do
>you know that make the same requirements?  (I'm probably going to be
>sorry I asked that :-)  But I think you know the point I'm trying to make).
>
>Just so I don't start a new war here, I'm not saying any of this is
>bad.  RISC seems to be proving its point.  But I think that a RISC
>compiler is NECESSARILY more complex than a CISC compiler, regardless
>of the level of optimization performed by the compiler.

This topic has been argued frequently in this newsgroup.  Whether or not
an i860 is hard to compile for proves zero about whether any given other
RISC is hard to compile for. It is hard to get data upon this topic,
but there is a little.  First, perhaps it would be good to hear from
people who have WRITTEN compilers for different architectures and see
I'll what THEY think [please]. Second, I'll cite one study, by an
unbiased source, especially since folks have been asking about it,
since I used a few of the foils at Hot Chips as an answer to a question
(not part of the talk, and if A.K. would send me his address, I'll send him
a copy):

	Funk, Bud K, "RISC and CISC benchmarks and insights",
	UNISYS World, January 1989, 11-14.

They evaluated 8 chips {80386, 32532, 68020, 68030, SPARC, 88000, 29000,
R3000}, running the same {small} integer benchmarks written in both {C, assembly
language}, using the same external memory, to keep things as fair as
possible.  [This is hard work, by the way, which is why you don't see more
of this kind of study.]

1) First, they handcoded the small benchmarks in assembler for all machines,
and called the 68020 == 1 (bigger is better).  The CISC performance ranged
from about .75 to 1.1; the 4 RISCs were about 2-2.5X faster.

2) They then compiled the C code.  The CISCs dropped about 50-60% in
performance; the RISCs dropped 20-40% in performance, so the resulting
ratio was that the RISCs ended up with a 4-5.5X performance advantage.

3) OF COURSE THESE ARE SMALL BENCHMARKS, AND SO ALL THE USUAL CAVEATS APPLY,
and I don't know (except in one case :-) exactly whose compilers were being
used, etc, etc.
However, the group doing these evaluations is NOT a purveyor of RISC chips
or compilers, and is trying to do a straight evaluation.

A quote from the article:
	"The RISC compilers are relatively new and still significantly
	better than their CISC counterparts."

Now, if that statement is true, I observe that it is NOT from lack of
competent people, working for years, to build good compilers for the
CISCs.......
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

schow@bnr-public.uucp (Stanley Chow) (07/20/89)

In article <23663@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>
>They evaluated 8 chips {80386, 32532, 68020, 68030, SPARC, 88000, 29000,
>R3000},  ....

Just a reminder that the three CISC's here are by no means representative
of CISC architecture. [Who cares about the actual number sold.]

Conclusions reached on a small sample of architecture cannot be generalized
to *all* CISC or RISC machines.

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
Me? Represent other people? Don't make them laugh so hard.

slackey@bbn.com (Stan Lackey) (07/20/89)

In article <23659@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <42942@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>>This is true, and makes me believe that IEEE FP and RISC are
>>incompatible.
>
>This seems to be an over-generalization.
>Anyway, please explain
>more why IEEE FP and RISC are incompatible, ...


Sure.  As I understand it, the major basis of RISC is:

Simplify the hardware.  Include only those features that:
  Provably improve performance;
    Their presence speeds up certain operation(s)
    Their presence does not slow down other operation(s)
  Are straightforward to implement;
  Are usable by straightforward compiler design;
  Anything else emulate in software.

[I'm going to let it stand here.  Probing more deeply into any of the
above is a separate discussion.]

IEEE floating point goes contrary to each of those issues.

1. Gradual underflow is not straightforward to implement.  It requires
more cases to be handled in virtually every fp operation, for example
pre-alignment before multiplication.  Now you can leave time before
every multiplication for this prealignment (Cydrome), you take a
special control flow to handle the case (68881 etc), you just zero the
operands/results and ignore it (essentially what DEC, Cray, IBM, etc
do), or you trap to the OS (i860).

BTW [at least] DEC has trap-on-underflow.  Guess what happens real
quick?  Users shut it off.  [Was that necessary? Bad boy. -Ed.]

Gradual underflow does not improve performance.  It [inaccurately]
extends the low end of the range a little.  It adds significantly to
design and verification time.  The only case I ever heard of it
mattering is a drummed-up case Kahan wrote.

This does not affect the complier, other than requiring the runtime
libraries to be able to convert yet another format on output.

In the case of an i860 software supplier, if he wants stuff like
transcendentals to run fast, he must tune them so that they don't
cause underflow traps.  Rather that just use 0 which is what he wants
anyway.

2. Rounding modes.  No compiler on Earth uses them, other than to get
around instruction set deficiencies (rounding when converting from
FP to integer).  This particular problem is more efficiently solved by
adding a second version of the convert-to-integer instruction.

Their use has been intended for range arithmetic.  Now, when I asked,
lots of people knew what they were for, but no one [on comp.arch] had
ever heard of them having ever been used.  Interesting how much
hardware out there can do these operations that are so rarely used.
Kind of like the decimal string instruction set on the VAX, except
that the VAX decimal strings are occasionally used. [There you go
again! -Ed.]

Their implementation requires not much logic (maybe 50-100 gates), but
logic that affects critical path timing.  It requires yet more
verification (test vectors).

--
My belief is that if IEEE FP had been done a few years after it
actually was, after RISC became the trend, it would have come out much
different.  In fact, I think the suppliers should get an official
subset recognized by the IEEE that just uses 0 for underflowed
operands/results, and just implements round-to-nearest.  I would be
very surprised if it ever came up in a sales situation anyway; in fact
all I care about is how fast it runs SPICE.

I'm done.  [Finally! -Ed.]  -Stan

khb@gammara.Sun.COM (gammara) (07/20/89)

In article <43004@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:

>Sure.  As I understand it, the major basis of RISC is:
>
>Simplify the hardware.  Include only those features that:
>  Provably improve performance;
>    Their presence speeds up certain operation(s)
>    Their presence does not slow down other operation(s)
>  Are straightforward to implement;
>  Are usable by straightforward compiler design;
>  Anything else emulate in software.

Getting the _right_ answer should be on the list.

>
>IEEE floating point goes contrary to each of those issues.
>
>1. Gradual underflow is not straightforward to implement. 
>
>Gradual underflow does not improve performance.  It [inaccurately]
>extends the low end of the range a little.  It adds significantly to
>design and verification time.  The only case I ever heard of it
>mattering is a drummed-up case Kahan wrote.

GU is one of the MOST useful parts of the standard. It, in fact,
allows many algorithms to run in 32-bits that would have required
64-bits (or more) with, say, IBMath. Kahan is one of the most skillful
folk at finding SIMPLE examples ... real codes are much too hard to
fully analyze ... and that is exactly why GU _should_ be employed for
all 32-bit machines. 

While purists will deep fry me in oil (or should), I can see the
great utility of a 64-bit non-compliant mode ... there are many codes
which are run in 64-bits for hysterical reasons (broke on a 360
therefore it must always be DP ...) which get along just fine with
abrupt underflow. AU is sometimes desired in the 32-bit realm, but
there one really should analyze the whole program (or run just
selected parts in AU). This can be accomplished on SunOS 4.x with a
runtime call.
>
>In the case of an i860 software supplier, if he wants stuff like
>transcendentals to run fast, he must tune them so that they don't
>cause underflow traps.  Rather that just use 0 which is what he wants
>anyway.

Depends on the algorithm. If 0 is what they want, one can toggle back
and forth.

>
>2. Rounding modes.  No compiler on Earth uses them, other than to get
>around instruction set deficiencies (rounding when converting from
>FP to integer).  This particular problem is more efficiently solved by
>adding a second version of the convert-to-integer instruction.

The folks who do interval arithmetic would strongly disagree here. For
better or worse, most of those folks seem to live in Europe. I have
seen versions of Pascal and Fortran extended to use 'em.
>
>--
>My belief is that if IEEE FP had been done a few years after it
>actually was, after RISC became the trend, it would have come out much
>different.  In fact, I think the suppliers should get an official

Getting the right answer is important. When dealing with FP problems
one has at LEAST the following issues to sweat:

1)	is the algorithm stable ?
2)	is it robust ?               

Requires a very hard backwards analysis to prove.

3)	is the implementation stable
4)	robust ?

This requires massive testing

5)	is the problem ill conditioned ?

The symptoms are EXACTLY those of 1-4 being broken. 

6)	is my computer arithmetic doing something funny

The symptoms are EXACTLY those of 1-5 being broken.

Using ieee math removes 6 from those things which one must sweat
bullets over when your 100+K line application is acting wonky. The
cost to implementors is non-trivial, but the savings to application
specialists is huge ... and they need not even understand that this is
happening in their behalf ... they just get fewer strange edge conditions.

>subset recognized by the IEEE that just uses 0 for underflowed
>operands/results, and just implements round-to-nearest.  I would be
>very surprised if it ever came up in a sales situation anyway; in fact
>all I care about is how fast it runs SPICE.

You should also care that the circuit is analyzed correctly. The sad
fact is that this is exceedingly hard to benchmark. That is why we
need the standard.

Kahan has a paper or two on Presubstitution a technique which allows
one to avoid a lot of the expensive precise exception stuff for a huge
number of applications.


Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)

bcase@cup.portal.com (Brian bcase Case) (07/21/89)

>>Correct me if I'm wrong, but I remember looking at some of the initial
>>specs for the Intel 860 and that thing looked like a nightmare to write
>>a compiler for.  Forget about optimizations, just getting floating
>>point results out of the FP pipeline, requires that the compiler have
>>knowledge of the internals of the hardware.
>
>This is true,

This is not true; the i860 has non-pipelined FP instructions that
don't require any knowledge of the pipeline.  Of course, if you
use them, you don't get the performance....

seanf@sco.COM (Sean Fagan) (07/21/89)

In article <BRUCE.89Jul18181114@heather.pooh.com> bruce@heather.pooh.com (Bruce Robertson) writes:
>Shouldn't all these optimizations be done in the machine independent
>portion of the compiler?  

Yes, all of the optimizations you mentioned (which I deleted) *should* be
done in the mi part of the compiler.  For compilers which are intended to
support multiple architectures (remember, not all are), it usually is.

>A well designed compiler should have
>all the machine dependencies isolated in one relatively small module
>(relative to the size of the entire compiler). RISCy things like delay
>slot filling are easiest to handle in the assembler, or other post
>processor.

Except that delay-slot filling is not the only type of optimization that
exists.  For some machine (such as a CDC Cyber), doing certain instructions
after another would slow it down, while on other machines, it can speed up.
Specifically:  memory accesses and various register combinations.  Some
microprocessors can be optimized to do certain kinds of memory fetches
faster, while the Cyber does sequential accesses best.  Also, something
like:
	FX6	X5*X3
	FX7	X6/X1

is bad practice on a cyber (put other instructions in between).  On some
other machines (specifically VLIW and the ilk), that could be reduced to a
pseudo-one instruction which did:

	X6 <- X5*X3, X7 <- <multiply unit> / X1

which would execute in parallel as much as possible (I know, I know, that's
really a bad example.  I had some better code than that lying around, but I
forget where I put it).

Then there's the choice of instruction selection.  Sometimes, which
instructions you choose can change some of your "machine independent"
optimizations.  Also, you have to choose registers properly (is it better to
put array addresses into registers, or would it be better to put other
variables in registers and let the hardware compute the array address?  For
something like a '486, this could be a tough choice...).

In other words, life isn't as simple as we would like it to be...

-- 
Sean Eric Fagan  |    "Uhm, excuse me..."
seanf@sco.UUCP   |      -- James T. Kirk (William Shatner), ST V: TFF
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

stevev@tekchips.LABS.TEK.COM (Steve Vegdahl) (07/21/89)

In article <1989Jul19.165456.20210@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
> Things like software pipe interlocks, delayed branches, etc. can almost
> always be deferred to a cleanup pass that is very stupid and simple.  (On
> some RISC chips, this sort of thing is done in the assembler, not the
> compiler at all.)  One of the surprises that came out of RISC work is
> that moving some of these things into the software really costs very
> little.  You may occasionally miss an optimization opportunity by not
> having the earlier passes aware of the final rearrangement, but it's
> not really very frequent.

The early RISC compiler work (e.g., Gross) seems have suggested this, but
later work (e.g., Bird's thesis from U. Michigan) has shown that doing
register allocation before instruction scheduling will often cause code
quality to be much poorer.

The register allocator generally wants to do all of one computation, then
all of the next, etc. to minimize the number of concurrently live variables.
The instruction scheduler generally wants to interleave "unrelated"
compuations to minimize the effects of memory latency and such.  This issue
came up when we were considering instruction scheduling for our 88000 Scheme
compiler.  (See our paper in the ASPLOS-III proceedings).

What happens in practice when register allocation is done first, is that the
register assignments introduce a lot of artificial constraints on the
scheduler, causing it to do a much poorer job.  In other words, the
assignment of common registers to computations that could otherwise be
interleaved  makes such interleaving impossible.

		Steve Vegdahl
		Computer Research Lab
		Tektronix Labs
		Beaverton, Oregon

stevev@tekchips.LABS.TEK.COM (Steve Vegdahl) (07/21/89)

In article <13995@lanl.gov>, jlg@lanl.gov (Jim Giles) writes:
> (A)   Instruction selection
 ...
> (B)   Register allocation
 ...
> (C)   Code ordering
 ...
> Any one of the above is fairly easy to do in isolation.  Any two of the
> above are difficult to do because they interfere with each other.  Only
> CISC archetectures even have to consider (A) since, in a RISC, this
> step is _very_ simple.  I claim that CISC compilers must do all three
> of the above in order to be competitive.  Hence, my claim that CISC is
> harder to compile for than RISC.

On the other hand, (C) is generally much more difficult for a RISC
architcture, due to architectural features such as delayed branches and
exposed memory latency.  These features tend not pose difficult problems
when treated in isolation---as was done in the early RISC compiler work---but
cause additional compications when combined with other phases.

I think that the RISC vs. CISC compiler issue is more or less a toss-up.
With either type of architecture, you can write a *simple* compiler by
compiling:
  * to a RISC subset
  * without code organization
For higher quality code, CISC compilers are complicated by the need to
consider complex addressing modes, while RISC compilers are complicated
by the need to consider delayed branches and memory latency.  Both kinds
of complications have interdependencies with other aspects of optimization.

		Steve Vegdahl
		Computer Research Lab
		Tektronix Labs
		Beaverton, Oregon

gideony@microsoft.UUCP (Gideon Yuvall) (07/22/89)

In article <43004@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>2. Rounding modes.  No compiler on Earth uses them, other than to get
>around instruction set deficiencies (rounding when converting from
>FP to integer).  This particular problem is more efficiently solved by
>adding a second version of the convert-to-integer instruction.
>
>Their use has been intended for range arithmetic.  Now, when I asked,
>lots of people knew what they were for, but no one [on comp.arch] had
>ever heard of them having ever been used.  Interesting how much
... ...

K.C.Ng's square-root code (part of 4.3BSD; look under
the "#if 0" conditional) uses rounding modes to compute correct
SQRTs.
-- 
Gideon Yuval, gideony@microsof.UUCP, 206-882-8080 (fax:206-883-8101;TWX:160520)

mash@mips.COM (John Mashey) (07/22/89)

In article <4356@tekcrl.LABS.TEK.COM> stevev@tekchips.LABS.TEK.COM (Steve Vegdahl) writes:
>In article <13995@lanl.gov>, jlg@lanl.gov (Jim Giles) writes:
> ...
>> (C)   Code ordering
> ...
>> Any one of the above is fairly easy to do in isolation.  Any two of the
>> above are difficult to do because they interfere with each other.  Only
>> CISC archetectures even have to consider (A) since, in a RISC, this
>> step is _very_ simple.  I claim that CISC compilers must do all three
>> of the above in order to be competitive.  Hence, my claim that CISC is
>> harder to compile for than RISC.
>
>On the other hand, (C) is generally much more difficult for a RISC
>architcture, due to architectural features such as delayed branches and
>exposed memory latency.  These features tend not pose difficult problems
>when treated in isolation---as was done in the early RISC compiler work---but
>cause additional compications when combined with other phases.

1) As I recall, our first reorganizer was written and debugged by 1 person
(Larry Weber) in a week.  Admittedly he'd written several before.
It certainly has been improved and rewritten since, but it lasted a good while.

2) It is clear that code reordering varies in usefulness from architecture
to architecture, BUT the boundary is not RISC-vs-CISC, but determined
by presence or absence of various features and the ways compilers use them.
In general, any machine with nontrivial pipelines and/or separate
functional units may well benefit.  Note that CDCs did all this years ago.  
Also, I heard long ago that people were able to get >5% improvement
on S/370-architecture systems by code-reordering.  Maybe somebody has
a reference.  Also, maybe somebody from Amdahl or IBM would care to
offer some inforomation on what the pipelines and functional units
look like on their current products, because I'd suspect there are places
where code rearrangement is usable.
I'd strongly suspect that most machines that could execute:
	load	reg1,a
	load	reg2,b
	add	reg1,20
	sub	reg2,30
are better off doing it that way than the following:
	load	reg1,a
	add	reg1,20
	load	reg2,b
	sub	reg2,30
because most machines will stall on the add/sub more on the second case.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

brooks@vette.llnl.gov (Eugene Brooks) (07/23/89)

In article <4355@tekcrl.LABS.TEK.COM> stevev@tekchips.LABS.TEK.COM (Steve Vegdahl) writes:
>The register allocator generally wants to do all of one computation, then
>all of the next, etc. to minimize the number of concurrently live variables.
>The instruction scheduler generally wants to interleave "unrelated"
>compuations to minimize the effects of memory latency and such.  This issue
>came up when we were considering instruction scheduling for our 88000 Scheme
>compiler.  (See our paper in the ASPLOS-III proceedings).
>
>What happens in practice when register allocation is done first, is that the
>register assignments introduce a lot of artificial constraints on the
>scheduler, causing it to do a much poorer job.  In other words, the
>assignment of common registers to computations that could otherwise be
>interleaved  makes such interleaving impossible.
We dealt with this problem in the Cerberus multiprocessor simulator by
simply having the register allocator get registers from the scratch pile
in a round-robin manner, instead of starting upwards from R0 in the search
for a free one each time.  This minimized the number of "false dependencies"
and allowed a postprocessing scheduler to work quite well.  In the limit
of large register counts it of course is perfect (256 registers is effectively
an infinite number, but in practice is worked quite well at 16 or 32 registers.
If one is constrained to do the scheduling in a machine dependent postprocessor
with the register allocation done in a PCC or GCC like portable compiler this
technique is quite useful and produces well scheduled code.

brooks@maddog.llnl.gov, brooks@maddog.uucp

sbf10@uts.amdahl.com (Samuel Fuller) (07/26/89)

In article <23802@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
> 2) It is clear that code reordering varies in usefulness from architecture
> to architecture, BUT the boundary is not RISC-vs-CISC, but determined
> by presence or absence of various features and the ways compilers use them.
> In general, any machine with nontrivial pipelines and/or separate
> functional units may well benefit.  Note that CDCs did all this years ago.  
> Also, I heard long ago that people were able to get >5% improvement
> on S/370-architecture systems by code-reordering.  Maybe somebody has
> a reference.  Also, maybe somebody from Amdahl or IBM would care to
> offer some inforomation on what the pipelines and functional units
> look like on their current products, because I'd suspect there are places
> where code rearrangement is usable.
> I'd strongly suspect that most machines that could execute:
> 	load	reg1,a
> 	load	reg2,b
> 	add	reg1,20
> 	sub	reg2,30
> are better off doing it that way than the following:
> 	load	reg1,a
> 	add	reg1,20
> 	load	reg2,b
> 	sub	reg2,30
> because most machines will stall on the add/sub more on the second case.

Some experiments have been conducted here at Amdahl wrt code reordering.
Fairly simple code rearrangement have produced approx. 5% improvement in
execution speed.

Current Amdahl machines have 5 or 6 stage pipelines
depending on the model.  Although the machines are microcoded
most instructions take just one flow down the pipeline.
The microcoded approach does not impact the cycle time (10ns) because
the cycle time is determined mostly by the cache access latency.
If data is in the cache, execution proceeds without delay.
Access to cache and to registers takes the same amount of time,
although a Register-Register architecture would probably allow a
shorter pipe.

-- 
---------------------------------------------------------------------------
Sam Fuller / Amdahl System Performance Architecture

I speak for myself, from the brown hills of San Jose.

UUCP: {ames,decwrl,uunet}!amdahl!sbf10 | USPS: 1250 E. Arques Ave (M/S 139)
INTERNET: sbf10@amdahl.com             |       P.O. Box 3470
PHONE: (408) 746-8927                  |       Sunnyvale, CA 94088-3470
---------------------------------------------------------------------------