[comp.arch] Compiling - RISC vs. CISC

jlg@lanl.gov (Jim Giles) (07/08/89)

Every once in a while, someone in this newsgroup makes the claim
that RISC trades off hardware complexity for compiler complexity.
This is simply _NOT_ true.  It is always _easier_ to write a compiler
for a RISC machine than for a CISC machine.

Compiling is generally divided into two parts.  This "front end"
parses the input language and determines whether the program is
syntactically and semantically correct.  The "front end" then
passes the program along in some intermediate form.  The "back
end" selects instructions which implement the original program,
allocates registers (and other temporary resources), and outputs
the code.  Some compilers combine the "front" and "back end" parts,
or divide labor in more unusual ways, but this does not effect the
validity of the following argument.

The "front end" processing of a compiler is identical whether the
target machine is a RISC or a CISC.  So, no trade-off of complexity
can occur there.

For a RISC machine, the only hard part of the "back end" is register
allocation.  Instruction selection is fairly simple since there is
generally only one way the perform each intermediate code operation.
On a pipelined machine, code ordering comes into play (at least if
you want optimized code).  This compilcates matters, since a different
code ordering makes different register allocation constraints.
For this reason, optimizing can be difficult, even on a RISC machine.

On a CISC machine, instruction selection can be a real problem.  This
is because there are generally several instruction sequences possible
to implement each operation.  Instruction selection obviously
effects the register allocation phase (and the code ordering phase).  
To make matters worse, instruction selection is context sensitive.
This means that different code ordering during optimization could
invalidate the instruction selection choice (or, at least make the
selected instructions sub-optimal).

There may indeed be a trade off between RISC and CISC.  But this
trade off is certainly _not_ in the complexity of the compiler.
In fact, as shown above, the circumstance is quite the opposite.

frazier@oahu.cs.ucla.edu (Greg Frazier) (07/08/89)

In article <13976@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>Every once in a while, someone in this newsgroup makes the claim
>that RISC trades off hardware complexity for compiler complexity.
>This is simply _NOT_ true.  It is always _easier_ to write a compiler
>for a RISC machine than for a CISC machine.
>
[ deleted explanation of compiler front- and backend ]
>For a RISC machine, the only hard part of the "back end" is register
>allocation.  Instruction selection is fairly simple since there is
>generally only one way the perform each intermediate code operation.
>On a pipelined machine, code ordering comes into play (at least if
>you want optimized code).  This compilcates matters, since a different
>code ordering makes different register allocation constraints.
>For this reason, optimizing can be difficult, even on a RISC machine.
>
>On a CISC machine, instruction selection can be a real problem.  This
>is because there are generally several instruction sequences possible
>to implement each operation.  Instruction selection obviously
>effects the register allocation phase (and the code ordering phase).  
>To make matters worse, instruction selection is context sensitive.
>This means that different code ordering during optimization could
>invalidate the instruction selection choice (or, at least make the
>selected instructions sub-optimal).
>
>There may indeed be a trade off between RISC and CISC.  But this
>trade off is certainly _not_ in the complexity of the compiler.
>In fact, as shown above, the circumstance is quite the opposite.

Actually, I believe you are wrong, Jim.  While code selection
is easier on a RISC, CISC compilers tend to avoid this by only
using the simple compilers.  On the other hand, RISCs require
very good optimizers in order to take advantage of their RISC-ness.
This is complex.  In addition, most RISC architectures expose their
pipelines, and hence require the compiler to avoid interlocks,
etc.  This is also complex.  On a related note, RISCs tend to
have delayed branches, register windows, etc. which the compiler
must understand.  Finally, as you pointed out, RISCs require
sophisticated register allocation.

So, you are right, in that the reduced instruction set makes
instruction selection much simpler.  So I guess it is easier to
write a stupid compiler for a RISC.  However, because a generic 
RISC arch allows and requires a high level of optimization,
they end up being very complex.

Greg Frazier
&&&&&&&&&&&&&&#########################))))))))))))))))))))
Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier
	       ----^/----
		   /

davet@oakhill.UUCP (David Trissel) (07/08/89)

In article <13976@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>Every once in a while, someone in this newsgroup makes the claim
>that RISC trades off hardware complexity for compiler complexity.
>This is simply _NOT_ true.  It is always _easier_ to write a compiler
>for a RISC machine than for a CISC machine.

False. My group has produced essentially the same compiler for both CISC 
(68K family) and RISC (88K family) and we've found the various tradeoffs
cancel each other out. The amount of work is virtually the same. The
code produced by both compilers is comparable.

>For a RISC machine, the only hard part of the "back end" is register
>allocation.  

What about the required pairing of registers for double wide operations such as
floating-point or shifting? That means the compiler has added complexity in 
its register allocation scheme. Never does the 68K architecture require paired
registers. What about the lack of a 32-bit effective address? This forces some 
RISC systems to have to resort to Intel segment-like (UGH!!) approaches to 
support access to the program's common data base. Some RISCS don't even have 
divide or multiply. (The 88K does have them.)

For every CISC compiler hassle you come up with I'll give you a RISC compiler 
hassle. For every RISC compiler hassle I'll give you a CISC compiler hassle.

>Instruction selection is fairly simple since there is
>generally only one way the perform each intermediate code operation.

This is a strange statement. Since in general terms CISC instruction sets are 
supersets of RISC models then why are the "extra" available CISC 
instructions mandated to be used by a CISC compiler? Indeed, one of the
arguments for RISC is the elimination of "unused" instructions from the
instruction set. Although this may bring up important architectural differences
between RISC and CISC it has no bearing on the complexity of a compiler.

>On a pipelined machine, code ordering comes into play (at least if
>you want optimized code).  This compilcates matters, since a different
>code ordering makes different register allocation constraints.
>For this reason, optimizing can be difficult, even on a RISC machine.

But as CISC implementations become more advanced the applicability of code
reordering is starting to surface there as well.

>On a CISC machine, instruction selection can be a real problem.  This
>is because there are generally several instruction sequences possible
>to implement each operation.  

I presume what you are driving at is that a CISC may have a shorter more 
complex instruction sequence available to it than a RISC for a given piece of 
code.  Point one is that the CISC compiler does not have to generate the more
complex sequence since it can still produce the degenerate case. Point two
is that the CISC compiler has the option of improving the instruction stream
while the RISC implementation has to wait for a future more advanced hardware
implementation to come along which can support something like simultaneous 
instruction execution.

As a case in point consider the somewhat common C expression *p++. The 68K
family could issue the standard:

   mov.l  (%an),%dn
   add.l  &4,%an

or the faster

   mov.l  (%an)+,%dn

The 68K requires a routine in the compiler peephole optimizer to "discover" 
and implement this optimization. But the result is a single 16-bit instruction
which (I think) executes in a single clock on the MC68040.

>Instruction selection obviously
>effects the register allocation phase (and the code ordering phase).  

True for both RISC and CISC. No difference.

>To make matters worse, instruction selection is context sensitive.
>This means that different code ordering during optimization could
>invalidate the instruction selection choice (or, at least make the
>selected instructions sub-optimal).

True for both RISC and CISC. No difference.

>There may indeed be a trade off between RISC and CISC.  But this
>trade off is certainly _not_ in the complexity of the compiler.

Correct.

>In fact, as shown above, the circumstance is quite the opposite.

This is counter-indicated above. Perhaps you have some concrete examples
in mind to show otherwise.

 -- Dave Trissel  Motorola Austin

tim@crackle.amd.com (Tim Olson) (07/10/89)

In article <25547@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
| Actually, I believe you are wrong, Jim.  While code selection
| is easier on a RISC, CISC compilers tend to avoid this by only
| using the simple compilers.  On the other hand, RISCs require
| very good optimizers in order to take advantage of their RISC-ness.

No, RISCs don't *require* very good optimizers; they just make it easier
to perform some optimizations.  Very simple compilers can be used with
fairly good results.

| This is complex.  In addition, most RISC architectures expose their
| pipelines, and hence require the compiler to avoid interlocks,
| etc.  This is also complex.

*Most* RISC architectures?  The Stanford MIPS processor did not have
interlocks, but nearly all other RISC processors do (a recent exception
is the i860, which exposes the pipeline in pipelined floating-point mode).


| On a related note, RISCs tend to
| have delayed branches, register windows, etc. which the compiler
| must understand.

I would claim that the stack-cache operation of the Am29000 register
file is *easier* to generate code for than a standard register file. 
Just determine the maximium number of registers required by a function
and allocate them on function entry -- register spilling & filling is
taken care of automatically by bounds checks.  Yes, delayed branches
have to be taken care of, but they aren't particularly hard -- some
systems don't put this in the compiler; it is done at assembly or link
time, along with other optimizations.

| Finally, as you pointed out, RISCs require
| sophisticated register allocation.

No, it is just useful to help reduce memory traffic.



	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

frazier@oahu.cs.ucla.edu (Greg Frazier) (07/10/89)

In article <26247@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>No, RISCs don't *require* very good optimizers; they just make it easier
>to perform some optimizations.  Very simple compilers can be used with
>fairly good results.

To take advantage of their RISC-ness, the DO require the optimizations.
Of course, they can run lousy code - they just go slow.

>| This is complex.  In addition, most RISC architectures expose their
>| pipelines, and hence require the compiler to avoid interlocks,
>| etc.  This is also complex.
>
>*Most* RISC architectures?  The Stanford MIPS processor did not have
>interlocks, but nearly all other RISC processors do (a recent exception
>is the i860, which exposes the pipeline in pipelined floating-point mode).

When you have single-cycle instructions, talking about exposed
pipelines is rather non-sensical.  On most RISCs, the only
instructions which are multi-cycle are loads and stores, which
are variable and thus require interlocks.
The i860 is the only RISC with a significant number of multi-cycle
instructions on it (and it's still a RISC! :-) - as more processors
move in that direction, you are going to see more exposed piplines.
Also, if you'll allow me to stretch things a bit, putting inst'ns
in delayed branch slots does count as exposing the pipeline, I
think - it is very implementation dependent.

>I would claim that the stack-cache operation of the Am29000 register
>file is *easier* to generate code for than a standard register file. 
>Just determine the maximium number of registers required by a function
>and allocate them on function entry -- register spilling & filling is
>taken care of automatically by bounds checks.  Yes, delayed branches
>have to be taken care of, but they aren't particularly hard -- some
>systems don't put this in the compiler; it is done at assembly or link
>time, along with other optimizations.

You're right - unfortunately, I was confusing in my memory someone
complaining how hard it is to write assembly code for register-
window machines.  Yes, they do make life easier for the compiler,
unless you are trying to optimize over/underflows!

>| Finally, as you pointed out, RISCs require
>| sophisticated register allocation.
>
>No, it is just useful to help reduce memory traffic.

Well, if you don't do the optimizations, and don't reduce the
memory traffic, then once again you are throwing out the
advantages of a RISC arch, and you might as well be running
on a CISC.  My basic thrust, please correct me if I'm wrong,
is that RISCs are much more dependent upon compiler optimizations
for their performance, and if you're not willing to invest in
a sophisticated compiler, you're probably working on the wrong
machine.

Greg Frazier
&&&&&&&&&&&&&&&&&&&##########################%%%%%%%%%%%%%%%%%%%%%

Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier
	       ----^/----
		   /

khb@gammara.Sun.COM (gammara) (07/10/89)

In article <13976@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>This is simply _NOT_ true.  It is always _easier_ to write a compiler
>for a RISC machine than for a CISC machine.
>
> stuff about compilers

Well, if RISC machines were simple (everyting in 1 cycle, no
overlapped execution, sans multiple functional units, etc.) you would
be 100% right.

Unfortunately RISC is commonly applied to machines like SPARC, MIPSco,
i860 and others.

It is said that Metaflow is working on a SPARC with several functional
units, for example.

Even for the current chips employed by Sun, instruction scheduling is
one of the most interesting optimizations (wrt effectiveness on real
programs). On a machine such as Metaflow is working on, instruction
scheduling could become as interesting as a TRACE/28 ....

Such compilers are NOT simpler than CISC compilers.

Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)

davecb@yunexus.UUCP (David Collier-Brown) (07/10/89)

In article <13976@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>Every once in a while, someone in this newsgroup makes the claim
>that RISC trades off hardware complexity for compiler complexity.
>This is simply _NOT_ true.  It is always _easier_ to write a compiler
>for a RISC machine than for a CISC machine.
[well-reasoned arguement for a Vax]

  If one is to consider CISC machines, there are two things to
remember:
	1) they come in families
	2) the popular ones aren't always the ones you want to consider.

  The IBM /360 and VAXen are good examples of a particular era.  That does
not make them the only CISC machines.  
  The Honeywell (now Bull) DPS-8 is a conscious attempt at a
very-complex-instruction set computer, based on the basic architecture
of the era of IBM 7090s and DEC-10s.  Given a compiler (say, FORTRAN
IV) for the basic machine language, one can add all the constructs for
PL/1 and (you should pardon the expression) COBOL to your language in
about two man-days: Honeybun put the primitives into the EISbox
(Extended Instruction Set).

  One can even generate good code for them (;-)), because they're
either
	regular, or
	only exist in one variant.

  The machine is still in production, is still very CISC, and actually
runs rather well.  They really did "narrow the semantic gap between
the machine language and the language understood by the compiler",
which was the reason d'etre of the vCISC machines.

 Mind you, I can't recommend the underlying order code (the stuff that
preceded the EISbox) to my worst enemy.  The machine is a sorta wart
with an elegant bag on the side.

  And Waterloo even has a very standard/rather good C compiler for it.

	       --dave (I once worked on Honeybuns) c-b
-- 
David Collier-Brown,  | davecb@yunexus, ...!yunexus!davecb or
72 Abitibi Ave.,      | {toronto area...}lethe!dave 
Willowdale, Ontario,  | Joyce C-B:
CANADA. 223-8968      |    He's so smart he's dumb.

jkrueger@daitc.daitc.mil (Jonathan Krueger) (07/10/89)

In article <13976@lanl.gov>, jlg@lanl (Jim Giles) writes:
>Every once in a while, someone in this newsgroup makes the claim
>that RISC trades off hardware complexity for compiler complexity.
>This is simply _NOT_ true.  It is always _easier_ to write a compiler
>for a RISC machine than for a CISC machine.

False.  It is always easier to generate code for a simple, orthogonal
instruction set.  (Corollary: it is always easier to generate optimal
code for a simple, orthogonal instruction set with simple, regular
rules for timings.)  The belief that every RISC machine exhibits these
characteristics more than any CISC is not based on fact.

-- Jon
-- 
--

rec@dg.dg.com (Robert Cousins) (07/10/89)

In article <13976@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>Every once in a while, someone in this newsgroup makes the claim
>that RISC trades off hardware complexity for compiler complexity.
>This is simply _NOT_ true.  It is always _easier_ to write a compiler
>for a RISC machine than for a CISC machine.

>In fact, as shown above, the circumstance is quite the opposite.

I must agree.  Our experience on the 88K with a number of pieces 
of software from gcc to an internally developed FORTH interpreter 
has shown that RISCs are easier to generate code for.  I would 
carry this a step further.  While the assembly language development 
effort is minimal, we have had little difficulty with assembly 
language to the point where I wonder if RISCs might not be easier 
to program at that level, also.  If a programmer can do it easily . . . .

Concerning your optimization comment, I would like to throw out an
interesting phenomenon we have noticed (perhaps other RISCers will
comment also):  At one point, people were talking about RISC processors
needing to execute MORE instructions to do the same amount of work.
THis implied that where a CISC might require 100 instructions, a
RISC might require >100 instructions, though the RISC instructions
would take less time.  We have noticed that in many cases, the 
instruction count goes down.  In fact, the best example is the
Dhrystone benchmark.  Since Dhrystone is an artificial measure of
integer compute power, a "Dhrystone MIPS" is considered one VAX MIPS
of compute power.  On the 88K, we need less than 1 MIPS to generate
a Dhrystone MIPS.  In other words, we get Dhrystone MIPS > CPU
clock speed.  When this first happened here, many people were
mumbling about clocks running backward and Escher paintings . . . . :-)

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

guy@auspex.auspex.com (Guy Harris) (07/11/89)

>They really did "narrow the semantic gap between the machine language
>and the language understood by the compiler",
>which was the reason d'etre of the vCISC machines.

As far as I can tell, so would a library routine implementing the same
primitives as the EIS instructions; as such, while it may close some
particular semantic gap (i.e., some piece of the semantic gap between
solid-state physics and your favorite programming language), I don't
know that it closed any very interesting semantic gap....

tim@crackle.amd.com (Tim Olson) (07/11/89)

In article <25562@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
| In article <26247@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
| >No, RISCs don't *require* very good optimizers; they just make it easier
| >to perform some optimizations.  Very simple compilers can be used with
| >fairly good results.
| 
| To take advantage of their RISC-ness, the DO require the optimizations.
| Of course, they can run lousy code - they just go slow.

I don't agree with this.  Our experience with the early (PCC-based)
internal compiler for the Am29000 showed pretty good performance with
little or no standard optimizations (no load scheduling,
common-subexpression elimination, dead-code elimination, live/dead
analysis for register allocation, etc.) The only things it performed
were those required by the architecture, i.e.  delayed branches.  Of
course, the compiler utilized the features of the architecture, such as
keeping all scalar local variables and function parameters in registers,
which you may or may not want to count as "optimization". 

| >| This is complex.  In addition, most RISC architectures expose their
| >| pipelines, and hence require the compiler to avoid interlocks,
| >| etc.  This is also complex.

When I first read this, I though you were referring to architectures
without hardware interlocks -- now I see that you were probably
referring to increasing the performance by scheduling instructions,
trying to avoid the interlocks.

| When you have single-cycle instructions, talking about exposed
| pipelines is rather non-sensical.  On most RISCs, the only
| instructions which are multi-cycle are loads and stores, which
| are variable and thus require interlocks.
| The i860 is the only RISC with a significant number of multi-cycle
| instructions on it (and it's still a RISC! :-) - as more processors
| move in that direction, you are going to see more exposed piplines.

If by "exposed pipeline" you mean that performance will be
enhanced by instruction scheduling, then I agree.  However, I don't
agree that most RISCs will expose the pipeline to software like the i860
-- most will provide hardware interlocks.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

jlg@lanl.gov (Jim Giles) (07/11/89)

From article <25547@shemp.CS.UCLA.EDU>, by frazier@oahu.cs.ucla.edu (Greg Frazier):
> [...]
> Actually, I believe you are wrong, Jim.  While code selection
> is easier on a RISC, CISC compilers tend to avoid this by only
> using the simple compilers. 

In this case, you are eliminating the supposed advantage to CISC
machines - their richer instruction set.  If you don't use a fairly
sophisticated instruction selection algorithm, your instruction
selection will almost always be long short of optimal.

> [...]             Finally, as you pointed out, RISCs require
> sophisticated register allocation.

So do CISCs!!  To be sure, CISCs tend to be less reliant upon
registers.  But they also tend to have _fewer_ registers - so
register spilling occurs as often as on RISCs.  CISCs require
ALL the optimization steps (for the same performance) as RISCs,
PLUS the extra complexity of instruction selection (which feeds
back into the other optimization steps).

jlg@lanl.gov (Jim Giles) (07/11/89)

From article <2190@oakhill.UUCP>, by davet@oakhill.UUCP (David Trissel):
> In article <13976@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
> [...]
>>For a RISC machine, the only hard part of the "back end" is register
>>allocation.  
> 
> What about the required pairing of registers for double wide operations
> such as floating-point or shifting? 

In what way does a machine which requires register pairing qualify as
a RISC?  If an instruction requires 2 operands, they should be allowed
to be any two general purpose registers.  Furthermore, you are assuming
that floating point is larger than other intrinsic types.  The best RISCs
are those which only have _one_ data size.  (By the way, my model of a
reasonable RISC would be a Cray-I instruction set without vectors.  This
is certainly RISCy - all data is 64 bits, all operations are reg to reg,
only one memory addressing mode, etc..)

> [...]
>>Instruction selection is fairly simple since there is
>>generally only one way the perform each intermediate code operation.
> 
> This is a strange statement. Since in general terms CISC instruction sets are 
> supersets of RISC models then why are the "extra" available CISC 
> instructions mandated to be used by a CISC compiler? Indeed, one of the
> arguments for RISC is the elimination of "unused" instructions from the
> instruction set. Although this may bring up important architectural
> differences between RISC and CISC it has no bearing on the complexity
> of a compiler.

This is _really_ a strange statement.  Since the supposed advantage of
CISC is the richer instruction set, failure to use it would not take
advantage of the machine.  I've heard CISC designers claim that individual
instructions can be allowed to be slower than possible in order to
provide the additional instructions.  If you are not using those
extra instructions, you might as well have a RISC which provides
only the instructions you _do_ use.  The hardware designer could
then spend more time making those work faster instead of making sure
that the unused instructions work.

So, this issue _does_ have a bearing on the complexity of the compiler.
If you are not willing to provide the sophisticated compilers required
to adequately use a machine, you have wasted money (read: design effort,
chip space, etc.) on the hardware.

>>On a pipelined machine, code ordering comes into play (at least if
>>you want optimized code).  This compilcates matters, since a different
>>code ordering makes different register allocation constraints.
>>For this reason, optimizing can be difficult, even on a RISC machine.
> 
> But as CISC implementations become more advanced the applicability of code
> reordering is starting to surface there as well.

_EXACTLY_!!!!!   All the optimizations required on a RISC are also
required on a CISC.  CISC just adds more complexity to the mix.

> [... example with C: *p++ ...]
>    mov.l  (%an),%dn
>    add.l  &4,%an
> or the faster
>    mov.l  (%an)+,%dn
> The 68K requires a routine in the compiler peephole optimizer to "discover" 
> and implement this optimization. But the result is a single 16-bit instruction
> which (I think) executes in a single clock on the MC68040.

Exactly my point.  There are actually several other possibilities
for instruuction selection in this case.  For example, p may already
be resident in a register.  The use of the data may require it to
end up in a register.  The further use of p may require it to be
left in a register.  Etc..  Only a pretty sophisticated compiler
can determine which instructions to use in each context.  By contrast,
there is only _one_ instruction sequence which will work on most RISC
machines:  load p, load data, store data, increment p, store p.
The 'peephole' optimizer need only discover the redundant loads
and stores to fit this sequence into context.  The instruction 
scheduler can reeorder the last four of these any way it likes.

Now, clearly 5 instructions may take longer than the 1 in your 68K example.
But, RISC machines are easier to pipeline, easier to speed up the clock
for, easier to provide staged functional units for, etc..  I don't
know of any CISC machines with 'hardwired' instruction sets.  Micro-
coding slows the machine down, but is typically the only way to fit
a CISC on a chip.  All this may mean that 5 instructions on a RISC
may be _faster_ than one on a CISC.

roelof@idca.tds.PHILIPS.nl (R. Vuurboom) (07/11/89)

In article <26257@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <25562@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>| In article <26247@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>| >No, RISCs don't *require* very good optimizers; they just make it easier
>| >to perform some optimizations.  Very simple compilers can be used with
>| >fairly good results.
    ^^^^^^^^^^^
>
>I don't agree with this.  Our experience with the early (PCC-based)
>internal compiler for the Am29000 showed pretty good performance with
					  ^^^^^^^^^^^
>little or no standard optimizations (no load scheduling,
>common-subexpression elimination, dead-code elimination, live/dead
>analysis for register allocation, etc.) The only things it performed
>were those required by the architecture, i.e.  delayed branches.  Of

Anybody care to quantify this? Just what sort of performance improvement can I
expect from no-holds-barred optimization over only-what-I-have-to optimization.?

>
>If by "exposed pipeline" you mean that performance will be
>enhanced by instruction scheduling, then I agree.  However, I don't

This term has been confusing me (too?). Is this the (an?) "accepted"
definition of exposed pipeline?

>agree that most RISCs will expose the pipeline to software like the i860
>-- most will provide hardware interlocks.
>
Well...who will and who won't?

>	-- Tim Olson
>	Advanced Micro Devices
>	(tim@amd.com)

f
i
l
l
e
r

f
o
r

r
n
-- 
Roelof Vuurboom  SSP/V3   Philips TDS Apeldoorn, The Netherlands   +31 55 432226
domain: roelof@idca.tds.philips.nl             uucp:  ...!mcvax!philapd!roelof

davet@oakhill.UUCP (David Trissel) (07/11/89)

In article <13980@lanl.gov> jlg@lanl.gov (Jim Giles) writes:

>> What about the required pairing of registers for double wide operations
>> such as floating-point or shifting? 

>In what way does a machine which requires register pairing qualify as
>a RISC?  

The point was that RISC architectures DO require the pairing. Can you name
any that don't?

>If an instruction requires 2 operands, they should be allowed
>to be any two general purpose registers. 

This is not what was being discussed.

>Furthermore, you are assuming
>that floating point is larger than other intrinsic types. 

Well, unless you limit yourself to only single-precision, floating-point IS
larger than other instrinsic types.

>The best RISCs are those which only have _one_ data size. 

Such an architecture would utterly fail in the marketplace. This is so far
off base it doesn't need a rebuttal.

>(By the way, my model of a reasonable RISC would be a Cray-I instruction 
>set without vectors.  This is certainly RISCy - all data is 64 bits, all 
>operations are reg to reg, only one memory addressing mode, etc..)

Do you have any inkling why your ideas aren't being implemented by RISC 
designers?

>So, this issue _does_ have a bearing on the complexity of the compiler.
>If you are not willing to provide the sophisticated compilers required
>to adequately use a machine, you have wasted money (read: design effort,
>chip space, etc.) on the hardware.

Nope, no bearing on the compiler. There is no gun being pointed at the compiler
writer forcing her to utilize all the available instructions. Your assumption
that most of the instructions have to be used "to adequately use a machine"
is simply incorrect.

>All the optimizations required on a RISC are also required on a CISC.  

Wrong. 68K compilers don't have to worry about register pairing. 68K compilers
don't have to worry about branch delayed slot filling, setting up segment
registers for data addressability, etc.

>CISC just adds more complexity to the mix.

I showed otherwise in that RISC has it's own set of headaches. I don't see
much difference in the total amount of work involved for either RISC or
CISC.

>Now, clearly 5 instructions may take longer than the 1 in your 68K example.

Actually RISCs only require 2 or 3 instructions to do the work of
the one in that example.

>All this may mean that 5 instructions on a RISC may be _faster_ than one 
on a CISC.

Not when using the latest CISCs.

 -- Dave Trissel  Motorola Austin

roelof@idca.tds.PHILIPS.nl (R. Vuurboom) (07/11/89)

In article <13980@lanl.gov> jlg@lanl.gov (Jim Giles) writes:

>that floating point is larger than other intrinsic types.  The best RISCs
>are those which only have _one_ data size.  (By the way, my model of a
			   ^^^^^

Why?

f
i
l
l
e
r

-- 
Roelof Vuurboom  SSP/V3   Philips TDS Apeldoorn, The Netherlands   +31 55 432226
domain: roelof@idca.tds.philips.nl             uucp:  ...!mcvax!philapd!roelof

daveh@cbmvax.UUCP (Dave Haynie) (07/11/89)

in article <13980@lanl.gov>, jlg@lanl.gov (Jim Giles) says:

> From article <2190@oakhill.UUCP>, by davet@oakhill.UUCP (David Trissel):
>> In article <13976@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>>    mov.l  (%an),%dn
>>    add.l  &4,%an
>> or the faster
>>    mov.l  (%an)+,%dn
>> The 68K requires a routine in the compiler peephole optimizer to "discover" 
>> and implement this optimization. But the result is a single 16-bit instruction
>> which (I think) executes in a single clock on the MC68040.

> Now, clearly 5 instructions may take longer than the 1 in your 68K example.
> But, RISC machines are easier to pipeline, easier to speed up the clock
> for, easier to provide staged functional units for, etc..  I don't
> know of any CISC machines with 'hardwired' instruction sets.  

That's probably why the 68040 was mentioned in the original example.  It has a
decent number of hardwired instructions, thus the aforementioned move with
post increment executes in one _clock_ (at least I had the same recollection;
many of the instructions execute in a single clock).  You're going to be hard
pressed to make a RISC do better than one instruction/clock (some do by 
exploiting parallel execution units, a concept not limited to RISC), and if your
RISC machine needs more instructions to achieve the same result, you lose. 
The main advantage any RISC is going to have in the long run is its 
simplicity.  A thing like the 68040 can adopt many if not all of the tricks 
you'll find in RISC machines, but it's a very large processor, and you're 
probably not going to see it in ECL or GaAs anytime soon.  Its CMOS process 
is probably not good for much over 50MHz (top speed on the 68030 anyway).  
Folks are talking about ECL designs starting at 60MHz-80MHz and continuing on 
with GaAs into the 100MHz to 200MHz range.  The first CPUs that get there will
be extremely simple ones so that the number of chips (and thus slow external 
connections) can be kept to a minimum.  And so they can be made in the first
place with the very latest technologies, which aren't counting millions of
transistors yet.
-- 
Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
           Be careful what you wish for -- you just might get it

henry@utzoo.uucp (Henry Spencer) (07/11/89)

In article <13980@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>> What about the required pairing of registers for double wide operations
>> such as floating-point or shifting? 
>
>In what way does a machine which requires register pairing qualify as
>a RISC?  ...

Don't be too hard on the guy; he's judging all RISCs by the lousy one he's
got.  Only Motorola could build a RISC with register pairing... :-)

>... If you are not using those
>extra instructions, you might as well have a RISC which provides
>only the instructions you _do_ use.  The hardware designer could
>then spend more time making those work faster...

In fact, it is notoriously true that many CISCs run faster if the compiler
writer treats them as RISCs and ignores all the fancy stuff.

>...RISC machines are easier to pipeline, easier to speed up the clock
>for, easier to provide staged functional units for, etc..  I don't
>know of any CISC machines with 'hardwired' instruction sets...

I can think of a couple of old ones.  The first pdp11, the 11/20, was
hardwired.  (This accounts for some of the little irregularities in the
11 instruction set, in fact, like the way INC isn't quite the same as
ADD #1...)  And I seem to recall that the 360/75 was mostly hardwired,
for speed.
-- 
$10 million equals 18 PM       |     Henry Spencer at U of Toronto Zoology
(Pentagon-Minutes). -Tom Neff  | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

frazier@oahu.cs.ucla.edu (Greg Frazier) (07/12/89)

In article <151@ssp1.idca.tds.philips.nl> roelof@idca.tds.PHILIPS.nl (R. Vuurboom) writes:
>
>This term has been confusing me (too?). Is this the (an?) "accepted"
>definition of exposed pipeline?

My understanding is that an exposed pipeline is one in which
interlocks, register reservations, etc. are done in software.
Thus, if you have a 10 cycle divide and a 2 cycle addition,
and have in your machine code
R1 <- R2 / R3
R4 <- R1 + R2,
you are going to get an incorrect value in R4, unless you
stick a bunch of no-ops inbetween.  In an un-exposed pipeline,
the compiler does not have to know how long operations take,
the hardware performs register reservations and prevents such
things from happening.  Thus, in my second posting, the
question as to the applicability of discussing exposed/unexposed
pipelines in the context of a RISC, even though I was the one
who brought it up in the first place.  Yes, I'm a bit of a
chuckle-head :-)

Greg Frazier
******************@@@@@@@@@@@@@@@@@@@@@@@@@====================
Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier
	       ----^/----
		   /

paul@moncam.co.uk (Paul Hudson) (07/12/89)

In article <2199@oakhill.UUCP> davet@oakhill.UUCP (David Trissel) writes:

   In article <13980@lanl.gov> jlg@lanl.gov (Jim Giles) writes:

   >> What about the required pairing of registers for double wide operations
   >> such as floating-point or shifting? 

   >In what way does a machine which requires register pairing qualify as
   >a RISC?  

   The point was that RISC architectures DO require the pairing. Can you name
   any that don't?

ARM. (Acorn Risc Machine) last time I looked ...

   >If an instruction requires 2 operands, they should be allowed
   >to be any two general purpose registers. 

   This is not what was being discussed.
But it is relevant to the complexity of the compiler.

   >Furthermore, you are assuming
   >that floating point is larger than other intrinsic types. 

   Well, unless you limit yourself to only single-precision, floating-point IS
   larger than other instrinsic types.

	[ omitted ]

   >So, this issue _does_ have a bearing on the complexity of the compiler.
   >If you are not willing to provide the sophisticated compilers required
   >to adequately use a machine, you have wasted money (read: design effort,
   >chip space, etc.) on the hardware.

   Nope, no bearing on the compiler. There is no gun being pointed at the compiler
   writer forcing her to utilize all the available instructions. Your assumption
   that most of the instructions have to be used "to adequately use a machine"
   is simply incorrect.

But what's the point in having those instructions, then? Why not use all that silicon
real-estate for something else?

   >All the optimizations required on a RISC are also required on a CISC.  

   Wrong. 68K compilers don't have to worry about register pairing. 68K compilers
   don't have to worry about branch delayed slot filling, setting up segment
   registers for data addressability, etc.

Neither does the compilers for the ARM. (except addressing literals,
and that wasn't difficult).

   >CISC just adds more complexity to the mix.

   I showed otherwise in that RISC has it's own set of headaches. I don't see
   much difference in the total amount of work involved for either RISC or
   CISC.

   >Now, clearly 5 instructions may take longer than the 1 in your 68K example.

   Actually RISCs only require 2 or 3 instructions to do the work of
   the one in that example.

   >All this may mean that 5 instructions on a RISC may be _faster_ than one 
   on a CISC.

   Not when using the latest CISCs.

I think all this shows that it depends on the particular examples of
CISC or RISC rather than being generic. I would argue that most RISCs
are more regular in their instruction sets than most CISCs, and that
this does make compilers significantly easier.

My experience in this is limited to writing a replacement code
generator for pcc for the Acorn Risc Machine. The result had only
peephole optimisation and couldn't compete with an fully-optimizing
compiler, but still gave quite good results.

More importantly, it was working quite quickly - here the regularilty
of register use and instructions was a big win.

--
Paul Hudson	 MAIL: Monotype ADG, Science Park, Cambridge, CB4 4FQ, UK.
		PHONE: +44 (223) 420018	  EMAIL: paul@moncam.co.uk,
	;"	  FAX: +44 (223) 420911		 ...!ukc!acorn!moncam!paul
 `"";";"        "/dev/null full: please empty the bit bucket"

tim@crackle.amd.com (Tim Olson) (07/12/89)

In article <151@ssp1.idca.tds.philips.nl> roelof@idca.tds.PHILIPS.nl (R. Vuurboom) writes:
| Anybody care to quantify this? Just what sort of performance improvement can I
| expect from no-holds-barred optimization over only-what-I-have-to optimization.?

Here are a couple of data points.  The only optimizations performed by
the internal pcc-derived compiler were delayed-branch slot filling,
loop-rotation, leaf-procedure optimization (no frame allocated), and
some loop-invarient code-motion (mainly constant addresses).  We can
compare this to the MetaWare High-C compiler for the Am29000, which
performs many more optimizations, including common-subexpression
elimination, dead-code elimination, constant and variable propagation,
register assignment by coloring, etc:

Benchmark	VAX 11/780 pcc	29K pcc		29K MetaWare

diff		   3.2 s	0.208 s		0.157 s (+32%)
grep		   2.1 s	0.193 s		0.142 s (+35%)
nroff		   7.1 s	0.564 s		0.507 s (+11%)

(29K simulation model was 25MHz, with separate 8Kbyte caches, which were
2-cycle first access, single cycle burst.)

So you can see that there is a definite improvement, but it certainly
isn't the 3X - 5X implied by the assertion that "you have to use
highly-optimizing compilers with RISC, otherwise you might as well use a
CISC processor."

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

fotland@hpihoah.HP.COM (Dave Fotland) (07/12/89)

> davet@oakhill.UUCP (David Trissel) writes:
>In article <13980@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>
>>> What about the required pairing of registers for double wide operations
>>> such as floating-point or shifting? 

>>In what way does a machine which requires register pairing qualify as
>>a RISC?  

>The point was that RISC architectures DO require the pairing. Can you name
>any that don't?

Hewlett Packard precision architecture doesn't require register pairing.
Double shifts can use any two general registers.  The floating point
registers are separate from the general registers and each can hold either
a 32 bit or 64 bit floating point value.  The compiler writers were against
requiring pairing and the floating point experts felt that double precision
was more important than single so there was no need to try to put two
singles in a double precision register.  This is also why double precision
floating point is only slightly slower than single on most HP-PA machines.

Floating point operations are naturally slower than integer operations
so it makes sense to have them handled by a coprocessor so they can overlap
with integer operations.  Having separate floating point registers makes
the coprocessor design easier and saves communication between the IU and FPU.

David Fotland
fotland@hpda.hp.com

mac3n@babbage.acc.virginia.edu (Alex Colvin) (07/13/89)

>   The Honeywell (now Bull) DPS-8 is a conscious attempt at a
> very-complex-instruction set computer, based on the basic architecture
> of the era of IBM 7090s and DEC-10s.

> .... Given a compiler (say, FORTRAN
> IV) for the basic machine language, one can add all the constructs for
> PL/1 and (you should pardon the expression) COBOL to your language in
> about two man-days: Honeybun put the primitives into the EISbox
> (Extended Instruction Set).

>   The machine is still in production, is still very CISC, and actually
> runs rather well.  They really did "narrow the semantic gap between
> the machine language and the language understood by the compiler",
> which was the reason d'etre of the vCISC machines.

Only for some languages.  PL/I, COBOL.  Try C.  While the machine does
understand character strings, it doesn't understand single characters very
well.  They're pretty much restricted to the EIS instructions, which are
strictly memory-memory, in contrast to the rest of the instruction set,
which is memory-register.

>   One can even generate good code for them (;-)), because they're
> either regular, or only exist in one variant.

Sure, it's easy to do optimal register allocation when all you've got is one
good register (EAQ).

>  Mind you, I can't recommend the underlying order code (the stuff that
> preceded the EISbox) to my worst enemy.  The machine is a sorta wart
> with an elegant bag on the side.

"Order code" gives you an idea of the age of the basic architecture.  But
EIS is a mess.  Seriously, I grew up on this machine, and I can say that I
prefer i80*86s.  Perhaps I'm biased by the early flaky implementations of
EIS (such as 0-length string copy faulting) and the wild variation in
instruction timing among different models.

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (07/13/89)

In article <199@dg.dg.com> uunet!dg!rec (Robert Cousins) writes:
>of compute power.  On the 88K, we need less than 1 MIPS to generate
>a Dhrystone MIPS.  In other words, we get Dhrystone MIPS > CPU
>clock speed.  When this first happened here, many people were

Could you clarify what you mean? I know of no correlation between
"Dhrystones/second" and "Instructions issued/second".

Also, on a lot of machines, the Dhrystone rating is quite a bit higher
than a simple "MIPS"/"VAX MIPS" ratio would indicate, thus demonstrating
that Dhrystone is but one of many benchmarks, and the 11/780 does better
on others...

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

rec@dg.dg.com (Robert Cousins) (07/13/89)

In article <28471@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>In article <199@dg.dg.com> uunet!dg!rec I wrote:
>>of compute power.  On the 88K, we need less than 1 MIPS to generate
>>a Dhrystone MIPS.  In other words, we get Dhrystone MIPS > CPU
>>clock speed.  When this first happened here, many people were
>Could you clarify what you mean? I know of no correlation between
>"Dhrystones/second" and "Instructions issued/second".
> ....
>  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
>  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
>  Moffett Field, CA 94035     
>  Phone:  (415)694-6117       

Before going any further, I would like to point out that I am (was) pointing
out anecdotal information.  When we first begain to run benchmarks on the
AViiON 88K products, we came up with Dhrystone numbers which when converted to 
VAX MIPS came out higher than the RAW MIPS of the machine.  The 16.67 Mhz 
machine has a peak raw MIPS rating of 16.67 MIPS (1 instruction per clock).
16.67 Dhrystone MIPS is 29289 Dhrystones/Second.  Since we were getting 
more Dhrystones/second than this, some people were suprized.

Hugh, I agree that Dhrystone is one benchmark in a whole field of
benchmarks.  However, I feel comfortable drawing a simple conclusion
from this result:  It is not clear that a CISC processor will take
FEWER INSTRUCTIONS to solve a problem than a RISC processor.  In some
cases RISCs may require fewer instructions.

This contrasts with some early claims that RISCs could execute 
short instructions in a shorter time than a CISC could execute long 
instructions doing the same work.  From then on, in some circle,
RISCs have been tacitly assumed to require more instructions to
do a given job than CISCs.

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

slackey@bbn.com (Stan Lackey) (07/13/89)

In article <199@dg.dg.com> uunet!dg!rec (Robert Cousins) writes:
>Concerning your optimization comment, I would like to throw out an
>interesting phenomenon we have noticed (perhaps other RISCers will
>comment also):  At one point, people were talking about RISC processors
>needing to execute MORE instructions to do the same amount of work.
>THis implied that where a CISC might require 100 instructions, a
>RISC might require >100 instructions, though the RISC instructions
>would take less time.  We have noticed that in many cases, the 
>instruction count goes down.  

I assume you mean 'instruction count goes down relative to our old
architecture'.  I bet going from 4 to 30 registers did something for
register spills and lots of other interesting stuff...  As well as
byte addressability...
:-) Stan

schow@bnr-public.uucp (Stanley Chow) (07/14/89)

In article <200@dg.dg.com> uunet!dg!rec (Robert Cousins) writes:
>
>Hugh, I agree that Dhrystone is one benchmark in a whole field of
>benchmarks.  However, I feel comfortable drawing a simple conclusion
>from this result:  It is not clear that a CISC processor will take
>FEWER INSTRUCTIONS to solve a problem than a RISC processor.  In some
>cases RISCs may require fewer instructions.
>

This is such a simple issue that arguing is pointless. You can just
post the static and dynamic instruction counts for some sample CPU's
and settle the question.

Static numbers ought to be easy to get. Dynamic numbers can be either
true dynamic number or just approximation by counting inner loops.
Since you are making a "suprising" claim, you ought to come up with
the numbers.

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
Me? Represent other people? Don't make them laugh so hard.

acockcroft@pitstop.West.Sun.COM (Adrian Cockcroft) (07/14/89)

> In other words, we get Dhrystone MIPS > CPU
> clock speed.  When this first happened here, many people were
> mumbling about clocks running backward and Escher paintings . . . . :-)
> 
> Robert Cousins

I'm interested to see what code the compiler produced for the 88k
could you mail me your Dhrystone source code and the resultant
assembler output (cc -S on sun). I'll compile your source and compare it
to the SPARC equivalent to try to see what optimisations have been
performed. I can also run the SPARC dhrystone through SPARCsim to
get an instruction mix analysis. If I get anywhere I will post the results
back here.

As an aside, I used to use the Greenhills 68k C compiler a lot for
realtime work and I found that the compiler optimisations were
style dependent. I eventually found what sort of expressions the
compiler could cope with best by tweaking the C code and looking
at the assembler produced. In particular strength reduction
optimisations were very dependent on the way I wrote loops.


-- 
Adrian Cockcroft Sun Cambridge UK TSE sun!sunuk!acockcroft
Disclaimer: These are my own opinions

mash@mips.COM (John Mashey) (07/15/89)

In article <28471@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>In article <199@dg.dg.com> uunet!dg!rec (Robert Cousins) writes:
>>of compute power.  On the 88K, we need less than 1 MIPS to generate
>>a Dhrystone MIPS.  In other words, we get Dhrystone MIPS > CPU
>>clock speed.  When this first happened here, many people were
>
>Could you clarify what you mean? I know of no correlation between
>"Dhrystones/second" and "Instructions issued/second".
>
>Also, on a lot of machines, the Dhrystone rating is quite a bit higher
>than a simple "MIPS"/"VAX MIPS" ratio would indicate, thus demonstrating
>that Dhrystone is but one of many benchmarks, and the 11/780 does better
>on others...

Yes, 100% [I'm on the road with little bandwidth, but I just can't pass
this one up...]
1) Even with no gimmickry, Dhrystone is the benchmark of choice for showing
how much better than a VAX you are, because Dhrystone has a lower-than usual
number of cycles / subroutine call, i.e., it seriously penalizes anything with
a slow function call sequence.  This data has been published for years.....

2) With the current level of compiler gimmickry, you must obey HrDr Weicker"s
advice about disbelieving anything unless you see the generated code.
What does any of this mean when you can change the numbers +40% based on
a single optimization of a single statement???
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

hankd@pur-ee.UUCP (Hank Dietz) (07/16/89)

Let's assume that the RISC and CISC machines under consideration are
identical except for instruction encoding; the CISC being microcoded and
each RISC instruction corresponding to one micro instruction.

CISC compilers tend to pack as many operations together in single
instructions as they possibly can, but this implies a certain amount of
redundancy because the complex instructions represent general case sequences
of micro instructions.  Initially, there will be several RISC instructions
for each of the CISC instructions, perhaps 3:1.  However, many of them will
be optimized away.

This is because the RISC compiler can recognize and remove redundancies in
the CISC micro instruction sequences.  Generally, RISC implies more available
expressions, which makes value propagation and common subexpression
elimination more productive.  Taking arbitrary basic blocks of RISC code and
doing rather simple analysis, we have often found that only about 1/4 of the
operations remain after optimization...  this is a very unlikely scenario
for CISC operations.

In summary, redundant "micro instructions" of the CISC operations nearly all
disappear from the optimized RISC code, hence a 1:1 ratio of RISC:CISC is
quite possible.  How do you get better than 1:1?  The only way is if some
entire CISC instructions are useless; i.e., you can't do better than 1:1
unless the RISC compiler is "smarter" than the CISC compiler.

								-hankd

PS: If the above sounds like "why nobody should ever build a CISC," let me
    remind folks that we're assuming instruction bandwidth isn't an issue
    and the code you care about really has such low-level redundancies.  Of
    course, finer-grain (RISC) compiler control can require more instruction
    bits per computation, so there's still plenty to argue about. ;-)

lord@se-sd.NCR.COM (Dave Lord ) (07/20/89)

In article <199@dg.dg.com> uunet!dg!rec (Robert Cousins) writes:
:In article <13976@lanl.gov: jlg@lanl.gov (Jim Giles) writes:
::Every once in a while, someone in this newsgroup makes the claim
::that RISC trades off hardware complexity for compiler complexity.
::This is simply _NOT_ true.  It is always _easier_ to write a compiler
::for a RISC machine than for a CISC machine.
:
:I must agree.  Our experience on the 88K with a number of pieces 
:of software from gcc to an internally developed FORTH interpreter 
:has shown that RISCs are easier to generate code for.  I would 
:carry this a step further.  While the assembly language development 
:effort is minimal, we have had little difficulty with assembly 
:language to the point where I wonder if RISCs might not be easier 
:to program at that level, also.  If a programmer can do it easily . . . .

Uh, yes and no. Compilers for _some_ languages may be easier to
write for RISC but having recently been working on a COBOL compiler
for the 88K I can tell you that a _huge_ amount of work is involved
in handling commercial data types (decimal and variations of decimal).

I will agree that contrary to popular belief writing assembly language
code for the 88k is a piece of cake.

guy@auspex.auspex.com (Guy Harris) (07/21/89)

>Uh, yes and no. Compilers for _some_ languages may be easier to
>write for RISC but having recently been working on a COBOL compiler
>for the 88K I can tell you that a _huge_ amount of work is involved
>in handling commercial data types (decimal and variations of decimal).

So, just out of curiosity:

	1) how much work relative to the work involved on CISCs with
	   decimal instructions, counting both the compiler work *and* the
	   incremental CPU design work for the decimal instructions for
	   *all* generations of the CISC (unless, of course, the compiler
	   work for the RISC has to be redone for subsequent generations)?

	2) how much work relative to the work involved on CISCs without
	   decimal instructions, or at least without the whizzo ones on
	   370s and VAXes and the like?

davet@oakhill.UUCP (David Trissel) (07/22/89)

In article <1989Jul9.202858.27121@jarvis.csri.toronto.edu> norvell@csri.toronto.edu (Theodore Stevens Norvell) writes:
>
>But the choice of instructions is not the hard part of instruction selection
>(on a CISC).  There is also the choice of address modes.

Valid point but that's only one side of the coin. Indeed our 68K compiler has 
extra templates for recognizing the more complex trees (actually called 'shapes'
for you pcc2 fans) that some addressing modes require.  However, our 88K 
compiler template file has become just as large as that of the 68K compiler 
because it requires extra templates to catch several special cases of 3 
register operand possibilities that are missed with the simple classic stock 
templates one would create for a base 88K machine. We found it rather
fascinating that the template files for the two machines are almost the same 
size.

>compiling the C statement   i = *p ;  where i and p are both memory (not
>register) variables.  Do you:
>	(A) Emit a move using an indirect address mode.
>	(B) load p and do a memory to memory move
>	(C) load *p with an indirect address mode and store i
>	(D) load p, load *p and then store i
> [...]
>In other words, even in a simple assignment statement, there are phase ordering
>problems between instruction selection and, optimization, register allocation
>and scheduling.

You are correct. But then the 68K compiler doesn't have the massive headache
of trying to support data addressability segment registers because of the RISC
inability to directly access an item with a full 32-bit address. Both are
massive headaches, which gets back to my original point some postings ago
that I don't see one architecture having a big advantage over the other when
it comes to purely 'amount-of-work-to-build-a-compiler' issues.

>
>Mr. Trissel's group seem to have solved this phase ordering problem by
>putting some of the address mode selection into a peephole optimizer.  
>This seems reasonable, but it moves the complexity to the peephole optimizer.

Correct again.  But once you figure in the amount of work that a RISC compiler
(or it's assembler) requires for code reordering [and for some non-Motorola 
RISCs the pipeline interlock worries -> :-)] is RISC really ahead in the amount
of work required? 

>>>To make matters worse, instruction selection is context sensitive.
>>>This means that different code ordering during optimization could
>>>invalidate the instruction selection choice (or, at least make the
>>>selected instructions sub-optimal).
>>
>>True for both RISC and CISC. No difference.
>
>The difference is that if there is only one good way to select the
>instructions (as is more often true on a RISC than a CISC), then there
>is no phase ordering problem, You simply select the instructions first
>with no fear of picking a sequence that screws up the register allocation
>or scheduling.  Not because it won't, but because there probably wasn't
>another sequence that would have been better.

But it seems to me that once you take into consideration the requirement that
a RISC compiler has to support code reorder scheduling to produce decent
output then the code reorderer must be every bit as complex as a 68K peephole 
optimizer (for example.) A reorderer has to keep up with live/dead 
registers, operand use/defines, have the knowledge of what can be swapped 
around to where etc. which is even more complex in many respects than what 
my 68K peephole is required to do.

-- Dave Trissel   Motorola Austin

mash@mips.COM (John Mashey) (07/23/89)

In article <2231@oakhill.UUCP> davet@oakhill.UUCP (David Trissel) writes:

>But it seems to me that once you take into consideration the requirement that
>a RISC compiler has to support code reorder scheduling to produce decent
>output then the code reorderer must be every bit as complex as a 68K peephole 
>optimizer (for example.) A reorderer has to keep up with live/dead 
>registers, operand use/defines, have the knowledge of what can be swapped 
>around to where etc. which is even more complex in many respects than what 
>my 68K peephole is required to do.

How about publishing some data on the size and complexity of the
peephole for the 68K and the code reorderer for the 88K?
This would seem to be a useful data point, esp. if the rest of the
compiler system is similar.  Also, estimates of staff-weeks to do this
would be really nice, to help start building an element of quantitative
data into this ongoing discussion.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

preston@titan.rice.edu (Preston Briggs) (07/23/89)

In article <23868@winchester.mips.COM> mash@mips.COM (John Mashey) writes:

>How about publishing some data on the size and complexity of the
>peephole for the 68K and the code reorderer for the 88K?
>This would seem to be a useful data point, esp. if the rest of the
>compiler system is similar.  Also, estimates of staff-weeks to do this
>would be really nice, to help start building an element of quantitative
>data into this ongoing discussion.

Well, here's a data point, though for an IBM/RT.
In an experimental FORTRAN compiler,
I do instruction selection and choice of addressing modes
via peephole optimization (ala Fraser and Davidson).
The code is pretty simple and short, less than 250 lines of C.
For the integer portion of the machine, there are only about 70 patterns.
These give complete coverage of the (very limited) instruction set.
For the floating point portion, I have more than 200 patterns,
evenly split between single and double precision.
However, these cover only a fraction of the available addressing
mode combinations.

I do code scheduling  before and after register allocation,
using the scheme of Gibbons and Muchnik (Sigplan 86).
I found it significantly more difficult to program,
at least to achieve good space and time performance.
The code size is less than 450 lines of C.

Unfortunately, I can't give you good guesses about the
development time.  Less than two years of grad student labor though!

A version for the 68K would avoid the scheduling problems,
but the number of peephole patterns needed for complete
coverage of the instruction set would be prohibitive.
(millions).  Fraser and Davidson would either use a
table-driven interpreter to get complete coverage,
or use a large set of test programs and select a
useful subset of the available pattterns.
These schemes, seperately or together, would however
be more difficult to program.

Preston