[comp.arch] '040 vs. SPARC

baum@Apple.COM (Allen J. Baum) (02/08/90)

[]
>In article <160@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:
>Has anyone seen a 68040?  I thought not.  You are comparing
>a chip that won't ship until this summer with one that is in
>a machine that has been in production for some time.  This
>occurs over and over in the RISC/CISC debate, but that doesn't
>seem to keep people from making these silly comparison's.

I'm afraid that I have to   agree   with this one. The '040 has hit
silicon, but you're still comparing a 1.2million transistor chip,
with built-in caches, to a much smaller chip (how many transistors in
the Cypress SPARC, anyone know?) 

It is still very significant that the are claiming to be faster AT THE SAME
COLCK RATE. It also took them a few more years to build the complex chip
that would do that- not an easy task, even with the extra time. The Moto
folks appear to have done a very nice job on the design of this chip.

We still need to wait for real benchmarks. Anyone planning SPECmarks for
the '040?

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (02/08/90)

In article <38415@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:

| It is still very significant that the are claiming to be faster AT THE SAME
| COLCK RATE. It also took them a few more years to build the complex chip
| that would do that- not an easy task, even with the extra time. The Moto
| folks appear to have done a very nice job on the design of this chip.

  This is very impressive. I would like to propose using MISC instead of
CISC, since the microcode which used to require many cycles per
instruction is now replaced by hard logic for virtually all of the
instructions, maybe all in the 040. I expect the 586 to have 1+
instructions per cycle average, too, indicating that traditional RISC
may have been the way to go when chips were small, and that richer
instruction sets may become possible in the next decade without giving
up any performance.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

jskuskin@eleazar.dartmouth.edu (Jeffrey Kuskin) (02/08/90)

In article <2101@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>In article <38415@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>
>| It is still very significant that the are claiming to be faster AT THE SAME
>| COLCK RATE. It also took them a few more years to build the complex chip
>| that would do that- not an easy task, even with the extra time. The Moto
>| folks appear to have done a very nice job on the design of this chip.
>
>  This is very impressive. I would like to propose using MISC instead of
>CISC, since the microcode which used to require many cycles per
>instruction is now replaced by hard logic for virtually all of the
>instructions, maybe all in the 040. I expect the 586 to have 1+
>instructions per cycle average, too, indicating that traditional RISC
>may have been the way to go when chips were small, and that richer
>instruction sets may become possible in the next decade without giving
>up any performance.


Yes, but how much do we benefit from the richer instruction sets, even
if all the instruction are hardwired and execute at 1 cycle/instruction?
Isn't one of the RISC folks' main arguments for simple instruction sets
that current compilers don't effectively exploit the complex addressing
modes and instructions supported in CISC chips?  Perhaps someone would
like to speculate on what progess the next decade will bring in
compiler technology...

-- Jeff Kuskin, Dartmouth College

jskuskin@eleazar.dartmouth.edu

scott@bbxsda.UUCP (Scott Amspoker) (02/08/90)

In article <19233@dartvax.Dartmouth.EDU> jskuskin@eleazar.dartmouth.edu (Jeffrey Kuskin) writes:
>Yes, but how much do we benefit from the richer instruction sets, even
>if all the instruction are hardwired and execute at 1 cycle/instruction?
>Isn't one of the RISC folks' main arguments for simple instruction sets
>that current compilers don't effectively exploit the complex addressing
>modes and instructions supported in CISC chips?  Perhaps someone would
>like to speculate on what progess the next decade will bring in
>compiler technology...

Well, it doesn't take much to find instructions on a 680x0 that are
not used by a C compiler.  However, my code tends to do a lot of
structure accesses with pointers such as "pointer->field".  The
68020 double-indirect-with-offset addressing mode is a real life
saver and I haven't seen that on the few RISC machines I've used.
Admittedly, a good optimizing compiler would not need such a
mode.

-- 
Scott Amspoker
Basis International, Albuquerque, NM
(505) 345-5232
unmvax.cs.unm.edu!bbx!bbxsda!scott

davec@proton.amd.com (Dave Christie) (02/09/90)

In article <2101@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
| 
|    I would like to propose using MISC instead of
| CISC, since the microcode which used to require many cycles per
| instruction is now replaced by hard logic for virtually all of the
| instructions, maybe all in the 040. I expect the 586 to have 1+

According to an article on the '040 in this week's EE Times:

"The IU integer pipeline has three different [control] mechanisms:
an initial decode PLA for the EA [address formation] stage,
a ROM-driven microcode engine for following stages and a finite
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
state machine to control the EU stage."

Just thought I'd pass that on....

--------
Dave Christie

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (02/09/90)

In article <19233@dartvax.Dartmouth.EDU> jskuskin@eleazar.dartmouth.edu (Jeffrey Kuskin) writes:

| Yes, but how much do we benefit from the richer instruction sets, even
| if all the instruction are hardwired and execute at 1 cycle/instruction?
| Isn't one of the RISC folks' main arguments for simple instruction sets
| that current compilers don't effectively exploit the complex addressing
| modes and instructions supported in CISC chips?  

  A fair question, but hard to answer in the middle ground. There are
some instructions which make code generation and execution faster for
almost all applications, such as mpy and div. The question has always
been if the gates could be better used to make something else faster,
not if those instructions would be useful. At the other end, there are
instructions which are really special purpose, and I don't think that
anyone would argue for including them in a general purpose CPU, such as
the FFT instruction I discussed here a few weeks ago.

  The answer is that instructions should be added if the sequence of
simple instructions to do the same thing is (a) common, and (b) slower.
If the sequence is more than a few instructions long some tradeoff comes
in because fewer instructions mean fewer hits on the memory. The guide
has got to be the overall speed of the CPU for a general mix (assuming a
g.p. CPU), rather than aiming for a single benchmark. This compromise
leaves room for lots of competition, because performance is a factor of
load characteristics to some extent.

  As long as adding the instructions and addressing modes don't slow
down other stuff, directly or by stealing gates, they can be a net win.
Another compromise is in register scoreboarding. By using a complex
instruction part of the execution may be overlapped with execution of
following instructions. This rapidly gets into interactions between the
compiler quality and features.

  I am told that the 586 will have an SPU for the string operations.
While I would expect this to have very little effect on general
performance, kernel bitmap searches and bitblt *may* now be overlappable
with other things. Is this a better use of gates than more cache? Is the
rumor even true? I don't claim to have the answers, but I have some
programs which use strchr(), strcat(), memcpy(), and such *very*
heavily, and I would be willing to try writing a few routines in
assembler if I could get 20-30% better performance. You have to take
advantage of the hardware.

  Some address complexity, at least in the area of having autoincr on
things is usually a win, but it may require a smart compiler or
scoreboarding to make best use of it. Operations directly to memory is a
favorite whipping boy of the RISC people, but it often saves use of a
register, saves two instructions, and if it allows fewer registers
implemented, or fewer saved on a context switch with only dirty
registers saved, it may be an overall win.

  Sorry for the long reply, but I said initially that the question was
complex. 
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

yam@nttmhs.ntt.jp (Toshihiko YAMAKAMI) (02/09/90)

From article <2101@crdos1.crd.ge.COM>, by davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr):

>   This is very impressive. I would like to propose using MISC instead of
> CISC, since the microcode which used to require many cycles per
> instruction is now replaced by hard logic for virtually all of the
> instructions, maybe all in the 040. I expect the 586 to have 1+
> instructions per cycle average, too, indicating that traditional RISC
> may have been the way to go when chips were small, and that richer
> instruction sets may become possible in the next decade without giving
> up any performance.

It is impressive. In the next design, we have to think about
how we can fill 5,000,000 transistors or more on one chip which
LSI technology will offer us.

I agree on this point.

However, how about exploiting hidden parallelism?
As disucssed  in this group, RISC technology has exploited
hidden parallelism in high level language descriptions.
When one accesses a variable on a memory, a RISC chip loads it
into its register. Then it does some operation on it.
The value is remained on the register. One can reuse it
in another operation.
Current RISC optimizing compilers make use of it in a certain
art level.

I am very intersted in another RISC/CISC war in this decade, 1990s.

-- Toshihiko YAMAKAMI


Toshihiko YAMAKAMI	NTT Telecommunication Networks Laboratories
 Telephone:	+81-468-59-3781 	FAX:	+81-468-59-2546
 junet:	yam@nttmhs.ntt.jp		CSNET:	yam%nttmhs.ntt.jp@relay.cs.net
 snail-mail:	Take 1-2356-523A, Yokosuka, Kanagawa 238-03 JAPAN

tim@nucleus.amd.com (Tim Olson) (02/09/90)

In article <2105@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
|   As long as adding the instructions and addressing modes don't slow
| down other stuff, directly or by stealing gates, they can be a net win.
| Another compromise is in register scoreboarding. By using a complex
| instruction part of the execution may be overlapped with execution of
| following instructions. This rapidly gets into interactions between the
| compiler quality and features.

But the complex instruction typically binds many operations together,
*reducing* the ability to efficiently overlap subsequent operations.
However, if the complex instruction is split into its constituent
parts, there is much more opportunity for instruction scheduling.

|   Some address complexity, at least in the area of having autoincr on
| things is usually a win, but it may require a smart compiler or
| scoreboarding to make best use of it.

Either this will take an extra cycle to write back the incremented
address register (in which case an explicit add is just as fast), or
an extra register file port just to write the incremented address at
the same time the load data is written.  If more register file ports
are going to be added, I'd rather issue multiple, general-purpose
instructions, which have a much greater chance of being used than a
limited auto-increment mode.



	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

bcase@cup.portal.com (Brian bcase Case) (02/09/90)

>It is still very significant that the[y] (Moto about the 040)
>are claiming to be faster AT THE SAME
>COLCK RATE. It also took them a few more years to build the complex chip
>that would do that- not an easy task, even with the extra time.

Well, it's not entirely clear what clock rate means here.  It is 
interesting to note that the 040 doubles the clock internally and
uses four edges.  The "execute pipeline stage" does all of the
following *in one clock cycle*:  register read, ALU, and register
writeback.  That hardly sounds like a pipeline comparable to other
optimized pipelines.  Question:  could a 25 MHz 040 operate at
50 MHz with a better pipeline?  It seems the answer is yes.  What
does this say about RISC vs. CISC?  I don't know, and besides I
am speculating anyway.

REGARDLESS, the 040 is a really great chip and it will make some
damn nice Macintoshes.  Add a graphics accelerator (using a RISC,
let's say), and WOW!

bcase@cup.portal.com (Brian bcase Case) (02/09/90)

>  This is very impressive. I would like to propose using MISC instead of
>CISC, since the microcode which used to require many cycles per
>instruction is now replaced by hard logic for virtually all of the
>instructions, maybe all in the 040. I expect the 586 to have 1+
>instructions per cycle average, too, indicating that traditional RISC
>may have been the way to go when chips were small, and that richer
>instruction sets may become possible in the next decade without giving
>up any performance.

No, hard logic is there for the simple instructions; the complex ones
do a "Hold everything for a few cycles until we can finish this
thing."  Yes, the 586 may indeed have 1+ instructions per cycle
on average, but that will only be for programs that use the simple
instruction subset.

Hey guys, only certain kinds of instructions can be made to fit in
a reasonable-length, uniform, lock-step pipeline.  That's a fact and
it's one of the ones on which RISC is based.  CISC chips are looking
good because they are making the simple instructions go fast and because
the compilers are changing.  And that's a function of $$ available
for development efforts....

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (02/10/90)

In article <29099@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:

| But the complex instruction typically binds many operations together,
| *reducing* the ability to efficiently overlap subsequent operations.
| However, if the complex instruction is split into its constituent
| parts, there is much more opportunity for instruction scheduling.

  Performance depends on how it's done. If the CPU can't do anything
else when it starts a complex instruction, then the gains from possible
internal overlap of phases will have to outweigh the blocking of the
CPU. If the CPU can continue to execute at least some other
instructions, then a smart compiler can probably find instructions.

  This isn't black and white, where all complex instructions are a lose
and all simple ones are a win. Volume of instructions impacts memory
bandwidth, too.

| Either this will take an extra cycle to write back the incremented
| address register (in which case an explicit add is just as fast), or
| an extra register file port just to write the incremented address at
| the same time the load data is written.  If more register file ports
| are going to be added, I'd rather issue multiple, general-purpose
| instructions, which have a much greater chance of being used than a
| limited auto-increment mode.

  What I said about memory bandwidth applies here, but even more to the
point, a load or store through a pointer (address register) usually has
at least one cycle overhead after the address is used, even with cache.
This can be used to do the increment without slowing anything down, and
without running another instruction decode. The issue which I believe is
primary is if the added complexity of the instruction decode slows it
down. Given the number of gates available I believe the answer is
"usually not."

  There are people who argue against having increment, stating that it's
not general purpose and that the incrment should be two discrete
instructions, namely (1) load immediate to 2nd register value 1, and (2)
add 2nd register to the register to be incremented. I don't agree with
this, either, but I can see that it is the ultimate extension of the
RISC method.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

csimmons@jewel.oracle.com (Charles Simmons) (02/11/90)

In article <38415@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes:
> From: baum@Apple.COM (Allen J. Baum)
> Subject: '040 vs. SPARC (was: Next computer...)
> Date: 7 Feb 90 18:02:06 GMT
> 
> >In article <160@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:
> >Has anyone seen a 68040?  I thought not.  You are comparing
> >a chip that won't ship until this summer with one that is in
> >a machine that has been in production for some time.  This
> >occurs over and over in the RISC/CISC debate, but that doesn't
> >seem to keep people from making these silly comparison's.
> 
> I'm afraid that I have to   agree   with this one. The '040 has hit
> silicon, but you're still comparing a 1.2million transistor chip,
> with built-in caches, to a much smaller chip (how many transistors in
> the Cypress SPARC, anyone know?) 
> --
> 		  baum@apple.com		(408)974-3385
> {decwrl,hplabs}!amdahl!apple!baum

While I am firmly a RISC bigot, there is a good CISC argument here.
When comparing the '040 and Sparc, you are not comparing a 1.2M
transistor chip with a 120K transistor chip.  In the 1.2M transistors
of the '040, there's an ALU, FPU, portions of a cache, and probably an
MMU.  For an accurate comparison, you'ld want to consider the Sparc chip
[ALU and portions of an MMU?], the FPU chip used with the Sparc, and
at least some of the transistors used to implement the off-chip Sparc
cache.

It has been suggested that when looked at in this light, the Sparc
uses just about as many transistors as the '040.

-- Chuck

tom@.parcom.nl (Tom van Peer) (02/12/90)

yam@nttmhs.ntt.jp (Toshihiko YAMAKAMI) writes:

>From article <2101@crdos1.crd.ge.COM>, by davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr):

>However, how about exploiting hidden parallelism?
>As disucssed  in this group, RISC technology has exploited
>hidden parallelism in high level language descriptions.
>When one accesses a variable on a memory, a RISC chip loads it
>into its register. Then it does some operation on it.
>The value is remained on the register. One can reuse it
>in another operation.

Big fun if you want to make a multi processor set-up.

-- 
Tom van Peer.
Parallel Computing, Amsterdam.
+31-20-233274
E-mail: tom@parcom.nl

jdarcy@pinocchio.encore.com (Jeff d'Arcy) (02/12/90)

Either yam@nttmhs.ntt.jp (Toshihiko YAMAKAMI)
	or davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr):
> However, how about exploiting hidden parallelism?
> As disucssed  in this group, RISC technology has exploited
> hidden parallelism in high level language descriptions.
> When one accesses a variable on a memory, a RISC chip loads it
> into its register. Then it does some operation on it.
> The value is remained on the register. One can reuse it
> in another operation.

tom@.parcom.nl (Tom van Peer):
> Big fun if you want to make a multi processor set-up.

Quite right, Tom.  In fact, it's big enough fun that doing this for any
non-local variables is probably too dangerous to try.  I don't know of any
scheme by which bus-snooper logic could tell the CPU to invalidate a value
in a register that wouldn't involve truly hideous complexity.  Fortunately,
access to shared variables is less frequent than access to locals, and in
many such cases you have to use more complex mutual-exclusion mechanisms
already.  If you have to go through all that anyway, the extra cost of not
being able to keep the value in a register is pretty negligible.

Disclaimer: I'm in OS, not compilers, so there may be issues here beyond
my ken.


Jeff d'Arcy     OS/Network Software Engineer     jdarcy@encore.com
  Encore has provided the medium, but the message remains my own

henry@utzoo.uucp (Henry Spencer) (02/13/90)

In article <604@bbxsda.UUCP> scott@bbxsda.UUCP (Scott Amspoker) writes:
>Well, it doesn't take much to find instructions on a 680x0 that are
>not used by a C compiler.  However, my code tends to do a lot of
>structure accesses with pointers such as "pointer->field".  The
>68020 double-indirect-with-offset addressing mode is a real life
>saver and I haven't seen that on the few RISC machines I've used.

Have you measured the costs of doing without it?  Those fancy addressing
modes are usually quite slow.  Just because it's one addressing mode rather
than an instruction or two doesn't mean it's faster.
-- 
SVR4:  every feature you ever |     Henry Spencer at U of Toronto Zoology
wanted, and plenty you didn't.| uunet!attcan!utzoo!henry henry@zoo.toronto.edu

henry@utzoo.uucp (Henry Spencer) (02/22/90)

In article <9755@cbmvax.commodore.com> jesup@cbmvax.cbm.commodore.com (Randell Jesup) writes:
>	2) The 68030 improves the speeds of many of the new addressing modes.
>I think some of them become useful.

And the 68040 improves them even more... except that it improves the speed
of the *simple* stuff by a much larger margin.  Nobody is going to optimize
for the 030 and ignore the 040 at this point... and the list of fast modes
on the 040 is the original 68000 mode list minus indexed.  Not one of the
new-on-020 modes is included; they all take the slow path.
-- 
"The N in NFS stands for Not, |     Henry Spencer at U of Toronto Zoology
or Need, or perhaps Nightmare"| uunet!attcan!utzoo!henry henry@zoo.toronto.edu