[net.micro.68k] RISC

doug@terak.UUCP (Doug Pardee) (05/28/85)

> If we created the chip
> using only the instructions the compiler needed, we could use less logic.
> We could decode the instructions faster because the microcode is simpler
> (more ooomph per MHz), production is simpler, yield is higher, speeds are
> faster, and everyone is happy except the assembler programmer.
> ...
> This is the concept behind the RISC architecture ...

RISC is an interesting concept, but I have a major reservation.  Perhaps
someone can explain to me how on earth you're going to feed instructions
to a RISC machine fast enough?

All of the popular 8 and 16 bit microprocessors are speed limited by
instruction fetch, not by instruction complexity.  I will entertain the
objection that the 6502, with its critical shortage of on-chip
registers, is also limited by operand accesses.  The usual RISC machine
has lots of registers, so operand accesses shouldn't be a problem.
And nobody says that a non-RISC cpu can't have lots of registers.

The knee-jerk answer is "cache".  But that's only an answer if one
refuses to allow non-RISC cpus to use cache; they can fit more logic
into any given cache than can RISC cpus, thereby having a better "hit
ratio" than RISC.

Of course, one could design a RISC machine with a super-high-speed ROM
or cache in which one could store the commonly used functions like
multiplication and division, and then one would only have to fetch a
subroutine call from the (slow) instruction stream.

But doesn't that sound like your everyday, garden variety microcoded
non-RISC cpu?
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

brooks@lll-crg.ARPA (Eugene D. Brooks III) (05/30/85)

The knee jerk answer to the instruction fetch bandwidth problem, cache,
is a valid answer.  The argument that one can give as much cache to a
complicated instruction set engine and therby get as much performance is
not valid.  The performance reduction for the complicated instruction set
comes from the time spent running microcode decode and execute instructions.

A good example of this is the VAX instruction set vs the Ridge instruction set.
The Ridge achieves the same performance as the VAX as a much lower hardware
cost.  The performance increase arises in the simplicity of the instruction
set.  This difference also shows up when you emulate these architectures
in software.  The instruction decode for such emulators on the vax is a
nigtmare and the designers of the VAX faced the same nightmare when the designed
the hardware and wrote the microcode for it.

thoth@tellab3.UUCP (Marcus Hall) (05/31/85)

>RISC is an interesting concept, but I have a major reservation.  Perhaps
>someone can explain to me how on earth you're going to feed instructions
>to a RISC machine fast enough?

One way to help with this problem is to use fairly wide memory accesses
(at least for instruction fetches).  Thus, in one memory cycle, many
instructions may be fetched simultaniously.  Of course this is done for
non-RISC machines as well, but a non-RISC machine will become
execution-bound sooner.

Also, since the RISC instruction set is simpler, the op-codes require fewer
bits, so a memory fetch will get more RISC instructions in one cycle
than it would non-RISC instructions.

marcus hall
..!ihnp4!tellab1!tellab2!thoth

mat@amdahl.UUCP (Mike Taylor) (05/31/85)

> All of the popular 8 and 16 bit microprocessors are speed limited by
> instruction fetch, not by instruction complexity.  I will entertain the
> objection that the 6502, with its critical shortage of on-chip
> registers, is also limited by operand accesses.  The usual RISC machine
> has lots of registers, so operand accesses shouldn't be a problem.
> And nobody says that a non-RISC cpu can't have lots of registers.
> 
> The knee-jerk answer is "cache".  But that's only an answer if one
> refuses to allow non-RISC cpus to use cache; they can fit more logic
> into any given cache than can RISC cpus, thereby having a better "hit
> ratio" than RISC.
> 
> Of course, one could design a RISC machine with a super-high-speed ROM
> or cache in which one could store the commonly used functions like
> multiplication and division, and then one would only have to fetch a
> subroutine call from the (slow) instruction stream.
> 
> But doesn't that sound like your everyday, garden variety microcoded
> non-RISC cpu?
> -- 
> Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
>                ^^^^^--- soon to be CalComp

Fetching one instruction per cycle is pretty much necessary to execute
one instruction per cycle.  Whether this is a bug or a feature is a
good question.  The difference between fetching a microword every cycle
and fetching an instruction is that the microcoded machine has a fixed
microsequence in ROM while the instructions come from RAM.  Since they
come from RAM, they can be optimized by the compiler for the particular
job being done. Clearly you pay for this by diverting fetches from
microstore to RAM.  Some RISC implementations try to get this cost back by
using the microstore's chip space for registers and/or instruction queues,
reducing total fetch traffic.

How can you tell a machine is not a RISC ?  Proposal...

If it is possible to replace some instructions with sequences of other
instructions with only small performance penalties (or gains, a la VAX)
then those instructions probably have no business in the instruction
set.  Their presence is probably slowing down the whole system.  Take
them out.  Redesign to use the space you got back by removing them.

Repeat the above cycle until no more instructions can be removed or
the cycle doesn't improve performance.  You now have a RISC.

Arguments ?
-- 
Mike Taylor                        ...!{ihnp4,hplabs,amd,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

joel@peora.UUCP (Joel Upchurch) (05/31/85)

        I think your argument ignores  the  fact  that  with  a  given
        fabrication technology that there is only so much function you
        put on a given chip.  If you chose to  have  a  large  control
        store  ROM  for  a  complex instruction set then you must make
        sacrifices in other parts of the  chip.  This  may  mean  less
        registers,  or  less  cache,  or a slower ALU, or other trade-
        offs.

        RISC argues that  this  is  not  a  good  trade,  that  it  is
        difficult  to  write  compilers to use complicated instruction
        sets and that these high level operations are not  necessarily
        any  faster  than a series of low level operations.  Also with
        simple instruction sets you have the option of hardwireing  it
        for additional performance.

doug@terak.UUCP (Doug Pardee) (06/03/85)

Responses to early comments on my question about how a RISC cpu can
fetch instructions fast enough to keep up with non-RISC cpus:

> The knee jerk answer to the instruction fetch bandwidth problem, cache,
> is a valid answer.  The argument that one can give as much cache to a
> complicated instruction set engine and therby get as much performance is
> not valid.  The performance reduction for the complicated instruction set
> comes from the time spent running microcode decode and execute instructions.

I'd believe this except that it ain't so.  The performance reduction on
current non-RISC single-chip cpus comes from instruction fetch.  For
example, the NS32016 can do your basic RISC-type operations in 3 or 4
clock cycles, but it takes 5 clock cycles to fetch the next instruction.
RISC-type instructions on the 32016 therefore take 5 clock cycles each.
And the 32016 is "burdened" by non-RISC "microcode decode and execute
instructions."

> One way to help with this problem is to use fairly wide memory accesses
> (at least for instruction fetches).  Thus, in one memory cycle, many
> instructions may be fetched simultaniously.  Of course this is done for
> non-RISC machines as well, but a non-RISC machine will become
> execution-bound sooner.

The NS32032 and MC68020 both fetch 32 bits of instruction data at one
time.  Both have "zillions" of pins on the package in order to pull that
off; I can't imagine building a RISC chip with 128-bit wide instruction
fetch, requiring over 250 pins on the package.

And if we're not talking single-chip architectures, we're going to have
a devil of a time making rational comparisons.  After all, IBM has used
a non-RISC architecture for some time and it goes pretty fast :-)

I'll believe that a non-RISC machine will become execution-bound sooner,
but this just another way of saying that the RISC machine is much more
limited by instruction fetch than the non-RISC machine is.

> Also, since the RISC instruction set is simpler, the op-codes require fewer
> bits, so a memory fetch will get more RISC instructions in one cycle
> than it would non-RISC instructions.

I thought this too, until I looked into some RISC machines.  They use
32-bit instruction words, twice as wide as the equivalent instructions
in, say, the 680xx and 320xx cpus.

Still looking for the answer...
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

steveg@hammer.UUCP (Steve Glaser) (06/05/85)

In article <1590@amdahl.UUCP> mat@amdahl.UUCP (Mike Taylor) writes:
>> All of the popular 8 and 16 bit microprocessors are speed limited by
>> instruction fetch, not by instruction complexity.  I will entertain the
>> objection that the 6502, with its critical shortage of on-chip
>> registers, is also limited by operand accesses.  The usual RISC machine
>> has lots of registers, so operand accesses shouldn't be a problem.
>> And nobody says that a non-RISC cpu can't have lots of registers.
>> 
>> The knee-jerk answer is "cache".  But that's only an answer if one
>> refuses to allow non-RISC cpus to use cache; they can fit more logic
>> into any given cache than can RISC cpus, thereby having a better "hit
>> ratio" than RISC.
>> 
>> Of course, one could design a RISC machine with a super-high-speed ROM
>> or cache in which one could store the commonly used functions like
>> multiplication and division, and then one would only have to fetch a
>> subroutine call from the (slow) instruction stream.
>> 

The basic tenants of RISC are that you throw out the useless bagage
that's in there "cause somebody wanted it and microcode was cheap" and
get back to a reaconable minimal set (do chuck out an instruction;
measure performance; if better or same, repeat).  With the real estate
gained, you then start putting reasonable sized caches and/or prefetch
queues on chip instead of useless instructions.

Some folks have decided that RISC means that you eliminate the
microcode and directly decode the operations.  This is too narrow of a
view.  It doesn't matter how you implement what's left, as long as you
only implement what is really needed.

As for caches, one issue that hasn't been mentioned is the use of block
mode transfers for fast cache filling.  There are ram chips out there
that can access "nearby" ram cells significantly faster that random
ones (so called page mode, nibble mode, or static column rams).  These
were mainly developed for the raster display markets (so you don't
waste all your memory bandwith refreshing the screen and you get a
chance to make updates to it), but are useful in other areas as well.
Since caches blocks are typically a power of 2 in size and start on
nice boundaries, it's pretty easy to map these into the notion of
"nearby" supported by a ram chip.  The only problem you have with this
kind of architecture is (1) you need a new bus protocol, and (2) ECC
gets harder cause you need to do it faster (anybody for a pipelined ECC
chip that can keep up with 50nsec ram cycles?).

	Steve Glaser

henry@utzoo.UUCP (Henry Spencer) (06/06/85)

> I thought this too, until I looked into some RISC machines.  They use
> 32-bit instruction words, twice as wide as the equivalent instructions
> in, say, the 680xx and 320xx cpus.

Yes and no.  The Berkeley RISC project adopted 32-bit instructions for
simplicity in initial work, not because they thought it was right for
the final design.  If you look, you'll find at least one paper from them
discussing a support chip which is (a) an instruction cache, and (b) an
instruction-encoding expander.  The latter function makes a large
difference to instruction density without introducing any extra delays.

Also, don't be too sure that the 68* and 32* chips use 16-bit instructions
a lot.  Remember that things like offsets take extra bytes, and those get
used *a lot* on those machines -- virtually every memory reference needs
one.  On the pdp11, a 16-bit machine if there ever was one, the average
instruction length was in fact about 32 bits.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

doug@terak.UUCP (Doug Pardee) (06/07/85)

> Some folks have decided that RISC means that you eliminate the
> microcode and directly decode the operations.  This is too narrow of a
> view.

An elephant is much like a tree, or is it a rope, or is it a snake...

If we're going to talk about RISC, we'd better decide what we're going
to call RISC.  Is RISC:
  1) a way to save silicon real estate on single-chip cpus?  or
  2) a concept that's equally applicable to board-level cpus?

Is RISC:
  1) strictly limited instruction set, essentially microcode level?  or
  2) any cpu without "useless" instructions? (define "useless")  or
  3) any cpu with oodles of registers?

For the moment, I'm going to talk about non-microcoded single-chip cpus
with strictly limited instruction sets.  Something you'd put in a TTL
system with dynamic RAMs.  I've come to the conclusion that for this
kind of RISC, its time has passed.  It made a lot of sense in the late
'70s when cpu cycles were 250-400 ns and memory cycles were 300-400 ns.

But since then the cpu manufacturers have concentrated on speed while
the memory manufacturers concentrated on capacity.  With cpu cycles now
at 60-100 ns and memory cycles at 225-325 ns, the way to improve speed
in a system is to decrease memory cycles, not cpu cycles.  That means
making instructions even *more* complex, so that each instruction
fetched does as much work as possible.  At 100 ns per cpu cycle, you can
afford to let the cpu do a lot of work that isn't always needed, because
you can throw away *half* of it and still have the cpu be faster than
if you fetched the microinstructions from main memory (using 120 ns
access/220 ns cycle time 256K DRAMs).

By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time.  The
10 MHz NS32016 can do a 32-bit register-to-register signed multiply in
8.3 usec.  The RISC-II cpu would have to be able to do the multiply in
only 25 cpu cycles in order to compete.  All the cache in the world
ain't gonna help...
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

hal@cornell.UUCP (Hal Perkins) (06/09/85)

In article <601@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes:
>By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time.  The
>10 MHz NS32016 can do a 32-bit register-to-register signed multiply in
>8.3 usec.  The RISC-II cpu would have to be able to do the multiply in
>only 25 cpu cycles in order to compete.  All the cache in the world
>ain't gonna help...

Now just a moment...  The Berkeley chips were built by academic folks
using conservative design rules, etc.  If they had been built by
experienced professional chip designers using state-of-the-art
technology they would have been a lot faster.  It's not fair to pick on
the relatively slow clock cycle of the experimental chips -- they were
built to prove a point (which they did), not to produce a product.
(I'm not making this up -- if you read some of the RISC papers by
Patterson and friends, et. al., they make the same point.)

Hal Perkins                         UUCP: {decvax|vax135|...}!cornell!hal
Cornell Computer Science            ARPA: hal@cornell  BITNET: hal@crnlcs

jans@mako.UUCP (Jan Steinman) (06/10/85)

In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Also, don't be too sure that the 68* and 32* chips use 16-bit instructions
>a lot.  Remember that things like offsets take extra bytes...

In my experience 16 bit instructions get used quite often on the NS32000 in
well written assembly code.  The NS32000 has an edge over the MC68000 in code
density in that instructions need not be word-aligned.  Henry is correct in
stating that offsets are used on virtually every memory reference, but those
offsets are usually 8 bits.  Following is some statistics for one module
pulled at random from my present project.  The code is highly optimized for
speed, so code density is not even as good as it should be, i.e. "jump" is
used instead of "br", saving one clock, but costing two bytes, etc.

	Size	Count
	8	0
	16	14
	24	13
	32	1
	40	0
	48	2
	Total	30
	Ave	22.1 bits per instruction.

The average instruction size of 22.1 bits is indeed more than the 16 bits the
original poster assumed, but much less than the 32 bits Henry expected, due
in large part to the large number of 24 bit instructions available on the
NS32000.  I believe the MC68000 would have a larger number of 32 bit
instructions with the consequent increase in average bits-per-instruction.

Of course, no analogy can be drawn for compiled code.  Many compilers under-
use register space and I suspect the average bits-per-instruction would be
much higher for compiled code on both machines.
-- 
:::::: Jan Steinman		Box 1000, MS 61-161	(w)503/685-2843 ::::::
:::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

doug@terak.UUCP (Doug Pardee) (06/11/85)

Once again, I find that I didn't do a good job of explaining myself.
Let me try again...

In the following comment, notice the word "equivalent":

me> I thought this too, until I looked into some RISC machines.  They use
me> 32-bit instruction words, twice as wide as the equivalent instructions
me> in, say, the 680xx and 320xx cpus.

> Also, don't be too sure that the 68* and 32* chips use 16-bit instructions
> a lot.  Remember that things like offsets take extra bytes, and those get
> used *a lot* on those machines -- virtually every memory reference needs
> one.

My point was supposed to be that you could take a 680xx/320xx and limit
yourself to the RISC instruction set (the "equivalent" instructions) and
have 16-bit instructions instead of RISC's 32-bit instructions.

Presumably, the reason that the longer instructions, with their offsets
etc., are used so much is because those instructions are more effective
than the RISC instructions for the task at hand (this sounds like
another discussion that I'm involved in :-)

> instruction-encoding expander.  The latter function makes a large
> difference to instruction density without introducing any extra delays.

In a microcoded cpu, the opcode is decoded and causes a sequence of
very simple micro-instructions to be executed.

In the proposed system, the opcode is decoded and causes a sequence of
very simple RISC instructions to be executed.

What's the difference?  Where does the much-vaunted performance gain
come from?
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

john@frog.UUCP (John Woods) (06/12/85)

> In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
> >Also, don't be too sure that the 68* and 32* chips use 16-bit instructions
> >a lot.  Remember that things like offsets take extra bytes...
>In my experience 16 bit instructions get used quite often on the NS32000 in
>well written assembly code.  The NS32000 has an edge over the MC68000 in code
>density in that instructions need not be word-aligned.  Henry is correct in
>stating that offsets are used on virtually every memory reference, but those
>offsets are usually 8 bits.  Following is some statistics for one module
>pulled at random from my present project.  The code is highly optimized for
>speed, so code density is not even as good as it should be, i.e. "jump" is
>used instead of "br", saving one clock, but costing two bytes, etc.
> 	Size	Count	Size Count
> 	8	0	16	14
> 	24	13	32	1
> 	40	0	48	2
> 	Total	30	Ave	22.1 bits per instruction.
>The average instruction size of 22.1 bits is indeed more than the 16 bits the
>original poster assumed, but much less than the 32 bits Henry expected, due
>in large part to the large number of 24 bit instructions available on the
>NS32000.  I believe the MC68000 would have a larger number of 32 bit
>instructions with the consequent increase in average bits-per-instruction.
> 
I counted words for some random compiled code here for the 68000 (the famous
Knight's Tour, which I still have lying around).
Size (bytes)	2	4	6	8
Sample 1:	23	4	3	0	Average 21.3 bits / 30
	main()
Sample 2:	18	7	4	0	Average 23.4 bits    instr.
	try()

The Greenhills compiler tries to make good use of registers.

On the other hand, I don't know how much these instructions actually
accomplished.  I wouldn't be surprised to find that the 32?32 does more with
those 22 bits per instruction, but I wouldn't be horribly shocked to find that
the 68000 does more [despite my employer, I am a closet 32032 fan, by the way-
it is almost as nice as the PDP-11 !-) ].  I suppose the true lesson is that
"bits per instruction" is yet another Red Herring in Heavy Oil:  a computer
with 64 bit instructions would have "poor" "code density", but if one of those
instructions was "Solve the (reg0)x(reg0) Knight's Tour and output the table
to channel (reg1)", the program would be awfully short !-).


--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA

Five tons of flax!

west@sdcsla.UUCP (Larry West) (06/12/85)

In article <2335@cornell.UUCP> hal@gvax.UUCP (Hal Perkins) writes:
>In article <601@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes:
>>By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time.  The
>>10 MHz NS32016 can do a 32-bit register-to-register signed multiply in
>>8.3 usec.  The RISC-II cpu would have to be able to do the multiply in
>>only 25 cpu cycles in order to compete.  All the cache in the world
>>ain't gonna help...
>
>Now just a moment...  The Berkeley chips were built by academic folks
>using conservative design rules, etc.  If they had been built by
>experienced professional chip designers using state-of-the-art
>technology they would have been a lot faster. ... <text omitted> ...

Hal's point is valid.   Another point to consider is this: what you
really want for a general-purpose computer is throughput.   How much
of a typical instruction stream is composed of 32-bit signed multiply's?

Obviously, in many applications, multiplication would take on great
significance.   But one point of RISC is that optimizing a few
special-purpose instructions can hurt you in general-purpose applications.
-- 

Larry West			Institute for Cognitive Science
(USA+619-)452-6220		UC San Diego (mailcode C-015) [x6220]
ARPA: <west@nprdc.ARPA>		La Jolla, CA  92093  U.S.A.
UUCP: {ucbvax,sdcrdcf,decvax,ihnp4}!sdcsvax!sdcsla!west OR ulysses!sdcsla!west

chuck@dartvax.UUCP (Chuck Simmons) (06/13/85)

> In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
> >Also, don't be too sure that the 68* and 32* chips use 16-bit instructions
> >a lot.  Remember that things like offsets take extra bytes...
> 
> In my experience 16 bit instructions get used quite often on the NS32000 in
> well written assembly code.  The NS32000 has an edge over the MC68000 in code
> density in that instructions need not be word-aligned.  Henry is correct in
> stating that offsets are used on virtually every memory reference, but those
> offsets are usually 8 bits.  Following is some statistics for one module
> pulled at random from my present project.  The code is highly optimized for
> speed, so code density is not even as good as it should be, i.e. "jump" is
> used instead of "br", saving one clock, but costing two bytes, etc.
> 
> 	Size	Count
> 	8	0
> 	16	14
> 	24	13
> 	32	1
> 	40	0
> 	48	2
> 	Total	30
> 	Ave	22.1 bits per instruction.
> 
> The average instruction size of 22.1 bits is indeed more than the 16 bits the
> original poster assumed, but much less than the 32 bits Henry expected, due
> in large part to the large number of 24 bit instructions available on the
> NS32000.  I believe the MC68000 would have a larger number of 32 bit
> instructions with the consequent increase in average bits-per-instruction.
> 
> Of course, no analogy can be drawn for compiled code.  Many compilers under-
> use register space and I suspect the average bits-per-instruction would be
> much higher for compiled code on both machines.
> :::::: Jan Steinman		Box 1000, MS 61-161	(w)503/685-2843 ::::::

Pulling out a "random" chunk of 68000 code (the only chunk I have sitting
near by) gives us the following table:

     Size    Count
      16      22
      32      17
      48       1
     Total    40
     Ave    23.6 bits per instruction.

I'm not sure that code density is really what we want to be looking
at, however.  I can recode the above algorithm to use more instructions
that are smaller on the average and thus acheive a higher code density.
(The 17 32-bit instructions can be replaced by 33 16-bit instructions
and 2 32-bit instructions, giving an average of 18.4 bits per instruction.)
However, this denser code would be both slower and longer.  Somewhere
we need to take into account the "usefulness" of each instruction.

Chuck Simmons

hull@hao.UUCP (Howard Hull) (06/15/85)

In article 195@frog, John Woods writes:
> On the other hand, I don't know how much these instructions actually
> accomplished.  I wouldn't be surprised to find that the 32?32 does more with
> those 22 bits per instruction, but I wouldn't be horribly shocked to find that
> the 68000 does more [despite my employer, I am a closet 32032 fan, by the way-
> it is almost as nice as the PDP-11 !-) ].

Just out of curiousity, what *ARE* you 320xx and 68k fans using for the PDP-11
instruction     MOV (SP)+,@(R5)+    eh?  (It's useful for initializing tables
or mapped video windows from stack images.)
I only need one or two examples, please, so read all your micro mail before
replying to this.  If you see a correct answer, and you think yours is better,
please send me mail; In a subsequent posting, I'll summarize whatever I get.
Thanks.

								     Howard Hull
[If yet unproven concepts are outlawed in the range of discussion...
                   ...Then only the deranged will discuss yet unproven concepts]
        {ucbvax!hplabs | allegra!nbires | harpo!seismo } !hao!hull

henry@utzoo.UUCP (Henry Spencer) (06/17/85)

> By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time.  The
> 10 MHz NS32016 can do a 32-bit register-to-register signed multiply in
> 8.3 usec.  The RISC-II cpu would have to be able to do the multiply in
> only 25 cpu cycles in order to compete.  All the cache in the world
> ain't gonna help...

The RISC II was designed by a bunch of grad students, using simplistic
design rules and mediocre MOS processing, with limited opportunity to try
again if the original chips didn't work.  The 10MHz NS32016 was done by
a swarm of professional silicon bashers, using professional facilities
and souped-up processes, over a long period of time with *many* design
iterations.  (In fact, too %@$@%#@% many for those of us who were waiting
for things to settle down and work...)  This comparison means little.

Also, multiplies are actually fairly infrequent operations in most programs.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

henry@utzoo.UUCP (Henry Spencer) (06/17/85)

> My point was supposed to be that you could take a 680xx/320xx and limit
> yourself to the RISC instruction set (the "equivalent" instructions) and
> have 16-bit instructions instead of RISC's 32-bit instructions.

Try doing memory references in 16 bits on either the 680xx or 320xx.
It can't be done, because the opcode alone is 16 bits.  And you need
memory references more on the commercial machines, because the RISC's
register architecture isn't present to reduce memory usage.

> > instruction-encoding expander.  The latter function makes a large
> > difference to instruction density without introducing any extra delays.
> 
> In a microcoded cpu, the opcode is decoded and causes a sequence of
> very simple micro-instructions to be executed.
> 
> In the proposed system, the opcode is decoded and causes a sequence of
> very simple RISC instructions to be executed.

Wrong, it causes *one* very simple RISC instruction to be executed.
The expander function is purely an encoding technique, it does not
introduce another level of emulation.  The point is that the poor code
density of current RISCs is the result of optimization for experimental
work rather than production.  More compact encodings, without change to
the basic concept, are both possible and desirable once the basic issues
are understood.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

jer@peora.UUCP (J. Eric Roskos) (06/20/85)

henry@utzoo.UUCP (Henry Spencer @ U of Toronto Zoology) writes:

> Also, multiplies are actually fairly infrequent operations in most programs.

That is, if "most programs" don't use multidimensional arrays...
-- 
Shyy-Anzr:  J. Eric Roskos
UUCP:       ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer
US Mail:    MS 795; Perkin-Elmer SDC;
	    2486 Sand Lake Road, Orlando, FL 32809-7642

	    Bar ol bar / Gur pbyq rgpurq cyngr /
	    Unf cevagrq gur jnez fgnef bhg.

henry@utzoo.UUCP (Henry Spencer) (06/21/85)

> > Also, multiplies are actually fairly infrequent operations in most programs.
> 
> That is, if "most programs" don't use multidimensional arrays...

Except in a few specific environments, most programs indeed do not use
multidimensional arrays.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

kds@intelca.UUCP (Ken Shoemaker) (06/21/85)

> introduce another level of emulation.  The point is that the poor code
> density of current RISCs is the result of optimization for experimental
> work rather than production.  More compact encodings, without change to
> the basic concept, are both possible and desirable once the basic issues
> are understood.
> -- 
> 				Henry Spencer @ U of Toronto Zoology
> 				{allegra,ihnp4,linus,decvax}!utzoo!henry

er, I don't think so...one of the basic concepts of RISCs as I understand
them is to reduce the number of pipeline stages in the execution of
instructions.  Adding another just to do an expansion/shuffleing of
of opcode bytes flys directly in the face of that thinking.
-- 
...and I'm sure it wouldn't interest anybody outside of a small circle
of friends...

Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm

{pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds
	
---the above views are personal.  They may not represent those of the
	employer of its submitter.

henry@utzoo.UUCP (Henry Spencer) (06/24/85)

> er, I don't think so...one of the basic concepts of RISCs as I understand
> them is to reduce the number of pipeline stages in the execution of
> instructions.  Adding another just to do an expansion/shuffleing of
> of opcode bytes flys directly in the face of that thinking.

Remember that the RISC is necessarily decoding its opcodes; not even on
a machine that simple is a numeric opcode a direct encoding of the internal
control signals.  The point is that the current instruction encoding was
chosen for simplicity, not compactness, and one can do better *without*
compromising the principles of the machine.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

baba@spar.UUCP (Baba ROM DOS) (06/25/85)

>                                           The 10MHz NS32016 was done by
> a swarm of professional silicon bashers, using professional facilities
> and souped-up processes, over a long period of time with *many* design
> iterations.
>

Don't you see a contradiction lurking in there?

					Baba ROM DOS

brooks@lll-crg.ARPA (Eugene D. Brooks III) (06/25/85)

> > > Also, multiplies are actually fairly infrequent operations in most programs.
> > 
> > That is, if "most programs" don't use multidimensional arrays...
> 
> Except in a few specific environments, most programs indeed do not use
> multidimensional arrays.

Perhaps we could do without the integer multiplier :-), but we had better
not drop the floating point adder and multiplier, and they had both better
operate at one op per clock after filling the pipeline.

meissner@rtp47.UUCP (Michael Meissner) (06/25/85)

Another case where multiplication occurs frequently is contructs of the
sort:

	struct {
		long	l;
		short	s;
	} *p;

	int	i;

	main(){
		/*...*/
		p += i;
		p[i].l = 1;
		/*...*/
	}

Ie, pointer arithmetic involving non-constant integers, particularly if
the size is not a multiple of 2.

--
	Michael Meissner
	Data General Corporation
	...{ ihnp4, decvax }!mcnc!rti-sel!rtp47!meissner

guy@sun.uucp (Guy Harris) (06/26/85)

> > > > Also, multiplies are actually fairly infrequent operations in most
> > > > programs.
> > > 
> > > That is, if "most programs" don't use multidimensional arrays...
> > 
> > Except in a few specific environments, most programs indeed do not use
> > multidimensional arrays.
> 
> Perhaps we could do without the integer multiplier :-), but we had better
> not drop the floating point adder and multiplier, and they had both better
> operate at one op per clock after filling the pipeline.

Methinks "most" is in the eye of the beholder.  Most UNIX utilities, and the
UNIX kernel, do relatively few multiplications *or* floating point
operations (the kernel need not do any; the 4.2BSD scheduler does, but the
Sun 4.2BSD scheduler and the 2.9BSD scheduler, which have the same
load-average dependent decay on the average weighting for "p_cpu", don't)
and rarely use multidimensional arrays.  The same is probably true for many
other classes of applications, such as the office automation/"personal
productivity" applications popular on PCs, for instance.

However, I suspect a scientific programmer would answer differently.  I used
multidimensional arrays in scientific programs I've written; they are quite
common in such programs (they are a natural data structure for many types of
calculations, like those which use models of two-dimensional or
three-dimensional spaces).  Those programs also would like the FP adder and
multiplier; I don't know how many integer multiplies or divides they do,
other than in support of accesses to multi-dimensional arrays.

In the case of such accesses, if the indices being used are linear functions
of the induction variable of a loop (I don't know whether there are any
looser conditions possible), a strength reduction can make the
multiplications unnecessary (C programmers may know this trick as "using a
pointer instead of an array subscript").

	Guy Harris

henry@utzoo.UUCP (Henry Spencer) (06/26/85)

> >                                           The 10MHz NS32016 was done by
> > a swarm of professional silicon bashers, using professional facilities
> > and souped-up processes, over a long period of time with *many* design
> > iterations.
> 
> Don't you see a contradiction lurking in there?

I wish I did...  Professional people and facilities plus plenty of time
would suggest getting it right the first time, except that it just makes
them more ambitious instead.  The souped-up processes help speed but don't
do a lot for correctness.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

ken@turtlevax.UUCP (Ken Turkowski) (06/26/85)

>>> Also, multiplies are actually fairly infrequent operations in most programs.
>> 
>> That is, if "most programs" don't use multidimensional arrays...
>
>Except in a few specific environments, most programs indeed do not use
>multidimensional arrays.

Depending on what your product is, you may not need multiplication at
all.  If the end product is a UN*X machine, most UN*X utilities do not
need multiplications.  However, if you are running any engineering or
scientific applications, the time devoted to multiplication is
considerable, and may even dominate execution time if there is no
hardware support for it.

Saving the status register is an infrequent operation; why not do
multiple conditional branches instead?  Virtually no application
programs need it at all.  The operating system just needs it to switch
contexts, which is a task done infrequently, and takes so much time as
it is... :-)
-- 

Ken Turkowski @ CADLINC, Menlo Park, CA
UUCP: {amd,decwrl,hplabs,nsc,seismo,spar}!turtlevax!ken
ARPA: turtlevax!ken@DECWRL.ARPA

***** Suppport your local number cruncher: join PIFOM!	*****
***** (Programmers In Favor Of Multiplication)		*****

bobbyo@celerity.UUCP (Bob Ollerton) (06/29/85)

Gee, if someone built a computer that did really fast multiplication
(what ever "really fast" is) should they call it the "Rabbit"?

I am sure that there are many designs that offer both fast, slow, and
no hardware support for multiplication.  The user pays for  the cost
of support for multiplication hardware in the final analysis,
either in the cost of the system, or the time spent knitting
paper clips together while running a job.

Maybe designs with out fast multiplication (and floating point!)
should be called the "Lizzard".

Then somone can introduce the Micro-LizzardII.  ::)) <-echo.

bob.
.

-- 
Bob Ollerton; Celerity Computing; 
9692 Via Excelencia; San Diego, Ca 92126; (619) 271 9940
{decvax || ucbvax || ihnp4}!sdcsvax!celerity!bobbyo
                              akgua!celerity!bobbyo

sasaki@harvard.ARPA (Marty Sasaki) (06/29/85)

This may be an increadibly stupid question, but I remember reading an
article (in CACM, I think) on RISC machines where most of the
instructions occupied a single byte of memory. Through cleverness,
something like 95% of all instructions executed occupied a single
byte (including operands). This meant that programs would be smaller
and would run faster by the simple fact that they required fewer
memory references to get work done.

All of the recent discussion on RISC machines hasn't mentioned this at
all. Have things changed sufficiently in the recent past that this
small program size doesn't matter? Is this a really dumb question?

-- 
----------------
  Marty Sasaki				net:   sasaki@harvard.{arpa,uucp}
  Havard University Science Center	phone: 617-495-1270
  One Oxford Street
  Cambridge, MA 02138

mash@mips.UUCP (John Mashey) (06/30/85)

Michael Meissner ...{ ihnp4, decvax }!mcnc!rti-sel!rtp47!meissner writes:
> Another case where multiplication occurs frequently is contructs of the
> sort:
> 	struct {
> 		long	l;
> 		short	s;
> 	} *p;
> 	int	i;
> 	main(){
> 		/*...*/
> 		p += i;
> 		p[i].l = 1;
> 		/*...*/
> 	}
> Ie, pointer arithmetic involving non-constant integers, particularly if
> the size is not a multiple of 2.

1) In this example, sizeof(*p) == 8 anyway.  To get the intended effect,
use short l[2], for example.
2) In any case, at least some compilers not only do multiplies of
powers of 2 by adds or shifts, but do so for constants that are almost
powers of 2 (i.e., few 1-bits) by shifts and adds or subtracts.
3) Indeed, multiplication does occur frequently in these cases; the real
question, is how frequent are these cases, really? [Not an answer, but
a question whose answer needs to be known when you're making tradeoffs
in CPU design].
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

larry@mips.UUCP (Larry Weber) (07/01/85)

> Another case where multiplication occurs frequently is contructs of the
> sort:
> 
> 	struct {
> 		long	l;
> 		short	s;
> 	} *p;
> 
> 	int	i;
> 
> 	main(){
> 		/*...*/
> 		p += i;
> 		p[i].l = 1;
> 		/*...*/
> 	}
> 

Almost all the examples provided actually use multiplication by a constant.
Thus, on all but the largest (and most expensive) of machines there are
shift/add/subtract sequences that do the trick nicely:
   x * 6 == ((x<<1)+x)<<1) 		three instructions
   x * 7 == (x<<3)-x			two instructions
   x *119== ((x<<4)-x)<<3)-x		four instructions
These sequences are often smaller than a call to a multiply routine and
almost always faster than a general purpose multiply instruction.  Most
of the constants are powers of two, near powers of two (31, 255) or simple sums
of powers of two (6, 10, 24).

Some machines are better than others at this sort of thing so study
those timings carefully.  This is a winner on the 68k.
-- 
-Larry B Weber
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!larry
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

kds@intelca.UUCP (Ken Shoemaker) (07/02/85)

> Remember that the RISC is necessarily decoding its opcodes; not even on
> a machine that simple is a numeric opcode a direct encoding of the internal
> control signals.  The point is that the current instruction encoding was
> chosen for simplicity, not compactness, and one can do better *without*
> compromising the principles of the machine.

I didn't say that it didn't decode them...My understanding is that the
various bits of the instructions always come into the processor on the
same set of data lines, which eliminates the need for some other unit
out there to steer the bits from the appropriate data lines to the
correct decoding unit.
-- 
...and I'm sure it wouldn't interest anybody outside of a small circle
of friends...

Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm

{pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds
	
---the above views are personal.  They may not represent those of the
	employer of its submitter.