[net.arch] RISC

paulsc@tekecs.UUCP (06/21/83)

I have only seen a couple articles on RISCs, and I don't believe
I understand just what is new about a RISC. Could someone
explain the difference between programming a RISC and writing
microcode?

Paul H. Scherf
P. O. Box 1000
Del. Sta. 61-201
Tektronix Engineering Computing Systems
Wilsonville, Oregon, USA

UUCP:	...!XXX!teklabs!tekecs!paulsc
	(where XXX is one of: aat cbosg chico decvax harpo ihnss
	lbl-unix ogcvax pur-ee reed ssc-vax ucbvax zehntel)
CSNET:	tekecs!paulsc @ tektronix
ARPA:	tekecs!paulsc.tektronix @ rand-relay

bcase@uiucdcs.UUCP (06/28/83)

#R:tekecs:-145500:uiucdcs:27800011:000:1529
uiucdcs!bcase    Jun 27 19:03:00 1983

First, to my mind, using 'RISC' may mean only the Berkeley RISC machine,
but it is rapidly becomming a generic term used to describe an architecture
which is designed to execute simple instructions quickly.  For some of
the 'RISC'-like machines, there is very little difference between the
language of the machine and traditional microcode; it just depends on
which one you are talking about.  Compiling for the Stanford MIPS is very
much like compiling to microcode.  Rather than make this a dissertation,
please accept the following as recommended readings:

"The Case for the Reduced Instruction Set Computer," Computer Arch. News,
Vol. 8, No. 6, Oct. 1980.

"RISC I:  A Reduced Instruction Set VLSI Computer," Proceedings from the
Eighth Symp. on Comp. Arch., May 1981.

"Comments on 'The Case for the Reduced Instruction Set Computer,'" CAN,
Vol. 8, No. 6, Oct. 1980.

"MIPS:  A Microprocessor Architecture," Proceedings from the 15th Workshop
on Microprogramming, Nov. 1982.

"The 801 Minicomputer," Proceedings from the Symp. on Arch. Support for 
Programming Langs. and OSs, March 1982.

There are many more, especially by Patterson, but they are mostly repetitive.
There is one in VLSI design which describes how the first RISC Is performed.
Have fun.  This is an interesting area; everybody seems to think that
everyone else's design is wrong ("They'll never succeed, they're doing
it all wrong.").
(I didn't intend to show favoritism in my list of references, if anyone
has any other suggestions, I want to read 'em.)

doug@terak.UUCP (Doug Pardee) (05/28/85)

> If we created the chip
> using only the instructions the compiler needed, we could use less logic.
> We could decode the instructions faster because the microcode is simpler
> (more ooomph per MHz), production is simpler, yield is higher, speeds are
> faster, and everyone is happy except the assembler programmer.
> ...
> This is the concept behind the RISC architecture ...

RISC is an interesting concept, but I have a major reservation.  Perhaps
someone can explain to me how on earth you're going to feed instructions
to a RISC machine fast enough?

All of the popular 8 and 16 bit microprocessors are speed limited by
instruction fetch, not by instruction complexity.  I will entertain the
objection that the 6502, with its critical shortage of on-chip
registers, is also limited by operand accesses.  The usual RISC machine
has lots of registers, so operand accesses shouldn't be a problem.
And nobody says that a non-RISC cpu can't have lots of registers.

The knee-jerk answer is "cache".  But that's only an answer if one
refuses to allow non-RISC cpus to use cache; they can fit more logic
into any given cache than can RISC cpus, thereby having a better "hit
ratio" than RISC.

Of course, one could design a RISC machine with a super-high-speed ROM
or cache in which one could store the commonly used functions like
multiplication and division, and then one would only have to fetch a
subroutine call from the (slow) instruction stream.

But doesn't that sound like your everyday, garden variety microcoded
non-RISC cpu?
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

lucius@tardis.UUCP (Lucius Chiaraviglio) (05/30/85)

_
> RISC is an interesting concept, but I have a major reservation.  Perhaps
> someone can explain to me how on earth you're going to feed instructions
> to a RISC machine fast enough?
> 
> All of the popular 8 and 16 bit microprocessors are speed limited by
> instruction fetch, not by instruction complexity.  I will entertain the
> objection that the 6502, with its critical shortage of on-chip
> registers, is also limited by operand accesses.  The usual RISC machine
> has lots of registers, so operand accesses shouldn't be a problem.
> And nobody says that a non-RISC cpu can't have lots of registers.

	How about giving the processor an instruction and operand queue in
addition to a cache and give both the processor-to-cache and cache-to-memory
paths a width that is some multiple of the length of a normal data word.  The
idea is to do memory access in parallel (even more than it already is -- this
means parallel not only at the bit level but at the word level as well; I
believe this is not a foreign idea -- if I remember correctly the VAX has a
cache-to-memory data path of this type) so as to make up for the slow serial
speed of these operations.

> The knee-jerk answer is "cache".  But that's only an answer if one
> refuses to allow non-RISC cpus to use cache; they can fit more logic
> into any given cache than can RISC cpus, thereby having a better "hit
> ratio" than RISC.

	Why can't you fit as much logic into the cache of a RISC machine as
into the cache of a standard CPU?  The rest of the CPU would still be RISC; I
thought part of the idea of some RISC projects was to save hardware space for
things like this.  Of course, some RISC projects also have the idea of making
the compiler and/or the operating system take care of managing the cache.

> Of course, one could design a RISC machine with a super-high-speed ROM
> or cache in which one could store the commonly used functions like
> multiplication and division, and then one would only have to fetch a
> subroutine call from the (slow) instruction stream.
> 
> But doesn't that sound like your everyday, garden variety microcoded
> non-RISC cpu?

	Yes, but with simpler and faster hardware, and no fixed microcode
(unless you take the step of putting these routines in ROM, which could be
useful for some applications).  Some CPUs have writable control store than be
loaded without having to reboot the system but with a RISC software (the
compiler, for one thing) is more responsible for taking care of this, so you
can arrange to have even more flexibility and less hassle with context
switches (for the non-assembly programmer that is -- at the low level the
hardware and the software are doing a great amount of work (the new hardware
that RISC machines put in the space they save on conventional hardware is for
this kind of thing) to make it easy for the upper levels, and thus to make the
whole system generally run more smoothly).  So in other words you still get
the advantages of your "everyday, garden variety microcoded non-RISC cpu"
(except that you have to spend more money on lots of fast memory, including
the cache, but memory is getting cheap enough these days to make it worth the
additional expense by improving performance), but lack some of the
disadvantages.

-- 
	-- Lucius Chiaraviglio
	{ seismo!tardis!lucius | lucius@tardis.ARPA | lucius@tardis.UUCP }

brooks@lll-crg.ARPA (Eugene D. Brooks III) (05/30/85)

The knee jerk answer to the instruction fetch bandwidth problem, cache,
is a valid answer.  The argument that one can give as much cache to a
complicated instruction set engine and therby get as much performance is
not valid.  The performance reduction for the complicated instruction set
comes from the time spent running microcode decode and execute instructions.

A good example of this is the VAX instruction set vs the Ridge instruction set.
The Ridge achieves the same performance as the VAX as a much lower hardware
cost.  The performance increase arises in the simplicity of the instruction
set.  This difference also shows up when you emulate these architectures
in software.  The instruction decode for such emulators on the vax is a
nigtmare and the designers of the VAX faced the same nightmare when the designed
the hardware and wrote the microcode for it.

thoth@tellab3.UUCP (Marcus Hall) (05/31/85)

>RISC is an interesting concept, but I have a major reservation.  Perhaps
>someone can explain to me how on earth you're going to feed instructions
>to a RISC machine fast enough?

One way to help with this problem is to use fairly wide memory accesses
(at least for instruction fetches).  Thus, in one memory cycle, many
instructions may be fetched simultaniously.  Of course this is done for
non-RISC machines as well, but a non-RISC machine will become
execution-bound sooner.

Also, since the RISC instruction set is simpler, the op-codes require fewer
bits, so a memory fetch will get more RISC instructions in one cycle
than it would non-RISC instructions.

marcus hall
..!ihnp4!tellab1!tellab2!thoth

mat@amdahl.UUCP (Mike Taylor) (05/31/85)

> All of the popular 8 and 16 bit microprocessors are speed limited by
> instruction fetch, not by instruction complexity.  I will entertain the
> objection that the 6502, with its critical shortage of on-chip
> registers, is also limited by operand accesses.  The usual RISC machine
> has lots of registers, so operand accesses shouldn't be a problem.
> And nobody says that a non-RISC cpu can't have lots of registers.
> 
> The knee-jerk answer is "cache".  But that's only an answer if one
> refuses to allow non-RISC cpus to use cache; they can fit more logic
> into any given cache than can RISC cpus, thereby having a better "hit
> ratio" than RISC.
> 
> Of course, one could design a RISC machine with a super-high-speed ROM
> or cache in which one could store the commonly used functions like
> multiplication and division, and then one would only have to fetch a
> subroutine call from the (slow) instruction stream.
> 
> But doesn't that sound like your everyday, garden variety microcoded
> non-RISC cpu?
> -- 
> Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
>                ^^^^^--- soon to be CalComp

Fetching one instruction per cycle is pretty much necessary to execute
one instruction per cycle.  Whether this is a bug or a feature is a
good question.  The difference between fetching a microword every cycle
and fetching an instruction is that the microcoded machine has a fixed
microsequence in ROM while the instructions come from RAM.  Since they
come from RAM, they can be optimized by the compiler for the particular
job being done. Clearly you pay for this by diverting fetches from
microstore to RAM.  Some RISC implementations try to get this cost back by
using the microstore's chip space for registers and/or instruction queues,
reducing total fetch traffic.

How can you tell a machine is not a RISC ?  Proposal...

If it is possible to replace some instructions with sequences of other
instructions with only small performance penalties (or gains, a la VAX)
then those instructions probably have no business in the instruction
set.  Their presence is probably slowing down the whole system.  Take
them out.  Redesign to use the space you got back by removing them.

Repeat the above cycle until no more instructions can be removed or
the cycle doesn't improve performance.  You now have a RISC.

Arguments ?
-- 
Mike Taylor                        ...!{ihnp4,hplabs,amd,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

joel@peora.UUCP (Joel Upchurch) (05/31/85)

        I think your argument ignores  the  fact  that  with  a  given
        fabrication technology that there is only so much function you
        put on a given chip.  If you chose to  have  a  large  control
        store  ROM  for  a  complex instruction set then you must make
        sacrifices in other parts of the  chip.  This  may  mean  less
        registers,  or  less  cache,  or a slower ALU, or other trade-
        offs.

        RISC argues that  this  is  not  a  good  trade,  that  it  is
        difficult  to  write  compilers to use complicated instruction
        sets and that these high level operations are not  necessarily
        any  faster  than a series of low level operations.  Also with
        simple instruction sets you have the option of hardwireing  it
        for additional performance.

doug@terak.UUCP (Doug Pardee) (06/03/85)

Responses to early comments on my question about how a RISC cpu can
fetch instructions fast enough to keep up with non-RISC cpus:

> The knee jerk answer to the instruction fetch bandwidth problem, cache,
> is a valid answer.  The argument that one can give as much cache to a
> complicated instruction set engine and therby get as much performance is
> not valid.  The performance reduction for the complicated instruction set
> comes from the time spent running microcode decode and execute instructions.

I'd believe this except that it ain't so.  The performance reduction on
current non-RISC single-chip cpus comes from instruction fetch.  For
example, the NS32016 can do your basic RISC-type operations in 3 or 4
clock cycles, but it takes 5 clock cycles to fetch the next instruction.
RISC-type instructions on the 32016 therefore take 5 clock cycles each.
And the 32016 is "burdened" by non-RISC "microcode decode and execute
instructions."

> One way to help with this problem is to use fairly wide memory accesses
> (at least for instruction fetches).  Thus, in one memory cycle, many
> instructions may be fetched simultaniously.  Of course this is done for
> non-RISC machines as well, but a non-RISC machine will become
> execution-bound sooner.

The NS32032 and MC68020 both fetch 32 bits of instruction data at one
time.  Both have "zillions" of pins on the package in order to pull that
off; I can't imagine building a RISC chip with 128-bit wide instruction
fetch, requiring over 250 pins on the package.

And if we're not talking single-chip architectures, we're going to have
a devil of a time making rational comparisons.  After all, IBM has used
a non-RISC architecture for some time and it goes pretty fast :-)

I'll believe that a non-RISC machine will become execution-bound sooner,
but this just another way of saying that the RISC machine is much more
limited by instruction fetch than the non-RISC machine is.

> Also, since the RISC instruction set is simpler, the op-codes require fewer
> bits, so a memory fetch will get more RISC instructions in one cycle
> than it would non-RISC instructions.

I thought this too, until I looked into some RISC machines.  They use
32-bit instruction words, twice as wide as the equivalent instructions
in, say, the 680xx and 320xx cpus.

Still looking for the answer...
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

steveg@hammer.UUCP (Steve Glaser) (06/05/85)

In article <1590@amdahl.UUCP> mat@amdahl.UUCP (Mike Taylor) writes:
>> All of the popular 8 and 16 bit microprocessors are speed limited by
>> instruction fetch, not by instruction complexity.  I will entertain the
>> objection that the 6502, with its critical shortage of on-chip
>> registers, is also limited by operand accesses.  The usual RISC machine
>> has lots of registers, so operand accesses shouldn't be a problem.
>> And nobody says that a non-RISC cpu can't have lots of registers.
>> 
>> The knee-jerk answer is "cache".  But that's only an answer if one
>> refuses to allow non-RISC cpus to use cache; they can fit more logic
>> into any given cache than can RISC cpus, thereby having a better "hit
>> ratio" than RISC.
>> 
>> Of course, one could design a RISC machine with a super-high-speed ROM
>> or cache in which one could store the commonly used functions like
>> multiplication and division, and then one would only have to fetch a
>> subroutine call from the (slow) instruction stream.
>> 

The basic tenants of RISC are that you throw out the useless bagage
that's in there "cause somebody wanted it and microcode was cheap" and
get back to a reaconable minimal set (do chuck out an instruction;
measure performance; if better or same, repeat).  With the real estate
gained, you then start putting reasonable sized caches and/or prefetch
queues on chip instead of useless instructions.

Some folks have decided that RISC means that you eliminate the
microcode and directly decode the operations.  This is too narrow of a
view.  It doesn't matter how you implement what's left, as long as you
only implement what is really needed.

As for caches, one issue that hasn't been mentioned is the use of block
mode transfers for fast cache filling.  There are ram chips out there
that can access "nearby" ram cells significantly faster that random
ones (so called page mode, nibble mode, or static column rams).  These
were mainly developed for the raster display markets (so you don't
waste all your memory bandwith refreshing the screen and you get a
chance to make updates to it), but are useful in other areas as well.
Since caches blocks are typically a power of 2 in size and start on
nice boundaries, it's pretty easy to map these into the notion of
"nearby" supported by a ram chip.  The only problem you have with this
kind of architecture is (1) you need a new bus protocol, and (2) ECC
gets harder cause you need to do it faster (anybody for a pipelined ECC
chip that can keep up with 50nsec ram cycles?).

	Steve Glaser

henry@utzoo.UUCP (Henry Spencer) (06/06/85)

> I thought this too, until I looked into some RISC machines.  They use
> 32-bit instruction words, twice as wide as the equivalent instructions
> in, say, the 680xx and 320xx cpus.

Yes and no.  The Berkeley RISC project adopted 32-bit instructions for
simplicity in initial work, not because they thought it was right for
the final design.  If you look, you'll find at least one paper from them
discussing a support chip which is (a) an instruction cache, and (b) an
instruction-encoding expander.  The latter function makes a large
difference to instruction density without introducing any extra delays.

Also, don't be too sure that the 68* and 32* chips use 16-bit instructions
a lot.  Remember that things like offsets take extra bytes, and those get
used *a lot* on those machines -- virtually every memory reference needs
one.  On the pdp11, a 16-bit machine if there ever was one, the average
instruction length was in fact about 32 bits.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

atbowler@watmath.UUCP (Alan T. Bowler [SDG]) (06/06/85)

In article <591@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes:
>
>I thought this too, until I looked into some RISC machines.  They use
>32-bit instruction words, twice as wide as the equivalent instructions
>in, say, the 680xx and 320xx cpus.
>
Actually a brief examination of code actually produced by compilers
on a 68000 shows that the compiler seldom generates anything but
32 bit instructions.  (including addressing).  It generates 16 bit
instructions formats less often than it generates 48 bit ones (absolute
address).
   This is not to say I am a big fan of what is currently being called
RISC.  It strikes me that a good deal of the claims about performance,
don't jive with what happened, and where there were actual performance
gains they where attributed to the wrong part of the archetecture.
   I am also not a big fan of very complex instructions either.  My
experience shows that supplied "CALL" instructions that do any
more than store a return address in a register make subroutine call
design difficult, and usually slower.
  There is entirely too much religious faith and too little fact in
most "RISC" discussion.

doug@terak.UUCP (Doug Pardee) (06/07/85)

> Some folks have decided that RISC means that you eliminate the
> microcode and directly decode the operations.  This is too narrow of a
> view.

An elephant is much like a tree, or is it a rope, or is it a snake...

If we're going to talk about RISC, we'd better decide what we're going
to call RISC.  Is RISC:
  1) a way to save silicon real estate on single-chip cpus?  or
  2) a concept that's equally applicable to board-level cpus?

Is RISC:
  1) strictly limited instruction set, essentially microcode level?  or
  2) any cpu without "useless" instructions? (define "useless")  or
  3) any cpu with oodles of registers?

For the moment, I'm going to talk about non-microcoded single-chip cpus
with strictly limited instruction sets.  Something you'd put in a TTL
system with dynamic RAMs.  I've come to the conclusion that for this
kind of RISC, its time has passed.  It made a lot of sense in the late
'70s when cpu cycles were 250-400 ns and memory cycles were 300-400 ns.

But since then the cpu manufacturers have concentrated on speed while
the memory manufacturers concentrated on capacity.  With cpu cycles now
at 60-100 ns and memory cycles at 225-325 ns, the way to improve speed
in a system is to decrease memory cycles, not cpu cycles.  That means
making instructions even *more* complex, so that each instruction
fetched does as much work as possible.  At 100 ns per cpu cycle, you can
afford to let the cpu do a lot of work that isn't always needed, because
you can throw away *half* of it and still have the cpu be faster than
if you fetched the microinstructions from main memory (using 120 ns
access/220 ns cycle time 256K DRAMs).

By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time.  The
10 MHz NS32016 can do a 32-bit register-to-register signed multiply in
8.3 usec.  The RISC-II cpu would have to be able to do the multiply in
only 25 cpu cycles in order to compete.  All the cache in the world
ain't gonna help...
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

brownc@utah-cs.UUCP (Eric C. Brown) (06/07/85)

In article <14882@watmath.UUCP> atbowler@watmath.UUCP (Alan T. Bowler [SDG]) 
writes:
>   I am also not a big fan of very complex instructions either.  My
>experience shows that supplied "CALL" instructions that do any
>more than store a return address in a register make subroutine call
>design difficult, and usually slower.

I would agree with you to a point:  the return address should be saved on
a stack.  Jamming it into a register only means you blow an instruction
inside the subroutine saving the old pc anyway.  (Recursion? recursion?
all I want to do is call another routine inside the first...)

Eric C. Brown
..!{decvax, ihnp4, seismo}!utah-cs!brownc
brownc@utah-cs

mash@mips.UUCP (John Mashey) (06/09/85)

Eric C. Brown ..!{decvax, ihnp4, seismo}!utah-cs!brownc writes:
> I would agree with you to a point:  the return address should be saved on
> a stack.  Jamming it into a register only means you blow an instruction
> inside the subroutine saving the old pc anyway.  (Recursion? recursion?
> all I want to do is call another routine inside the first...)

Not necessarily: 1) Leaf routines (i.e., those that make no further calls)
might not want to burn memory references for saving/restoring the register
when it need not be saved/restored at all.  2)  As usual, one must make
all sorts of design tradeoffs - simplifying something somewhere may
complexify something else.  In this case, one must carefully weigh the
usefullness of having subroutine call/return save/restore regs, versus
complexity of fault-handling and slowdown of basic cycle to allow for it,
especially in heavily pipelined designs.  This is not to say that
that mechanism is a bad idea, merely that the ramifications can surprise you
and must be taken into account.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

hal@cornell.UUCP (Hal Perkins) (06/09/85)

In article <601@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes:
>By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time.  The
>10 MHz NS32016 can do a 32-bit register-to-register signed multiply in
>8.3 usec.  The RISC-II cpu would have to be able to do the multiply in
>only 25 cpu cycles in order to compete.  All the cache in the world
>ain't gonna help...

Now just a moment...  The Berkeley chips were built by academic folks
using conservative design rules, etc.  If they had been built by
experienced professional chip designers using state-of-the-art
technology they would have been a lot faster.  It's not fair to pick on
the relatively slow clock cycle of the experimental chips -- they were
built to prove a point (which they did), not to produce a product.
(I'm not making this up -- if you read some of the RISC papers by
Patterson and friends, et. al., they make the same point.)

Hal Perkins                         UUCP: {decvax|vax135|...}!cornell!hal
Cornell Computer Science            ARPA: hal@cornell  BITNET: hal@crnlcs

jans@mako.UUCP (Jan Steinman) (06/10/85)

In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Also, don't be too sure that the 68* and 32* chips use 16-bit instructions
>a lot.  Remember that things like offsets take extra bytes...

In my experience 16 bit instructions get used quite often on the NS32000 in
well written assembly code.  The NS32000 has an edge over the MC68000 in code
density in that instructions need not be word-aligned.  Henry is correct in
stating that offsets are used on virtually every memory reference, but those
offsets are usually 8 bits.  Following is some statistics for one module
pulled at random from my present project.  The code is highly optimized for
speed, so code density is not even as good as it should be, i.e. "jump" is
used instead of "br", saving one clock, but costing two bytes, etc.

	Size	Count
	8	0
	16	14
	24	13
	32	1
	40	0
	48	2
	Total	30
	Ave	22.1 bits per instruction.

The average instruction size of 22.1 bits is indeed more than the 16 bits the
original poster assumed, but much less than the 32 bits Henry expected, due
in large part to the large number of 24 bit instructions available on the
NS32000.  I believe the MC68000 would have a larger number of 32 bit
instructions with the consequent increase in average bits-per-instruction.

Of course, no analogy can be drawn for compiled code.  Many compilers under-
use register space and I suspect the average bits-per-instruction would be
much higher for compiled code on both machines.
-- 
:::::: Jan Steinman		Box 1000, MS 61-161	(w)503/685-2843 ::::::
:::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

doug@terak.UUCP (Doug Pardee) (06/11/85)

Once again, I find that I didn't do a good job of explaining myself.
Let me try again...

In the following comment, notice the word "equivalent":

me> I thought this too, until I looked into some RISC machines.  They use
me> 32-bit instruction words, twice as wide as the equivalent instructions
me> in, say, the 680xx and 320xx cpus.

> Also, don't be too sure that the 68* and 32* chips use 16-bit instructions
> a lot.  Remember that things like offsets take extra bytes, and those get
> used *a lot* on those machines -- virtually every memory reference needs
> one.

My point was supposed to be that you could take a 680xx/320xx and limit
yourself to the RISC instruction set (the "equivalent" instructions) and
have 16-bit instructions instead of RISC's 32-bit instructions.

Presumably, the reason that the longer instructions, with their offsets
etc., are used so much is because those instructions are more effective
than the RISC instructions for the task at hand (this sounds like
another discussion that I'm involved in :-)

> instruction-encoding expander.  The latter function makes a large
> difference to instruction density without introducing any extra delays.

In a microcoded cpu, the opcode is decoded and causes a sequence of
very simple micro-instructions to be executed.

In the proposed system, the opcode is decoded and causes a sequence of
very simple RISC instructions to be executed.

What's the difference?  Where does the much-vaunted performance gain
come from?
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

john@frog.UUCP (John Woods) (06/12/85)

> In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
> >Also, don't be too sure that the 68* and 32* chips use 16-bit instructions
> >a lot.  Remember that things like offsets take extra bytes...
>In my experience 16 bit instructions get used quite often on the NS32000 in
>well written assembly code.  The NS32000 has an edge over the MC68000 in code
>density in that instructions need not be word-aligned.  Henry is correct in
>stating that offsets are used on virtually every memory reference, but those
>offsets are usually 8 bits.  Following is some statistics for one module
>pulled at random from my present project.  The code is highly optimized for
>speed, so code density is not even as good as it should be, i.e. "jump" is
>used instead of "br", saving one clock, but costing two bytes, etc.
> 	Size	Count	Size Count
> 	8	0	16	14
> 	24	13	32	1
> 	40	0	48	2
> 	Total	30	Ave	22.1 bits per instruction.
>The average instruction size of 22.1 bits is indeed more than the 16 bits the
>original poster assumed, but much less than the 32 bits Henry expected, due
>in large part to the large number of 24 bit instructions available on the
>NS32000.  I believe the MC68000 would have a larger number of 32 bit
>instructions with the consequent increase in average bits-per-instruction.
> 
I counted words for some random compiled code here for the 68000 (the famous
Knight's Tour, which I still have lying around).
Size (bytes)	2	4	6	8
Sample 1:	23	4	3	0	Average 21.3 bits / 30
	main()
Sample 2:	18	7	4	0	Average 23.4 bits    instr.
	try()

The Greenhills compiler tries to make good use of registers.

On the other hand, I don't know how much these instructions actually
accomplished.  I wouldn't be surprised to find that the 32?32 does more with
those 22 bits per instruction, but I wouldn't be horribly shocked to find that
the 68000 does more [despite my employer, I am a closet 32032 fan, by the way-
it is almost as nice as the PDP-11 !-) ].  I suppose the true lesson is that
"bits per instruction" is yet another Red Herring in Heavy Oil:  a computer
with 64 bit instructions would have "poor" "code density", but if one of those
instructions was "Solve the (reg0)x(reg0) Knight's Tour and output the table
to channel (reg1)", the program would be awfully short !-).

--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA

Five tons of flax!

west@sdcsla.UUCP (Larry West) (06/12/85)

In article <2335@cornell.UUCP> hal@gvax.UUCP (Hal Perkins) writes:
>In article <601@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes:
>>By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time.  The
>>10 MHz NS32016 can do a 32-bit register-to-register signed multiply in
>>8.3 usec.  The RISC-II cpu would have to be able to do the multiply in
>>only 25 cpu cycles in order to compete.  All the cache in the world
>>ain't gonna help...
>
>Now just a moment...  The Berkeley chips were built by academic folks
>using conservative design rules, etc.  If they had been built by
>experienced professional chip designers using state-of-the-art
>technology they would have been a lot faster. ... <text omitted> ...

Hal's point is valid.   Another point to consider is this: what you
really want for a general-purpose computer is throughput.   How much
of a typical instruction stream is composed of 32-bit signed multiply's?

Obviously, in many applications, multiplication would take on great
significance.   But one point of RISC is that optimizing a few
special-purpose instructions can hurt you in general-purpose applications.
-- 

Larry West			Institute for Cognitive Science
(USA+619-)452-6220		UC San Diego (mailcode C-015) [x6220]
ARPA: <west@nprdc.ARPA>		La Jolla, CA  92093  U.S.A.
UUCP: {ucbvax,sdcrdcf,decvax,ihnp4}!sdcsvax!sdcsla!west OR ulysses!sdcsla!west

chuck@dartvax.UUCP (Chuck Simmons) (06/13/85)

> In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
> >Also, don't be too sure that the 68* and 32* chips use 16-bit instructions
> >a lot.  Remember that things like offsets take extra bytes...
> 
> In my experience 16 bit instructions get used quite often on the NS32000 in
> well written assembly code.  The NS32000 has an edge over the MC68000 in code
> density in that instructions need not be word-aligned.  Henry is correct in
> stating that offsets are used on virtually every memory reference, but those
> offsets are usually 8 bits.  Following is some statistics for one module
> pulled at random from my present project.  The code is highly optimized for
> speed, so code density is not even as good as it should be, i.e. "jump" is
> used instead of "br", saving one clock, but costing two bytes, etc.
> 
> 	Size	Count
> 	8	0
> 	16	14
> 	24	13
> 	32	1
> 	40	0
> 	48	2
> 	Total	30
> 	Ave	22.1 bits per instruction.
> 
> The average instruction size of 22.1 bits is indeed more than the 16 bits the
> original poster assumed, but much less than the 32 bits Henry expected, due
> in large part to the large number of 24 bit instructions available on the
> NS32000.  I believe the MC68000 would have a larger number of 32 bit
> instructions with the consequent increase in average bits-per-instruction.
> 
> Of course, no analogy can be drawn for compiled code.  Many compilers under-
> use register space and I suspect the average bits-per-instruction would be
> much higher for compiled code on both machines.
> :::::: Jan Steinman		Box 1000, MS 61-161	(w)503/685-2843 ::::::

Pulling out a "random" chunk of 68000 code (the only chunk I have sitting
near by) gives us the following table:

     Size    Count
      16      22
      32      17
      48       1
     Total    40
     Ave    23.6 bits per instruction.

I'm not sure that code density is really what we want to be looking
at, however.  I can recode the above algorithm to use more instructions
that are smaller on the average and thus acheive a higher code density.
(The 17 32-bit instructions can be replaced by 33 16-bit instructions
and 2 32-bit instructions, giving an average of 18.4 bits per instruction.)
However, this denser code would be both slower and longer.  Somewhere
we need to take into account the "usefulness" of each instruction.

Chuck Simmons

hull@hao.UUCP (Howard Hull) (06/15/85)

In article 195@frog, John Woods writes:
> On the other hand, I don't know how much these instructions actually
> accomplished.  I wouldn't be surprised to find that the 32?32 does more with
> those 22 bits per instruction, but I wouldn't be horribly shocked to find that
> the 68000 does more [despite my employer, I am a closet 32032 fan, by the way-
> it is almost as nice as the PDP-11 !-) ].

Just out of curiousity, what *ARE* you 320xx and 68k fans using for the PDP-11
instruction     MOV (SP)+,@(R5)+    eh?  (It's useful for initializing tables
or mapped video windows from stack images.)
I only need one or two examples, please, so read all your micro mail before
replying to this.  If you see a correct answer, and you think yours is better,
please send me mail; In a subsequent posting, I'll summarize whatever I get.
Thanks.

								     Howard Hull
[If yet unproven concepts are outlawed in the range of discussion...
                   ...Then only the deranged will discuss yet unproven concepts]
        {ucbvax!hplabs | allegra!nbires | harpo!seismo } !hao!hull

henry@utzoo.UUCP (Henry Spencer) (06/17/85)

> By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time.  The
> 10 MHz NS32016 can do a 32-bit register-to-register signed multiply in
> 8.3 usec.  The RISC-II cpu would have to be able to do the multiply in
> only 25 cpu cycles in order to compete.  All the cache in the world
> ain't gonna help...

The RISC II was designed by a bunch of grad students, using simplistic
design rules and mediocre MOS processing, with limited opportunity to try
again if the original chips didn't work.  The 10MHz NS32016 was done by
a swarm of professional silicon bashers, using professional facilities
and souped-up processes, over a long period of time with *many* design
iterations.  (In fact, too %@$@%#@% many for those of us who were waiting
for things to settle down and work...)  This comparison means little.

Also, multiplies are actually fairly infrequent operations in most programs.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

henry@utzoo.UUCP (Henry Spencer) (06/17/85)

> My point was supposed to be that you could take a 680xx/320xx and limit
> yourself to the RISC instruction set (the "equivalent" instructions) and
> have 16-bit instructions instead of RISC's 32-bit instructions.

Try doing memory references in 16 bits on either the 680xx or 320xx.
It can't be done, because the opcode alone is 16 bits.  And you need
memory references more on the commercial machines, because the RISC's
register architecture isn't present to reduce memory usage.

> > instruction-encoding expander.  The latter function makes a large
> > difference to instruction density without introducing any extra delays.
> 
> In a microcoded cpu, the opcode is decoded and causes a sequence of
> very simple micro-instructions to be executed.
> 
> In the proposed system, the opcode is decoded and causes a sequence of
> very simple RISC instructions to be executed.

Wrong, it causes *one* very simple RISC instruction to be executed.
The expander function is purely an encoding technique, it does not
introduce another level of emulation.  The point is that the poor code
density of current RISCs is the result of optimization for experimental
work rather than production.  More compact encodings, without change to
the basic concept, are both possible and desirable once the basic issues
are understood.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

jer@peora.UUCP (J. Eric Roskos) (06/20/85)

henry@utzoo.UUCP (Henry Spencer @ U of Toronto Zoology) writes:

> Also, multiplies are actually fairly infrequent operations in most programs.

That is, if "most programs" don't use multidimensional arrays...
-- 
Shyy-Anzr:  J. Eric Roskos
UUCP:       ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer
US Mail:    MS 795; Perkin-Elmer SDC;
	    2486 Sand Lake Road, Orlando, FL 32809-7642

	    Bar ol bar / Gur pbyq rgpurq cyngr /
	    Unf cevagrq gur jnez fgnef bhg.

henry@utzoo.UUCP (Henry Spencer) (06/21/85)

> > Also, multiplies are actually fairly infrequent operations in most programs.
> 
> That is, if "most programs" don't use multidimensional arrays...

Except in a few specific environments, most programs indeed do not use
multidimensional arrays.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

kds@intelca.UUCP (Ken Shoemaker) (06/21/85)

> introduce another level of emulation.  The point is that the poor code
> density of current RISCs is the result of optimization for experimental
> work rather than production.  More compact encodings, without change to
> the basic concept, are both possible and desirable once the basic issues
> are understood.
> -- 
> 				Henry Spencer @ U of Toronto Zoology
> 				{allegra,ihnp4,linus,decvax}!utzoo!henry

er, I don't think so...one of the basic concepts of RISCs as I understand
them is to reduce the number of pipeline stages in the execution of
instructions.  Adding another just to do an expansion/shuffleing of
of opcode bytes flys directly in the face of that thinking.
-- 
...and I'm sure it wouldn't interest anybody outside of a small circle
of friends...

Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm

{pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds
	
---the above views are personal.  They may not represent those of the
	employer of its submitter.

henry@utzoo.UUCP (Henry Spencer) (06/24/85)

> er, I don't think so...one of the basic concepts of RISCs as I understand
> them is to reduce the number of pipeline stages in the execution of
> instructions.  Adding another just to do an expansion/shuffleing of
> of opcode bytes flys directly in the face of that thinking.

Remember that the RISC is necessarily decoding its opcodes; not even on
a machine that simple is a numeric opcode a direct encoding of the internal
control signals.  The point is that the current instruction encoding was
chosen for simplicity, not compactness, and one can do better *without*
compromising the principles of the machine.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

baba@spar.UUCP (Baba ROM DOS) (06/25/85)

>                                           The 10MHz NS32016 was done by
> a swarm of professional silicon bashers, using professional facilities
> and souped-up processes, over a long period of time with *many* design
> iterations.
>

Don't you see a contradiction lurking in there?

					Baba ROM DOS

brooks@lll-crg.ARPA (Eugene D. Brooks III) (06/25/85)

> > > Also, multiplies are actually fairly infrequent operations in most programs.
> > 
> > That is, if "most programs" don't use multidimensional arrays...
> 
> Except in a few specific environments, most programs indeed do not use
> multidimensional arrays.

Perhaps we could do without the integer multiplier :-), but we had better
not drop the floating point adder and multiplier, and they had both better
operate at one op per clock after filling the pipeline.

meissner@rtp47.UUCP (Michael Meissner) (06/25/85)

Another case where multiplication occurs frequently is contructs of the
sort:

	struct {
		long	l;
		short	s;
	} *p;

	int	i;

	main(){
		/*...*/
		p += i;
		p[i].l = 1;
		/*...*/
	}

Ie, pointer arithmetic involving non-constant integers, particularly if
the size is not a multiple of 2.

--
	Michael Meissner
	Data General Corporation
	...{ ihnp4, decvax }!mcnc!rti-sel!rtp47!meissner

guy@sun.uucp (Guy Harris) (06/26/85)

> > > > Also, multiplies are actually fairly infrequent operations in most
> > > > programs.
> > > 
> > > That is, if "most programs" don't use multidimensional arrays...
> > 
> > Except in a few specific environments, most programs indeed do not use
> > multidimensional arrays.
> 
> Perhaps we could do without the integer multiplier :-), but we had better
> not drop the floating point adder and multiplier, and they had both better
> operate at one op per clock after filling the pipeline.

Methinks "most" is in the eye of the beholder.  Most UNIX utilities, and the
UNIX kernel, do relatively few multiplications *or* floating point
operations (the kernel need not do any; the 4.2BSD scheduler does, but the
Sun 4.2BSD scheduler and the 2.9BSD scheduler, which have the same
load-average dependent decay on the average weighting for "p_cpu", don't)
and rarely use multidimensional arrays.  The same is probably true for many
other classes of applications, such as the office automation/"personal
productivity" applications popular on PCs, for instance.

However, I suspect a scientific programmer would answer differently.  I used
multidimensional arrays in scientific programs I've written; they are quite
common in such programs (they are a natural data structure for many types of
calculations, like those which use models of two-dimensional or
three-dimensional spaces).  Those programs also would like the FP adder and
multiplier; I don't know how many integer multiplies or divides they do,
other than in support of accesses to multi-dimensional arrays.

In the case of such accesses, if the indices being used are linear functions
of the induction variable of a loop (I don't know whether there are any
looser conditions possible), a strength reduction can make the
multiplications unnecessary (C programmers may know this trick as "using a
pointer instead of an array subscript").

	Guy Harris

henry@utzoo.UUCP (Henry Spencer) (06/26/85)

> >                                           The 10MHz NS32016 was done by
> > a swarm of professional silicon bashers, using professional facilities
> > and souped-up processes, over a long period of time with *many* design
> > iterations.
> 
> Don't you see a contradiction lurking in there?

I wish I did...  Professional people and facilities plus plenty of time
would suggest getting it right the first time, except that it just makes
them more ambitious instead.  The souped-up processes help speed but don't
do a lot for correctness.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

ken@turtlevax.UUCP (Ken Turkowski) (06/26/85)

>>> Also, multiplies are actually fairly infrequent operations in most programs.
>> 
>> That is, if "most programs" don't use multidimensional arrays...
>
>Except in a few specific environments, most programs indeed do not use
>multidimensional arrays.

Depending on what your product is, you may not need multiplication at
all.  If the end product is a UN*X machine, most UN*X utilities do not
need multiplications.  However, if you are running any engineering or
scientific applications, the time devoted to multiplication is
considerable, and may even dominate execution time if there is no
hardware support for it.

Saving the status register is an infrequent operation; why not do
multiple conditional branches instead?  Virtually no application
programs need it at all.  The operating system just needs it to switch
contexts, which is a task done infrequently, and takes so much time as
it is... :-)
-- 

Ken Turkowski @ CADLINC, Menlo Park, CA
UUCP: {amd,decwrl,hplabs,nsc,seismo,spar}!turtlevax!ken
ARPA: turtlevax!ken@DECWRL.ARPA

***** Suppport your local number cruncher: join PIFOM!	*****
***** (Programmers In Favor Of Multiplication)		*****

rbt@sftig.UUCP (R.Thomas) (06/28/85)

> Saving the status register is an infrequent operation; why not do
> multiple conditional branches instead?  Virtually no application
> programs need it at all.  The operating system just needs it to switch
> contexts, which is a task done infrequently, and takes so much time as
> it is... :-)
> -- 
> 
> Ken Turkowski @ CADLINC, Menlo Park, CA

It's not as funny as you think!  The old UNIVAC 1107 did just exactly that
to save the status of the carry and overflow flags.  And in order to
restore them, it had to actually do an arithmetic sequence that would
force the flags to have the right values.  There were *no* instructions to
directly save or restore these flags!

An old time hacker,
Rick Thomas
attunix!rbt

For that matter, some benighted hardwares protect the instructions that
load the flags register because there are flags in that register that you
would rather not have the user mucking around with.  On those machines, if
you want to save and restore the flags from user mode, you have to resort
to some such trick as the above.  I know, there are better ways to design
systems, but some designers are not as smart as some designers.

bobbyo@celerity.UUCP (Bob Ollerton) (06/29/85)

Gee, if someone built a computer that did really fast multiplication
(what ever "really fast" is) should they call it the "Rabbit"?

I am sure that there are many designs that offer both fast, slow, and
no hardware support for multiplication.  The user pays for  the cost
of support for multiplication hardware in the final analysis,
either in the cost of the system, or the time spent knitting
paper clips together while running a job.

Maybe designs with out fast multiplication (and floating point!)
should be called the "Lizzard".

Then somone can introduce the Micro-LizzardII.  ::)) <-echo.

bob.
.

-- 
Bob Ollerton; Celerity Computing; 
9692 Via Excelencia; San Diego, Ca 92126; (619) 271 9940
{decvax || ucbvax || ihnp4}!sdcsvax!celerity!bobbyo
                              akgua!celerity!bobbyo

sasaki@harvard.ARPA (Marty Sasaki) (06/29/85)

This may be an increadibly stupid question, but I remember reading an
article (in CACM, I think) on RISC machines where most of the
instructions occupied a single byte of memory. Through cleverness,
something like 95% of all instructions executed occupied a single
byte (including operands). This meant that programs would be smaller
and would run faster by the simple fact that they required fewer
memory references to get work done.

All of the recent discussion on RISC machines hasn't mentioned this at
all. Have things changed sufficiently in the recent past that this
small program size doesn't matter? Is this a really dumb question?

-- 
----------------
  Marty Sasaki				net:   sasaki@harvard.{arpa,uucp}
  Havard University Science Center	phone: 617-495-1270
  One Oxford Street
  Cambridge, MA 02138

mash@mips.UUCP (John Mashey) (06/30/85)

Michael Meissner ...{ ihnp4, decvax }!mcnc!rti-sel!rtp47!meissner writes:
> Another case where multiplication occurs frequently is contructs of the
> sort:
> 	struct {
> 		long	l;
> 		short	s;
> 	} *p;
> 	int	i;
> 	main(){
> 		/*...*/
> 		p += i;
> 		p[i].l = 1;
> 		/*...*/
> 	}
> Ie, pointer arithmetic involving non-constant integers, particularly if
> the size is not a multiple of 2.

1) In this example, sizeof(*p) == 8 anyway.  To get the intended effect,
use short l[2], for example.
2) In any case, at least some compilers not only do multiplies of
powers of 2 by adds or shifts, but do so for constants that are almost
powers of 2 (i.e., few 1-bits) by shifts and adds or subtracts.
3) Indeed, multiplication does occur frequently in these cases; the real
question, is how frequent are these cases, really? [Not an answer, but
a question whose answer needs to be known when you're making tradeoffs
in CPU design].
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

larry@mips.UUCP (Larry Weber) (07/01/85)

> Another case where multiplication occurs frequently is contructs of the
> sort:
> 
> 	struct {
> 		long	l;
> 		short	s;
> 	} *p;
> 
> 	int	i;
> 
> 	main(){
> 		/*...*/
> 		p += i;
> 		p[i].l = 1;
> 		/*...*/
> 	}
> 

Almost all the examples provided actually use multiplication by a constant.
Thus, on all but the largest (and most expensive) of machines there are
shift/add/subtract sequences that do the trick nicely:
   x * 6 == ((x<<1)+x)<<1) 		three instructions
   x * 7 == (x<<3)-x			two instructions
   x *119== ((x<<4)-x)<<3)-x		four instructions
These sequences are often smaller than a call to a multiply routine and
almost always faster than a general purpose multiply instruction.  Most
of the constants are powers of two, near powers of two (31, 255) or simple sums
of powers of two (6, 10, 24).

Some machines are better than others at this sort of thing so study
those timings carefully.  This is a winner on the 68k.
-- 
-Larry B Weber
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!larry
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

kds@intelca.UUCP (Ken Shoemaker) (07/02/85)

> Remember that the RISC is necessarily decoding its opcodes; not even on
> a machine that simple is a numeric opcode a direct encoding of the internal
> control signals.  The point is that the current instruction encoding was
> chosen for simplicity, not compactness, and one can do better *without*
> compromising the principles of the machine.

I didn't say that it didn't decode them...My understanding is that the
various bits of the instructions always come into the processor on the
same set of data lines, which eliminates the need for some other unit
out there to steer the bits from the appropriate data lines to the
correct decoding unit.
-- 
...and I'm sure it wouldn't interest anybody outside of a small circle
of friends...

Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm

{pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds
	
---the above views are personal.  They may not represent those of the
	employer of its submitter.

jeff@gatech.CSNET (Jeff Lee) (07/04/85)

> > Saving the status register is an infrequent operation; why not do
> > multiple conditional branches instead?  Virtually no application
> > programs need it at all.  The operating system just needs it to switch
> > contexts, which is a task done infrequently, and takes so much time as
> > it is... :-)
> 
> It's not as funny as you think!  The old UNIVAC 1107 did just exactly that
> to save the status of the carry and overflow flags.  And in order to
> restore them, it had to actually do an arithmetic sequence that would
> force the flags to have the right values.  There were *no* instructions to
> directly save or restore these flags!

And then (moving slightly off the subject) there was the code that was
needed to save all your registers on a CDC-6000 type machine. It had an
interesting architecture where when you loaded register A1-A5 with a number
the corresponding X1-X5 received the contents of that memory location. When
you loaded registers A6-A7 with a number the contents of X6-X7 was stored in
that location. Registers A0 and X0 were not associated like this. In order
to store a register, you had to change an A-register value.

As I remember it, the code used one of the index registers (B-registers) and
performed a series of 18 conditional Return-Jumps based on the sign of the
chosen B-register and shifting it 1 bit to the left at a time. Then it could
nuke that B-register and save A6 or A7 in it and start storing values. It then
could reconstruct the killed B-register by tracing through the memory locations
used as targets for the return jumps. Can you say CONVOLUTED boys and girls??
Needless to say, the software designers decided that if YOU want a register
then YOU better save it because they aren't about to save them all.

	Enjoy,
-- 
Jeff Lee
CSNet:	Jeff @ GATech		ARPA:	Jeff%GATech.CSNet @ CSNet-Relay.ARPA
uucp:	...!{akgua,allegra,hplabs,ihnp4,linus,seismo,ulysses}!gatech!jeff

johnl@ima.UUCP (07/07/85)

Having a hardware multiplication instruction isn't as much of a win as
you might think.  On everybody's favorite chip, the 8088, a 16 x 16
multiply takes about 115 cycles, while shifts and adds are 2 and 3 cycles
respectively.  This means that for practically any constant multiplier
you'll get faster code by constructing your multiply from shifts, adds,
and substracts.

Even on other higher powered machines a multiply is still typically 10
times slower than an add or shift, so that it's still usually faster to
do the addition chain for small multipliers such as the ones you usually
run into when doing subscripting.

John Levine, ima!johnl

gottlieb@cmcl2.UUCP (Allan Gottlieb) (07/09/85)

Yes the 6600 was a poor architecture for saving all the registers,
but it did multiply fast :-).  We are acutally quite fond of the
convoluted code sequence that you mentioned since it was discovered
here when we had an early timesharing system for the 6600.

Allan Gottlieb
GOTTLIEB@NYU
{floyd,ihnp4}!cmcl2!gottlieb   <---the character before the 2 is an el
-- 
Allan Gottlieb
GOTTLIEB@NYU
{floyd,ihnp4}!cmcl2!gottlieb   <---the character before the 2 is an el

franka@mmintl.UUCP (Frank Adams) (07/09/85)

In article <149@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
(concerning integer multiplies) > the real
>question, is how frequent are these cases, really? [Not an answer, but
>a question whose answer needs to be known when you're making tradeoffs
>in CPU design].

Actually, this still isn't quite the right question.  The right question
is how common *should* these cases be?  By which I mean, assuming good
programming techniques.  I have seen too many structure definitions and
arrays which were padded or shrunk to make the size a power of two.
Programmers shouldn't (and shouldn't have to) worry about such things!

Analysis of current programming practices reflects the reactions to
current (and past) architectures, and tends to propogate mistakes.

All right, I will admit that such analysis is better than no analysis
at all.  But be aware of its limitations.

richardt@orstcs.UUCP (richardt) (07/10/85)

As I read, I see a *lot* of comments about how "most 68xxx & 32xxx 
instructions take 32 bits anyway!"  I would like to suggest that a large
factor in this may be due to *sloppy compilers*!  I have written
68000 code as a hobbyist for about the last three years.  I have started
writing BASIC interpreters several times and a FORTH-relative once.  In
all of those cases, I found that the average instruction width 
  # of instructions / # of bits for code = ~18 bits
That's a rough estimate; It may be a little high.  When you add in data,
then things get interesting.  A program that uses a lot of predetermined
data runs up that average a lot more than you think.  When I sat down
and started to write op system routines, the average instruction was
about 20 bits.  (Note: I am using a mathematical average here.)
My point is, If most of your programs have larger average instruction
widths, maybe you've got a sloppily-written compiler.  The average 
instruction width on a 'thousand needn't be anywhere near 32 bits.
	So look at your compiler before you go running of screaming 
	'Give me a RISC.'
I'll step down off my soapbox now, and someone else can defend the National
chips.  I don't have the info (yet.)
----------------------------------------
Is there an assembly-language programmer in the house?
						orstcs!richardt
/* ---------- */

mat@amdahl.UUCP (Mike Taylor) (07/12/85)

> 
> Having a hardware multiplication instruction isn't as much of a win as
> you might think.  On everybody's favorite chip, the 8088, a 16 x 16
> multiply takes about 115 cycles, while shifts and adds are 2 and 3 cycles
> respectively.  This means that for practically any constant multiplier
> you'll get faster code by constructing your multiply from shifts, adds,
> and substracts.
> 
Unless the multiplier is very wide and smart enough to do things like sum
partial products in parallel with each multiplication.  This gives speed
you couldn't get with the above.  However, it sure is not obvious what
the right thing to do is.
-- 
Mike Taylor                        ...!{ihnp4,hplabs,amd,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

peterb@pbear.UUCP (07/13/85)

/* Written  9:12 pm  Jul 11, 1985 by amdahl!mat in pbear:net.arch */
>>
>> Having a hardware multiplication instruction isn't as much of a win as
>> you might think.  On everybody's favorite chip, the 8088, a 16 x 16
>> multiply takes about 115 cycles, while shifts and adds are 2 and 3 cycles
>> respectively.  This means that for practically any constant multiplier
>> you'll get faster code by constructing your multiply from shifts, adds,
>> and substracts.
>>
>Unless the multiplier is very wide and smart enough to do things like sum
>partial products in parallel with each multiplication.  This gives speed
>you couldn't get with the above.  However, it sure is not obvious what
>the right thing to do is.
>--
>Mike Taylor                        ...!{ihnp4,hplabs,amd,sun}!amdahl!mat
/* End of text from pbear:net.arch */

From the timing, one can see that since a 8 bit multiply can vary in cylcles
from 70 - 77 and that a 16 bit multiply takes from 118 - 133, the difference
in timing is 7 and 15 respectively. This is a by product of shift/add
sequences to do the multiply, and it is slower than need be. A 26116 can do
16 by 16 multiply in about 19 clock cylces. that's about 1.9 microseconds for
100ns cycles time.

Why it is so slow I don't know. could anybody explain this???

Peter Barada
{ihnp4!inmet|{harvard|cca}!ima}!pbear!peterb

henry@utzoo.UUCP (Henry Spencer) (07/15/85)

> As I read, I see a *lot* of comments about how "most 68xxx & 32xxx 
> instructions take 32 bits anyway!"  I would like to suggest that a large
> factor in this may be due to *sloppy compilers*!  ...
> ...
> Is there an assembly-language programmer in the house?

Speaking as the one who started this particular line of discussion, the
numbers I originally quoted were from people working in assembler, doing
implementations of the same little interpretive-language kernel on various
different machines.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

jeff@gatech.CSNET (Jeff Lee) (07/16/85)

> > 
> > Having a hardware multiplication instruction isn't as much of a win as
> > you might think.  On everybody's favorite chip, the 8088, a 16 x 16
> > multiply takes about 115 cycles, while shifts and adds are 2 and 3 cycles
> > respectively.  This means that for practically any constant multiplier
> > you'll get faster code by constructing your multiply from shifts, adds,
> > and substracts.

Most engineers would disagree with this statement (I am NOT an engineer),
that having hardware multiplication is not a big win.

> Unless the multiplier is very wide and smart enough to do things like sum
> partial products in parallel with each multiplication.  This gives speed
> you couldn't get with the above.  However, it sure is not obvious what
> the right thing to do is.

I remember looking over the algorithms for the Cyber 170/750 hardware
(or was it a 170/755; those old machines). The machines central clock was
on order of 50ns. It performed wierd splits of the bits to produce the
intermediate products and performed a heap of adding in parallel. The result
was that the first stage of the pipeline (I think it was a 2 stage job)
could accept a pair of operands (48-bit mantissa, 11-bit exponent, 1-bit sign)
every 2 cycles.

What sort of algorithm does Cray now use for his Cray-II machine? Does
he get his multiplies done in a single clock cycle or are they still
pipelined? A friend of mine (previously an electrical engineer, but turned
computer science) took a class in hardware algorithms. It looked pretty
interesting but also not your typical "seat-of-the-pants" type algorithms.
Quite a bit of thought went into the magic....
-- 
Jeff Lee
CSNet:	Jeff @ GATech		ARPA:	Jeff%GATech.CSNet @ CSNet-Relay.ARPA
uucp:	...!{akgua,allegra,hplabs,ihnp4,linus,seismo,ulysses}!gatech!jeff

franka@mmintl.UUCP (Frank Adams) (07/16/85)

With all the discussion about multiplies, let me express one of my pet
peeves.  Which is that integer multiplies are hardly ever (never, in my
experience) defined in a manner suitable for their most common use.
I would guess, conservatively, that well over 90% of the integer multiplies
that are done generate a result with no more bits than the larger of the
two operands.  That is, one multiplies two words and wants a word result.
But hardware multiplies invariably generate a two word result, leaving the
high-order word to be allowed for and/or disposed of.  This is a problem
equally for compilers and hand-written assembler code.

The same problem occurs for division.

Incidently, re the subscript computation problem, why not a command which
multiplies a word by a one-byte constant?  This would deal cleanly with
most two-dimensional arrays and arrays of structures, yet should be
implementable as a reasonably efficient instruction.

nather@utastro.UUCP (Ed Nather) (07/18/85)

> I would guess, conservatively, that well over 90% of the integer multiplies
> that are done generate a result with no more bits than the larger of the
> two operands.

I didn't know that over 90% of multiplication operations were multiplying
something by 1.  Wow.  You learn something new on the net every day!

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nather%utastro.UTEXAS@ut-sally.ARPA

mahar@weitek.UUCP (mahar) (07/18/85)

In article <493@mmintl.UUCP>, franka@mmintl.UUCP (Frank Adams) writes:
> Which is that integer multiplies are hardly ever (never, in my
> experience) defined in a manner suitable for their most common use.
> I would guess, conservatively, that well over 90% of the integer multiplies
> that are done generate a result with no more bits than the larger of the
> two operands.  That is, one multiplies two words and wants a word result.
> But hardware multiplies invariably generate a two word result, leaving the
> high-order word to be allowed for and/or disposed of.

The Decsystem-10 & 20 has an instruction IMUL which does exactly what you
want. It Multiplies a 36bit value by another 36 bit value and gives a 36
bit result. They also have a divide which divides a 36 bit number.

crandell@ut-sally.UUCP (Jim Crandell) (07/19/85)

> > I would guess, conservatively, that well over 90% of the integer multiplies
> > that are done generate a result with no more bits than the larger of the
> > two operands.
> 
> I didn't know that over 90% of multiplication operations were multiplying
> something by 1.  Wow.  You learn something new on the net every day!

They're not.  Actually, 45% multiply by 1, and another 45%+ multiply by 0.
-- 

    Jim Crandell, C. S. Dept., The University of Texas at Austin
               {ihnp4,seismo,ctvax}!ut-sally!crandell

jss@sjuvax.UUCP (J. Shapiro) (07/22/85)

> > ... one multiplies two words and wants a word result.
> > But hardware multiplies invariably generate a two word result, leaving the
> > high-order word to be allowed for and/or disposed of.
> 
> The Decsystem-10 & 20 has an instruction IMUL which does exactly what you
> want. It Multiplies a 36bit value by another 36 bit value and gives a 36
> bit result. They also have a divide which divides a 36 bit number.

For comparison and elucidation, so do:

	VAX
	PDP-11
	National 32016 and family
	Motorola 68000, 68020, 6800, 6809, ...
	Zilog Z8000
	Z80
	Intel 8086, 80186, 82086

-----

Bell Laboratories is an Artificial Inteligence project of the UNIX (tm)
operating system... (borrowed from an unknown wit)

Jon Shapiro

franka@mmintl.UUCP (Frank Adams) (07/22/85)

[Me]
>> I would guess, conservatively, that well over 90% of the integer multiplies
>> that are done generate a result with no more bits than the larger of the
>> two operands.

[Ed Nather]
>I didn't know that over 90% of multiplication operations were multiplying
>something by 1.  Wow.  You learn something new on the net every day!

Are you being intentionally obtuse?  The number of bits in an operand is
the number of bits being used to store the operand, not the number of
significant bits in the operand.  Sheesh.

jans@mako.UUCP (Jan Steinman) (07/23/85)

In article <2397@ut-sally.UUCP> Jim Crandell writes, quotes:
>>>I would guess, conservatively, that well over 90% of the integer multiplies
>>>that are done generate a result with no more bits than the larger of the
						^^^^
>>>two operands.
>> 
>>I didn't know that over 90% of multiplication operations were multiplying
>>something by 1.  Wow.  You learn something new on the net every day!
>
>They're not.  Actually, 45% multiply by 1, and another 45%+ multiply by 0.
>
*** Sarcasm Mode Off ***

One would assume the original poster meant atomic units, such as machine
words, rather than (the obviously wrong) bits.  It should be stressed that
such determination is essentially a run-time operation, and the additional
overhead of overflow detection MUST then be done.
-- 
:::::: Jan Steinman		Box 1000, MS 61-161	(w)503/685-2843 ::::::
:::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

louie@umd5.UUCP (07/23/85)

In article <1202@sjuvax.UUCP> jss@sjuvax.UUCP (J. Shapiro) writes:
>> > ... one multiplies two words and wants a word result.
>> > But hardware multiplies invariably generate a two word result, leaving the
>> > high-order word to be allowed for and/or disposed of.
>> The Decsystem-10 & 20 has an instruction IMUL which does exactly what you
>> want. It Multiplies a 36bit value by another 36 bit value and gives a 36
>> bit result. They also have a divide which divides a 36 bit number.
>
>For comparison and elucidation, so do:
>
>	VAX
>	PDP-11
>	National 32016 and family
>	Motorola 68000, 68020, 6800, 6809, ...
>	Zilog Z8000
>	Z80

Huh?  You think that a Z80 is clever enough to MULTIPLY?  Mine sure doesn't.
By the way, the Sperry 1100 series also had a Multiple Single Integer (MSI)
instruction that give a 36 bit result from two 36 bit operands.  This seems
to get used much more often (at least in my code) than the Multipy Integer(MI)
instruction that give a 72 bit result.

>	Intel 8086, 80186, 82086
>
>Jon Shapiro


-- 
Louis A. Mamakos WA3YMH   University of Maryland, Computer Science Center
 Internet: louie@umd5.arpa
 UUCP: {seismo!umcp-cs, ihnp4!rlgvax}!cvl!umd5!louie

darrell@sdcsvax.UUCP (Darrell Long) (07/23/85)

In article <233@weitek.UUCP> mahar@weitek.UUCP (mahar) writes:
>In article <493@mmintl.UUCP>, franka@mmintl.UUCP (Frank Adams) writes:
>> Which is that integer multiplies are hardly ever (never, in my
>> experience) defined in a manner suitable for their most common use.
>> I would guess, conservatively, that well over 90% of the integer multiplies
>> that are done generate a result with no more bits than the larger of the
>> two operands.  That is, one multiplies two words and wants a word result.
>> But hardware multiplies invariably generate a two word result, leaving the
>> high-order word to be allowed for and/or disposed of.
>
>The Decsystem-10 & 20 has an instruction IMUL which does exactly what you
>want. It Multiplies a 36bit value by another 36 bit value and gives a 36
>bit result. They also have a divide which divides a 36 bit number.

This is also true for the VAX-11 and the AT&T 3B series.  If you want
a two word result then you must ask for it explicitly (if you can get
it at all).
-- 
Darrell Long
Department of Electrical Engineering and Computer Science
University of California, San Diego

USENET: sdcsvax!darrell
ARPA:   darrell@sdcsvax

doug@terak.UUCP (Doug Pardee) (07/23/85)

> But hardware multiplies invariably generate a two word result, leaving the
> high-order word to be allowed for and/or disposed of.  This is a problem
> equally for compilers and hand-written assembler code.
> 
> The same problem occurs for division.

National's NS320xx CPUs do byte*byte->byte, word*word->word, and
long*long->long multiplication (and division) as the normal cases.

They also have "extended" multiply and divide instructions with the
"conventional" double-length product/dividend, but these are a bit more
difficult to use because they are oriented toward extended-precision
arithmetic (for example, these are strictly unsigned operations).

> Incidently, re the subscript computation problem, why not a command which
> multiplies a word by a one-byte constant? ...

The NS320xx instruction set includes the INDEX instruction, which 
multiplies a subscript value by a length and adds the result to a
register.  The length can be contained in a register or memory or
can be a constant, and can be 8, 16, or 32 bits long (unsigned).
The catch is that it must be the same length as the subscript.
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

peterb@pbear.UUCP (07/24/85)

Jon,

The 8086 / 80186 do NOT produce same magnitude results from multiplication
and division. AX is mulitiplied by the operand and the result is left in
DX:AX. This is 32 bits, not the original 16 that the operands are!

Peter Barada
{ihnp4!inmet|{harvard|cca}!ima}!pbear!peterb

tc@amd.UUCP (Tom Crawford) (07/24/85)

The SDS 9300 had an instruction called Twin Multiply.  It treated two
24-bit operands as pairs of 12-bit integers and performed two 
12 x 12 => 24 multiplies.

Who's SDS, you ask?  Scientific Data Systems, of course!

			Tom Crawford
			...amd!tc

jer@peora.UUCP (J. Eric Roskos) (07/25/85)

> Incidently, re the subscript computation problem, why not a command which
> multiplies a word by a one-byte constant?  This would deal cleanly with
> most two-dimensional arrays and arrays of structures, yet should be
> implementable as a reasonably efficient instruction.

Why not have instructions that contain small constants built into the
opcode?  What?  Never heard of that?  I guess it's because it's not a RISC...

Seriously, though, the above sorts of things are quite common in current
CISC research... in fact, there are some good papers (e.g., by Flynn) on
optimally encoding instructions; this is where the recently-much-maligned
ideas on having machines with different instruction sets for different
applications/compilers come from (whether or not different COMPILERS need
different instruction sets, or whether it's really the APPLICATION that
determines it, depends on who you talk to.).  Actually it's much more
general than that, though, since you use an optimization scheme to minimize
the length of instructions for some desired set of programs; and a result
of this is that short constants come out encoded in the opcodes for some
common classes of programs.
-- 
Shyy-Anzr:  J. Eric Roskos
UUCP:       ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer
US Mail:    MS 795; Perkin-Elmer SDC;
	    2486 Sand Lake Road, Orlando, FL 32809-7642

henry@utzoo.UUCP (Henry Spencer) (07/25/85)

> > The Decsystem-10 & 20 has an instruction IMUL which does exactly what you
> > want. It Multiplies a 36bit value by another 36 bit value and gives a 36
> > bit result. They also have a divide which divides a 36 bit number.
> 
> For comparison and elucidation, so do:
> 
> 	VAX
> 	PDP-11
> 	National 32016 and family
> 	Motorola 68000, 68020, 6800, 6809, ...
> 	Zilog Z8000
> 	Z80
> 	Intel 8086, 80186, 82086

From personal experience, I can assure you that the PDP-11 and the 6809
most definitely do not.  And the 6800 has no multiply instruction at all.
I don't think the Z80 does either.  The 68000 has 16x16->32, but neither
32x32->32 nor 16x16->16 that I remember.  And my (admittedly dim) memory
of the Z8000 is that its multiply instructions yield double-width results.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

franka@mmintl.UUCP (Frank Adams) (07/25/85)

In article <1202@sjuvax.UUCP> jss@sjuvax.UUCP (J. Shapiro) writes:
>> > ... one multiplies two words and wants a word result.
>> > But hardware multiplies invariably generate a two word result, leaving the
>> > high-order word to be allowed for and/or disposed of.
>> 
>> The Decsystem-10 & 20 has an instruction IMUL which does exactly what you
>> want.
>
>For comparison and elucidation, so do:
>
>	VAX
>	PDP-11
>	National 32016 and family
>	Motorola 68000, 68020, 6800, 6809, ...
>	Zilog Z8000
>	Z80
>	Intel 8086, 80186, 82086
>

All right, I shouldn't have said *invariably*.  The PDP-10 is a good
machine.  But while I am not sufficiently familiar with the others in
this list, the 8086 does *not* have such an instruction.  You can multiply
two 8-bit quantities to get a 16-bit result, or two 16-bit quantities
to get a 32-bit result.  The MUL/IMUL distinction on the 8086 is for
signed vs unsigned operands.

mhs@enmasse.UUCP (Mike Schloss) (07/26/85)

> > The Decsystem-10 & 20 has an instruction IMUL which does exactly what you
> > want. It Multiplies a 36bit value by another 36 bit value and gives a 36
> > bit result. They also have a divide which divides a 36 bit number.
> 
> For comparison and elucidation, so do:
> 
> 	VAX
> 	PDP-11
> 	National 32016 and family
> 	Motorola 68000, 68020, 6800, 6809, ...
> 	Zilog Z8000
> 	Z80
> 	Intel 8086, 80186, 82086
> 
> 
> Jon Shapiro


I may be missing something but my Motorola 68000 manual says nothing
about 32 bit mults.  Just 16*16 signed and unsigned.  The 68020 does
have 32*32 but I doubt any in the 6800 family do.

					Mike Schloss

jack@boring.UUCP (08/05/85)

In article <1202@sjuvax.UUCP> jss@sjuvax.UUCP (J. Shapiro) writes:
<Included text about a machine that multiplies and divides two N
 bit numbers, giving an N bit result>
>For comparison and elucidation, so do:
>
>	VAX
Dunno, could be.
>	PDP-11
Wrong. Both MUL and DIV (If you don't own an 11/10, which doesn't
even have a DIV:-) use a register *pair* for a 32 bit quantity.
>	National 32016 and family
Right.
>	Motorola 68000, 68020
More-or-less right (though there's some tricky sign bit stuff, I 
remember).
>				6800
MUL? DIV???? WHoeaaaaaaa!!! You can be glad that it has an ADD and a
SUB!!
>					6809
Wrong. 8x8->16 multiply, and no DIV, I believe.
>	Zilog Z8000
Dunno.
>	Z80
Cannot DIV or MUL.
>	Intel 8086, 80186, 82086
Dunno.

Well, that kind of invalidates your point, I believe.
-- 
	Jack Jansen, jack@mcvax.UUCP
	The shell is my oyster.