[comp.arch] When is RISC not RISC?

hascall@atanasoff.cs.iastate.edu (John Hascall) (01/31/89)

  In response to someone proposing adding instructions to a RISC
  machine, I wrote:

>>  Here it is again, adding instructions to a RISC machine... won't
>>  be long before we have a RISC machine with more instructions
>>  than a VAX.... :-)

  And was "corrected" by someone* thusly:

> And again...
> Sigh, RISC doesn't mean a small number of instructions.  RISC means....

   REDUCED Instruction Set Computer (i.e., a reduced number of instructions)

   True, many RISC machine incorporate a number of other feature, which
   because they have been used by a number of RISC machines, have come to
   considered a part of RISC--but there is no reason that these feature
   could not be part of a CISC machine (other than chip real-estate).

   I think the real problem here is a poorly named acronym, but it
   probably sounded "cute" (I, for one, am quite tired of papers
   titled "A RISCy blah blah blah" etc).

   Perhaps we could have a new buzzword contest, how about SOC (simple,
   orthogonal computer)?

   My $.02 (or less) worth,
   John Hascall
   ISU Comp Center

   * My apologies for losing the attribution above, but rn barfed on the
     overly long "References:" field and I had to do this by hand.

diamond@csl.sony.JUNET (Norman Diamond) (01/31/89)

In article <747@atanasoff.cs.iastate.edu>, hascall@atanasoff.cs.iastate.edu (John Hascall) writes:

> > Sigh, RISC doesn't mean a small number of instructions.  RISC means....

>    REDUCED Instruction Set Computer (i.e., a reduced number of instructions)

Maybe reduced number of KINDS of instructions.

If you have an add instruction for a byte and another for a word ...
if you have an add instruction for signed and another for unsigned ...
do you think these are ciscy?

Having an add instruction for a little-endian word and another for a
big-endian word strikes me as a little silly (maybe a big silly :-),
but still riscy.

Incidentally, wouldn't little-beginnian and big-beginnian be more
accurate?
-- 
Norman Diamond, Sony Computer Science Lab (diamond%csl.sony.jp@relay.cs.net)
  The above opinions are my own.   |  Why are programmers criticized for
  If they're also your opinions,   |  re-inventing the wheel, when car
  you're infringing my copyright.  |  manufacturers are praised for it?

colwell@mfci.UUCP (Robert Colwell) (01/31/89)

In article <747@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu (John Hascall) writes:
>
>  In response to someone proposing adding instructions to a RISC
>  machine, I wrote:
> 
>>>  Here it is again, adding instructions to a RISC machine... won't
>>>  be long before we have a RISC machine with more instructions
>>>  than a VAX.... :-)
>
>  And was "corrected" by someone* thusly:
>   
>> And again...
>> Sigh, RISC doesn't mean a small number of instructions.  RISC means....
>
>   REDUCED Instruction Set Computer (i.e., a reduced number of instructions)

No, that's what the *acronym* stands for.  That acronym was not invented by
the folks who started the RISC effort (IBM), it came from Berkeley.  Which
doesn't by itself make it invalid (no, I really mean that :-)) but it does
cast doubt on appeals to authority in this case.  And early on, it fit the
concept a lot better than it does now.

I gave a talk at MIT a couple of months ago arguing, I hope successfully,
that our VLIW embodies almost all of the major RISC concepts.  The one 
thing it does not have is "few" instructions, 2**1024 is a big number.
But, as Brian said, the intrinsic complexity of an instruction isn't the
source of evil, it's the cost of implementing it vs. the performance
cost of not having it vs. the compiler cost of each.  We implement that
large number of instructions by having many functional units, each with
its own fully decoded instruction residing in its own ICache slice, so
what you may have thought would be the large runtime cost of decoding an
instruction word that big is actually negligible.

Besides, "reduced" can be taken to mean "reduced in number" or "reduced
in complexity".  I think it's high time to leave the obsolete notion of
numbers of instructions as a measure of "RISCness" and move to the more
appropriate "function/implementation level".  If you move a lot of 
functionality into runtime hardware you are following in the tradition
of CISC machines.  If you are maximizing performance by moving as much
as you can into compile time, thus minimizing the machine's cycle time,
you are probably building a RISC.

The days of counting instructions and drawing some 
legitimate conclusion are long gone.

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
175 N. Main St.
Branford, CT 06405     203-488-6090

rodman@mfci.UUCP (Paul Rodman) (02/01/89)

In article <747@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu (John Hascall) writes:
>
>  In response to someone proposing adding instructions to a RISC
>  machine, I wrote:
> 
>>>  Here it is again, adding instructions to a RISC machine... won't
>>>  be long before we have a RISC machine with more instructions
>>>  than a VAX.... :-)
>
>  And was "corrected" by someone* thusly:
>   
>> And again...
>> Sigh, RISC doesn't mean a small number of instructions.  RISC means....
>
>   REDUCED Instruction Set Computer (i.e., a reduced number of instructions)

  At the risk of starting more pointless RISC/CISC flameage, let me add
my 2 cents worth here: (I know many of you out there won't agree....:-) 

The term RISC has been terribly misused, but my personal definition has
be widened to include machines that don't have a "small" number of 
instructions. 

E.g. the Multiflow Trace,( which I am using to compose this mail) has 
a VERY large space of possible instructions. I would still term this
machine RISCy  as each functional unit is controlled directly 
by the instruction word, and is decoupled from instruction packets that
are wired to other functional units. Hence the original purpose of the
RISC idea is served.

Conventional RISCs are designed to approach 1 "op" per cycle. We designed
a multiple-functional unit machine that executes >1 ops / cycle.
The VLIW compiler is considerably "smarter" than a typical
RISC compiler, and the compiler <-> hardware fusion is even more important
than for a simple RISC, but the basic mind set is still the same.

    Paul K. Rodman
    rodman@mfci.uucp

aglew@mcdurb.Urbana.Gould.COM (02/01/89)

>   I think the real problem here is a poorly named acronym, but it
>   probably sounded "cute" (I, for one, am quite tired of papers
>   titled "A RISCy blah blah blah" etc).
>
>   Perhaps we could have a new buzzword contest, how about SOC (simple,
>   orthogonal computer)?
>
>   ISU Comp Center

RAMM = Reduced Addressing Mode Machine

SEISM = Small, Efficient, Instruction Set Machine

alan@pdn.nm.paradyne.com (Alan Lovejoy) (02/02/89)

In article <10030@diamond.csl.sony.JUNET> diamond@csl.sony.JUNET (Norman Diamond) writes:
>In article <747@atanasoff.cs.iastate.edu>, hascall@atanasoff.cs.iastate.edu (John Hascall) writes:
>> > Sigh, RISC doesn't mean a small number of instructions.  RISC means....
>Maybe reduced number of KINDS of instructions.
>
>Having an add instruction for a little-endian word and another for a
>big-endian word strikes me as a little silly (maybe a big silly :-),
>but still riscy.

RISC should stand for "Reduced Instruction Set Complexity".  This means
minimizing the number of different instruction formats, having only one
instruction size (e.g., 32 bits), eliminating instructions that can't
be performed in once cycle with a reasonable amount of hardware or
without making the cycle-time considerably longer that is needed for
most instructions, and a host of other things less well characterized
by the nominal semantics of R.I.S.C. (caches, pipelines,
parallelism...).

Whether an architecture is little-endian or big-endian depends upon how
it maps byte adressess (well, on a byte-addressed machine, anyway) of 
the bytes in a word or longword to the arithmetic significance of those
bytes.  Once a word or longword has been fetched from memory into a
register, the bytes of that word or longword no longer have addresses,
so the endianness becomes undefined.  Since traditional RISC
architectures only allow load and store operations to reference memory,
all other operations must be carried out on values in registers.  So
there would be no need to have big-endian and little-endian versions 
of "ADD" or any other operation besides LOAD and STORE.

-- 
Alan Lovejoy; alan@pdn; 813-530-2211; ATT-Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for ATT-Paradyne.  They do not speak for me. 
___________ This Month's Slogan: Reach out and BUY someone (tm). ___________
Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.

pauls@Apple.COM (Paul Sweazey) (02/03/89)

How about a bit in the page tables that indicates the endian-ness of data accesses through that map?  Seems like the right place to me (and I didn't think it up either).

Paul Sweazey
pauls@apple.com

keithe@tekgvs.LABS.TEK.COM (Keith Ericson) (02/04/89)

>
> Sigh, RISC doesn't mean a small number of instructions. RISC means..
>

Seems to me that the "reduced" is a totally incorrect moiniker (sp?):
the truly salient point is that all the instructions are equal
length, to reduce problems maintaining the instruction pipeline.

>Incidentally, wouldn't little-beginnian and big-beginnian be more
>accurate?

Yeah, you're right. Guess it depends on which eye you look out of. :-)

kEITH

rpw3@amdcad.AMD.COM (Rob Warnock) (02/07/89)

+---------------
| > Incidentally, wouldn't little-beginnian and big-beginnian be more accurate?
| Yeah, you're right. Guess it depends on which eye you look out of. :-)
+---------------

Well, the names "Big-Endian" and "Little-Endian" were borrowed by Danny Cohen
(in his classic paper "On Holy Wars And A Plea For Peace") from peoples of the
same names in one of Jonathan Swift's stories. Seems there was this "holy war"
between folks who like to eat their soft-boiled eggs from the "big end" and
those who preferred to eat from the "little end"...

So "accuracy" here should probably yield to classic usage, eh?  ;-}


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

jk3k+@andrew.cmu.edu (Joe Keane) (02/08/89)

Keith Ericson writes:
> Seems to me that the "reduced" is a totally incorrect moiniker (sp?): the
> truly salient point is that all the instructions are equal length, to reduce
> problems maintaining the instruction pipeline.

As much as i dislike VAX instruction encoding, i can't agree with this.
Single-size instructions are nice, but you'll pay a price in code density.  The
RT has two instruction sizes, and i think it was the right choice.

dennis@gpu.utcs.toronto.edu (Dennis Ferguson) (02/08/89)

In article <cXvqKRy00Wo=0TV282@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes:

>Single-size instructions are nice, but you'll pay a price in code density.  The
>RT has two instruction sizes, and i think it was the right choice.

Except that, because they had to encode both the OP code and a couple
of registers into 16 bit instructions, the RT ended up with only 16
registers.  There just isn't enough room for more registers if you have
to accomodate the entire instruction set in 16 bits.

Personally, I think that was the wrong choice.  I'd think I'd rather
have longer instructions and more registers, thanks.

Dennis Ferguson
University of Toronto

firth@sei.cmu.edu (Robert Firth) (02/08/89)

Keith Ericson writes:
> Seems to me that the "reduced" is a totally incorrect moiniker (sp?): the
> truly salient point is that all the instructions are equal length, to reduce
> problems maintaining the instruction pipeline.

In article <cXvqKRy00Wo=0TV282@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes:
>As much as i dislike VAX instruction encoding, i can't agree with this.
>Single-size instructions are nice, but you'll pay a price in code density.  The
>RT has two instruction sizes, and i think it was the right choice.

This issue has been argued quite vigorously in the DoD RISC program,
and I'd like to offer an unobjective opinion.

It seems pretty clear that instruction density can be improved by having
more than one length.  The GE design used two lengths (16 and 32) and
claimed a 15% improvement in instruction density as a result.  This seems
reasonable to me.

However, set against that the following

. more complicated instruction decode
. more complicated pipeline management
, more complicated Icache design
. loss of one bit in span of relative branch or call

If the goal is pure speed, the question I ask is: are you better off
with 15% more bytes of instructions and a bigger Icache?  If you can
raise the hit rate from 93% to 94% you have offset the difference in
instruction size (you fetch 6% rather than 7% for a reduction of 14%
in instructions fetched).  Moreover, very little new logic is involved,
just more of the same.

My view (for what it's worth) is that with CURRENT technology it is
better to have all instructions the same length.  However, you do
need a big Icache (as I think the evolution of the Mips Inc machines
demonstrates).

rodman@mfci.UUCP (Paul Rodman) (02/09/89)

In article <cXvqKRy00Wo=0TV282@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes:
>Keith Ericson writes:
>> Seems to me that the "reduced" is a totally incorrect moiniker (sp?): the
>> truly salient point is that all the instructions are equal length, to reduce
>> problems maintaining the instruction pipeline.
>
>As much as i dislike VAX instruction encoding, i can't agree with this.
>Single-size instructions are nice, but you'll pay a price in code density.  The
>RT has two instruction sizes, and i think it was the right choice.

Well, maybe it was the right choice.

Code density is one of the LEAST important aspects of instruction set
design if performance is one of your goals. When I was a grad student at
CMU in 1978 we actually studied several existing architectures with an 
ISP simulator. One of the metrics used to "rate" the architectures was
the static code size. A better one was the "dynamic" code size which
would correspond to instruction cache fetches. None of these metrics
really tried to quantify the things that make pipelining an instruction set
difficult.

Trading static code size for speed in execution is a tradeoff that most
folks would love. Who cares what the size of the text is when the data
is 100 Mb, anyway? The important thing is getting control bits to the
functional units.

Some machines try to get the best of both worlds. The Multiflow Trace,
for example, uses a very long instruction word to potentially execute
many operations per cycle. However, we don't want to drag around 1024 bits
for EVERY instruction word , or static code size would be a problem.
So we have a "mask" word in memory with the instruction "packets" that
tells us which packets were unused and these packets are loaded to
a zero (nop) at cmiss time. By processing the mask word at cmiss time
we take something that would normally be a first order effect (i.e.
cycle time) and move it to second order (cmiss time). You could even
do tricks like this at page fault time (3rd order).

I belive a company called Computer Consoles built a vax clone that
translated the instructions at cmiss time.

Personally, I think the vax has about the worst possible archtecture
one could come up with (assuming you aren't PURPOSEFULLY trying to 
create a bad machine :-). It was great in the days when control store
proms were 50ns and main memory was 1us. Today, cache rams are the
same speed as the control store rams (and the control store better
be rams because the microcode is so complex that you have to be able to
fix it). Byte aligned instructions make the hardware more difficult and
buy you nothing. So many instructions that serve no purpose, etc, etc.
Blech, what a mess. And so many minds worked *hard* to create it! Ha!

    Paul Rodman
    rodman@mfci.uucp

Who says engineering is a science? Engineering is an Art, and don't ever
forget it!

bla@hpcupt1.HP.COM (Brad Ahlf) (02/09/89)

>> Sigh, RISC doesn't mean a small number of instructions. RISC means..

Instead of RISC, how about RCC:

Reduced Complexity Computers

I have seen this discussed and coined in some of the analyst papers like
Gantner Group's (I think) discussion of current 'RISC' computers.  I like
this term and think it is both easier to grasp and a better expression of
the current successful implementations.

jk3k+@andrew.cmu.edu (Joe Keane) (02/09/89)

Dennis Ferguson writes:
> Except that, because they had to encode both the OP code and a couple of
> registers into 16 bit instructions, the RT ended up with only 16 registers.
> There just isn't enough room for more registers if you have to accomodate the
> entire instruction set in 16 bits.

I agree register sets are limited by instruction coding, but i don't think that
different-sized instructions is part of the problem.

Most of the RT's 32-bit instructions consist of an 8-bit opcode, two 4-bit
register specifiers, and a 16-bit immediate field.  You can't add more registers
even if you ignore 16-bit instructions.

The PDP-11 has single-size instructions (plus immediate words, does this count?)
but only 3-bit register specifiers.  Can't fix this either.

The R2000 has big single-size instructions, but still only a 5-bit register
specifier.  32 registers is nothing to get excited about.

What can we do about this?  More on this later...

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (02/10/89)

In article <6310013@hpcupt1.HP.COM> bla@hpcupt1.HP.COM (Brad Ahlf) writes:
>>> Sigh, RISC doesn't mean a small number of instructions. RISC means..
>
>Instead of RISC, how about RCC:
>
>Reduced Complexity Computers

****************************************

How about FIFC?

Fixed Instruction Format Computer

"To avoid these [described previously] problems two criteria were used in
the design of the VAX-11 instruction format: 1) all instructions should
have the "natural" number of operands, and (2) all operands should have the
same generality in specification.  These criteria led to a highly variable
instruction format.  ..."   - W. D. Strecker, "VAX 11/780" (reprinted in
Sieworiek, Bell, and Newell).

Now, as history has shown, it was exactly that highly variable format, with
a variable number of operands, and addressing mode encoding requiring separate
decoding, which has made such instruction sets hard to pipeline, and thus 
harder to build high performance machines with.  (And thus the cause of more 
problems than those solved.)  The secret to success for all the
RISC machines that I know of, from the CDC 6600 on, has been the choice of
simple, fixed, easy to decode instruction formats.

As has been pointed out repeatedly on this group, the NUMBER of instructions
is really a second order determinant of performance.  (On the other hand,
including a lot of special hardware for functions which are never exercised
certainly is a first order determinant.)

So, while we are proposing, I propose turning Strecker's phrase around:
the opposite of "highly variable" is "fixed", and it also happens to be one
of the few common denominators of all "RISC" machines.

*************************************************

As an aside, I note, as previously, that the VAX was very successful at
meeting another mid-70's problem however: keeping object code size small.

I notice that, recently, some results from the GNU compiler on some RISC's
has been a challenge to the object code size of VAX code.  Obviously, GNU
is doing something interesting there.  Does anyone know why the code from
the GNU C compiler appears to be significantly more compact than other
compilers on the RISC machines?

-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

lexw@idca.tds.PHILIPS.nl (A.H.L. Wassenberg) (02/10/89)

In article <21606@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>
> How about FIFC?
>
> Fixed Instruction Format Computer
>

It feemf to me af if you had a fpeech-defect!  :-)


            __
           /  )           Lex Wassenberg
          /               Philips Telecommunication & Data Systems B.V.
         /   _            Apeldoorn, The Netherlands
      __/   /_\ \/        Internet:   lexw@idca.tds.philips.nl
     (_/\___\___/\        UUCP:       ..!mcvax!philapd!lexw

jesup@cbmvax.UUCP (Randell Jesup) (02/14/89)

In article <8476@aw.sei.cmu.edu> firth@bd.sei.cmu.edu (Robert Firth) writes:
>In article <cXvqKRy00Wo=0TV282@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes:
>>As much as i dislike VAX instruction encoding, i can't agree with this.
>>Single-size instructions are nice, but you'll pay a price in code density.  The
>>RT has two instruction sizes, and i think it was the right choice.
>
>This issue has been argued quite vigorously in the DoD RISC program,
>and I'd like to offer an unobjective opinion.
>
>It seems pretty clear that instruction density can be improved by having
>more than one length.  The GE design used two lengths (16 and 32) and
>claimed a 15% improvement in instruction density as a result.  This seems
>reasonable to me.

	Not quite.  The GE rpm-40 had one instruction size: 16 bits.  One of
the instructions was 'prefix (PFX)', which supplied 12 bits of immediate for
use in the next instruction (most instructions using immediates could use
4 bits of immediate directly, except things like branch, which used 12).
You could put multiple PFX instructions before a regular one to build up a 32-
bit immediate.  Most constants (I think the number was 90%) fall in 4 bits,
and close to 99% fall in 16 (1 PFX instruction)).  The other disadvantage of 16-
bit instructions is two-operand instructions vs 3-operand (though many times
three operands aren't needed).  The advantage of 16-bit instructions is
memory bandwidth.

>However, set against that the following
>
>. more complicated instruction decode

	Very slightly, since PFX is an instruction, it just routes the result
to the immediate value register.  The most complex part of this (not very)
is shifting the value in the IVR over when a PFX is executed (easy because
it's a fixed shift).

>. more complicated pipeline management

	All the reorganizer has to do is keep PFXs with their associated
instructions.  It does add a small amount of complexity, but not much.

>, more complicated Icache design

	One this you're wrong, since PFX is just another instruction.

>. loss of one bit in span of relative branch or call

	Also wrong, since a branch doesn't need to indicate whether a
following word is part of the instruction: it just takes the IVR and
masks in the rest of the immediate from the branch.  For one pfx, you
are limited to 24 bits relative addressing.  That's usually enough.  If
it isn't, use two PFXs for 32 bits relative.

>If the goal is pure speed, the question I ask is: are you better off
>with 15% more bytes of instructions and a bigger Icache?  If you can
>raise the hit rate from 93% to 94% you have offset the difference in
>instruction size (you fetch 6% rather than 7% for a reduction of 14%
>in instructions fetched).  Moreover, very little new logic is involved,
>just more of the same.

	I say there is no difference in Icache complexity due to 16-bit
instructions.  Therefor, you should get twice as many instructions into
it, and if the 15% figure is true, then you should get effectively 15%
more done with what's in the icache.  (This assumes a loop the size of the
icache.  For other conditions, it may change the hit-rate instead.)

>My view (for what it's worth) is that with CURRENT technology it is
>better to have all instructions the same length.  However, you do
>need a big Icache (as I think the evolution of the Mips Inc machines
>demonstrates).

	Icache is what makes the world go around.  The next stepping stone:
making dcaches more efficient (their current hit-rates are lousy unless they're
ridiculously large).  This may require yet more integration of back-end
software and silicon, or even front-end software.

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

rodman@mfci.UUCP (Paul Rodman) (02/15/89)

In article <5964@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>
>	Very slightly, since PFX is an instruction, it just routes the result
>to the immediate value register.  The most complex part of this (not very)
>is shifting the value in the IVR over when a PFX is executed (easy because
>it's a fixed shift).
>

You mean it takes me *cycles* to build a >4 bit constant??? *Gasp,choke.*
I guess thats fine for some machines, but if you're reading 4 x 64 bit words
from a large array every beat from a common block, you may need lots of
constants without such a penalty. 

I'd much rather have the instruction bits to do *exactly* what I (i.e. the
compiler) want to do, than have a program that is statically smaller.
Cache, main memory, and disk are cheap and getting cheaper. 
Anytime I can trade them for speed its a win. 

>
>	I say there is no difference in Icache complexity due to 16-bit
>instructions.  

There would be on a machine that was trying to do more than one lousy 
operation per cycle.

Therefor, you should get twice as many instructions into
>it, and if the 15% figure is true, then you should get effectively 15%
>more done with what's in the icache.  (This assumes a loop the size of the
>icache.  For other conditions, it may change the hit-rate instead.)

Icaches can be made huge. Who cares?

>
>	Icache is what makes the world go around.  The next stepping stone:
>making dcaches more efficient (their current hit-rates are lousy unless they're
>ridiculously large).  

Why not have a *real* pipelined memory system and a compiler than can
handle more than one miserable outstanding load in flight? Or have both.

    Paul Rodman
    rodman@mfci.uucp

aglew@mcdurb.Urbana.Gould.COM (02/16/89)

>Why not have a *real* pipelined memory system and a compiler than can
>handle more than one miserable outstanding load in flight? Or have both.
>
>
>    Paul Rodman

Denelcore's HEP.

We discussed barrell processors (processors that execute a different
process at every pipe stage, hence no dependencies) a while back;
someone (I think it was John Mashey) posted a rather good argument
against them. Does anyone have that around? (BTW, is comp.arch
archived anywhere?)

w-colinp@microsoft.UUCP (Colin Plumb) (02/16/89)

rodman@mfci.UUCP (Paul Rodman) wrote:
> In article <5964@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>> 	Very slightly, since PFX is an instruction, it just routes the result
>> to the immediate value register.  The most complex part of this (not very)
>> is shifting the value in the IVR over when a PFX is executed (easy because
>> it's a fixed shift).
> 
> You mean it takes me *cycles* to build a >4 bit constant??? *Gasp,choke.*
> I guess thats fine for some machines, but if you're reading 4 x 64 bit words
> from a large array every beat from a common block, you may need lots of
> constants without such a penalty. 

You's *hate* the transputer.  It takes one prefix instruction per 4 bits.
Currently, that's one cycle per 4 bits (past the first).  There are some
noises about speeding up this particular part of the decode process.

> Icaches can be made huge. Who cares?

Anyone who wants 15% more icache than the competitor.  Assume we both have
the same icache technology.  If my instructions are denser, I get more
of my inner loop in the icache, and run faster.  If the denser instructions
don't cost me in decode as much as they get me in hit rate, I win.  Although,
as you point out, icache capacity is a rapidly moving target.  Still, it
can be a valid tradeoff.
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do."

rodman@mfci.UUCP (Paul Rodman) (02/17/89)

In article <28200275@mcdurb> aglew@mcdurb.Urbana.Gould.COM writes:
>
>>Why not have a *real* pipelined memory system and a compiler than can
>>handle more than one miserable outstanding load in flight? Or have both.
>>
>>
>>    Paul Rodman
>
>Denelcore's HEP.
>

Yuk. Please don't bring up that pile of junk. 

I was referring more to our machine, where the compiler sees an exposed 
pipeline  several beats long to main memory.  In some 
cases a data cache is useful for reducing latency if the hit rate will be
high enough. Of course, your compiler must do a good job
about deciding when to go for the cache. 

...I'm  disgusted at these piddly RISC chips that brag about register
scoreboarding of one lousy outstanding load at a time. Great for use
with data caches, not much use for high performance computation on large
data sets (where I'd much rather not use the cache, but want to get decent
bandwidth). If you want to do a daxpy, for example, you need two loads
and a store per 2 flops. What good are floating point chips if you can't
feed 'em? 

The architecture is broken if I can't at least do 1 load/store 
(from memory,please) per flop. :-)  

    Paul Rodman
    rodman@mfci.uucp

rcd@ico.ISC.COM (Dick Dunn) (02/17/89)

In article <649@m3.mfci.UUCP>, rodman@mfci.UUCP (Paul Rodman) writes:
...
> >Denelcore's HEP.
...
> Yuk. Please don't bring up that pile of junk. 

This is rude, out of place in the newsgroup, and unfounded anyway.  The
ideas that Denelcor (mostly Burton Smith) attempted to develop in the HEP
were reasonably well-founded; pieces of them can be found in more contem-
porary designs.  Remember that the HEP design was really mid-'70's, and
also keep in mind that corporate vicissitudes (financing, contracts, and
all that piddly money stuff) do not determine technical merit.

> I was referring more to our machine...

How nice.  Slam machine X, then boost your own.

> ...In some 
> cases a data cache is useful for reducing latency if the hit rate will be
> high enough. Of course, your compiler must do a good job
> about deciding when to go for the cache...

Remarkable as it may seem, there are data caches with very high hit rates,
and the compilers don't even have to worry about their existence.
-- 
Dick Dunn      UUCP: {ncar,nbires}!ico!rcd           (303)449-2870
   ...Just say no to mindless dogma.

aglew@mcdurb.Urbana.Gould.COM (02/17/89)

>Andy also mentioned at one point that he thought string ops weren't
>very necessary in a machine that had word ops with masks (or somesuch;
>I don't recall exactly). I wanted to hear your reasons for that, Andy.
>I sent you email, but it was bounced. Would you please post a msg with
>more details?
>
>    -Olin

My reasoning: the most commonly used string ops are moves.
There's only one way to make moves faster - move more data
per cycle => larger busses [*]. Most string moves are small,
so would, eg., be able to fit into a 128 bit, 16 byte, wide bus.
Given that you can move wide words, how do handle misaligneds?
By a decomposition of the word into power-of-two sized transactions
- works, but gets more difficult as word size increases, and,
in the dynamic case, requires decisions. Most byte-addressible
architectures already have signals similar to "Store the data
off the buss in this word only". Provide explicit control of these.

Similar arguments apply for length strings.

I don't have enough real data behind these statements, yet;
but I've been flogging them for a few years, and have finally 
got support to examine them in detail. Since I need to invent
the tools to do the study first, you could still probably beat
me to publication - I wouldn't mind, just tell me so that I can
do something more interesting.

Now, I admit that John Mashey's statements about the inefficacy
of optimizing strings tend to imply that this is not a very rewarding
area of research, but I point out that the same things also apply
to block moves - and, running profiling on my system I regularly 
see block moves occupying up to 10% of system time. Why is a discussion
for an OS group.


[*] Well, remapping and parallel moves are possibilities,
    appropriate for large moves, but probably not for the
    most frequent case.

aglew@mcdurb.Urbana.Gould.COM (02/18/89)

>The architecture is broken if I can't at least do 1 load/store 
>(from memory,please) per flop. :-)  
>
>    Paul Rodman
>    rodman@mfci.uucp

Actually, given some of the instruction frequency statistics
I've been seeing recently, you need about 3 load/stores
per arithmetic operation. From cache, admittedly.

jesup@cbmvax.UUCP (Randell Jesup) (02/25/89)

In article <644@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
>In article <5964@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>>	Very slightly, since PFX is an instruction, it just routes the result
>>to the immediate value register.  The most complex part of this (not very)
>>is shifting the value in the IVR over when a PFX is executed (easy because
>>it's a fixed shift).
>>
>You mean it takes me *cycles* to build a >4 bit constant??? *Gasp,choke.*
>I guess thats fine for some machines, but if you're reading 4 x 64 bit words
>from a large array every beat from a common block, you may need lots of
>constants without such a penalty. 

	Excuse me, but what does loading from common blocks have to do
with constant sizes?

	As I said, if you look at statistics on usage of constants, there
are a very large percentage that will fit in 4 bits.  This design was not
a total pedal to the metal design, but one that balanced memory speed, size,
and bandwidth against processor speed.  16-bit instructions allows us to use
much denser, slower and cheaper instruction memories while still running
at 40Mhz, and much faster than we would have with 32-bit instructions at
20Mhz (given constant I-Mem speed).

	Remember, there are other uses of RISC chips than in Unix workstations.
Embedded controllers, for one.

>>	I say there is no difference in Icache complexity due to 16-bit
>>instructions.  
>
>There would be on a machine that was trying to do more than one lousy 
>operation per cycle.

	Why the flame?  I was talking about the design decision between
32-bit instructions and 16-bit ones on the RPM 40.  Not many 'RISC' chips
execute more than one instruction per cycle, certainly not the RPM-40.
It pushed the design rules a fair ways, and had several close-to-critical
paths (in other words without process/design rule changes it would be
very hard to add a lot to it).

>>Therefor, you should get twice as many instructions into
>>it, and if the 15% figure is true, then you should get effectively 15%
>>more done with what's in the icache.  (This assumes a loop the size of the
>>icache.  For other conditions, it may change the hit-rate instead.)
>
>Icaches can be made huge. Who cares?

	People who can't afford huge external ICaches.  Also, at the speeds
we're talking about, there are loading limitations to how much ram you can
attach to the processor.  The ICache I was talking about was the on-chip
ICache.

>>	Icache is what makes the world go around.  The next stepping stone:
>>making dcaches more efficient (their current hit-rates are lousy unless they're
>>ridiculously large).  
>
>Why not have a *real* pipelined memory system and a compiler than can
>handle more than one miserable outstanding load in flight? Or have both.

	We do.  We can do one load per cycle, the memory systems are
pipelined. (The latency is more than one cycle, of course).

>    Paul Rodman
>    rodman@mfci.uucp

	Calm down.


-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup