[comp.arch] A simple question on RISC

aiko@xanth.cs.odu.edu (John K Hayes) (11/01/88)

Can someone explain, in general, what RISC technology is?  Starting, perhaps,
with what RISC stands for.  I'm not exactly an architecture person.

I shall appreciate it-----------john
-- 
 ---{john hayes}     Old Dominion Univ; Norfolk, Virginia USA
                     UUCP:  aiko@xanth.UUCP  or  ...!uunet!xanth!aiko
                     ARPA:  aiko@xanth.cs.odu.edu      
                     Home: (804) 622-8348     Work: (804) 460-2241 ext 195

dumesny@batcomputer.tn.cornell.edu (Alain Dumesny) (11/01/88)

In article <6544@xanth.cs.odu.edu> aiko@cs.odu.edu (John K Hayes) writes:
>Can someone explain, in general, what RISC technology is?  Starting, perhaps,
>with what RISC stands for.  I'm not exactly an architecture person.

RISC stands for Reduced Instruction Set Computer, which are processor like the
M88000.  On the other side are what are called the CISC (Complex Instruction
Set Computer) which are processors like the M680x0 and 80x86 (yuck).
The main difference between RISC and CISC is that RISC processors only include
the most basic (and necessary) intructions.  Therefore when you send it an
intruction, the instruction won't have to be "decoded" (by decoded I mean that
it won't go through a lookup table of micro-code), which realizes a big 
improvement in performance.  On CISC processors you have intructions that are
rather complicate (i.e. they do a lot of thing) and they need to go to the
look-up table to be "decoded".

A simple (if you program in assembly language) example might be using a JMP
(jump) to subroutine.   On a CISC chip you just say JMP address, and the
processor does the rest for you.  On a RISC chip you normally won't have this 
kind of intructions because they can be broken down into basics on.  Like 
saving what ever register you need on the stack (including the return address
which is in the PC) and load the new PC (program counter) with the new add.

Also RISC chips tends to have a lot more registers build in (made possible 
because RISC chips require a lot less component, and therefore the extra space
can be used for fast memory (registers)).

Advantages of RISC are that they are much faster (on the number of instructions
per cycle) and cheaper.  However IN GENERAL you need more instructions for
a RISC chip for a given program than you need for a CISC chip for the same
program (maybe about 30%).  But you still have faster processing, which 
explaines for the "recent" rush for RISC technology (which actually was there
before CISC).

You might wonder why insn't everybody using RISC then ? right.
Well you have to take into consideration the udge investment in software, and
also the fact that RISC chips require smarter compiler when you program in
a high level language.

Anyway, I hope this helped you instead of confusing you !  :-)

----------------
Alain Dumesny
Cornell University
awzy@crnlvax5.bitnet

No flame if I am wrong, I am learning this tuff now !  :-)

khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (11/01/88)

In article <6544@xanth.cs.odu.edu> aiko@cs.odu.edu (John K Hayes) writes:
>
>
>
>Can someone explain, in general, what RISC technology is?  Starting, perhaps,
>with what RISC stands for.  I'm not exactly an architecture person.
>
>I shall appreciate it-----------john
>-- 
RISC == reduced instruction 

The IBM 360 was the first really popular microcoded machine. This
allowed a very large number of instructions to be supported (i.e.
different assembley language op codes). Thus was born CISC (complex
instruction set computers). The rationale was if its hard to do in
software (or is slow) let's put it in the hardware. Especially when
people compared the relative costs of hardware and software. And so
everyone was happy. 

After a few years, some bright people noticed what Seymour Cray never
forgot, hardwired logic was faster. Also that maybe not all those
nifty wizbang instructions ever got used. Or if they got used, it was
so rare that it didn't matter. Or they got used, and it was slower
than some combination of simple instructions. Or all of the above.

Thus was born RISC. If we ignore Seymour (always a mistake), the early
RISC prople were IBM, UCB and Stanford. With the exception of Seymour,
all of these early RISC machines were too simple.

Now we have the Clipper, the MIPS chip, SPARC, and the AMD29K, the
Moto 88K, and dozens of experiental chips.

The basic design philosophy is : pick what you think real problems
are. Measure how it behaves. Design your RISC. Simulate it. Recurse
until you feel good. Make it. Market it. Get rich (one hopes :>). The
point being don't just put stuff in the chip, decide what you want,
and make sure it does it.

In a very real way, RISC places a big burden on compilers (to get good
performance). On the other hand, since RISC's tend to be simple,
optimizers don't flail around as much (on a VAX, say, when do you use
the compute polynomial instruction, when is it better to have
something in a register or memory ? what about saving registers, do it
"by hand" or use the push all instruction ? RISC machines typically
only have one way to do anything. Thus the compiler writer can
concentrate on other questions (like how to keep the pipe always full)).

Thus (IMHO) RISC is a design methodogy.

Of course, I don't speak for Sun. Furthermore Sun desgined SPARC
without any help from me.

Keith H. Bierman
It's Not My Fault ---- I Voted for Bill & Opus

kyriazis@rpics (George Kyriazis) (11/01/88)

In article <75577@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:
>
>After a few years, some bright people noticed what Seymour Cray never
>forgot, hardwired logic was faster. Also that maybe not all those
>nifty wizbang instructions ever got used. Or if they got used, it was
>so rare that it didn't matter. Or they got used, and it was slower
>than some combination of simple instructions. Or all of the above.
>
>Thus was born RISC....
>

One small detail that I remember reading in the thesis concerning the
RISC II chip.  If one instruction, is not used very often, it not only
takes up silicon space, but also slows down the rest decoding circuitry,
since it adds output capacitance (in the case of MOS) to the previous
gate level.  The elimination of one signle instruction won't do much good,
but if you do that systimatically and organize the decoding logic so
it has a minumum level of levels and minimum input capacitance, you score.
(Remember the input capacitance of the decoding circuitry is the output
capacitance of the spevious stage (simplification), namely the instruction
register.

  George Kyriazis
  kyriazis@turing.cs.rpi.edu
------------------------------

bcase@cup.portal.com (Brian bcase Case) (11/02/88)

>...RISC technology (which actually was there before CISC).

This is a common misconception.  Before machine architectures got as
complex as the 370 and VAX, there were indeed simple machines.  But
they were not RISCs.  RISC means more than simplicity, it means the
*right* simplicity.  The PDP-8 is a very simple machine, but it is
hardly a RISC.

bcase@cup.portal.com (Brian bcase Case) (11/03/88)

>RISC == reduced instruction 

That's reduced instruction set computer.

>Thus was born RISC. If we ignore Seymour (always a mistake), the early
>RISC prople were IBM, UCB and Stanford. With the exception of Seymour,
>all of these early RISC machines were too simple.

Just to put in my 2 cents, not everyone (e.g., me) agrees that Seymour
builds RISC machines.  They might be simple, but they are not RISCs, by
my estimation.

As I understand it, the MIPS guys concluded that the early Stanford MIPS
architecture was too *complicated*, not too simple (the packing of one
or two instructions per word really complicated the pipeline control, at
least that's what I remember reading, and some instructions were two-
address while others were three).  The 801 had a very complex register
file (3 reads, two writes).  Subsequent machines seem to be simpler
(although we might all be surprised by John Cocke's Americas machine.).

Notice also that Seymour's first machine came out in '76 (I believe).
The 801 research began in 1975, though it was perhaps not as focused
as it later became.  I'd say this is a pretty close call to who was first.
The 801 guys set down the ground rules and then let the compiler guys
define the instructions.  I don't think this is the way Seymour did it.
Seymour is a smart guy and understands what can go fast and what cannot,
but little things like the lack of three-address integer instructions
shows that he was not designing from the point of view of a compiler
guy.  And the Multiflow VLIW and Wulf's WM machine are closer to RISC
vector machines than are the Crays.

henry@utzoo.uucp (Henry Spencer) (11/04/88)

In article <75577@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:
>... If we ignore Seymour (always a mistake), the early
>RISC prople were IBM, UCB and Stanford. With the exception of Seymour,
>all of these early RISC machines were too simple.

Uh, do remember that the SPARC is essentially the UCB machine plus some very
minor (although useful) improvements.  There was nothing terribly wrong with
most of those early machines.
-- 
The Earth is our mother.        |    Henry Spencer at U of Toronto Zoology
Our nine months are up.         |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (11/04/88)

In article <10802@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:

>As I understand it, the MIPS guys concluded that the early Stanford MIPS
>architecture was too *complicated*, not too simple (the packing of one
>or two instructions per word really complicated the pipeline control, at
>least that's what I remember reading, and some instructions were two-
>address while others were three).  The 801 had a very complex register
>file (3 reads, two writes).  Subsequent machines seem to be simpler
>(although we might all be surprised by John Cocke's Americas machine.).

Simpler in some ways, more complex in others is the trend.

>
>Notice also that Seymour's first machine came out in '76 (I believe).

Wrong. UNIVAC, the all of the CDC machines until he left to found Cray
Research. The CDC 6600 is often listed as the first supercomputer. 

>The 801 research began in 1975, though it was perhaps not as focused
>as it later became.  I'd say this is a pretty close call to who was first.

6600 was a 1960's machine. So if it's RISC, it won by a decade.
                              ^^

>The 801 guys set down the ground rules and then let the compiler guys
>define the instructions.  I don't think this is the way Seymour did it.
>Seymour is a smart guy and understands what can go fast and what cannot,
>but little things like the lack of three-address integer instructions
>shows that he was not designing from the point of view of a compiler
>guy.  And the Multiflow VLIW and Wulf's WM machine are closer to RISC
>vector machines than are the Crays.

True on both counts. Seymour designs simple things which go fast. He
assumes that compiler writers can cope. This is, perhaps, not the best
approach...but his track record is pretty good (CDC owned the high
performance computer market during his long reign). The Multiflow and
Cydra machines are, arguably, RISC's .

Keith H. Bierman
It's Not My Fault ---- I Voted for Bill & Opus

colwell@mfci.UUCP (Robert Colwell) (11/04/88)

In article <76083@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:
>In article <10802@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>
>>The 801 had a very complex register
>>file (3 reads, two writes).

Maybe this is due to all the hair they added on behalf of their step-multiply
and divide; this is one of the RISC patents.  Apparently, they can interrupt
on some of the bits in the register file.

>>The 801 research began in 1975, though it was perhaps not as focused
>>as it later became.  I'd say this is a pretty close call to who was first.

I was at one of Patterson's first RISC talks at Bell Labs in late '79
or early '80.  I recall him giving credit to John Cocke directly (then)
as to how he got involved with RISC.  I don't know how Hennessy got
involved, and never thought to ask him.

>>The 801 guys set down the ground rules and then let the compiler guys
>>define the instructions.  I don't think this is the way Seymour did it.
>>Seymour is a smart guy and understands what can go fast and what cannot,
>>but little things like the lack of three-address integer instructions
>>shows that he was not designing from the point of view of a compiler
>>guy.  And the Multiflow VLIW and Wulf's WM machine are closer to RISC
>>vector machines than are the Crays.
>
>True on both counts. Seymour designs simple things which go fast. He
>assumes that compiler writers can cope. This is, perhaps, not the best
>approach...but his track record is pretty good (CDC owned the high
>performance computer market during his long reign). The Multiflow and
>Cydra machines are, arguably, RISC's .

I agree; calling a Cray a RISC is (to my mind) more palatable than 
linking lots of registers to RISCs, but not by that much.

I like discussing our VLIW in the RISC context, because I think it
demonstrates the important RISC principles beautifully (smart compilers,
simple hardware (that's what our software people think, anyway), exposed
pipelines, load/store, no microcode).  But!! -- the instruction word
can be 512 bits long in our largest announced machine.  2**512 is a
lot of instructions.  And yes, I'm being a bit mischievous, too,
because any one functional unit's instruction set is small enough
to need only the simplest hardware for decode.  The rest of the control
hardware is to handle the self-draining pipelines, a design decision
that (I think) was the correct one, even for a RISC machine.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

albaugh@dms.UUCP (Mike Albaugh) (11/05/88)

From article <76083@sun.uucp>, by khb%chiba@Sun.COM
(Keith Bierman - Sun Tactical Engineering):
> In article <10802@cup.portal.com> bcase@cup.portal.com (Brian bcase Case)
writes:
>>Notice also that Seymour's first machine came out in '76 (I believe).
> 
> Wrong. UNIVAC, the all of the CDC machines until he left to found Cray
> Research. The CDC 6600 is often listed as the first supercomputer. 

	I wondered who else was going to notice this.
> 
> 6600 was a 1960's machine. So if it's RISC, it won by a decade.
>                               ^^
	If things like the 88000 are, what possible "rule" can exclude
the 6600?
> 
>>Seymour is a smart guy and understands what can go fast and what cannot,
>>but little things like the lack of three-address integer instructions
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>shows that he was not designing from the point of view of a compiler
> 
> True on both counts. Seymour designs simple things which go fast. He

	Uhhh... Am I the only one who remembers "IX7 X2+X3"?
Actually, the thing I liked _most_ about the 6600 (which it shared
with the the DEC PDP10 and PDP11) was that one could learn a very
small "grammer" for instruction composition and know that any
instruction that "fit" would in fact exist _and_do_what_the_name_implied
not all the special instructions on any of these machines fit the simplest
possible grammer, but all the generally useful ones did. The second
condition (do_what_...) is widely violated by most recent CISC's (e.g.
680x0) and the first even more so ("whadya mean I can only use that
addressing mode on instructions whose mnemonic, when translated into
Urdu, has a length which modulo 7 == 3 ?" :-). Reading about the current
crop of "RISC" processors, I am overwhelmed by Deja Vu. Multiple
function units, scoreboarding, forwarding, multiple instructions per
word, orthogonal instruction sets, compare_and_branch vs CC's...

	Oh, yeah, lack of Integer Mul & div was a bit of a pain, The
"Return Jump" a kluge (although a common one at the time), and One's
complement not exactly my cup of tea, so I don't maintain that the
6600 was perfect, but considering the competition at the time...

> Keith H. Bierman
> It's Not My Fault ---- I Voted for Bill & Opus
                          ^^^^^^^^^^^^^^^^^^^^^^
	I hope you will again, I probably will	:-)

| Mike Albaugh (albaugh@dms.UUCP || {...decwrl!turtlevax!}weitek!dms!albaugh)
| Atari Games Corp (Arcade Games, no relation to the makers of the ST)
| 675 Sycamore Dr. Milpitas, CA 95035		voice: (408)434-1709
| The opinions expressed are my own (Boy, are they ever)

baum@Apple.COM (Allen J. Baum) (11/05/88)

[]
>In article <547@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>>>The 801 had a very complex register
>>>file (3 reads, two writes).
>
>Maybe this is due to all the hair they added on behalf of their step-multiply
>and divide; this is one of the RISC patents.  Apparently, they can interrupt
>on some of the bits in the register file.

Actually--
 3 reads - required for indexed store (RegA-> Mem(RegB+RegX)
 2 writes- required for delayed loads (load RgA in cyc 1, stores in cyc 2 along
                                       with whatever else is executed in cyc 2)

Um, about interrupting on some of the bits in the register file- I don't recall
seeing that in the patents. Do you have a reference?
--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

bcase@cup.portal.com (Brian bcase Case) (11/06/88)

In article <10802@cup.portal.com> bcase@cup.portal.com (bcase) writes:
>>Notice also that Seymour's first machine came out in '76 (I believe).
>
>Wrong. UNIVAC, the all of the CDC machines until he left to found Cray
>Research. The CDC 6600 is often listed as the first supercomputer. 

Yes, yes, yes.  Ok, please, please, don't send me any more mail about
the CDC6600.  Things that need to be said:  I just assume that when 
people talk about Cray's machines they are talking about Cray Inc.'s
machines.  Thankfully, there are those out there who remember (better
than I) that he did the CDC6600.

The 6600 is somewhat close to being a RISC machine.  It is hardwired,
has three-address operations (is that right?), some registers, sorta
fixed-length instructions, etc.  But I find it interesting that subsequent
architectures, even Seymour's, diverged from this simplicity.  This
tells me that the architects of those days didn't "get the point."  The
fact that nearly all new architectures are RISCs or RISCy these days, tells
me that the formulation known as RISC has enabled architects to "get the
point."  RISC is a logical, reasoned approach that didn't exist in 1966,
even though someone might have had a simple machine.  This makes me tend
to *dis*believe that those early simple machines were RISCs:  they were
experiments that were not deemed successful enough to change the way
things were done.  RISC was an experiment too, but someone could demostrate
why it was good, and the architects saw that it was good, and it was good.
Par-ump-a-dum-dum.  The mouse and lamb kept time, par-ump-a-dum-dum....

>>The 801 research began in 1975, though it was perhaps not as focused
>>as it later became.  I'd say this is a pretty close call to who was first.
>
>6600 was a 1960's machine. So if it's RISC, it won by a decade.
                              ^^
Right.

>Seymour designs simple things which go fast. He
>assumes that compiler writers can cope. This is, perhaps, not the best
>approach...but his track record is pretty good (CDC owned the high
>performance computer market during his long reign). The Multiflow and
>Cydra machines are, arguably, RISC's .

Well put, I think.

bcase@cup.portal.com (Brian bcase Case) (11/06/88)

>Actually--
> 3 reads - required for indexed store (RegA-> Mem(RegB+RegX)
> 2 writes- required for delayed loads (load RgA in cyc 1, stores in cyc 2 along
>                                       with whatever else is executed in cyc 2)

Well, two write ports are not required to handle one outstanding load; the
29K has a patented mechanism for buffering the load result in a separate
latch and then forwarding from there into the data path until an instruction
that creates a register-write hole (e.g. jump, store) comes along.  Notice
also that the next load always creates an opportunity to write back the
previously loaded valued:  the newly loaded data is first destined for the
latch anyway, so whatever is already in the latch can use the writeback stage
of the outgoing load.  This is the brilliant invention, I wish I were the
inventor!  (A once-again Danish citizen named Ole Moller thought of it.)

colwell@mfci.UUCP (Robert Colwell) (11/06/88)

>>In article <547@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>>>>The 801 had a very complex register
>>>>file (3 reads, two writes).
>>
>>Maybe this is due to all the hair they added on behalf of their step-multiply
>>and divide; this is one of the RISC patents.  Apparently, they can interrupt
>>on some of the bits in the register file.
>
>Actually--
> 3 reads - required for indexed store (RegA-> Mem(RegB+RegX)
> 2 writes- required for delayed loads (load RgA in cyc 1, stores in cyc 2 along
>                                       with whatever else is executed in cyc 2)
>
>Um, about interrupting on some of the bits in the register file- I don't recall
>seeing that in the patents. Do you have a reference?

I was confusing two separate patents.  The first, related to the 801,
was on their condition code register, which collected bits from all
over the place.  You could interrupt on them, and this register could
also be transferred to the general registers, manipulated, and written
back.  (patent 4,589,087)

The one I was mixing it up with is 4,630,195, which is evidently not
related to the 801 at all, and concerns tagging the general registers
such that overlapping transfers are properly interlocked.

Thus proving once again that I can't read legalese worth a darn.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

bga@raspail.UUCP (Bruce Albrecht) (11/07/88)

In article <559@dms.UUCP>, albaugh@dms.UUCP (Mike Albaugh) writes:
> 	Oh, yeah, lack of Integer Mul & div was a bit of a pain, The
> "Return Jump" a kluge (although a common one at the time), and One's
> complement not exactly my cup of tea, so I don't maintain that the
> 6600 was perfect, but considering the competition at the time...
> 

The 6600 did have integer multiply (although the PP's did not).  I'm curious,
though, are the Cybers the only machine around that simulate integer divide
by doing a converting to floating point, floating point divide, normalize,
and convert back to integer?

cik@l.cc.purdue.edu (Herman Rubin) (11/07/88)

In article <1021@raspail.UUCP>, bga@raspail.UUCP (Bruce Albrecht) writes:
> In article <559@dms.UUCP>, albaugh@dms.UUCP (Mike Albaugh) writes:
> > 	Oh, yeah, lack of Integer Mul & div was a bit of a pain, The
> > "Return Jump" a kluge (although a common one at the time), and One's
> > complement not exactly my cup of tea, so I don't maintain that the
> > 6600 was perfect, but considering the competition at the time...
> > 
> 
> The 6600 did have integer multiply (although the PP's did not).  I'm curious,
> though, are the Cybers the only machine around that simulate integer divide
> by doing a converting to floating point, floating point divide, normalize,
> and convert back to integer?

The integer multiply on the 6600 was an afterthought, and required a kludge. 
To avoid special cases, one either had to avoid the possibility of 48
significant bits in both factors, or go through conversion to floating
point.

As for divide, the CRAYs do not have any kind at all.  They do have a
reciprocal (floating point only) instruction which is accurate to a claimed
30 bits, and a correction factor instruction.  It is not necessarily true
that, say, 7n/n = 7.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

baum@Apple.COM (Allen J. Baum) (11/08/88)

[]
>In article <10938@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>>Actually--
>> 3 reads - req'd for indexed store (RegA-> Mem(RegB+RegX)
>> 2 writes- req'd for delayed loads (load RgA in cyc 1, stores in cyc 2 along
>>                                   with whatever else is executed in cyc 2)
>
>Well, two write ports are not required to handle one outstanding load; the
>29K has a patented mechanism for buffering the load result in a separate
>latch and then forwarding from there into the data path until an instruction
>that creates a register-write hole (e.g. jump, store) comes along.  Notice
>also that the next load always creates an opportunity to write back the
>previously loaded valued:  the newly loaded data is first destined for the
>latch anyway, so whatever is already in the latch can use the writeback stage
>of the outgoing load.  This is the brilliant invention, I wish I were the
>inventor!  (A once-again Danish citizen named Ole Moller thought of it.)

I don't know if the original 801 had autoincrmenting addressing modes, but
that in itself is enough to require a two write-port reg. file.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (11/08/88)

In article <1021@raspail.UUCP> bga@raspail.UUCP (Bruce Albrecht) writes:
>In article <559@dms.UUCP>, albaugh@dms.UUCP (Mike Albaugh) writes:
>> 	Oh, yeah, lack of Integer Mul & div was a bit of a pain, The
>> "Return Jump" a kluge (although a common one at the time), and One's
>> complement not exactly my cup of tea, so I don't maintain that the
>> 6600 was perfect, but considering the competition at the time...
>> 
>
>The 6600 did have integer multiply (although the PP's did not).  I'm curious,
>though, are the Cybers the only machine around that simulate integer divide
>by doing a converting to floating point, floating point divide, normalize,
>and convert back to integer?

If you take early (and probably current) Lahey compilers for the PC
output, you will find that he does much integer arithmetic on the 8087
(or other NDP side). Turns out that 32-bit math was faster over there
(on the 8088,8086,80186). I don't know about the 286/386's.
Keith H. Bierman
It's Not My Fault ---- I Voted for Bill & Opus

cprice@mips.COM (Charlie Price) (11/08/88)

In article <1021@raspail.UUCP> bga@raspail.UUCP (Bruce Albrecht) writes:
>
>The 6600 did have integer multiply (although the PP's did not).  I'm curious,
>though, are the Cybers the only machine around that simulate integer divide
>by doing a converting to floating point, floating point divide, normalize,
>and convert back to integer?

Saying that the 6600 (or any of the 6000 series) had
integer multiply without any further explanation and then saying
that it didn't have integer divide is wrong.

For those who are not familiar with the story here, the 6000
series machines had 60-bit words and thus 60-bit integers (maybe)
and a floating point format with a 48-bit mantissa.
The main objective of the machines was fast scientific
computation so the floating point times were quite fast.
There were operations for integer add and subtract on the 60-bit registers.
There were not, however, multiply or divide instructions.
It was possible to use the always-present floating point hardware to
do multiply/divide for some integers.
To quote the "Control Data 6000 Series Computer Systems Reference Manual"
Revision W (1-15-73):

  Fixed point integer multiplication is handled in the multiply functional
  units as a subset operation of the unrounded Floating Multiply (40, 42)
  instructions.  The multiply is double precision (96 bits) and allows
  separate recovery of upper and lower products.  The multiply requires
  that both of the integer operands be converted (by program) to
  floating point format to provide biased exponents.  This insures that
  results are note sensed as under-flow conditions.
  The bias is removed when the result is unpacked.

  An integer divide takes several steps and makes use of the Divide and
  Shift Units. ...

I don't want to type in the example text -- the basic story is that the
floating result is unpacked and shifted by the exponent amount to get
the bits into the right position for the integer representation.

So the hardware doesn't really support 60-bit ints very well at all.
If you limit integers to 48 bits of significance then multiply
and divide operations don't give you any surprises -- but of course then you
are faced with the problem that add and subtract don't really
work for the same domain and range as multiply and divide.
-- 
Charlie Price    cprice@mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086

seanf@sco.COM (Sean Fagan) (11/11/88)

In article <7819@winchester.mips.COM> cprice@winchester.UUCP (Charlie Price) writes:
>Saying that the 6600 (or any of the 6000 series) had
>integer multiply without any further explanation and then saying
>that it didn't have integer divide is wrong.
>To quote the "Control Data 6000 Series Computer Systems Reference Manual"
>Revision W (1-15-73):
>
[to multiply integers, convert to floating point, then back]

The CDC Cyber 170 machines (which were, essentially, 7600's) did integer
multiply, in hardware, without having to convert to FP and back.  However,
you could only multiply 48-bit integers (that is, exponent equal to 0).
They still didn't have a divide instruction, although COMPASS *did* provide
one (as a "micro," but it trashed you source registers and a B register or
two (I forget, sorry)).

And, of course, Cray's don't have divides at all.  Does anybody besides me
get the impression that Seymour don't like 'em? 8-)

>Charlie Price    cprice@mips.com        (408) 720-1700
>MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086

-- 
Sean Eric Fagan  | "Engineering without management is *ART*"
seanf@sco.UUCP   |     Jeff Johnson (jeffj@sco)
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

idall@augean.OZ (Ian Dall) (11/14/88)

In article <75577@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:
>
> Or if they (wizzbang instructions) got used, it was
> so rare that it didn't matter. Or they got used, and it was slower
> than some combination of simple instructions. Or all of the above.

Can anyone tell me *why* some of these microcoded instructions were
slower than a combination of simpler instructions on the same machine?
I am not debating CISC vs RISK here since both cases run on the *same*
(cisc) machine.  If nothing else the second case must have resulted in
more memory accesses for instruction fetches. Was the difference
simply incompetence on the part of the micro code writer, or is there
some reason for this.
-- 
 Ian Dall           life (n). A sexually transmitted disease which afflicts
                              some people more severely than others.
idall@augean.oz

brooks@maddog.llnl.gov (Eugene Brooks) (11/15/88)

In article <1688@scolex> seanf@sco.COM (Sean Fagan) writes:
>And, of course, Cray's don't have divides at all.  Does anybody besides me
>get the impression that Seymour don't like 'em? 8-)
Seymour made the observation that floating point divides are much less
frequent than multipy or add so one could implement them with a less
complex reciprocal approximation unit followed by a Newton Raphson hit or
two.  This disproves the notion that Semour just hacked in what he could do
fast, he obviously made careful decisions about what to put in hardware
based on how heavily instructions were used and how fast they could be
implemented.

tim@crackle.amd.com (Tim Olson) (11/15/88)

In article <419@augean.OZ> idall@augean.OZ (Ian Dall) writes:
| In article <75577@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:
| >
| > Or if they (wizzbang instructions) got used, it was
| > so rare that it didn't matter. Or they got used, and it was slower
| > than some combination of simple instructions. Or all of the above.
| 
| Can anyone tell me *why* some of these microcoded instructions were
| slower than a combination of simpler instructions on the same machine?
| I am not debating CISC vs RISK here since both cases run on the *same*
| (cisc) machine.  If nothing else the second case must have resulted in
| more memory accesses for instruction fetches. Was the difference
| simply incompetence on the part of the micro code writer, or is there
| some reason for this.

There are a number of reasons, most of them due to the main problem of
limited microcode space:

	1) "Free" microcode sequences.  "Hey, look at this!  If we just
	change one input to the "add" sequence, we get "clear"! (Too bad
	clear now reads the data before storing a zero into it).

	2) Limited data areas (and thus, limited algorithms for code
	sequences).  Microcode doesn't normally have access to
	arbitrary precompiled data tables in memory like a library
	routine does, so we see things like the recent ARM reverse-bit
	sequence which is slower than a standard table-lookup. 

	3) Microcode is not as easily optimized.  A general microcode
	routine to multiply two integers is easy enough to write, but it
	is usually faster to perform a series of shifts and adds in
	macrocode when you are multiplying by a constant.  One could
	write each of these optimal sequences in microcode, and have a
	horrendous set of mul_by_5, mul_by_37, etc. instructions, but it
	is much easier and less wasteful to let the compiler handle it.


	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

ok@quintus.uucp (Richard A. O'Keefe) (11/15/88)

In article <419@augean.OZ> idall@augean.OZ (Ian Dall) writes:
>Can anyone tell me *why* some of these microcoded instructions were
>slower than a combination of simpler instructions on the same machine?

There have been several micro-coded machines which had hardware
support for decoding _part_ of an instruction.
For a specific case, imagine a machine which has both
	add.2	Dst,Src		Dst +:= Src
and	add.3	Dst,Src1,Src2	Dst := Src1+Src2
but suppose that the hardware has special support for decoding the
first two operands, while additional operands have to be decoded by
microcode.  Then
	mov.2	Dst,Src1
	add.2	Dst,Src2
would be fully decoded by hardware, while
	add.3	Dst,Src1,Src2
would have one of its operands decoded by microcode.

Another case would be an operation which could be done with a fast
algorithm and a large table or with a slow algorithm and no table.
For example, to find the first bit in a 16-bit word, just index
a precomputed table of 2**16 bytes (there are 17 cases).  Or do it
with a dumb ALU and no table.

richt@breakpoint.ksr.com (Rich Title) (11/15/88)

In article <419@augean.OZ> idall@augean.OZ (Ian Dall) writes:
>Can anyone tell me *why* some of these microcoded instructions were
>slower than a combination of simpler instructions on the same machine?
>... Was the difference
>simply incompetence on the part of the micro code writer, or is there
>some reason for this.
>-- 

When I was at DEC, I remember one of the software guys asking one
of the hardware guys essentially this question: why, on the VAX 780,
are the add-compare-branch compound instructions slower than doing
separate add, compare, and branch instructions? The answer had to
do with exceptions and restartability. E.g., if the add-compare-branch
gets a memory fault on the compare step, it has to be restartable,
which means the result of the add cannot yet have been written. I.e.,
the add-compare-branch cannot be implemented straightforwardly by 
doing an add, compare, and branch; but instead it needs to go about
its operations in an unnatural order and save enough state as it
goes so it can always back out if it gets an exception.

I think on later VAXes they figured out how to do this better. 

    - Rich

henry@utzoo.uucp (Henry Spencer) (11/16/88)

In article <23541@amdcad.AMD.COM> tim@crackle.amd.com (Tim Olson) writes:
>There are a number of reasons, most of them due to the main problem of
>limited microcode space:
>
>	3) Microcode is not as easily optimized...

Also, optimization of microcode can require more microcode space.  One
reason why the simple instructions tend to be fast is that a lot of work
(microcode space, decode hardware, etc.) is put into optimizing them.
For example, the hardware may break out the major variants of the MOVE
instruction, rather than having the microcode do test-branch-test-branch
to finish the decoding.  This sort of thing can't be done for the whole 
instruction set without exceeding the available resources (space,
development time, staff), and the difference can be major enough to make
the fancy stuff slower than "doing it by hand".
-- 
Sendmail is a bug,             |     Henry Spencer at U of Toronto Zoology
not a feature.                 | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

mph@praxis.co.uk (Martin Hanley) (11/16/88)

In article <419@augean.OZ> idall@augean.OZ (Ian Dall) writes:
>In article <75577@sun.uucp> khb@sun.UUCP (Keith Bierman) writes:
>>
>> Or if they (wizzbang instructions) got used, it was
>> so rare that it didn't matter. Or they got used, and it was slower
>> than some combination of simple instructions. Or all of the above.
>
>Can anyone tell me *why* some of these microcoded instructions were
>slower than a combination of simpler instructions on the same machine?
>-- 
> Ian Dall

I think, in general, the microcoded instructions tend to be more
general than the simple ones, and thus have more cases to consider.

Also, the decode path (is that the right term?) is shorter for a
non-microcoded instruction, and thus again is quicker. The difference
here is akin to the difference between machine code and interpreted
code, though of course the magnitudes of difference vary.

							Martin Hanley.

-------------------------------------------------------------------------------
| "I'm not a god, I was just       | This drivel was brought to you by:       |
| misquoted" - Lister, "Red Dwarf" |                        mph@praxis.co.uk  |
-------------------------------------------------------------------------------

firth@sei.cmu.edu (Robert Firth) (11/16/88)

In article <13844@lll-winken.llnl.gov> brooks@maddog.UUCP (Eugene Brooks) writes:

>Seymour made the observation that floating point divides are much less
>frequent than multipy or add...

True

>... so one could implement them with a less
>complex reciprocal approximation unit followed by a Newton Raphson hit or
>two...

False.  This leads to results less accurate than those given by a true
divide.  For instance, 21.0/7.0 gives an exact 3.0 with a true divide,
but usually will not if computed as 21.0*(1/7.0).

>...This disproves the notion that Seymour just hacked in what he could do
>fast...

Really?  I'd say it confirms the notion.

>... he obviously made careful decisions about what to put in hardware
>based on how heavily instructions were used and how fast they could be
>implemented.

Too bad he didn't base his decision in part on whether users performing
floating-point calculations prefer better answers to worse ones.  Too
bad also that the 'megaflops per millisecond' benchmarking culture tends
to be heavily biased in favour of "wrong answers fast".

firth@sei.cmu.edu (Robert Firth) (11/16/88)

In article <392@ksr.UUCP> richt@ksr.UUCP (Rich Title) writes:

>When I was at DEC, I remember one of the software guys asking one
>of the hardware guys essentially this question: why, on the VAX 780,
>are the add-compare-branch compound instructions slower than doing
>separate add, compare, and branch instructions? The answer had to
>do with exceptions and restartability. E.g., if the add-compare-branch
>gets a memory fault on the compare step, it has to be restartable,
>which means the result of the add cannot yet have been written. I.e.,
>the add-compare-branch cannot be implemented straightforwardly by 
>doing an add, compare, and branch; but instead it needs to go about
>its operations in an unnatural order and save enough state as it
>goes so it can always back out if it gets an exception.

[ The instruction is

	ACB limit.rx add.rx index.mx displ.bw
]

I don't believe this.  First, the instruction cannot get a memory
fault after the add step, since the address of the index operand is the
last one to be evaluated.

Secondly, the instruction is not restartable.  For example, an overflow
during the add step will still cause the index to be overwritten, a
junk compare to be performed, and a synchronous trap to be taken with
the value of PC unpredictable.  Only a reserved operand fault leaves
the machine in a known state.

As an aside, the instruction doesn't even work.  It implements an add,
compare and branch at the bottom of the loop.  The only language that
defines a FOR loop that is always executed at least once is Fortran,
and it requires the sign of the step to be known at compile time.  For
all other languages, you have to hand code a compare at the top of the
loop anyway, so it's pretty pointless to code it again at the bottom.
(Well, actually you put the compare at the bottom and have an initial
branch to it, but you see what I mean.)  In other words, there is no
programming language construct to which this instruction corresponds.

It also implements in hardware that famous Pascal bug: an iteration
whose termination value is an extremum of the integer range fails.

Lest this post fall into the hands of young persons, I shall not mention
the floating-point form of the instruction.

slackey@bbn.com (Stan Lackey) (11/16/88)

In article <419@augean.OZ> idall@augean.OZ (Ian Dall) writes:
>In article <75577@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:
>> Or if they (wizzbang instructions) got used, it was
>> so rare that it didn't matter. Or they got used, and it was slower
>> than some combination of simple instructions. Or all of the above.
>Can anyone tell me *why* some of these microcoded instructions were
>slower than a combination of simpler instructions on the same machine?
A significant number of these claims come from experience on some of the VAX
machines.  There are two major cases that come to mind.

1. It was sometimes the case that the complex instruction did more
than the application required.  For example, the CALL instructions
saved a lot of stuff to the stack, aligned the stack pointer, etc.  A
compiler which implemented a "lightweight call" would run faster.

2. There was a particular detail with the VAX-11/780 INDEX instruction
- it ran slower than its emulation.  The reason is that it was
microcoded using the microcoded integer multiply.  The emulation used
the simpler MULL instruction, which was done by the floating point
accellerator.  The instruction was added to the instruction set very
late in the development, too late to make the hardware change to the
FPA.

So, aside from implementation details, I see no "RISC superiority"
arguments here.
-Stan

aglew@urbsdc.Urbana.Gould.COM (11/16/88)

>> Or if they (wizzbang instructions) got used, it was
>> so rare that it didn't matter. Or they got used, and it was slower
>> than some combination of simple instructions. Or all of the above.
>
>Can anyone tell me *why* some of these microcoded instructions were
>slower than a combination of simpler instructions on the same machine?
>I am not debating CISC vs RISK here since both cases run on the *same*
>(cisc) machine.  If nothing else the second case must have resulted in
>more memory accesses for instruction fetches. Was the difference
>simply incompetence on the part of the micro code writer, or is there
>some reason for this.
>
> Ian Dall

On at least one machine I am familiar with there was a lot of hardware
provided expressly for optimizing, eg. instruction memory accesses.
It was a lot harder to access memory from microcode than it was from
the instruction stream.

trt@rti.UUCP (Thomas Truscott) (11/17/88)

> >Can anyone tell me *why* some of these microcoded instructions were
> >slower than a combination of simpler instructions on the same machine?

Henry Spencer answered this well -- the performance-critical instructions
(such as load/store/move) get enormous attention by the hardware people
and by the micro-coders.  Other instructions often lack necessary hardware,
or may be poorly-microcoded.  Examples abound, many concern the VAX
because it is the flagship of CISC and is so popular and accessible.

Here are some examples on other machines.  These examples should
not be construed as criticism of these machines,
they are simply the machines with which I am familiar.

PDP 11/70 (and probably most other models)
	The MOV instruction is the workhorse of this machine,
	and the designers no doubt worked hard to make it fast.
	They did a good enough job that
		JMP	LABEL
	is slower than
		MOV	#LABEL,pc	/ move address into program counter
	MOV is faster than JMP in all cases, including some such as
		MOV	@(r2)+,pc
	that JMP cannot even support.
	So JMP is a useless instruction on the PDP.

	The RTS (return from subroutine) call is heavily used
	on the PDP.  Yet
		RTS	r5
	is slower than the equivalent
		MOV	(r5)+,pc
	So RTS is a useless instruction on the PDP.

	There are other pointless instructions on the PDP (remember MARK?)
	and this on a machine with not that many instructions!
	Different models of PDP had different instruction sets,
	peculiarities with some of the instructions,
	and peculiarities with some of the addressing modes.
	So compiler writers wrote for the common subset,
	leaving the peculiar things unused, so they were "useless".
	
	What is wrong with having useless instructions?
	I am sure various RISC papers address this topic.

Gould PowerNode 9050
	There is a nice "Add-Bit-To-Register" instruction that
	can add any power of 2 to a register,
	which the timing table claims* takes one cycle.
	(* Vendor timing tables always have total disclaimers.)
	But it actually takes several cycles.
	The compiler writers are well aware of this,
	and never use this instruction.
	I estimate that, for one reason or another,
	over half of the PowerNode instructions are never* used
	(* of course the diagnostics check their operation).

Gould PowerNode 6000
	This is a somewhat slower Gould model, which micro-codes
	the multiple-bit-shift operation as a several single bit shifts.
	The compiler writers are well aware of this,
	and use various tricks to minimize (or eliminate) the shift distance.

There are plenty more machines out there, but this is getting a bit long.

> So, aside from implementation details, I see no "RISC superiority"
> arguments here.

One of the arguments against CISC is that they have
far too many "implementation details".
	Tom Truscott

guy@auspex.UUCP (Guy Harris) (11/18/88)

>As an aside, the instruction doesn't even work.  It implements an add,
>compare and branch at the bottom of the loop.  The only language that
>defines a FOR loop that is always executed at least once is Fortran,

It's interesting to note that Fortran 66 didn't, as I remember,
explicitly say that DO-loops were one-trip; they just all happened to be
implemented that way.  Fortran 77 explicitly said that they were
*zero*-trip; I seem to remember somebody from DEC arguing that they
should be one-trip, but fortunately the F77 committee didn't listen to
him....

khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (11/18/88)

In article <464@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
>>As an aside, the instruction doesn't even work.  It implements an add,
>>compare and branch at the bottom of the loop.  The only language that
>>defines a FOR loop that is always executed at least once is Fortran,

No!
>
>It's interesting to note that Fortran 66 didn't, as I remember,
>explicitly say that DO-loops were one-trip; 

Right!

they just all happened to be
>implemented that way.

Not all. Which is why ...

  Fortran 77 explicitly said that they were
>*zero*-trip;

i.e. since there was NOT universal implementation of 1 trip do loops,
we should pick the sensible meaning.


 I seem to remember somebody from DEC arguing that they
>should be one-trip, but fortunately the F77 committee didn't listen to
>him.... 

Since 1 trip is what DEC did, the argument was probably powerful...to
DEC :>


Keith H. Bierman
It's Not My Fault ---- I Voted for Bill & Opus

jk3k+@andrew.cmu.edu (Joe Keane) (11/19/88)

Robert Firth writes:
> The only language that defines a FOR loop that is always executed at least
> once is Fortran, and it requires the sign of the step to be known at compile
> time.

C's do-while is uncommon, but they haven't removed it yet.

> For all other languages, you have to hand code a compare at the top of the
> loop anyway, so it's pretty pointless to code it again at the bottom.

The naive way to code `for (i = lower; i < upper; i += step) foo;' is:

        mov     lower,i
test    cmp     i,upper
        blt     end
        foo
        add     step,i
        bra     test
end

> (Well, actually you put the compare at the bottom and have an initial branch
> to it, but you see what I mean.)

Good idea; let's try it:

        mov     lower,i
        bra     entry
body    foo
        add     step,i
entry   cmp     i,upper
        bge     body

Statically, same number of instructions; dynamically, less except in the
zero-trip case.  But look, the add, compare, and branch are consecutive.

Suppose we want to eliminate the unconditional branch.  We can duplicate the
test at the top.  Or the compiler may figure out that the initial test can't
fail (mostly if the loop has constant bounds).  In either of these cases, the
add, compare, and branch are still consecutive.

--Joe

cik@l.cc.purdue.edu (Herman Rubin) (11/19/88)

In article <oXV-=Wy00V4109jmdK@andrew.cmu.edu>, jk3k+@andrew.cmu.edu (Joe Keane) writes:
> Robert Firth writes:
< > The only language that defines a FOR loop that is always executed at least
< > once is Fortran, and it requires the sign of the step to be known at compile
< > time.
> 
> C's do-while is uncommon, but they haven't removed it yet.
> 
< > For all other languages, you have to hand code a compare at the top of the
< > loop anyway, so it's pretty pointless to code it again at the bottom.
< 
> The naive way to code `for (i = lower; i < upper; i += step) foo;' is:
> 
>         mov     lower,i
> test    cmp     i,upper
>         blt     end
>         foo
>         add     step,i
>         bra     test
> end
> 
< > (Well, actually you put the compare at the bottom and have an initial branch
< > to it, but you see what I mean.)
> 
> Good idea; let's try it:
> 
>         mov     lower,i
>         bra     entry
> body    foo
>         add     step,i
> entry   cmp     i,upper
>         bge     body
> 
> Statically, same number of instructions; dynamically, less except in the
> zero-trip case.  But look, the add, compare, and branch are consecutive.
> 
> Suppose we want to eliminate the unconditional branch.  We can duplicate the
> test at the top.  Or the compiler may figure out that the initial test can't
> fail (mostly if the loop has constant bounds).  In either of these cases, the
> add, compare, and branch are still consecutive.
> 
> --Joe

Even if we do not eliminate the conditional branch, it should be pointed out
emphatically that the second version is likely to be faster.  With the
conditional branch at the beginning, we have two consecutive branch instruc-
tions at run time, and this can be quite expensive timewise.  The other way,
only one branch instruction is run PER LOOP.  And Joe's remark that the 
programmer or the compiler may know that the loop will be run at least once
is important.  If the language does not allow the programmer to provide this
information, this is a built-in machine-independent deficiency in the language.

The only time the branch-at-the-beginning wins is when the loop should not be
performed.  If this is an important consideration, a third alternative is:


         mov     lower,i
 test    cmp     i,upper
	 bge	 lpexit
 body    foo
         add     step,i
 	 cmp     i,upper
         bge     body
 lpexit

This is one instruction longer, and meets the speed of the branch-at-the-
beginning code if the loop is not to be executed, but will beat it if the
loop is executed at least once.  It may be marginally slower than Joe's 
if the loop is carried out.  I would recommend this one if zero execution
steps are of moderate importance.  But I find it hard to see any justification
for Robert's objection to having two conditional branches in the CODE,  one
being in the loop, rather than having a conditional and an unconditional branch
in the LOOP.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

deraadt@dataspan.UUCP (Theo De Raadt) (11/21/88)

>Can anyone tell me *why* some of these microcoded instructions were
>slower than a combination of simpler instructions on the same machine?
>I am not debating CISC vs RISK here since both cases run on the *same*
>(cisc) machine.  If nothing else the second case must have resulted in
>more memory accesses for instruction fetches. Was the difference
>simply incompetence on the part of the micro code writer, or is there
>some reason for this.
>
> Ian Dall

I don't know, just that on the CISC processor I am most familiar with,
the bus cycles are longer, so that all sorts of slow/fast/synchronous/
asynchronous types of memory and IO can be talked to easily without having
to add much external hardware, the circuitry to talk to any type of memory
is built into the chip. Of course, I  guess that they should have been
shorter, but then I don't know if that has much to do with the instruction
set timing at all.
 <tdr.
-- 
_____                 _                   -----------------------------------
  / /            /   / \ _   _      /_/_  Theo de Raadt:       (403) 289-4620
 / /_ _  ___  __/_  /__/ _\  _\  __/ /    DATASPAN
/ / /</_(_)  (_/</_/  \_(_/\(_/\(_/_(_/   ..!alberta!calgary!dataspan!deraadt

bcase@cup.portal.com (Brian bcase Case) (11/21/88)

Concerning the ACB (VAX's add-compare-branch instruction)
>Secondly, the instruction is not restartable.

?????!!!!! It had better be!  Otherwise, what happens when a clock
interrupt comes along!

>As an aside, the instruction doesn't even work.  It implements an add,
>compare and branch at the bottom of the loop.  The only language that
>defines a FOR loop that is always executed at least once is Fortran,
>and it requires the sign of the step to be known at compile time.

No, there is an optimization called loop rotation that allows test-at-
the-top loops (whiles and fors in C, e.g.) to be converted to test-at-
the-bottom.  This is useful because it eliminates one jump instruction.

Of course, the VAX's ACB instruction is a poor match for loop rotation
of C loops because you want to test the condition *without* doing the
add the first time, but this can be overcome (Don't misunderstand, I do
not think this is a good instruction; I'm just saying that it isn't
useless).

>For all other languages, you have to hand code a compare at the top of the
>loop anyway, so it's pretty pointless to code it again at the bottom.

Yes, this is one way to overcome the fact that the ACB first adds then
tests.  This is still worth it because it saves one (untaken) jump on 
each loop iteration.  A compiler can do this (or at least I could have made
one of my C compilers do this without much trouble at all).

>In other words, there is no
>programming language construct to which this instruction corresponds.

Yeah, it is a stupid instruction.

bcase@cup.portal.com (Brian bcase Case) (11/21/88)

>Good idea; let's try it:
>
>        mov     lower,i
>        bra     entry
>body    foo
>        add     step,i
>entry   cmp     i,upper
>        bge     body
>
>Statically, same number of instructions; dynamically, less except in the
>zero-trip case.  But look, the add, compare, and branch are consecutive.

Ok, but notice the label in the middle of the add-compare-branch sequence?
I don't think it is possible to specify the ACB instruction in a version
that has a branch destination in the middle!  This is one of the points of
RISC:  Hey, *I'M* the compiler, let *me* decide on the semantics of
instruction sequences.

hankd@pur-ee.UUCP (Hank Dietz) (11/23/88)

In article <11562@cup.portal.com>, bcase@cup.portal.com (Brian bcase Case) writes:
> >        mov     lower,i
> >        bra     entry
> >body    foo
> >        add     step,i
> >entry   cmp     i,upper
> >        bge     body
>
> Ok, but notice the label in the middle of the add-compare-branch sequence?
> I don't think it is possible to specify the ACB instruction in a version
> that has a branch destination in the middle!  This is one of the points of
> RISC:  Hey, *I'M* the compiler, let *me* decide on the semantics of
> instruction sequences.

You can do the same thing by simply replicating the test code above the loop
rather than jumping into the loop.  It's a smidge harder to do, but it isn't
*that* hard, and it works even if the branching-backward loop becomes a
single instruction. The example would become:

        mov     lower,i
        cmp     i,upper		;replicated test
        blt     skipit		;opposite condition to bge
body    foo
        add     step,i
        cmp     i,upper
        bge     body
skipit

Which obviously doesn't mind having the code between body and skipit end up
being a single instruction.

					-hankd@ee.ecn.purdue.edu

PS:  I'm not saying I like ACB, but you're not *my* compiler (:-).

shankar@src.honeywell.COM (Son of Knuth) (11/23/88)

In article <11562@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>>        mov     lower,i
>>        bra     entry
>>body    foo
>>        add     step,i
>>entry   cmp     i,upper
>>        bge     body
>>
>>Statically, same number of instructions; dynamically, less except in the
>>zero-trip case.  But look, the add, compare, and branch are consecutive.
>
>Ok, but notice the label in the middle of the add-compare-branch sequence?
>I don't think it is possible to specify the ACB instruction in a version
>that has a branch destination in the middle!  This is one of the points of
>RISC:  Hey, *I'M* the compiler, let *me* decide on the semantics of
>instruction sequences.

But this can be transformed into:
          bra    entry
body      foo
          acb    step,i,upper,body  (or whatever the ACB operands are)
entry     cmp    i,upper
          bge body

This doesn't justify the ACB instruction - but it does point out its
usefulness in the above example, assuming it doesn't take too many
microcode cycles.

urjlew@ecsvax.uncecs.edu (Rostyk Lewyckyj) (11/24/88)

In article <12369@srcsip.UUCP>, shankar@src.honeywell.COM (Son of Knuth) writes:
% In article <11562@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
% >>        mov     lower,i
% >>        bra     entry
% >>body    foo
% >>        add     step,i
% >>entry   cmp     i,upper
% >>        bge     body
% >>
% >>Statically, same number of instructions; dynamically, less except in the
% >>zero-trip case.  But look, the add, compare, and branch are consecutive.
% >
% >Ok, but notice the label in the middle of the add-compare-branch sequence?
% >I don't think it is possible to specify the ACB instruction in a version
% >that has a branch destination in the middle!  This is one of the points of
% >RISC:  Hey, *I'M* the compiler, let *me* decide on the semantics of
% >instruction sequences.
% 
% But this can be transformed into:
%           bra    entry
% body      foo
%           acb    step,i,upper,body  (or whatever the ACB operands are)
% entry     cmp    i,upper
%           bge body
% 
% This doesn't justify the ACB instruction - but it does point out its
% usefulness in the above example, assuming it doesn't take too many
% microcode cycles.


Is this acb instruction being discussed , a real instruction in
some machine? and Is it in any way different from the usual loop
closing instructions like IBM 360  BXLE or BXH?
In the dynamic evaluations, what costs/weights are being assigned
to instruction fetch/issue for the two cases? and if it's a close
comparison, what frequencies are expected for the zero trip case
versus non zero trip?
Also in these arguments and in real evaluations does anyone ever
try to assign costs to instruction set complexity? i.e. cost of
one instruction name space? register dependancies (I am thinking
of the funny even oodd rules in IBM 360)?
-----------------------------------------------
  Reply-To:  Rostyslaw Jarema Lewyckyj
             urjlew@ecsvax.UUCP ,  urjlew@tucc.bitnet
       or    urjlew@tucc.tucc.edu    (ARPA,SURA,NSF etc. internet)
       tel.  (919)-962-9107