[net.arch] Addressing modes really "RISCness"

mash@mips.UUCP (John Mashey) (02/28/86)

Clif Purkiser, Intel, writes (on top of long sequence of news, whose
beginning I lost track of):
> > >> As far as I'm concerned, the test for RISCness should be: given any piece
> > >> of source code, is there only one reasonable code sequence which can be
> > >> output by the compiler?
> > >
> > >How is that a test for "RISCness"?  Following the above scheme, you could,
> > >for example, have a machine that directly (and optimally) interpreted some
> > >concise representation of the source code... but that's exactly what people
> > >are calling a CISC machine, and are claiming is bad...

1) I'd agree that isn't a test for RISCness, since even some obvious RISC CPUs
have multiple reasonable code sequences, even in the simplest cases.  For
example, suppose you want to place a small constant in a register, given an
architecture with 2 source operands, a destination reg, and a register that
holds a zero.  Then: 
	add	reg,zero,const
	xor	reg,zero,const
	or	reg,zero,const
	sub	reg,zero,-const
all do the same thing, and it's pretty unreasoanble to have reduced the
instruction set to eliminate those.

2) When you start using multiple levels of optimization, and allowing
different time-space tradeoffs, you can get wildly-varying code.
This has nothing to do with the architecture.

3) I hope a "test for RISCness" is never seen as a goal in itself.
What counts are things like performance, price, ease of interfacing,
ease of compiler-writing, performance headroom for the future, etc, etc.
For performance, a reasonable measure of architectural efficiency is
clock cycles / "unit of work", since the length of the clock cycle is
usually the fundamental technology limit at any given time.

4) However, if there is such a thing as a "test for RISCness" as an
architectural metric, try clock cycles / instruction.  Note the difference
between this metric and the previous one. A "true" RISC would have
1 cycle/instruction, but it might have crippled it's capabilities so
much that it executes many more instructions per "equivalent work".

5) as examples of the above, consider a VAX 11/780, which uses 10
200ns cycles per instruction, but is considered to be a 1 Mips CPU.
This is confusing.  Thus, the 780 is really a .5-VAX-Mips machine,
so that 1 "Mips" is really only .5 VAX-Mips, and thus the VAX has
10 cycles/instr, but 5 cycles/"unit of work", where 1Mips = "unit of work."
A MIPS chip at 8Mhz is about 5X a VAX, using about 1.6 cycles/instruction.
In this case, cycles/instr & cycles/"standard Mips" are the same, 1.6.

6) When you add features to a chip, you never lessen the cycle time.
When the new features increase the cycle time faster than they lessen
the number of cycles, it's time to quit adding things, as is noted
by CLif's next comment:

> 	The problem with RISC when taken to it is logically conclusion
> is you get rid of a lot of useful instructions and very useful addressing 
> modes.
> 	I am sure no good RISC designer would include such obscure instructions
> as Ascii Adjust for Addition, Subtraction, AAA,AAS etc. or XLAT 
> (Table Look-up Translation).  Yet they all have their uses.  The Ascii Adjust
> instruction are used by COBOL (yes people still use it) compilers and 
> spreadsheet designers because COBOL uses BCD and these instruction speed 
> this process up.  Likewise, XLAT can be used for converting ASCII to EBCDIC
> in 5 clocks.  

7) Actually, you might be greatly surprised: these things can be useful,
but the only way thatcan be proved is by measuring large numbers of real
COBOL programs, then looking at different ways to spend the silicon.
	a) XLAT (or equivalent on other machines, like S/370 TR) is more
	dependent on memory system speed than anything else. Assuming
	reasonable cache behavior, my favorite RISC takes 6 cycles/byte
	to do the S/370 TR instruction.
	b) You'd be surprised how little time COBOL actually spends doing
	BCD arithmetic, compared to everything else.  In particular,
	numbers from both VAXen and S/370s [which have these things]
	never show these as very frequently-used things.  This is not to
	say that they may not be reasonable things to do, given an
	existing micro-coded architecture.  However, appropriately-designed
	RISCs can actually do fairly well on this stuff. [This was a shock
	to us; we hadn't designed for it!]
	c) Real COBOL programs spend tons of time in I/O libraries,
	ISAM routines, the kernel, etc.

> 	One disagreement I have with the RISC proponents is the theory that
> everyone writes in a HLL.  It seems that despite years of trying to force
> everyone to write in HLL languages there will always be a few assembly 
> language programers.  Because no matter how much performance Semicoducter and 
> Computer companies give programmers they always want their programs to run 
> faster.  So these CISC instructions while not useful to compiler writers are 
> useful to assembly language jocks. 

Given an architecture, if adding an instruction doesn't slow the machine
down, then you may want to have it whether compilers can get to it or not.
However, that's not the point, which is: if you have a choice, do you
want a collection of complex instructions that do whatever they do,
or would you rather write your own (assembler, in some cases) code, but have
it all run at microcode speed instead.
> 
> 	I also think that many of the addressing modes on CISC machines
> result in higher performance than the simpler RISC addressing modes.
> For example take
> long x, a[10], i; 
> x = a[i];
> 
> 	This probably compiles into these RISC instructions assuming x, a, and i
> are in register R1,R2, and R3 respectively.
> ShiftL R4, R3, #2
> Add    R4, R4, R2
> Load   R1, [R4]
> This is a single 4 clock instruction on the 80386 vs 3 clocks for the RISC
> chip.  However, the RISC chip has had to fetch 3 instructions vs one for 
> the CISC processor.  So unless the RISC chip has a large on-chip cache it
> will be slower.    

The I-cache doesn't have to be on chip.  In any case, where this counts is
inside loops.  In that case, you probably have code where a decent optimizer
creates an induction variable that is the address of a[i], then increments
the induction varible by 4 each time, thus eliminating both the shift and
the add, so that the equivalent RISC code takes  (just to fetch a[i]) is
just 1 cycle (+ cache miss effects, if any).  OF course, what you really
must compare is the entire loop code, with enough data to believe that
the sample chosen happens often enough in real code to make it worthwhile.
NOTE: good optimizers quite frequently eliminate most of the need for
"index mode with indexing by operand type".
> 
> 	I guess my main point is that a RISC designer in the 
> search to simplify the chips complexity may easily throw out obscure
> instructions which are useful to certain applications.  Or they may eliminate
> generally useful addressing modes.
> 
> 	I think that the good thing about the RISC philosphy is that it
> will reduce the tendency of designers to add new instructions or addressing
> modes just because they look whizzy or Brand X has them.  If a complex way of
> doing something is slower than a simple way don't put it in.

RIGHT ON!! That's the way to design, all right.  However, one has GOT to
have reasonable data to support the inclusion/exclusion of an instruction or
mode. Human intuition on this stuff is notoriously bad.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086