[net.arch] RISC perspective

nathan@orstcs.UUCP (03/01/84)

We've seen a lot lately about RISC architectures, and a
bunch of people down at Berkeley did a chip, and Pyramid
Computers has a box out that they claim far outperforms a Vax.

I would like to make the claim that a RISC with a medium-to-large
instruction cache could be called a "virtual-microcode" machine.
Since on a RISC all the complicated operations are implemented with
a somewhat brain-damaged subroutine call, the subroutines that
implement what would be instructions on another architecture tend
to be in the cache all the time.  This includes the Enter and Exit
operations, loop operations, and the like; esoterica such as long
divides would sit out in main memory until needed.

Does anyone have any objection to this analogy?

	Nathan Myers

jvz@sdcsvax.UUCP (03/04/84)

I don't have any problems with your analogy, but beware that
a machine like that was done by Burroughs Corp... the B1700/B1800
series.  Those machines were microcoded, with an instruction cache
to require only small amounts of expensive memory.

John Van Zandt

nather@utastro.UUCP (Ed Nather) (03/09/84)

<>
	I have advocated "cache-advisor" instructions for RISC's--instructions
	that don't *do* anything, from the programmer's point of view, but which
	inform the cache-manager of highly-probable-to-certain upcoming memory
	events.  This is another trade-off--a loss of code-density with a
	possible net gain in performance.  That need not complicate the
	architecture.  It could, in fact simplify it--if cache-block-
	prefetching were more effective,
	then instructions could be looser and "fatter" (more easily decoded and
	executed) with less worry over the fetching costs.  Possible analogies:
	impurity doping of semiconductors to increase electron mobility, or
	adding minute amounts of certain minerals in steel-making to improve its
	tensile strength.  It's worth some experimentation, I think.
	---
	Michael Turner (ucbvax!ucbesvax.turner)

This sounds like a splendid idea to me.  If the programmer inserts a single
instruction that serves as "advice" to the cache-manager, then the overhead
involved is just 1 fetch-time (plus decoding time, but that's true for any
instruction).  It would act like a "pseudo-op" in an assembler program, to
help manage things but generate no output.  If such advice saves only 1 fetch
it breaks even; everything after that is gravy.  I found that the "sieve"
benchmark could be speeded up a factor of 2 on our Vax by advising the "C"
compiler about what to put into registers.  The "C" compiler is the only one
we have humble enough to accept advice.  I find the thought of a much closer
relationship between compiler and programmer, via direct "advice" where it
can really matter, a marvelous one.  Advising the computer directly  could
be even better!


-- 

                                       Ed Nather
                                       ihnp4!{ut-sally,kpno}!utastro!nather
                                       Astronomy Dept., U. of Texas, Austin

guy@rlgvax.UUCP (Guy Harris) (03/10/84)

This is a spinoff from a comment about speeding up C programs by advising
the C compiler what to put into registers.

The idea is good, but there's one problem with its current implementation;
there's no way to make those hints truly "machine-independent".  The problem
is that you want to use all the registers you can (actually, there's even
a tradeoff here; you get extra performance from instructions referencing
registers, but you do have to save and restore those registers, and if you
have complicated expressions you may run out of scratch registers and be
forced to use temporaries in memory), but "all the registers you can" is
machine-dependent.  On the PDP-11, "all the registers you can" in the Ritchie
and Johnson compilers is 3; on the VAX-11, it's 6, and on the M68000, it's
(at least on our compiler) 6 registers not containing pointers and four
registers containing pointers.  Just declaring everything you can to be
"register" wins only if 1) there are enough registers to satisfy your requests
or 2) the variables that will cause the greatest improvement when put in
registers are assigned to registers first, and the less important ones cause
the compiler to ignore the "register" declaration.

You can sort of rig it to work right, on compilers that assign "register"
variables to registers in the order they encounter them in the source text,
buy putting the "best" candidates first; however, 1) this means you may not
be able to use the "register parameter" feature, 2) it makes your code a
little less readable, and 3) most importantly, it requires you to know what the
compiler does - but what happens if the compiler doesn't work that way?
Writing portable code to use registers efficiently is tricky at times.

	Guy Harris
	{seismo,ihnp4,allegra}!rlgvax!guy

rpw3@fortune.UUCP (03/11/84)

#R:orstcs:-280000100:fortune:16500005:000:2889
fortune!rpw3    Mar 10 19:07:00 1984

And while all you guys are busy stuffing things in registers, let us
not forget that process-switch time gets worse the more "short term"
state there is to save/restore. Especially when using an operating
system model of the Concurrent Euclid sort, or any other system in
which the "interrupt handlers" are full processes in their own right,
saving and restoring ("swapping") the enormous amount of state implied
by all those registers can result in a net reduction of throughput.
Properly designed non-write-through cache does not suffer from that
problem, since many pieces of many interrupt processes can come to live
in the cache.

Small aside:	I really wish that the Motorola 68000 had only 8
		general registers rather than 8 addr + 8 data. For any
		instruction set of the VAX/68k/16k style, 16 registers
		is too many if you do a lot of process switching.

If the "registers" are implemented as main memory locations which addressed
with short addresses relative to some process base, and the identification
of "register" addresses is fast enough so that cached "registers" compete
with hardware register addressing, then the number of registers should be
set solely by the number of bits we wish to use for them in the instructions.
I still feel that even in this case the number of specially named "registers"
(short addresses) will be quite small (32 or less).

An interesting example to look at for register naming is the old RCA-1602 (?)
4-bit microcontroller chip. It had a few registers (4?), but the neat thing
was that the actual scratchpad cell that was used for a given register was
run-time settable. To do a "process switch" you said "Use loc.234 for the N
register, 107 for the T reg, and (last) loc.025 for the PC".

Now, following that through to bigger machines, we should look at decoupling
the NAMING of "registers" (which is restricted by the desire to keep the
instruction stream efficient) from the LOCATION of registers (which should
be a large address space).  The context-switch problem then becomes changing
the name-location map quickly, like the TI 9900.  Additionally, with smart
enough compilers (but don't ask me to program in assembly for THIS machine!)
you can change the assignment of registers to locations as needed throughout
a given module or even subroutine, to balance the "register working set".
("Quick Martha, call the nut house! He's gone and re-invented register
page-maps!") If the register page-map was actually of the translation-
lookaside-cache style (with the same number of entries as registers), and
was onboard the CPU chip, it could probably compete in performance with
fixed hardwired register sets.

Just a little something to jog the discussion along...

Rob Warnock

UUCP:	{sri-unix,amd70,hpda,harpo,ihnp4,allegra}!fortune!rpw3
DDD:	(415)595-8444
USPS:	Fortune Systems Corp, 101 Twin Dolphin Drive, Redwood City, CA 94065

hal@cornell.UUCP (Hal Perkins) (03/12/84)

The IBM 801, which also has a simple register-register instruction
set like the RISC, has explicit instructions to control the cache.
For example, if the program is about to store the registers into a
block of memory, the instructions can be used to prevent the cache
from fetching the old contents of the storage line when the first
byte of the line is referenced.

Now, a question.  The only thing I've seen on the 801 is the paper
in the 1982 conference on hardware support for high-level languages
(printed in a special issue of SIGPLAN Notices/SIGARCH News).  This
was later revised and printed in the IBM Journal of R&D.  Has anyone
seen any other information about this?  I'd particularly like to see
a "Principles of Operation" or architecture manual for the 801, but
I'm afraid that information is probably still classified IBM Top
Secret.


Hal Perkins                         UUCP: {decvax|vax135|...}!cornell!hal
Cornell Computer Science            ARPA: hal@cornell  BITNET: hal@crnlcs

aaw@pyuxss.UUCP (Aaron Werman) (03/12/84)

To add my wish list -
	First a note -the TMS9900 series was one of the slowest
microprocessors ever publicly sold, because although context switching
and process switching did not take much time, the lack of registers
slowed it to a crawl.

How about an architecture with mixed memory - that is, have
subroutines consisting of a small chunk of very fast memory trailed by
the rest of the subroutine contained in normal memory. The obvious
place for this "uncache" would be inside the memory controller.
This scheme would allow fast context switching as well as real time
program execution.
The problem with nonregister architectures though is most visible in 0
address (pure stack) machines, which inherently clog memory busses
rather than perform useful function. Some algorithms are enhanced by
registers, others don't need 'em.
			{harpo,houxm,ihnp4}!pyuxss!aaw
			Aaron Werman

mark@umcp-cs.UUCP (03/13/84)

The Ridge computer has cache control instructions, so they say, as
well as instructions to turn on and off updating the virtual memory
translation look-aside buffer.  Its branches also have advice to
the cache about which way they expect to go.
-- 
Spoken: Mark Weiser 	ARPA:	mark@maryland
CSNet:	mark@umcp-cs 	UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!mark

tgg@hou5e.UUCP (03/14/84)

>The idea is good, but there's one problem with its current implementation;
>there's no way to make those hints truly "machine-independent".  The problem
>is that you want to use all the registers you can (actually, there's even
>a tradeoff here; you get extra performance from instructions referencing
>registers, but you do have to save and restore those registers, and if you
>have complicated expressions you may run out of scratch registers and be
>forced to use temporaries in memory), but "all the registers you can" is
>machine-dependent.  On the PDP-11, "all the registers you can" in the Ritchie
>and Johnson compilers is 3; on the VAX-11, it's 6, and on the M68000, it's
>(at least on our compiler) 6 registers not containing pointers and four
>registers containing pointers.  Just declaring everything you can to be
>"register" wins only if 1) there are enough registers to satisfy your requests
>or 2) the variables that will cause the greatest improvement when put in
>registers are assigned to registers first, and the less important ones cause
>the compiler to ignore the "register" declaration.
>
>
>	Guy Harris
>	{seismo,ihnp4,allegra}!rlgvax!guy

No no no - you're taking the meaning of 'register' too literally! It does not
mean that the compiler is forced to use a register for these variables, it just
means that you're giving it a hint that these variables are more likely to
be referenced than other variables. If that means that, for the machine at
hand, the variables should be three's complimented and stored in reverse bit
order starting at memory location 3fd3.8 (octal) because that is the absolute
best place to store an integer for speedy calculations, then the compiler
should do that for 'register' variables. Let the compiler worry about what
to do if it runs out of registers or other resources.

The point is that 'register' exists to differentiate some variables from
others so that a compiler can make some reasonable tradeoffs if possible.
Making every variable in a program 'register' isn't any good because their
*all* important. If you want overall optimization, make a better optimizer.

Personally, I think that 'register' should have been changed to 'fast' or
something like it long ago...
	Tom Gulvin		ABI - Holmdel

smk@axiom.UUCP (Steven M. Kramer) (03/15/84)

You have to take into acct :
	1) # of reg assignment allowed on your machine,
	2) cost in terns of saving/restoring regs
as Guy Harris said.  I usually order my reg so tthat the most used variable
comes first, so that if my program were ported to a PDP-11, it would
still run sort of fast.  I know that's sort of implementation dependent, but
all optimization is that way.  On the 68000 port we are using, there is
a difference between address resgister and data registers that ALSO must
be taken into consideration.  On the VAX and PDP that's not so, but
using rigister in any machine should keep in mind all of the differences.
The best code would #ifdef the differences (especially if another processor
handled things differently), but few of us (I don't, I can't anticipate
all uses) think that way.  So, each new port has to look a register
declarations.  Few even use them to begin with. so ANY use is a plus.
-- 
	--steve kramer
	{allegra,genrad,ihnp4,utzoo,philabs,uw-beaver}!linus!axiom!smk	(UUCP)
	linus!axiom!smk@mitre-bedford					(MIL)

turner@ucbesvax.UUCP (03/20/84)

> /***** ucbesvax:net.arch / orstcs!nathan / 10:42 pm  Mar  3, 1984*/
> I would like to make the claim that a RISC with a medium-to-large
> instruction cache could be called a "virtual-microcode" machine.
> ... the subroutines that implement what would be instructions on another
> architecture tend to be in the cache all the time.  This includes the
> Enter and Exit operations, loop operations, and the like; esoterica such
> as long divides would sit out in main memory until needed.  [Nathan Myers]

I think that's what they had in mind (I haven't kept up with the ever-
shifting rationale for RISC).  I still question whether the gains reaped
from the various simplifications and optimizations possible in RISC
architectures aren't lost in some ways.  RISC has no BCD or i/o formatting
instructions, because "compiler-writers don't use them."  Fine.  But
programs nevertheless spend much time translating back and forth between
ASCII/EBCDIC and binary.  Instructions with a static frequency approaching
zero might still have a dynamic frequency (and a cost for emulating them
in software) that is quite high.

So if "printf" is in the cache, rather than having much of it hardwired in
microcode, this leaves less space for other code that competes for the
cache.  Also, since RAM has a lower bit-density than ROM/PLA implementation
there is that much less space to compete for.

But then again--when you are *not* using BCD/io-format heavily, on a RISC
you don't have to pay for them.  It's a trade-off.

I have advocated "cache-advisor" instructions for RISC's--instructions
that don't *do* anything, from the programmer's point of view, but which
inform the cache-manager of highly-probable-to-certain upcoming memory
events.  This is another trade-off--a loss of code-density with a possible
net gain in performance.  That need not complicate the architecture.  It
could, in fact simplify it--if cache-block-prefetching were more effective,
then instructions could be looser and "fatter" (more easily decoded and
executed) with less worry over the fetching costs.  Possible analogies:
impurity doping of semiconductors to increase electron mobility, or
adding minute amounts of certain minerals in steel-making to improve its
tensile strength.  It's worth some experimentation, I think.
---
Michael Turner (ucbvax!ucbesvax.turner)