[comp.arch] registerless architecture

spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) (11/12/90)

Has anyone every thought about or done a registerless architecture?
registers, after all, are just a sort of cache, another level in the
memory hierarchy.  but a fixed size, hard-wired one.  Consider
a machine with a 4 level memory

0) the fpu and alu      0Kb
1) on-chip cache       10Kb
2) normal cache       100Kb
3) main ram        10 000Kb
4) magnetic disk  100 000Kb

It is very easy expand the size/speed of caches, but not to add registers.
I think this is a big advantage.  The way a cache works generalizes
the behavior things like register windows.

One problem is that instructions would have to be very large (3 addresses).
using a stack based approach would help.  The 3 addresses are then
relative to the stack pointer, and can be small enough to fit into the
instruction. That's 8 or 9 bits for 32 bit machines, or twice that
for 64 bit machines.  again, it scales easily.

context switch is fast and easy, there's nothing but CCR, PC, and FP.

any thoughts on this?  stupid idea, or the wave of the future?  :)


			Consume
Scott Draves		Be Silent
spot@cs.cmu.edu		Die

cgy@cs.brown.edu (Curtis Yarvin) (11/12/90)

In article <1990Nov12.145410.29035@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:
>
>Has anyone every thought about or done a registerless architecture?
>registers, after all, are just a sort of cache, another level in the
>memory hierarchy.  but a fixed size, hard-wired one.

>One problem is that instructions would have to be very large (3 addresses).
>using a stack based approach would help.  The 3 addresses are then
>relative to the stack pointer, and can be small enough to fit into the
>instruction. That's 8 or 9 bits for 32 bit machines, or twice that
>for 64 bit machines.  again, it scales easily.

This is one of the only two reasons to use registers.  The other is that
registers can still be made a bit faster; no association or anything necessary
(this goes unless you are one of those direct-mapped cache people).  This
capability isn't much used in practice, though - generally both register
and cache hits take one clock cycle.

>context switch is fast and easy, there's nothing but CCR, PC, and FP.

Ah, but no... you have to flush your cache anyway, you don't really
gain anything here.

>Scott Draves		Be Silent
>spot@cs.cmu.edu		Die

		-Curtis

"I tried living in the real world
 Instead of a shell
 But I was bored before I even began." - The Smiths

tom@ssd.csd.harris.com (Tom Horsley) (11/13/90)

>>>>> Regarding registerless architecture; spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) adds:

spot> Has anyone every thought about or done a registerless architecture?
spot> registers, after all, are just a sort of cache, another level in the
spot> memory hierarchy.  but a fixed size, hard-wired one.  Consider
spot> a machine with a 4 level memory

Once a long long time ago in a universe far far away I worked on a compiler
for a new machine that was going to be registerless because, as the engineers
said, "cache is just as fast as registers anyway".

By the time we got to the point where they were ready to cancel the project
the engineers had taken to pleading with the compiler writers to come up
with some way to allocate variables in locations such that frequently used
variables would be in spots that didn't get cache collisions with other
frequently used variables...

There is a common technique for doing something like this in compilers. It
is called "register allocation". Unfortunately, it is orders of magnitude
more difficult to do when there are no registers...

spot> any thoughts on this?  stupid idea, or the wave of the future?  :)

Stupid idea (that's your phrase, not mine :-).
--
======================================================================
domain: tahorsley@csd.harris.com       USMail: Tom Horsley
  uucp: ...!uunet!hcx1!tahorsley               511 Kingbird Circle
                                               Delray Beach, FL  33444
+==== Censorship is the only form of Obscenity ======================+
|     (Wait, I forgot government tobacco subsidies...)               |
+====================================================================+

jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (11/13/90)

From article <1990Nov12.145410.29035@cs.cmu.edu>,
by spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves):
> 
> Has anyone every thought about or done a registerless architecture?

My Ultimate RISK (Computer Architecture News, 1988) is a memory-to-memory
architecture with no registers in the instruction execution unit other
than the PC.  It has no arithmetic unit in the IEU either, which is why
I call it an IEU instead of a CPU.  The registers and arithmetic unit(s)
are out on the memory bus.  It was proposed as a purely pedagogical
exercise, but it can be pipelined to death, and with appropriate ALU(s)
out on the bus, it can be quite powerful.  I gather a few people have
built or are building machines based on my design, but I haven't heard
much from them.
					Doug Jones
					jones@herky.cs.uiowa.edu

my@dtg.nsc.com (Michael Yip) (11/13/90)

Someoen mentioned about a registerless architecture but using 
large on chip cache instead of the registers.  The reason was
registers limit the machine architecture and instruction sets
and expanding the cache is easier than adding more registers.

The transputers (eg T400, T800) are basically "registerless"
machines.  Transputer is basically a "stack based RISC machine"
which does not use any registers other than the 3 temporary
stack registers.  Instructions operate on the stack instead 
of registers.  The Transputers have on chip RAM (not cache) for
storage, therefore the context of a process including the 
content of the stack can be stored on the on chip RAM.  I think
that newer transputers also use caches, but I am not sure 
anymore since I only designed with the transputer a long time
ago when it first came out.  

So does the Transputer architecture fit into the registerless
computer architecture?

By the way, I think that the AT&T Crisp (????) is also a 
stack base machine.  But I don't know any detail about it.

About instructions and the number of registers ... doesn't
the register windowing technique also solve the problem since
the instruction set does not really depend on the number of 
total registers available on the chip (but the number of 
registers available at one time.)

Just my $0.02!  ;)

-- Mike
   my@dtg.nsc.com

mash@mips.COM (John Mashey) (11/13/90)

In article <1990Nov12.145410.29035@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:

>Has anyone every thought about or done a registerless architecture?
>registers, after all, are just a sort of cache, another level in the
>memory hierarchy.  but a fixed size, hard-wired one.  Consider
....
>It is very easy expand the size/speed of caches, but not to add registers.
>I think this is a big advantage.  The way a cache works generalizes
>the behavior things like register windows.
....
>using a stack based approach would help.  The 3 addresses are then
>relative to the stack pointer, and can be small enough to fit into the
>instruction. That's 8 or 9 bits for 32 bit machines, or twice that
>for 64 bit machines.  again, it scales easily.

Bell Labs' CRISP chips were this way. This architecture was a fairly
elegant evolution of the register windows path, i.e., it had a true
"stack cache", with on-chip registers laid fairly invisibly over the
top of the stack.  I.e., register numbers were really offsets from
the stack pointer, and if they were within range, you got the register,
else had to fetch the data.  Of interest to compiler writers was the
fact that if you generated an address via other routes, and the address was
in the stack cache, you got it also, eliminating the need to deal
with address aliasing  (i.e.,  x ...  y = &x;   func(y)).

So, anyway, they've been built, and serious software work done with them,
although CRISPs never did get to the commercial market, which is a little
sad.  (I may disagree with some of the design choices, but it did have some
elegant ideas.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jeremy@cs.adelaide.edu.au (Jeremy Webber) (11/13/90)

In article <1990Nov12.145410.29035@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:

   Has anyone every thought about or done a registerless architecture?


Have a look at the INMOS Transputer.  It has 3 general purpose registers, which
aren't addressed directly, but via stack operations.  It also has a small
amount of on-chip 1-cycle RAM, mapped into the processor's address space.
Its negatives are no memory management support, and the on-chip RAM isn't a
cache, it is hardwired into the low memory addresses.

Still, they have a lot of virtues, particularly if you're rolling your own
hardware. 

		-jeremy
--
--
Jeremy Webber			   ACSnet: jeremy@chook.ua.oz
Digital Arts Film and Television,  Internet: jeremy@chook.ua.oz.au
3 Milner St, Hindmarsh, SA 5007,   Voicenet: +61 8 346 4534
Australia			   Papernet: +61 8 346 4537 (FAX)

cik@l.cc.purdue.edu (Herman Rubin) (11/13/90)

In article <1990Nov12.145410.29035@cs.cmu.edu>, spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:
 
> Has anyone every thought about or done a registerless architecture?
> registers, after all, are just a sort of cache, another level in the
> memory hierarchy.  but a fixed size, hard-wired one.  Consider
> a machine with a 4 level memory
 
> 0) the fpu and alu      0Kb
> 1) on-chip cache       10Kb
> 2) normal cache       100Kb
> 3) main ram        10 000Kb
> 4) magnetic disk  100 000Kb
 
> It is very easy expand the size/speed of caches, but not to add registers.
> I think this is a big advantage.  The way a cache works generalizes
> the behavior things like register windows.
 
> One problem is that instructions would have to be very large (3 addresses).
> using a stack based approach would help.  The 3 addresses are then
> relative to the stack pointer, and can be small enough to fit into the
> instruction. That's 8 or 9 bits for 32 bit machines, or twice that
> for 64 bit machines.  again, it scales easily.
 
> context switch is fast and easy, there's nothing but CCR, PC, and FP.
 
> any thoughts on this?  stupid idea, or the wave of the future?  :)

Even with registers, it is sometimes necessary to change code, but it can
be  made infrequent.  Without registers, ugh!

Only a 9-bit field relative to a pointer?  One of the stupid (in my opinion)
things about the 86-class machines is the 16 bit field relative to a pointer,
and more than one such field could be active.  

Indirect addressing and addressing relative to registers is extremely 
important; to replace registers with cache intelligently would require
allowing arbitrary depth of indirection, which is not a bad idea.  But
there would be at least a cache access for each one.  Also, the idea of
allowing instructions of arbitrary address length seems to be out of
fashion.  It would allow indexing of registers, which should be allowed
anyhow.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

foo@titan.rice.edu (Mark Hall) (11/13/90)

)In article <1990Nov12.145410.29035@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:
)
)   Has anyone every thought about or done a registerless architecture?
)
    Just for a sense of history: the TI 9900 (and 99000 I believe) were
  also registerless.  They never made it very big in the marketplace.

    (this is almost folklore to me, so correct me if I am wrong. It has
  been a long time since I looked at a chip spec. Any chip spec.)

  - mark

bean@putter.wpd.sgi.com (David (Bean) Anderson) (11/13/90)

In article <1990Nov12.145410.29035@cs.cmu.edu>, spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:
|> 
|> Has anyone every thought about or done a registerless architecture?
|> registers, after all, are just a sort of cache, another level in the
|> memory hierarchy.  but a fixed size, hard-wired one.  Consider
|> a machine with a 4 level memory
|> 
|> 0) the fpu and alu      0Kb
|> 1) on-chip cache       10Kb
|> 2) normal cache       100Kb
|> 3) main ram        10 000Kb
|> 4) magnetic disk  100 000Kb
|> 
|> It is very easy expand the size/speed of caches, but not to add registers.
|> I think this is a big advantage.  The way a cache works generalizes
|> the behavior things like register windows.
|> 
|> One problem is that instructions would have to be very large (3 addresses).
|> using a stack based approach would help.  The 3 addresses are then
|> relative to the stack pointer, and can be small enough to fit into the
|> instruction. That's 8 or 9 bits for 32 bit machines, or twice that
|> for 64 bit machines.  again, it scales easily.
|> 
|> context switch is fast and easy, there's nothing but CCR, PC, and FP.
|> 
|> any thoughts on this?  stupid idea, or the wave of the future?  :)
|


1.  Register files are typically multi-ported -- one can usually get
two reads and one write to the file in one clock (indeed, usually in
a small fraction of the clock) -- whereas a cache typically is single
ported and while it can deliver one data item per clock, it is usually
the "next" clock.  Caches will always be slower than registers because
(if for no other reason) the path length and gate count to cache will
be higher than to a register file.  

2.  Why are registers considered a *problem*?   Modern compilers usually
do a good job of effectively using the registers as opposed to *stupid*
cache hardware.  Indeed, some interesting work in "blocking algorithms"
(faking the cache into behaving like a large register file) have resulted
in some impress performance figures.

3.  The HP3000 is a stack machine with no GPRs.  The hardware (on
some models) would keep the top four stack items in a register file
in order to increase performance.  

4.  Register window architectures are an interesting compromise.
They use a large register file that the compiler can use as it 
sees fit, however, one can address registers either by name 
or relative to the window base. 


Who decides what data items should go in high speed memory is
the critical issue:  hardware implemented heuristics (cache)
or compiler handled directives (registers)?   There are places for both.

		Bean

ts@cup.portal.com (Tim W Smith) (11/13/90)

< any thoughts on this?  stupid idea, or the wave of the future?  :)

Why do you assume it can't be a "stupid idea" and "the wave of the
future"? :-)

						Tim Smith

nather@ut-emx.uucp (Ed Nather) (11/14/90)

In article <3168@ns-mx.uiowa.edu>, jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) writes:
> From article <1990Nov12.145410.29035@cs.cmu.edu>,
> by spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves):
> > 
> > Has anyone every thought about or done a registerless architecture?
> 
> My Ultimate RISK (Computer Architecture News, 1988) is a memory-to-memory
> architecture with no registers in the instruction execution unit other

Many years ago there was this microprocessor, see, that was 16 bits (!!)
when all the others were only 8 bits, and it was going to be a real
world-beater and wipe out Intel, Motorola, etc.  The thing HAD NO
REGISTERS either, went memory-to-memory because that's where everything
ends up anyway, so what good are registers?  It was made by that powerhouse
of computing called Texas Instruments who, as you know, wiped out all the
competition and changed its name to IBM and ...

Actually, I've forgotten (or suppressed) the chip number, but it was a
real dog, much too slow compared with its competition, and died the
Death of Dumb Chips long, long ago.

Aren't there any CS courses that teach the History of Computer Architectures?

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin

ig@caliban.uucp (Iain Bason) (11/14/90)

curtis>In article <56084@brunix.UUCP> cgy@cs.brown.edu (Curtis Yarvin) writes:
scott>In article <1990Nov12.145410.29035@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:
tom>In article <TOM.90Nov12122800@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes:
herman>In article <2731@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
mark>In article <1990Nov13.011231.4899@rice.edu> foo@titan.rice.edu (Mark Hall) writes:

scott>Has anyone every thought about or done a registerless architecture?
scott>registers, after all, are just a sort of cache, another level in the
scott>memory hierarchy.  but a fixed size, hard-wired one.
curtis>
curtis>...registers can still be made a bit faster; no association or anything necessary
curtis>(this goes unless you are one of those direct-mapped cache people).  This
curtis>capability isn't much used in practice, though - generally both register
curtis>and cache hits take one clock cycle.
curtis>

I agree here.  One other point is that if you're doing a stack cache,
and all your instructions use indexing off the stack pointer, you have
to add the index to the stack pointer.  That is going to take > 0 time.

scott>context switch is fast and easy, there's nothing but CCR, PC, and FP.
curtis>
curtis>Ah, but no... you have to flush your cache anyway, you don't really
curtis>gain anything here.

This is not entirely true.  It doesn't take much hardware to add a (small)
process-id tag to cache lines.  Then cache flushing can take place in the
background, while the CPU does useful work.  In some cases (e.g., a simple
interrupt handler) only a few cache lines will be flushed before the CPU
returns to the interrupted process.

tom>Once a long long time ago in a universe far far away I worked on a compiler
tom>for a new machine that was going to be registerless because, as the engineers
tom>said, "cache is just as fast as registers anyway".
tom>
tom>By the time we got to the point where they were ready to cancel the project
tom>the engineers had taken to pleading with the compiler writers to come up
tom>with some way to allocate variables in locations such that frequently used
tom>variables would be in spots that didn't get cache collisions with other
tom>frequently used variables...
tom>
tom>There is a common technique for doing something like this in compilers. It
tom>is called "register allocation". Unfortunately, it is orders of magnitude
tom>more difficult to do when there are no registers...

Most (maybe all? anyone know?) C compilers will allocate local variables on
the stack.  Hardware can certainly be designed to cache a stack (all you
have to do is avoid collisions from contiguous memory; I would think this
would be the normal way to do a cache).  The compiler could create new
locals and just pretend they are registers (although I'm sure there would
be smarter ways to optimize for the architecture).

I expect many languages other than C can also be made to allocate local
variables on the stack.  Lisp might be tough, and Smalltalk, but then
they usually are on any architecture.

A compiler for a machine like this would obviously be different.  For
instance, I imagine "register" coloring would be difficult to do when
the number of "registers" is variable.  You have to take into account
the fact that other routines may have data in the cache, and only 
allocate space if you think it will save this routine more time than
it will cost other routines.

tom>spot> any thoughts on this?  stupid idea, or the wave of the future?  :)
tom>
tom>Stupid idea (that's your phrase, not mine :-).

This is far from clear

herman>Only a 9-bit field relative to a pointer?  One of the stupid (in my opinion)
herman>things about the 86-class machines is the 16 bit field relative to a pointer,
herman>and more than one such field could be active.  

I don't think Scott is proposing to limit *all* indexes to 9 bits.  Look
at it this way: most CPUs limit you to 5-bit indexes into their register
files, but they let you use larger indexes into memory.

herman>Indirect addressing and addressing relative to registers is extremely 
herman>important; to replace registers with cache intelligently would require
herman>allowing arbitrary depth of indirection, which is not a bad idea.

Gaaak!  Please banish the thought from your mind.  I believe one company
(Data General?) had a hell of a time trying to do virtual memory
with such a "feature".  Apparently it was almost never used, anyway.

mark>    Just for a sense of history: the TI 9900 (and 99000 I believe) were
mark>  also registerless.  They never made it very big in the marketplace.
mark>
mark>    (this is almost folklore to me, so correct me if I am wrong. It has
mark>  been a long time since I looked at a chip spec. Any chip spec.)

I believe you are correct, although I'd never even heard of the 99000.
"They never made it very big" is being charitable.  Speaking of which,
does anyone remember the Fairchild F8?




-- 

			Iain Bason
			..uunet!caliban!ig
-- 

			Iain Bason
			..uunet!caliban!ig

baum@Apple.COM (Allen J. Baum) (11/14/90)

[]
I can't let all this stuff go by without my three cents...
>In article <39637@ut-emx.uucp> nather@ut-emx.uucp (Ed Nather) writes:
>In article <3168@ns-mx.uiowa.edu>, jones@pyrite.cs.uiowa.edu writes:

>> > Has anyone every thought about or done a registerless architecture?

As many posters have pointed out, most stack architectures can be considered
registerless, in the sense that they can be built without physical registers,
and, if registers were implemented, could not address them directly.

>Many years ago there was this microprocessor, see,...The thing HAD NO
>REGISTERS either, went memory-to-memory because that's where everything
>ends up anyway, so what good are registers?  It was made by... Texas
>Instruments..... it was a >real dog, much too slow .....

Actually, there was at least one implementation of the TI9900 that was fast...
because they put registers in, but I don't remember if it block-loaded
them when the register pointer was switched,( like the Intel 960 does)
or if they were a register-cache.

The CRISP is in some sense the ultimate expression of the registerless
machine. It is a stack architecture, but at the same time it is a 2
1/2 address machine. It can be built with no physical registers, or
can have many. The physical registers are used as a cache. If an
interrupt occurrs, the SP is changed, and accesses off the stack
pointer suddenly miss. It is not required to mess with the
'stack-cache at that point, although it makes mucho sense from a
performance point of view. Note that getting to interrupt code is very
fast- there aren't many things that must be saved, but can if you
want, for performance reasons.  The thing that makes CRISP a bit
different is that the 'cache' is not automatically loaded on a miss;
special function call instructions do that.

Note that there have been register architectures which had no physical
registers, notably the early DEC PDP-10s (and maybe PDP-6s?), where the
registers overlayed the first 16 memory locations, and there was an 
option that installed real registers. I think I remember reading that
no PDP-10 was sold without the option. Hmmmm.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

philip@pescadero.Stanford.EDU (Philip Machanick) (11/14/90)

In article <1990Nov13.035859.4777@relay.wpd.sgi.com>, bean@putter.wpd.sgi.com (David (Bean) Anderson) writes:
|> In article <1990Nov12.145410.29035@cs.cmu.edu>, spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:
|> |> 
|> |> Has anyone every thought about or done a registerless architecture?
[detail deleted]
|> |> One problem is that instructions would have to be very large (3 addresses).
|> |> using a stack based approach would help.  The 3 addresses are then
|> |> relative to the stack pointer, and can be small enough to fit into the
|> |> instruction. That's 8 or 9 bits for 32 bit machines, or twice that
|> |> for 64 bit machines.  again, it scales easily. [stuff deleted]
|> |> any thoughts on this?  stupid idea, or the wave of the future?  :)
[more deleted]
|> 2.  Why are registers considered a *problem*?   Modern compilers usually
|> do a good job of effectively using the registers as opposed to *stupid*
|> cache hardware.  Indeed, some interesting work in "blocking algorithms"
|> (faking the cache into behaving like a large register file) have resulted
|> in some impress performance figures.
|> 
|> 3.  The HP3000 is a stack machine with no GPRs.  The hardware (on
|> some models) would keep the top four stack items in a register file
|> in order to increase performance.  
[more deleted]

In fact, I believe the Burroughs B5500 series introduced this registers at top
of stack scheme. This was a very pure stack-based architecture, with most
instructions relative to top of stack. Because they had no addresses, they
were packed 4 to a 48-bit word. Very efficient? See Hennessy and Patterson
"Computer Architecture: A Quantitative Approach", Morgan Kuffman, 1990 for
why RISC performs better. In other words, this is not a stupid idea, just
one that's been tried and not delivered - a wave of the past, if you will.
-- 
Philip Machanick
philip@pescadero.stanford.edu

jmaynard@.hsch.utexas.edu (Jay Maynard) (11/14/90)

In article <39637@ut-emx.uucp> nather@ut-emx.uucp (Ed Nather) writes:
>Many years ago there was this microprocessor, see, that was 16 bits (!!)
>when all the others were only 8 bits, and it was going to be a real
>world-beater and wipe out Intel, Motorola, etc.  The thing HAD NO
>REGISTERS either, went memory-to-memory because that's where everything
>ends up anyway, so what good are registers?  It was made by that powerhouse
>of computing called Texas Instruments who, as you know, wiped out all the
>competition and changed its name to IBM and ...

>Actually, I've forgotten (or suppressed) the chip number, but it was a
>real dog, much too slow compared with its competition, and died the
>Death of Dumb Chips long, long ago.

The chip you're thinking of is the TMS9900. I have one in a small card cage
of a development system, with some flaky bubble memory and a cassette port,
and one of the screwiest dialects of BASIC I've ever met. It's a real
{curi,monstr}osity.

The 9900's big failing, though, was its 64K addressing limit; that was a
severe competitive disadvantage compared to the newer 16-bit chips being
introduced. I think there were later versions that didn't have that problem,
but my memory for such details is getting fuzzy.

Oh, yes...there was one common application for the 9900: the TI 99/4[a]
home computer.
-- 
Jay Maynard, EMT-P, K5ZC, PP-ASEL | Never ascribe to malice that which can
jmaynard@thesis1.hsch.utexas.edu  | adequately be explained by stupidity.
         "With design like this, who needs bugs?" - Boyd Roberts

faiman@m.cs.uiuc.edu (11/14/90)

For Ed Nather at UT-Austin ....

It was the TI 9900 (RIP).  See, for example, "16-bit Microprocessor
Architecture," by Terry Dollhoff, Reston, 1979, which contains a
10-chapter case study that uses this device.  Quote (without comment)
from the dust jacket: "Complete analysis of the 9900 microprocessor
with stand-alone programs, performance ratings of six competing 16-bit
machines, and more!"

Mike (used to teach a microprocessor course) Faiman, Urbana

torbenm@gere.diku.dk (Torben [gidius Mogensen) (11/14/90)

All this discussion about registers versus no registers but cache has
centered about the costs of implementing local variables: either by
keeping them in registers and then loading/saving them across calls,
or in memory and make the cache take care of loading/saving to main
memory.

There are several issues involved in this:

1) What are the relative costs of accessing registers and cache
memory.

2) What are the relative cost of loading/saving a (large) set of
registers in a burst, versus letting the cache do so at its own pace.

3) How do you effectively map a fixed number of registers to a
variable number of local variables. (Register allocation).


As for 1), there is a general acceptance that registers are faster
than cache, but only slightly so. The main problem is that, with
present cache architectures, you can access only one cache element per
cycle, whereas it is possible to access several registers. This can be
solved by implementing a multiported cache. Also, it might be possible
to access on-chip cache as fast as registers (to within a few percent)
if the architecture is designed for this.


As for 2), there has been arguments pointing both ways. Sequential
mode access to memory is faster than random access, so this speaks in
favor of loading/saving in bursts. It must be noted that many modern
cache architectures do the same: If there is a cache miss, a block of
cells is saved/loaded in one go.

The main difference is that with registers, loading/saving is done in
a consistent (compile-time) fashion, whereas with cache it is done by
need (run-time). The compile-time register saving will often lead to
unnecessary memory traffic, as registers might be saved, only to be
loaded again immediately afterwards, because the procedure returns
immediately (typically the leaf calls in a recursive algorithm). The
same may happen, to a lesser degree, if the cache use burst mode
access to main memory.

On the other hand, local variables need not be saved when leaving a
procedure, and the register saving strategy will know that.  The local
variables will still be kept in the cache, so they will be saved
unnecessarily when the cache locations are re-used.  Note that this
does not happen all the time: if a return is followed by another call,
the same memory locations are re-used, so no cache miss occur.

It is possible to make a stack-cache that will know that values above
the stack top are garbage, and set the do-not-save bit. This begins to
look very much like variable sized register windows, the difference
being that you can address arbitrarily far down the stack in a
transparent fashion.


As for 3), register based architectures require register allocation to
map local variables to a possibly smaller number of registers. While
good algorithms exist, they are all compile-time based and must thus
take worst-case behaviour into account. This will invariably lead to
saving registers when this is (in a particular run-time situation) not
necessary. While the problem in most cases is small, it can in some
cases have a noticeable effect. This is especially true in languages
like LISP or Prolog, where basic blocks and procedures are small.


The above discussion has centered around loading and saving of local
variables, but there are other points to consider. In languages that
use a lot of heap access (LISP, Prolog,...), a multiported cache is a
huge benefit, whereas a large set of registers almost no help at all.
Even if the cache uses burst mode access, it can benefit a heap. In
fact the article "Cache Performance of Combinator Graph Reduction" by
Koopman, Lee and Siewiorek in this years IEEE International Conference
on Computer Languages argues that a large cache block size is
beneficial for such languages.


All in all, IMHO a well-designed registerless architecture with a
multiported cache can perform just as well as a register architecture
on C-like languages and far better on languages like LISP and Prolog.


	Torben Mogensen (torbenm@diku.dk)

moss@cs.umass.edu (Eliot Moss) (11/14/90)

In article <1990Nov14.064225.14406@caliban.uucp> ig@caliban.uucp (Iain Bason) writes:

   I expect many languages other than C can also be made to allocate local
   variables on the stack.  Lisp might be tough, and Smalltalk, but then
   they usually are on any architecture.

LISP is actually pretty easy, I believe; Smalltalk is harder because its
"stack frames" can be treated as objects (you can send messages to them!),
they can be retained, etc. I suppose that LISPs supporting continuations also
present problems, but that the stack can be used most of the time (see the
paper in the most recent SIGPLAN conference on the subject).

   A compiler for a machine like this would obviously be different.  For
   instance, I imagine "register" coloring would be difficult to do when
   the number of "registers" is variable.  You have to take into account
   the fact that other routines may have data in the cache, and only 
   allocate space if you think it will save this routine more time than
   it will cost other routines.

On the contrary, register allocation via graph coloring would work reasonably
well. Ordinary register allocation is trying to compensate for the difference
in speed between registers and memory. If you eliminate registers and rely
solely on some kind of cache, then register allocation is easy, and equivalent
to having as many registers as you like. However, you *should* try to use as
few memory locations as possible, and graph coloring is nicely suited to doing
that. The nodes of the graph are the original collection of "variables"
(including various computed expressions, etc.); when the graph is k-colorable,
then these variables fit into k locations. The graph coloring algorithms used
in compilers attempt to minimize the number of colors (i.e., locations) used.
Hope this is reasonably clear ....
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206; Moss@cs.umass.edu

peter@ficc.ferranti.com (Peter da Silva) (11/15/90)

In article <56084@brunix.UUCP> cgy@cs.brown.edu (Curtis Yarvin) writes:
> >context switch is fast and easy, there's nothing but CCR, PC, and FP.

> Ah, but no... you have to flush your cache anyway, you don't really
> gain anything here.

Unless you're one of those direct-mapped cache people... :->
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
peter@ferranti.com

berg@cip-s04.informatik.rwth-aachen.de (AKA Solitair) (11/16/90)

Scott Draves writes:
>   Has anyone every thought about or done a registerless architecture?

Ever had a look at the FORTH-engine (think the part number was 4016),
see my other post under the subject: optimal processor.
--
Sincerely,                 berg%cip-s01.informatik.rwth-aachen.de@unido.bitnet
           Stephen R. van den Berg.
"I code it in 5 min, optimize it in 90 min, because it's so well optimized:
it runs in only 5 min.  Actually, most of the time I optimize programs."

urjlew@uncecs.edu (Rostyk Lewyckyj) (11/17/90)

The original purpose of registers or B boxes as they were called
on the machine at Manchester where they were invented was for
address modification for addressing arrays, rather than doing
permanent code modification on the fly. Their use as temporary
volatile fast storage came later. So don't forrget the original
register uses.

bimandre@saturnus.cs.kuleuven.ac.be (Andre Marien) (11/20/90)

artcle : <1990Nov14.113748.3677@diku.dk> says :

> This will invariably lead to
> saving registers when this is (in a particular run-time situation) not
> necessary. While the problem in most cases is small, it can in some
> cases have a noticeable effect. This is especially true in languages
> like LISP or Prolog, where basic blocks and procedures are small.

> In languages that
> use a lot of heap access (LISP, Prolog,...), a multiported cache is a
> huge benefit, whereas a large set of registers almost no help at all.

> All in all, IMHO a well-designed registerless architecture with a
> multiported cache can perform just as well as a register architecture
> on C-like languages and far better on languages like LISP and Prolog.

While it is true that architectures seem to forget languages like Prolog,
the above quotes are not quite true.
Let me first say that the fact that Prolog is forgotten can be justified by the
small commercial interest compared to other languages. I hope this will change,
of course (see signature)

In Prolog, basic blocks are small, but then Prolog is compiled very differently
from C. There is nothing but recursion in Prolog, which is not translated
to 'procedure calls' by any decend system I know of. Some attempts have been
made to map Prolog stacks to C/Pascal procedure stacks, none really succesful.

Register allocation for Prolog is also different from C/Pascal/... .
There is a more complex abstract machine, with a lot of often used registers,
lets say 8. The calling conventions can use another 4 registers just for
fast argument passing. The kernel algorithm in Prolog is unification, which
adds another 2 registers. So some 16 registers can be used to great benefit.

Anyone who both did a 386 and a SPARC/MIPS port will know the difference
the number of available registers makes.

Prolog does have burst heap/stack accesses :
creating choicepoints : some 10 words
creating environments : some 6 words
creating structured data : some 6 words
but I don't see a reason here for multiported access.

Decend support for tag manipulation would be far more useful.
This was one of the big disappointments on the SPARC.

Andre' Marien
bimandre@cs.kuleuven.ac.be
(ProLog by BIM, ex BIM_Prolog)
If opinions are found, they are not guaranteed to belong to anyone.

hankd@dynamo.ecn.purdue.edu (Hank Dietz) (11/21/90)

As to all the comments about needing only cache, I've said it before
and I'll say it again....  Registers help because:

[1]	They are fast
[2]	Register refs don't interfere with memory data path
[3]	You never miss (i.e., have static timing for schedules)
[3]	Register names are shorter than addresses

A conventional cache gets you only benefit [1]; however, ambiguously
aliased references (array elements and pointer targets) are
effectively managed by a cache whereas they require frequent flushing
from registers.  If you want all the benefits, you need both....

Well, almost.  Actually, all you need is a mutant thing like CRegs.
See the paper:  H.  Dietz and C-H Chi, "CRegs: A New Kind of Memory
for Referencing Arrays and POinters," from Supercomputing '88.  If you
don't like that, look at the paper in Supercomputing '90 by B. Heggy
and M. Soffa -- it describes a somewhat more complex register value
forwarding mechanism which works like fully-associative CRegs.

						-hankd

mshute@cs.man.ac.uk (Malcolm Shute) (11/26/90)

In article <3168@ns-mx.uiowa.edu> jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) writes:
>My Ultimate RISK (Computer Architecture News, 1988) is a memory-to-memory
>architecture with no registers in the instruction execution unit other
>than the PC.  It has no arithmetic unit in the IEU either, which is why
>I call it an IEU instead of a CPU.  The registers and arithmetic unit(s)
>are out on the memory bus.  It was proposed as a purely pedagogical
>exercise, [...]

Mine went the other way (Microelectronics Journal Vol 15, No 3&5)...
It had an ACC, but no PC.

Instead there was an instruction in location zero of memory which,
when executed, had its address field incremented, written back,
and used as the address of the next instruction to be fetched.

You might have gathered, that it wasn't tuned for high speed use!
Instead, the aim was to see if I could design a 16-bit processor
using only 600 transistors.  There were only 4 instructions in the
instruction set, and getting it to do the equivalent of a PDP11
MOV memory, memory  operation took about 7 instructions, in much
the same contorted sort of a way as Single Instruction Computers.

It was a fun exercise.  Probably not much use though.
--

Malcolm SHUTE.         (The AM Mollusc:   v_@_ )        Disclaimer: all

daveh@cbmvax.commodore.com (Dave Haynie) (01/08/91)

In article <1990Nov21.004355.212@noose.ecn.purdue.edu> hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes:
>As to all the comments about needing only cache, I've said it before
>and I'll say it again....  Registers help because:

>[1]	They are fast
>[2]	Register refs don't interfere with memory data path
>[3]	You never miss (i.e., have static timing for schedules)
>[4]	Register names are shorter than addresses

>A conventional cache gets you only benefit [1]; however, ambiguously
>aliased references (array elements and pointer targets) are
>effectively managed by a cache whereas they require frequent flushing
>from registers.  If you want all the benefits, you need both....

Well, it seems to me that if you built a registerless machine right, you
could pick up a few more points.  A good cache is fast these days.  So
lets have three, one for data, one for instruction, one to replace actual
registers.  So we got [1].

As for [2], registers to intefere with a memory path -- when they are swapped
to main memory during a context swap.  So if we have a good sized register 
cache, in many cases we not only miss interference during task execution, but
from within a task as well.  Like a Harvard machine, only with three internal
data paths rather than two.  I guess you have to decide how the register cache
actually works during a program execution -- it one could treat each virtual
register as on a normal fixed register machine, but it would probably make as
much sense to make it act like a register window machine.  In today's silicon,
you could have a 4-8K register cache with multiple set associtivity.

Number [3] is something of an issue -- with a task swap on a conventional
machine, you "miss" only on task boundaries.  Here, you miss the first time
you access a register, but never again, at least until your task is swapped out
and back in, in which case you may miss, but even that's not guaranteed.

Number [4] is solved by making all working register references relative to a
real register, which points to the base of register space.  The time to add
in the offset from the base pointer can be hidden in the CPU pipeline if 
there's a dedicated adder for this purpose.

Still, with all that said, I'm not sure this puppy buys you much over the
conventional approach, and it does make the design of the CPU more complex.
It would definitely cut down on the context swap time, and does have the
interesting property of making the number of logical registers used in a
task definable by the OS, or even the application if you split things into
user and supervisor/kernel space.

>						-hankd

-- 
Dave Haynie Commodore-Amiga (Amiga 3000) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
	"Don't worry, 'bout a thing. 'Cause every little thing, 
	 gonna be alright"		-Bob Marley

baum@Apple.COM (Allen J. Baum) (01/09/91)

[]
>In article <17212@cbmvax.commodore.com> daveh@cbmvax.commodore.com (Dave Haynie) writes:
 --arguments for a 'register cache'
e.g. it one could treat each virtual
>register as on a normal fixed register machine, but it would probably make as
>much sense to make it act like a register window machine.

>make all working register references relative to a
>real register, which points to the base of register space.  The time to add
>in the offset from the base pointer can be hidden in the CPU pipeline if 
>there's a dedicated adder for this purpose.

You've just described the ATT CRISP pretty muc. Look it up...

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum