[comp.arch] 80960 "stack frame cache" - multiple register sets

mcg@omepd (Steven McGeady) (04/20/88)
In article <29454@linus.UUCP> sid@linus.arpa (Sid Stuart) writes:
>
>>   The highly-integrated 80960KB has a number of functions on-chip that
>>are characteristic of multiple-chip solutions.  On-chip functions include
>>32 32-bit registers, the FPU [with four additional 80-bit registers],
>>a 512-byte instruction cache, a stack frame cache, and a 32-bit multiplexed
>>burst bus.			^^^^^^^^^^^^^^^^^?
>
>	I have a copy of the 80960 Programmer's Reference Manual. I
>can find no reference in it to a "stack frame cache". Can someone point
>out where this is mentioned and what size this mythical cache is? Are the
>four sets of local registers supposed to be the stack frame cache? 

I probably should not have used that term, since it is an internal and
somewhat archaic referential for the "multiple local register sets" described
in the 80960 manual.

The definition of the local registers (of which you get a new copy at the
entry to each procedure entered with a 'call' instruction) is an important
aspect of the 80960 architecture, and an important place where we find
architectual performance "headroom".  One its face, the 80960 has 32
addressable registers (actually, there is an encoding bit for an additional
32 "special function registers", but that bit is unimplemented in the
current architecture).  32 registers seems to be a common compromise
between enough registers and consuming too much encoding space with register
addressing.

The 80960 effectively allows an arbitrarily large number of additional
registers to be implemented by the processor, without the need for more
encoding space, and without the need for extremely sophisticated inter-module
global optimization techniques in HLL compilers (although these don't hurt).
The current processors (KA,KB,MC) implemented four sets of local registers
(16 per set), for a total of 80 registers.  The archicture manages the
saving and restoring of registers into a save area in a process' stack space
as additional calls are made.  On the current implementations, programs
which oscillate between procedures with a difference in call depth less
than or equal to 4 will never need to spill any registers to memory.
In future (high-performance) implementations, this number is likely to be
increased.  In implementations where cost or die size (same thing) is
a crucial factor, the number of on-chip registers can be traded off against
performance.

Further, while the current architecture spills all 16 local registers when
needed, future implementations will be able to mark used registers with
'dirty' bits, spilling only those registers actually used by the procedure.

The multiple local register sets are a compromise between a Berkeley
(Patterson) style sliding (overlapping) register widow architecture and a
Stanford (Hennessey) style flat register space.  Like the Berkeley approach,
the 80960 allows good performance from less-than-spectacular compiler
technology; like the Stanford approach, it does not require complex addressing
logic in speed-critical register access hardware controls.

Finally, if the whole notion grosses you out for some reason, the call
and return instructions can be forgotten and you can implement your very
own procedure calling mechanism using the 'branch-and-link' instruction,
and pretend that you have a normal, flat, 32-register machine.
Why anyone would do that I don't know, but it is in the spirit of the
80960 design to let users make their own decisions about such things.

S. McGeady
Intel Corp.