mcg@omepd (Steven McGeady) (04/20/88)
In article <29454@linus.UUCP> sid@linus.arpa (Sid Stuart) writes: > >> The highly-integrated 80960KB has a number of functions on-chip that >>are characteristic of multiple-chip solutions. On-chip functions include >>32 32-bit registers, the FPU [with four additional 80-bit registers], >>a 512-byte instruction cache, a stack frame cache, and a 32-bit multiplexed >>burst bus. ^^^^^^^^^^^^^^^^^? > > I have a copy of the 80960 Programmer's Reference Manual. I >can find no reference in it to a "stack frame cache". Can someone point >out where this is mentioned and what size this mythical cache is? Are the >four sets of local registers supposed to be the stack frame cache? I probably should not have used that term, since it is an internal and somewhat archaic referential for the "multiple local register sets" described in the 80960 manual. The definition of the local registers (of which you get a new copy at the entry to each procedure entered with a 'call' instruction) is an important aspect of the 80960 architecture, and an important place where we find architectual performance "headroom". One its face, the 80960 has 32 addressable registers (actually, there is an encoding bit for an additional 32 "special function registers", but that bit is unimplemented in the current architecture). 32 registers seems to be a common compromise between enough registers and consuming too much encoding space with register addressing. The 80960 effectively allows an arbitrarily large number of additional registers to be implemented by the processor, without the need for more encoding space, and without the need for extremely sophisticated inter-module global optimization techniques in HLL compilers (although these don't hurt). The current processors (KA,KB,MC) implemented four sets of local registers (16 per set), for a total of 80 registers. The archicture manages the saving and restoring of registers into a save area in a process' stack space as additional calls are made. On the current implementations, programs which oscillate between procedures with a difference in call depth less than or equal to 4 will never need to spill any registers to memory. In future (high-performance) implementations, this number is likely to be increased. In implementations where cost or die size (same thing) is a crucial factor, the number of on-chip registers can be traded off against performance. Further, while the current architecture spills all 16 local registers when needed, future implementations will be able to mark used registers with 'dirty' bits, spilling only those registers actually used by the procedure. The multiple local register sets are a compromise between a Berkeley (Patterson) style sliding (overlapping) register widow architecture and a Stanford (Hennessey) style flat register space. Like the Berkeley approach, the 80960 allows good performance from less-than-spectacular compiler technology; like the Stanford approach, it does not require complex addressing logic in speed-critical register access hardware controls. Finally, if the whole notion grosses you out for some reason, the call and return instructions can be forgotten and you can implement your very own procedure calling mechanism using the 'branch-and-link' instruction, and pretend that you have a normal, flat, 32-register machine. Why anyone would do that I don't know, but it is in the spirit of the 80960 design to let users make their own decisions about such things. S. McGeady Intel Corp.