marc@oahu.cs.ucla.edu (Marc Tremblay) (10/28/88)
>In article <469@oracle.UUCP> csimmons@oracle.UUCP (Charles Simmons) writes: > >If I remember the arguments from MIPS correctly (want to help me out >John?), there's a stronger objection to multiple-window-register-files. >I think it's something to the effect that register-windows cause the >load/store access time to be slower. Having a multiple-window register file, or more precisely, having many registers, slows down the processor cycle. Even with an independent port for the load/store, the operation is still based on the basic processor cycle. With a longer cycle the load/store accesses become slower. There are two reasons: 1) for a large register file, let's say 128 registers, the decoding of the registers addresses is longer (more bits to decode, even if you use partial decoding there is still a penalty), 2) the data bus is longer because it has to go over so many registers. A longer data bus implies larger capacitance and longer discharge time, thus longer processor cycle. Usually the access to the register file, either on a register read/write or on a load/store, is part of the critical path. You can play some tricks to get around those drawbacks, for example the Am29000 uses overlapping to avoid the penalty caused by the decoding. Even though the hardware is quite expensive (3 large decoders, 3 small adders, and some multiplexers), it is a gain. The Intel 80960 uses a cache for local register sets. I haven't seen the layout :-), but it seems like the sets are separated in a way that the data bus is not lengthened. Finally you can organize the layout in such a way that the current register set is always at the same place. Everytime that there is a change of window, you need to shift out the current window to a back up window, an shift in the new window into the current window, this whole operation can be done in ONE cycle for register files of a reasonable sizes (we've done it for a register file of 128 registers). This method makes the length of the data bus independent of the number of windows. So the question is: Is it clever to invest in a large register file with windows or is it better to use the silicon for other circuitry? The answer depends on how good your compiler people are! Marc Tremblay marc@CS.UCLA.EDU ...!(ihnp4,ucbvax)!ucla-cs!marc Computer Science Department, UCLA
mash@mips.COM (John Mashey) (10/28/88)
In article <17268@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: >Having a multiple-window register file, or more precisely, having many >registers, slows down the processor cycle. Even with an independent port for >the load/store, the operation is still based on the basic processor cycle. >With a longer cycle the load/store accesses become slower. >There are two reasons: 1) for a large register file, let's say 128 registers, >the decoding of the registers addresses is longer (more bits to decode, >even if you use partial decoding there is still a penalty), >2) the data bus is longer because it has to go over so many registers. >A longer data bus implies larger capacitance and longer discharge time, >thus longer processor cycle.... >So the question is: Is it clever to invest in a large register file with >windows or is it better to use the silicon for other circuitry? .... Could you quantify the hit from these issues, or point at some references that show data for this? In any of my competitive analysis, I've never made an issue of it, and generally haven't thought it was an issue....if somebody else proves it is an issue, I won't argue too much! :-) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
tim@crackle.amd.com (Tim Olson) (10/28/88)
In article <17268@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: | Having a multiple-window register file, or more precisely, having many | registers, slows down the processor cycle. Even with an independent port for | the load/store, the operation is still based on the basic processor cycle. | With a longer cycle the load/store accesses become slower. While larger register files do slow down register access by a very small amount, I don't think that the register file access is really the critical path on most machines (it certainly isn't the main critical path on the Am29000). As has been stated here before, cache accesses (for instructions, data, or both), or MMU translations are probably the limiting factor. The emperical evidence also tends to support this: both Am29000 and SPARC (with large register files) have 30MHz versions available, while MIPS and 88000 (32-register files) are at 20 - 20MHz. Note that this certainly does not mean that they couldn't be at 30MHz, I'm just pointing out that the register file size has little to do with the overall cycle time in real processors. -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)
bcase@cup.portal.com (Brian bcase Case) (10/29/88)
>Having a multiple-window register file, or more precisely, having many >registers, slows down the processor cycle. With a longer cycle the >load/store accesses become slower. There are two reasons: 1) for a large >register file, let's say 128 registers, the decoding of the registers >addresses is longer (more bits to decode, even if you use partial decoding >there is still a penalty), 2) the data bus is longer because it has to go >over so many registers. A longer data bus implies larger capacitance and >longer discharge time, thus longer processor cycle. This is true in theory. Howvever there are two effects that prevent a (within reason) large reg. file from slowing down the basic cycle: Circuit design can solve some speed problems, its just a matter of spending the power budget. It is possible to make a register file that reads and writes *at the same time*. But it is larger and more wasteful. Yes, decoding time and capacitive effects can be a problem. But even for the 29K's reg file, the access time is something like 10ns, and that's good old 1.25 micron technology. However, no one designs a reg. file that reads and writes at the same time because it isn't necessary; the register file isn't the critical path. Things like the cache tag access-> compare-> set_select and ALU_subtract->condition_code (or its equivalent) are the harder things. Again, clever circuit design applies, but what usually happens is that a compromise is reached so that almost everything "is the critical path." If you think about it, you see that anything else would be a poor, unbalanced design, and the circuit designers would get fired! >You can play some tricks to get around those drawbacks, >for example the Am29000 uses overlapping to avoid the penalty caused by >the decoding. The 29k does no more overlapping than the next guy, to my knowledge. BUT, I am not one of the circuit designers, so I might be wrong. >The Intel 80960 uses a cache for local register sets. >I haven't seen the layout :-), but it seems like the sets are separated >in a way that the data bus is not lengthened. Again, the lengthening of the data bus is a concern, but need not be a problem. In a reasonable architecture, a pipeline register sits right after the register file and right after the ALU. The lengthening of the data bus is important only for those internal bus operations that have to traverse the entire length of the processor bit cells. Very few operations have to do this (maybe things involving indirect jumps or something; the PC section might be at the end opposite the reg. file in the bit cell), and, again, you just have to have mega drivers for them. I don't mean to say this is *free*, but it is possible, and so a large register file does not necessarily slow the processor cycle. >So the question is: Is it clever to invest in a large register file with >windows or is it better to use the silicon for other circuitry? >The answer depends on how good your compiler people are! Well, the windows question is a hard one, so I won't even try that one. But as for *lots* of registers: Registers have three times the bandwidth and much less latency than the next level in the memory hierarchy. Maybe I can't think of what to do with them today (actually, I can think of several things), but putting a critical resource in a register instead of a cache makes a BIG difference on almost any machine, certainly on simple ones. The 29k UNIX implementations keep operating system goodies in protected global registers while user code is running. So do some CISC processors, but in them, you have no choice about what gets put there, it's defined by the architecture. Lots of registers give you a powerful resource that you can use anyway you want! When 29k compilers get the sort of capability found in the DEC Titan compilers (wishful thinking?), the abilty to do universal register allocation, global registers will be useful for keeping a program's global scalars in registers. This will probably necessitate implementing the "missing" 64 global registers because the greedy OS guys have already hoarded most of the 64 existing globals. :-)
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/29/88)
In article <7195@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >In article <17268@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: > >>Having a multiple-window register file, or more precisely, having many >>registers, slows down the processor cycle. Even with an independent port for >Could you quantify the hit from these issues, or point at some references >that show data for this? I don't know enough of the details to know why this is so, but I have been told that "all other things being equal" (i.e. a typical processor design with typical technology) that it is difficult to provide more than about 32 registers with single cycle read/write access. The only machine that I know well that is germane is the Cyber 205, which has 256 G.P. registers and takes two cycles to read/write the register file, and which has additional logic to hide that fact in most circumstances. So, it doesn't prove the case, but it doesn't contradict it either. Since 32 is usually acknowledged to be in the diminishing returns area, I'm not sure how many people have tried to provide more registers. I have no knowledge of how the registers are organized in one of the new register window machines, and whether this is a problem on those machines. Anyway, the original poster's hypothesis, that "many" registers creates an access speed problem, seems reasonable at this point. -- Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
bcase@cup.portal.com (Brian bcase Case) (10/30/88)
>I don't know enough of the details to know why this is so, but I have been >told that "all other things being equal" (i.e. a typical processor design >with typical technology) that it is difficult to provide more than about >32 registers with single cycle read/write access. Wow, I don't know who told you this, but it is not true for VLSI implementations. Creating a register file from discrete memory chips is *much* harder (I did it once in ECL; try it with two write ports!). In VLSI, the register file is typically implemented in a write-before- read fashion; so in some sense, the register file is doing two accesses per cycle. I.e., register file read must take less than 1/2 cycle (pipeline register set-up time, some clock margin, etc.). WRT the discussion about variable-sized windows, the 29K allows (doesn't force) activation records (stack frames) to be allocated in the local register file (this is what windowing means). These activation records are variable-sized windows, with resolution of one register, up to the size of the local register file (128 registers). The "stack-pointer- plus-offset" calculation is overlapped with the register file write operation (write-before read). The register file can also be used as a flat space of 128 registers.