Register Machines

phipps@fortune.UUCP (Clay Phipps) (03/29/84)
[My apologies to those who have already seen this follow-up;
 I withdrew it a few hours after its original posting, 
 but was unsuccessful in reposting it, so I am trying it again.
 My original text has not been modified.]

For any given level of hardware technology,
caches are going to be slower than registers
because they have much more work to do than registers.
The following description of the processing involved for 
an arithmetic & logic unit (ALU) to read data is informal.
Virtual memory issues are ignored.  Details of real machines vary.  

To read data from a simple array of n registers w bits wide,
you input an address (i.e., a register number) and it outputs the contents.
An n-way decoder and n rows of w flip-flops each will just about do it.
I suspect that almost every non-micro manufacturer 
would use high-speed RAMs instead of actual flip-flops
(except maybe Seymour Cray).

To read data from a simple cache, you feed it a memory address;
it must check all of the memory addresses in the cache tags
to see if it has a match.
If there is a match, you have a 'cache hit': the data is present,
and you can now perform the steps above for register access.
If there is no match, you have a 'cache miss': the data is absent, 
and you have more work to do (the rest of this paragraph describes that).
You have to put the ALU on hold, 
and request the data from the main memory.
Wait until the main memory provides the data;
this will take much longer than a cache hit access would have (factor >= 4).
You must find a place in the cache to put the data;
this means selecting some data to throw out, 
as determined by your replacement strategy (e.g., LRU). 
Once the data arrives from main memory, 
you must overwrite the old data at that place in the cache.
Update the cache tag with the requested address; take the ALU off hold.
Now you can either output the data directly to the ALU,
or you can perform the steps above for register access.

Ideally, you desigh the cache to do as much as possible in parallel;
for example, you probably want to compare the address to as many cache tags
as possible at the same time.  Cache miss handling can be complicated,
and is a substantial source of bugs in the ALU; 
you may want to handle this in microcode rather than random logic, 
possibly giving up some speed for a big win in development modifiability.

The gate delays alone are probably sufficient to explain the speed difference
between registers and caches, even when the registers and cache 
in a machine are based on the same RAM chips and logic family.

This was written from a mainframe perspective.
Are things any different for you microprocesor people out there ?

-- Clay Phipps

-- 
   {allegra  amd70  cbosgd  decwrl!amd70  harpo  hplabs!hpda  
    ihnp4  megatest  nsc  oliveb  sri-unix  ucbvax!amd70  varian}
   !fortune!phipps