[comp.arch] Throwaway: Speedup address formation for cache lookup

aglew@crhc.uiuc.edu (Andy Glew) (09/12/90)

Here's another of my throwaway ideas[*]:

Want to remove logic from the critical path of your processor?  Is
address formation/translation/cache lookup on the critical path?
Virtual caches remove address translation, but you still have address
formation - since typical addresses are formed by an addition
(AMD29000 notwithstanding) and cache lookup.

Problem: the address addition has to be completed before cache lookup.
   That might be a while for 64 bit addresses.
(Well, you can play around with self timing, or piping out the lsbs
before the msbs have formed).
    (Or you can use the IBM trick of assuming that the next address
will be in the same set as previously accessed, to reduce cache setup)


Possible hacks to reduce the need to wait for the address addition:

(1) Many addresses have limited carry propagation.  In the extreme case,
    you could just OR the base+index together and send that to the cache quickly,
    falling back to using the full addition in case of an error.
    (I've already published results on the degree of success of ORing).
    Or, you can use limited cary length propagate circuits, which are 
    possibly faster than full carry propagate, again, with fallback in case
    of error.

(2) Most variables are accessed in only two possible ways:  either by a full pointer,
    or by a single fixed base and offset.  Usually only one of the many possible
    base+offset pairs is used.  If this statement is true for all variables
    in a cache line, then triple the tags (one full address, one base+index)
    might be used, with triple the match logic.  
    	This might permit the base+index to be sent directly to the cache, 
    although the extra circuitry + delays might lose any potential speedup.

    Like I said, they're throwaway ideas.

[*] I'm bored, waiting for a trace collection to finish.
--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

hankd@dynamo.ecn.purdue.edu (Hank Dietz) (09/14/90)

In article <AGLEW.90Sep11231456@dwarfs.crhc.uiuc.edu> aglew@crhc.uiuc.edu (Andy Glew) writes:
>Here's another of my throwaway ideas[*]:
...
>Possible hacks to reduce the need to wait for the address addition:
>
>(1) Many addresses have limited carry propagation.  In the extreme case,
>    you could just OR the base+index together and send that to the cache quickly,
>    falling back to using the full addition in case of an error.
...
>(2) Most variables are accessed in only two possible ways:  either by a full pointer,
>    or by a single fixed base and offset.  Usually only one of the many possible
>    base+offset pairs is used.  If this statement is true for all variables
>    in a cache line, then triple the tags (one full address, one base+index)
>    might be used, with triple the match logic.

How about these instead:

[3]	Although ORing an offset in works fine give small offsets and a
	compiler which performs the appropriate data layout, the size
	constraint sucks.  Unlike OR, XOR can be used without imposing
	a size constraint.  Sure, the data structure's elements end up
	arranged in a rather random-looking memory sequence, but so what?
	This is well within the grasp of a smart compiler.

[4]	Gray codes only have one bit change for increment.  By using such an
	address encoding, things can be simplified, but this gets somewhat
	nasty to deal with...  I think.

						-hankd@ecn.purdue.edu

jack@leo.UUCP (Jack Benkual) (09/14/90)

In article <AGLEW.90Sep11231456@dwarfs.crhc.uiuc.edu>, aglew@crhc.uiuc.edu (Andy Glew) writes:
$ Here's another of my throwaway ideas[*]:
$ 
$ Want to remove logic from the critical path of your processor?  Is
$ address formation/translation/cache lookup on the critical path?
$ ....
$ Problem: the address addition has to be completed before cache lookup.
$    That might be a while for 64 bit addresses.
$..... 
$ (1) Many addresses have limited carry propagation.  .....
$     possibly faster than full carry propagate, again, with fallback in case
$     of error.
$ 
$ (2) Most variables are accessed in only two possible ways: 
$     either by a full pointer, or by a single fixed base and offset. .....
$     in a cache line, then triple the tags (one full address, one base+index)
$     might be used, with triple the match logic.  
$     	This might permit the base+index to be sent directly to the cache, 
$     although the extra circuitry + delays might lose any potential speedup.

The above optimizations as well as the following ones are possible when on
chip caches and MMU's are used:

  (3) The cache index address bits are typically divided in two groups. One
  group is decoded to access the selected row and the other group is used to
  select the column that is desired. The row decoding needs to be done first.
  Interesting simplifications occur when you try to perform an add and
  decode. So one can build a row decoder that gets two operands and an expected
  carry from the lower order bits. It can try to fetch both possible addresses
  regardless the incoming carry and let the multiplexer select between the
  two. In any case relatively small number of bits need to be added to start
  accessing the cache. This can be balanced by moving more bits to the 
  multiplexing phase.