aglew@crhc.uiuc.edu (Andy Glew) (09/12/90)
Here's another of my throwaway ideas[*]: Want to remove logic from the critical path of your processor? Is address formation/translation/cache lookup on the critical path? Virtual caches remove address translation, but you still have address formation - since typical addresses are formed by an addition (AMD29000 notwithstanding) and cache lookup. Problem: the address addition has to be completed before cache lookup. That might be a while for 64 bit addresses. (Well, you can play around with self timing, or piping out the lsbs before the msbs have formed). (Or you can use the IBM trick of assuming that the next address will be in the same set as previously accessed, to reduce cache setup) Possible hacks to reduce the need to wait for the address addition: (1) Many addresses have limited carry propagation. In the extreme case, you could just OR the base+index together and send that to the cache quickly, falling back to using the full addition in case of an error. (I've already published results on the degree of success of ORing). Or, you can use limited cary length propagate circuits, which are possibly faster than full carry propagate, again, with fallback in case of error. (2) Most variables are accessed in only two possible ways: either by a full pointer, or by a single fixed base and offset. Usually only one of the many possible base+offset pairs is used. If this statement is true for all variables in a cache line, then triple the tags (one full address, one base+index) might be used, with triple the match logic. This might permit the base+index to be sent directly to the cache, although the extra circuitry + delays might lose any potential speedup. Like I said, they're throwaway ideas. [*] I'm bored, waiting for a trace collection to finish. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
hankd@dynamo.ecn.purdue.edu (Hank Dietz) (09/14/90)
In article <AGLEW.90Sep11231456@dwarfs.crhc.uiuc.edu> aglew@crhc.uiuc.edu (Andy Glew) writes: >Here's another of my throwaway ideas[*]: ... >Possible hacks to reduce the need to wait for the address addition: > >(1) Many addresses have limited carry propagation. In the extreme case, > you could just OR the base+index together and send that to the cache quickly, > falling back to using the full addition in case of an error. ... >(2) Most variables are accessed in only two possible ways: either by a full pointer, > or by a single fixed base and offset. Usually only one of the many possible > base+offset pairs is used. If this statement is true for all variables > in a cache line, then triple the tags (one full address, one base+index) > might be used, with triple the match logic. How about these instead: [3] Although ORing an offset in works fine give small offsets and a compiler which performs the appropriate data layout, the size constraint sucks. Unlike OR, XOR can be used without imposing a size constraint. Sure, the data structure's elements end up arranged in a rather random-looking memory sequence, but so what? This is well within the grasp of a smart compiler. [4] Gray codes only have one bit change for increment. By using such an address encoding, things can be simplified, but this gets somewhat nasty to deal with... I think. -hankd@ecn.purdue.edu
jack@leo.UUCP (Jack Benkual) (09/14/90)
In article <AGLEW.90Sep11231456@dwarfs.crhc.uiuc.edu>, aglew@crhc.uiuc.edu (Andy Glew) writes:
$ Here's another of my throwaway ideas[*]:
$
$ Want to remove logic from the critical path of your processor? Is
$ address formation/translation/cache lookup on the critical path?
$ ....
$ Problem: the address addition has to be completed before cache lookup.
$ That might be a while for 64 bit addresses.
$.....
$ (1) Many addresses have limited carry propagation. .....
$ possibly faster than full carry propagate, again, with fallback in case
$ of error.
$
$ (2) Most variables are accessed in only two possible ways:
$ either by a full pointer, or by a single fixed base and offset. .....
$ in a cache line, then triple the tags (one full address, one base+index)
$ might be used, with triple the match logic.
$ This might permit the base+index to be sent directly to the cache,
$ although the extra circuitry + delays might lose any potential speedup.
The above optimizations as well as the following ones are possible when on
chip caches and MMU's are used:
(3) The cache index address bits are typically divided in two groups. One
group is decoded to access the selected row and the other group is used to
select the column that is desired. The row decoding needs to be done first.
Interesting simplifications occur when you try to perform an add and
decode. So one can build a row decoder that gets two operands and an expected
carry from the lower order bits. It can try to fetch both possible addresses
regardless the incoming carry and let the multiplexer select between the
two. In any case relatively small number of bits need to be added to start
accessing the cache. This can be balanced by moving more bits to the
multiplexing phase.