[comp.arch] Smart I-cache?

jcallen@Encore.COM (Jerry Callen) (11/01/90)

davidsen@crdos1.crd.ge.com (bill davidsen) writes:

>  The feature is intelligent I-cache, which only stores instructions
>which are the target of jumps.

This is, of course, the AMD29K branch target cache, which stores the
target of the branch and the next 3 instructions. The idea is that 
this gives the memory system 4 clock ticks in which to deliver the
4th instruction after the branch target, and then hopefully the memory
system can deliver subsequent instructions every cycle (page mode DRAM,
for instance).

I attended an AMD marketing pitch a few years ago on the 29K; the
argument was that the branch target cache enables the 29K to squeeze
reasonable performance out of a cheap DRAM memory system. Can anyone
comment on how effective the branch target cache is?

-- Jerry Callen
   jcallen@encore.com

rfk@egr.duke.edu (Robert F. Krick) (11/02/90)

From article <2823@crdos1.crd.ge.COM>, by davidsen@crdos1.crd.ge.COM 
(Wm E Davidsen Jr):
> 
>   The feature is intelligent I-cache, which only stores instructions
> which are the target of jumps. 
>
> [stuff deleted]
> 
>   If anyone has any info on recent work (if any) I'd like to hear it. If
> there are any good papers I should look up I'd like to see them, too.
> Obviously this must either be harder to do than I think, or provide less
> benefit, or everyone would be doing it.

Although not a "smart I-cache", Dr. Apostolos Dollas (my advisor) and I 
have been working to *eliminate* the performance degradation associated 
with I-cache misses and branch penalties.  To accomplish this, we have 
developed the Sustained Performance Architecture (SPA).  This architecture 
is based on the concept that program flow information (extracted prior to
execution) can be used to identify those basic blocks which are candidates 
for execution within the next n cycles (where n is the number of cycles
of latency for the memory).  The instructions from these basic blocks can 
be prefetched from multiple banks of instruction memory such that the 
required instructions are available when they are needed by the processor.  
(Of course, some instructions will be prefetched unnecessarily, but these 
are not delivered to the processor.)  In essence, SPA leverages information 
available when the program is compiled in order to improve performance.

Since the program flow information is extracted prior to execution, SPA 
cannot *improve* the performance for branches which have dynamically 
defined targets (i.e. jmp r2).  With the notable exception of subroutine
returns which SPA can handle without any loss in performance, the degradation
associated with this class of branch instructions is insignificant ( << 1%),
because these instructions are sufficiently rare in compiled code from 
languages such as FORTRAN and C.

We have developed a working `proof of concept' prototype.  In addition, 
we have applied for patent protection in both the USA and Japan.  If you 
would like more details about this architecture, please see our article 
in the December '89 issue of the ACM "Computer Architecture News".  Based
on demand, I will make additional technical reports available via ftp.

========================================================================
Robert F. Krick; Dept. of Elec. Eng.; Duke University; Durham, NC  27706
    Internet:  rfk@cameron.egr.duke.edu 	AT&T: (919)660-5268

	  "When the branching gets conditional, the prefetching 
	   gets tough, and the tough get going!"

jbuck@galileo.berkeley.edu (Joe Buck) (11/02/90)

In article <1157@cameron.egr.duke.edu>, rfk@egr.duke.edu (Robert F. Krick) writes:
|> Since the program flow information is extracted prior to execution, SPA 
|> cannot *improve* the performance for branches which have dynamically 
|> defined targets (i.e. jmp r2).  With the notable exception of subroutine
|> returns which SPA can handle without any loss in performance, the degradation
|> associated with this class of branch instructions is insignificant ( << 1%),
|> because these instructions are sufficiently rare in compiled code from 
|> languages such as FORTRAN and C.

Object-oriented languages are becoming more popular; there are many more
indirect branches (in C++, these are virtual function calls) in such
languages.  However, I have no statistics on what fraction of instructions
in such programs are indirect branches, and this may vary widely from
program to program.  You'd do very poorly on Forth code, but I doubt that
you care all that much. :-)

> We have developed a working `proof of concept' prototype.  In addition, 
> we have applied for patent protection in both the USA and Japan.

Have you gotten it to the point where you can run the SPECmarks on it
(meaning that you've finished Fortran and C compilers that can provide
the scheduling information)?

--
Joe Buck
jbuck@galileo.berkeley.edu	 {uunet,ucbvax}!galileo.berkeley.edu!jbuck

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/02/90)

In article <1157@cameron.egr.duke.edu> rfk@egr.duke.edu (Robert F. Krick) writes:

| Since the program flow information is extracted prior to execution, SPA 
| cannot *improve* the performance for branches which have dynamically 
| defined targets (i.e. jmp r2).  With the notable exception of subroutine
| returns which SPA can handle without any loss in performance, the degradation
| associated with this class of branch instructions is insignificant ( << 1%),
| because these instructions are sufficiently rare in compiled code from 
| languages such as FORTRAN and C.

  register int (*state)();
  
  /* code */
  if (m < n) (*state)(foo, mumble, barf);

  I agree that this is an infrequent case.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
      The Twin Peaks Halloween costume: stark naked in a body bag

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/02/90)

In article <13120@encore.Encore.COM> jcallen@encore.Com (Jerry Callen) writes:

| I attended an AMD marketing pitch a few years ago on the 29K; the
| argument was that the branch target cache enables the 29K to squeeze
| reasonable performance out of a cheap DRAM memory system. Can anyone
| comment on how effective the branch target cache is?

  A number of people have pointed out that a branch target cache makes
really good use of the cache in terms of keeping the CPU busy, but is
less effective in reducing main memory loading.

  I could be easily convinced that a small branch target cache, like 4k
or so, and a larger general cache for both I and D would be desirable.
Then you get into delays caused by multilevel cache...
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
      The Twin Peaks Halloween costume: stark naked in a body bag

tim@proton.amd.com (Tim Olson) (11/04/90)

In article <13120@encore.Encore.COM> jcallen@encore.Com (Jerry Callen) writes:
| I attended an AMD marketing pitch a few years ago on the 29K; the
| argument was that the branch target cache enables the 29K to squeeze
| reasonable performance out of a cheap DRAM memory system. Can anyone
| comment on how effective the branch target cache is?

Mark Hill's thesis compared a number of different caches, including
the 29K's BTC.  He concluded that the BTC was more effective than a
regular cache up to about 2K bytes.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

rfk@duke.cs.duke.edu (Robert F. Krick) (11/04/90)

I have accumulated the several followups to my last posting.  Here are
some replies:

In article <39324@ucbvax.BERKELEY.EDU>, jbuck@galileo.berkeley.edu 
(Joe Buck) writes:
> Object-oriented languages are becoming more popular; there are many more
> indirect branches (in C++, these are virtual function calls) in such
> languages.  However, I have no statistics on what fraction of instructions
> in such programs are indirect branches, and this may vary widely from
> program to program.  You'd do very poorly on Forth code, but I doubt that
> you care all that much. :-)

Indeed the Sustained Performance Architecture (SPA) which I described may 
not be appropriate for all languages because of the requirement that "most" 
of the program flow must be known prior to execution.  As I said, SPA cannot
*improve* the performance for portions of the program where the branches
have dynamically defined targets.  A special purpose i-cache could easily 
be incorporated in SPA to explicitly handle these situations in the same 
manner as other architectures.  The decision to incorporate such an i-cache
would depend on the frequency and penalty associated with these dynamically 
defined branch targets.

>> We have developed a working `proof of concept' prototype.  In addition, 
>> we have applied for patent protection in both the USA and Japan.

> Have you gotten it to the point where you can run the SPECmarks on it
> (meaning that you've finished Fortran and C compilers that can provide
> the scheduling information)?

The goal of our `proof of concept' prototype was to demonstrate that critical 
aspects of the architecture could be implemented in hardware.  We have 
accomplished this goal by building a working system.  We do not have any
plans to develop the existing prototype further into a system capable of 
executing the SPECmarks.  However, we are actively seeking licensing and/or 
funding which will enable us to develop the next generation prototype, namely,
a full-fledged general purpose computer.  My current research is designing
and implementing the necessary `scheduling' algorithms for the compiler.

In article <13120@encore.Encore.COM>, jcallen@encore.Com (Jerry Callen) writes:
> This is, of course, the AMD29K branch target cache, which stores the
> target of the branch and the next 3 instructions. 

As you realize, any technique which is heuristic or based upon locality of 
reference cannot be 100% effective.  As a result, there will *always* be 
some performance degradation for techniques such as branch prediction, 
i-caches, branch target caches, etc.  In contrast, the SPA concept addresses 
the *cause* of performance degradation in instruction sequencing by looking 
ahead in the program flow to prefetch all instructions which may be required.  
As the disparity between processor speed and main memory speed increases, 
heuristic and locality-based solutions are becoming increasingly expensive 
and IMHO will eventually give way to structured solutions which provide 
higher performance at lower cost.

In article <2829@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.com (bill davidsen)
writes:
>> In article <1157@cameron.egr.duke.edu> rfk@egr.duke.edu (Robert F. Krick) 
>> writes:
>> Since the program flow information is extracted prior to execution, SPA 
>> cannot *improve* the performance for branches which have dynamically 
>> defined targets (i.e. jmp r2).  With the notable exception of subroutine
>> returns which SPA can handle without any loss in performance, the degradation
>> associated with this class of branch instructions is insignificant ( << 1%),
>> because these instructions are sufficiently rare in compiled code from 
>> languages such as FORTRAN and C.
>
>  register int (*state)();
>
> /* code */
> if (m < n) (*state)(foo, mumble, barf);
>
> I agree that this is an infrequent case.

Thanks for your support!  There is another (possibly more frequent) situation 
in which branches with dynamically defined targets may be used:  a branch 
table for the switch/case construct in C.  For small branch tables, it is 
more appropriate in SPA to generate code for a binary decision tree (despite 
the slight increase in code size).  The code for these binary decision 
trees can then be executed without any loss in performance.

========================================================================
Robert F. Krick; Dept. of Elec. Eng.; Duke University; Durham, NC  27706
    Internet:  rfk@cameron.egr.duke.edu 	AT&T: (919)660-5268

	  "When the branching gets conditional, the prefetching 
	   gets tough, and the tough get going!"