[comp.sys.intel] Code prefetch

m5d@bobkat.UUCP (Mike McNally ) (11/20/86)

Perhaps I am giving away a valuable idea, but I won't flatter myself
that this could be the case.  

Various architectures (the x86 architecture in particular) involve
a mechanism whereby the CPU is able to fetch code during unused/
unusable bus cycles.  In the x86 series, the prefetch is dumb (less
dumb on the 2- and 386 than the original chips, but dumb nevertheless).
The prefetch queue is kept small (although I was surprised at the 
size (16 bytes) of the queue in the 386) because it was found that
the point of diminishing returns was encountered at about 6 bytes.
The obvious problem is that branch instructions do not cooperate
with the idea of prefetching.  Attempts to follow different paths
beyond a branch seem like Koyaanisqatsi solutions to me.

A compiler with a minimal amount of smarts can detect runs of code
which can be guaranteed to execute in sequence.  Since the compiler
is generating the code, why not provide information on the size
of the upcoming run of certainly sequential instructions?  This
could easily be provided in the 86 architecture as a special
tag byte, leaving maybe four or five bits to tell how many instructions
can safely be prefetched.  The advantages to such a system are even
greater given a burst-mode bus cycle.

Do any existing architectures incorporate such a scheme?  Seems pretty
obvious to me.  Are there any problems that I can't see?  Could this
be as significant as the development of multitronics ("an entirely
new approach...") (note my username)?

-- 
****                                                         ****
**** At Digital Lynx, we're almost in Garland, but not quite ****
****                                                         ****

Mike McNally                                    Digital Lynx Inc.
Software (not hardware) Person                  Dallas  TX  75243
uucp: ...convex!ctvax!bobkat!m5                 (214) 238-7474

dan@prairie.UUCP (Daniel M. Frank) (11/21/86)

In article <242@bobkat.UUCP> m5d@bobkat.UUCP (Mr Mike McNally) writes:
>The obvious problem is that branch instructions do not cooperate
>with the idea of prefetching.  Attempts to follow different paths
>beyond a branch seem like Koyaanisqatsi solutions to me.

   Many pipelined machines (for about the last 20 years) have had
two prefetch queues, and fetched down both branch paths for a while
until the branch was resolved.  This is a good solution because there
may be several instructions ahead of the branch in the pipeline, so
you have no idea whether the branch is going to be taken or not.  Also,
prefetch queues are seldom that long, so the case where a branch
is followed in the prefetch by another branch is infrequent enough
not to be costly (usually it just stalls fetching for a while).

   The idea has been broached (and I think used in some experimental
processors) of a "branch usually" or "branch seldom" instruction, in
which the compiler indicates to the processor which branch is most
likely to be taken.  If branch predictions are relatively accurate,
you only pay for the branch in the occasional case when it goes
against the prediction.  The question of how to make accurate predictions
is a current research topic.

   Of course, this is all complicated when the branch target address
is being computed when the branch instruction hits the decode pipeline
step.  This usually happens when the target address is in a register.
Instruction fetch may have to come to a halt until the interlock can
be resolved.  This problem can be reduced somewhat by careful coding.

   One processor under development at the UW actually EXECUTES down
both branch paths (I won't say any more about that, as a paper is in
the works by the folks developing it).

-- 
    Dan Frank
    uucp: ... uwvax!prairie!dan
    arpa: dan%caseus@spool.wisc.edu

crowl@rochester.ARPA (Lawrence Crowl) (11/21/86)

In article <242@bobkat.UUCP> m5d@bobkat.UUCP (Mr Mike McNally) writes:
>Various architectures involve a mechanism whereby the CPU is able to fetch
>code during unused/unusable bus cycles.  In the x86 series, the prefetch
>is dumb.  The prefetch queue is kept small because it was found that the
>point of diminishing returns was encountered at about 6 bytes.  The
>obvious problem is that branch instructions do not cooperate with the
>idea of prefetching.  Attempts to follow different paths beyond a branch
>seem like Koyaanisqatsi solutions to me.     

>A compiler with a minimal amount of smarts can detect runs of code which
>can be guaranteed to execute in sequence.  Since the compiler is
>generating the code, why not provide information on the size of the
>upcoming run of certainly sequential instructions?  ...  The advantages
>to such a system are even greater given a burst-mode bus cycle.   

>Mike McNally

The sequence approach still does not help in prefetching code past the 
branch point before actually finding out which way the branch will go.  
I suspect a variable length burst mode will be no more effective than a
fixed burst mode which is always applied on instruction fetch.  

One possible solution to this problem is to have the branch instructions
themselves indicate which branch is more likely.  The machine could then
prefetch the most likely sequence, and most often it will already have
the instructions it needs.  When the branch taken is the least common,
the prefetch queue must be flushed and re-filled.   

The RISC approach is to have the instructions after the branch executed
reguardless if the branch is taken or not.  For a small pipe or prefetch
queue, the compiler can usually fill this "delay slot" with a useful
instruction.  Though this represents a problem if the pipe or prefetch
queue gets to large.  
-- 
  Lawrence Crowl		716-275-5766	University of Rochester
			crowl@rochester.arpa	Computer Science Department
 ...!{allegra,decvax,seismo}!rochester!crowl	Rochester, New York,  14627