[comp.arch] Branch Target Cache vs. Instruction Cache

hansen@mips.UUCP (05/16/87)

Let me start by apologizing for the tone of my last message - I'll try to be
less caustic and more informative this time.

In article <767@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes:
 (about how a branch target cache performs comparably to an instruction
    cache, in a real system environment)
> it can perform comparably because, logically, part of the instruction cache
> is in the external rams.  the branch target cache is only there to cover the
> latency of the initial access in that external ram (how many times do we
> have to say this?).  so, in essence, the Am29000 branch target cache can
> cache 32 loops of *any* size.  this is most certainly not true of traditional
> instruction caches.  on the other hand, this is not necessarily a very
> important attribute; the point is that a branch target cache can perform,
> and in fact does perform, as well as an instruction cache.  i feel that it
> should be i (or someone defending the am29000) who should be asking *you*
> to defend a traditional instruction cache relative to the branch target
> cache (always assuming an external burst mode memory for the branch target
> cache, of course.  the burst mode memory is a critical part of the concept.).

My point was that the four words of information in the BTC is not sufficient
to cover the latency of DRAM accesses in a real system environment. I would
freely acknowlege that it _is_ sufficient if the next level store is SRAM -
however, in that case, the BTC is a _suppliment_ to the SRAM-based
instruction cache. In fact, the system that is simulated by AMD in their
performance simulation has an external cache.

I quote from the McFarling and Hennessey paper "Reducing the cost of
branches," (13th Annual Int. Symp. on Computer Architecture, June 1986) the
following statement: "In our simulations, we noticed that a direct mapped
BTB and an instruction cache of the same size had about the same hit ratio."

As to defending a traditional instruction cache relative to the branch
target cache, this is a apples-and-orange comparison. The instruction cache
leaves main memory bandwidth free for other purposes, such as data
references, memory refresh, and I/O traffic. What I could compare is the
effective branch latency of the two approaches.  Again depending on the
results of the paper above, they compute an average branch time of 1.53
cycles for something similar to the MIPS R2k branch scheme (labelled in the
paper as the "Fast Compare" scheme). I would add 0.06 to that value to
account for the differences (explained below).  They don't have an exact
match for the AMD branch target cache, but do have results for a 64-entry
branch target buffer that generally has shorter penalties for mis-predicted
branches and buffer misses, which requires 1.38 cycles per branch.  However,
the AMD branch scheme also doesn't permit branches on conditions other than
the sign bit of a register, so we have to add time for the comparison
operations (.36 cycles for compare equal/not equal, and .08 for full
arithmetic compares), so the real total is 1.38+.36+.08 = 1.82 cycles per
branch. (Numbers are from the paper, and assume an 80% hit ratio in the BTC)
In summary:

Branch Scheme		average cycles per branch
=======================================================
Fast Compare (R2k)		1.59
64-entry BTC/sign compare (29k)	1.82

Selecting between these schemes, holding all else constant gives the fast
compare scheme about a 2% edge in overall performance over the BTC scheme.
However, McFarling & Hennessey give this warning about the fast compare
scheme: "...the timing of the simple compare is a concern, because it must
complete in time to change the instruction address going out in the next
cycle. This could easily end up as the critical path controlling instruction
cache access and, hence, cycle time." They are right; the timing was a
concern, and we had to put in special hardware to perform the fast compare
and to quickly translate branch addresses to physical addresses.  It is this
translation hardware (Micro-TLB) that adds about 1% of a cycle per
instruction or 6% per branch.  However, the critical paths are on-chip ones,
and can be expected to scale with the technology.

> >I guess the AMD folks are excited about their new child, but when reality
> >sets in, and you try to build a real UNIX-based computer system out of this
> >fine controller part, you'll be sorely disappointed if you expected 17 MIPS
> >at 25 MHz with a nibble-mode DRAM as your primary memory system.  Based on
> 
> *not* nibble-mode, video dram; there is a big difference.
> 
> >their benchmarks (puzzle, dhrystone, and a tiny piece of linpack; sorry, but
> >that's all they've got to compare with...) with an external cache, external
> >cache control hardware, and a fast main memory system, a 29000 at 25 MHz that
> >you can build next year doesn't perform any faster than a MIPS R2000 system
> >at 16.7 MHz that you can build now.
> 
> i will be one of the first to say that mipsco has a good part.  you guys have
> done a great job in the sense that you have gotten great performance at lower
> clock rates.  but what happens to your bus at 25, 30, 35 mhz?  i'm not saying
> that you are guys are going to fail to fix things, but let's not be casting
> stones!  how do you know that amd will be sorely dissapointed in its
> expectations of 17 mips at 25 mhz?  there are accurate simulations to support
> just this data!  i am not trying to say you are wrong, but don't just make
> the statement, back it up with solid data or at least some observations you
> have made about problems with the architecture/implementation!  please,
> follow the lead of your co-worker john mashey.

I agree - there is a difference between nibble-mode and video DRAM, though
both these and page-mode DRAM were stated as working perfectly well with a
4-word BTC and no I-cache. Now the benchmark data is hard to interpret - the
benchmarks we can compare with AMD's all run faster at 16.7 MHz than AMD's
25 Mhz part.  However, these are small benchmarks that do not stress the
cache and memory subsystems of either designs, and I would welcome the
appearance of standard benchmarks that are more representative of
applications' performance.  For example, the DoDuc benchmark would be a big
step forward over most existing benchmarks, and contains a mix of
both integer and floating-point code. I think John Mashey & I have
already discussed differences the architecture of the two machines that can
account for the difference in the benchmark performance, and I'm not out to
repeat the discussion, but to try to help draw what conclusions can be made
from the benchmark data available to date.

Time and considerable further development will occur before AMD can supply
performance data from running large benchmarks under real system conditions,
and I understand that we'll have to make do with what we've got.  However
the only benchmark of potentially meaningful size from AMD (sipasm) performs
substantially less than 17 (780-relative) MIPS with an external cache as
well as burst-mode memory, and as I understand it, these results do not take
multiprogramming cache effects and finite memory write speed into account.

I'm at a disadvantage in quoting performance data for the AMD part, but the
branch target cache miss rate on the larger benchmarks is in the 50% range,
is it not? That would mean that much of the branch behaviour of these larger
programs is not simple loops, and the analogy of the AMD 29k BTC holding 32
loops of any size isn't really valid. What looks bad for the AMD part is
that branches that miss in the BTC get a four-cycle delay. (Please correct
me if I'm wrong on this, but I'm assuming that the 4 words in each BTC entry
reflect that it will take at least 4 cycles to start up again after a BTC
miss.) At risk of contradicting statement made earlier in this message,
that would cause the average branch time to be about 3.5 cycles.

As to what MIPS will do to make external caches work at 25, 30, 35 MHz,
I'm afraid this isn't the right forum to be discussing our future
products. We have publicly stated that we will improve the performance
of our products at a rate of doubling performance every eighteen months,
and our current product plans are running faster than that rate.
If I could ask, what happens to the Am29k bus at 40 to 55 MHz?

Regards,
Craig
-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

tim@amdcad.UUCP (05/16/87)

In article <397@dumbo.UUCP>, hansen@mips.UUCP (Craig Hansen) writes:

+-----
| My point was that the four words of information in the BTC is not sufficient
| to cover the latency of DRAM accesses in a real system environment. I would
| freely acknowlege that it _is_ sufficient if the next level store is SRAM -
| however, in that case, the BTC is a _suppliment_ to the SRAM-based
| instruction cache. In fact, the system that is simulated by AMD in their
| performance simulation has an external cache.
+-----
You correctly point out one of the uses of the BTC -- a suppliment to an
external instruction cache.  However, it is certainly not the case that
all Am29000 systems will be Unix supermicro's with 32 meg of main memory
and ecc.  In many potential applications, a small (1-6 MB) amount of
main memory is sufficient.  In these applications, video-DRAM based
memory designs can perform 4-cycle accesses with single-cycle burst
accesses for the instruction stream.  The BTC is then used to cover the
latency of a branch while the first instruction access is performed to
the memory.

+-----
| I quote from the McFarling and Hennessey paper "Reducing the cost of
| branches," (13th Annual Int. Symp. on Computer Architecture, June 1986) the
| following statement: "In our simulations, we noticed that a direct mapped
| BTB and an instruction cache of the same size had about the same hit ratio."
+-----
It is very hard to generalize the relative performance of a BTC vs a
standard instruction cache for all machines and cache sizes. The term
"hit ratio" doesn't even mean the same thing.  When we calculate hit
ratio for the BTC, it is done only for branch instructions and other
"branch-like" operations, such as interrupt vectoring and returning.  If
there is a miss, the subsequent instructions which are then cached are
*not* counted as hits, as would be the case for a standard instruction cache.
Given the limited chip area we could afford, the BTC gives us better
overall *performance* than a similar-sized instruction cache.

+-----
| Time and considerable further development will occur before AMD can supply
| performance data from running large benchmarks under real system conditions,
| and I understand that we'll have to make do with what we've got.  However
| the only benchmark of potentially meaningful size from AMD (sipasm) performs
| substantially less than 17 (780-relative) MIPS with an external cache as
| well as burst-mode memory, and as I understand it, these results do not take
| multiprogramming cache effects and finite memory write speed into account.
+-----

Here are the some numbers for some "substantial" programs.

	(*NOTE* -- all MIPS numbers are Am29000 MIPS.  17 Am29000 MIPS ~=
	 15 VAX 11/780 MIPS, but again, this is hard to generalize. 
	 Craig correctly points out that multiprogramming cache effects
	 were not taken into account, mainly because to do so accurately
	 requires a full multitasking kernel simulation and a "general"
	 simulated machine load, whatever that is.)

	First system specification:
	
	separate, 64K byte external instruction & data caches -- 2 cycle
	access with single-cycle burst capability.  4-cycle main memory
	*access time* with single-cycle burst (ie video DRAM).  Data
	cache is write-through with a *single* write-buffer (finite
	memory write speed *is* taken into account).

Assembler:
Statistics of "simasm" simulation:

User Mode:		  377268 cycles	(0.01509072 seconds)
Supervisor Mode:	   25497 cycles	(0.00101988 seconds)
Total:			  402765 cycles	(0.01611060 seconds)
Simulation speed:	 16.25 MIPS (1.54 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nroff:
Statistics of "nroff" simulation:

User Mode:		  126780 cycles	(0.00507120 seconds)
Supervisor Mode:	    3226 cycles	(0.00012904 seconds)
Total:			  130006 cycles	(0.00520024 seconds)
Simulation speed:	 17.99 MIPS (1.39 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff:
Statistics of "diff" simulation:

User Mode:		  397135 cycles	(0.01588540 seconds)
Supervisor Mode:	     610 cycles	(0.00002440 seconds)
Total:			  397745 cycles	(0.01590980 seconds)
Simulation speed:	 18.33 MIPS (1.36 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Second system specification:
	
	NO external caches. Video-DRAM main memory, 4-cycle access,
	single-cycle instruction burst. 4 cycle loads and stores. (loads
	and stores are not, as yet, scheduled by the compiler.)

Assembler:
Statistics of "simasm" simulation:

User Mode:		  478285 cycles	(0.01913140 seconds)
Supervisor Mode:	   35565 cycles	(0.00142260 seconds)
Total:			  513850 cycles	(0.02055400 seconds)
Simulation speed:	 12.90 MIPS (1.94 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nroff:
Statistics of "nroff" simulation:

User Mode:		  151836 cycles	(0.00607344 seconds)
Supervisor Mode:	    3998 cycles	(0.00015992 seconds)
Total:			  155834 cycles	(0.00623336 seconds)
Simulation speed:	 14.62 MIPS (1.71 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff:
Statistics of "diff" simulation:

User Mode:		  526944 cycles	(0.02107776 seconds)
Supervisor Mode:	     650 cycles	(0.00002600 seconds)
Total:			  527594 cycles	(0.02110376 seconds)
Simulation speed:	 13.15 MIPS (1.90 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As you can see, a substantial fraction of the potential Am29000
performance can be had *without caches* (and most of the difference
would be made up by adding a data cache only).

+-----
| I'm at a disadvantage in quoting performance data for the AMD part, but the
| branch target cache miss rate on the larger benchmarks is in the 50% range,
| is it not? That would mean that much of the branch behaviour of these larger
| programs is not simple loops, and the analogy of the AMD 29k BTC holding 32
| loops of any size isn't really valid.
+-----
Yes, we attempt to cache *all* branch targets, not just loops.  They are
a mix of conditional branch, unconditional branch, and procedure call
targets, as well as interrupt vector heads and instructions following a
page boundary crossing.
 
More numbers:

Assembler:	Branch cache hit ratio:	 42.59% (our lowest)
nroff:		Branch cache hit ratio:	 59.88%
diff:		Branch cache hit ratio:	 86.49%


+-----
|                                   ... What looks bad for the AMD part is
| that branches that miss in the BTC get a four-cycle delay. (Please correct
| me if I'm wrong on this, but I'm assuming that the 4 words in each BTC entry
| reflect that it will take at least 4 cycles to start up again after a BTC
| miss.) At risk of contradicting statement made earlier in this message,
| that would cause the average branch time to be about 3.5 cycles.
+-----
No, any BTC misses stall the pipeline only for the amount of time it
takes to fetch the first instruction from external memory. 

+-----
| If I could ask, what happens to the Am29k bus at 40 to 55 MHz?
+-----
It turns into a giant antenna, wiping out everyone's radio reception for
a radius of 10 miles ;-)

	Tim Olson
	Processor Strategic Planning
	Advanced Micro Devices
	(tim@amdcad.amd.com)
	

bcase@apple.UUCP (Brian Case) (05/18/87)

In article <397@dumbo.UUCP> hansen@mips.UUCP (Craig Hansen) writes:
>Let me start by apologizing for the tone of my last message - I'll try to be
>less caustic and more informative this time.

Yeah, I should appologize too for over-reacting.

>In article <767@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes:
> (about how a branch target cache performs comparably to an instruction
>    cache, in a real system environment)
>> it can perform comparably because, logically, part of the instruction cache
>> is in the external rams.  the branch target cache is only there to cover the
>> latency of the initial access in that external ram
>My point was that the four words of information in the BTC is not sufficient
>to cover the latency of DRAM accesses in a real system environment. I would
>freely acknowlege that it _is_ sufficient if the next level store is SRAM -
>however, in that case, the BTC is a _suppliment_ to the SRAM-based
>instruction cache. In fact, the system that is simulated by AMD in their
>performance simulation has an external cache.

How about 80ns 256K DRAMs, interleaved two-way.  With these RAMs (available
at Fry's for $7 each, so proabably for $4 from a real suppler), the Am29000
need have only *two* instructions in each branch target entry.

We're not even talking about video DRAMs, which are probably cheaper than
these high-speed guys.  But, video DRAMs are available at least
in 120 ns versions, three 25 MHz Am29000 cycles, so it should still be
possible, even considering that extra control logic may add a cycle to
the first access.  I know that such RAMs are more expensive than
pedestrian 256K DRAMs (by about a factor of 2), and this is a real
consideration for some people, but a full main memory system can be built
for real computers with these parts (look inside a Macintosh Plus or
Mac II (UNIX, but not fantastic performance) for a lesson in minmalist
deisgn).

>As to defending a traditional instruction cache relative to the branch
>target cache, this is a apples-and-orange comparison. The instruction cache
>leaves main memory bandwidth free for other purposes, such as data
>references, memory refresh, and I/O traffic. What I could compare is the

But, with separate instruction and data paths to the processor (the case
of the Am29000), leaving main memory bandwidth free for other purposes
is much less an issue.  Things like I/O traffic had better not be directly
visible on the Am29000 bus (or the processor/memory bus of ANY high speed
processor chip) if high performance is to be obtained.

>the sign bit of a register, so we have to add time for the comparison
>operations (.36 cycles for compare equal/not equal, and .08 for full
>arithmetic compares), so the real total is 1.38+.36+.08 = 1.82 cycles per
>branch. (Numbers are from the paper, and assume an 80% hit ratio in the BTC)

It is not fair to add the time taken to do the compare to the time taken
for the branch:  sometimes the compare is overlapped with something else
(it can fit in the delay slot of another delayed branch or can be overlapped
with a load or store).  However, since this overlapping will not nullify
all compares (or perhaps even a significant portion of them), you are right
to charge some cost for them relative to a MIPSCO-type branch.
>In summary:
>
>Branch Scheme		average cycles per branch
>=======================================================
>Fast Compare (R2k)		1.59
>64-entry BTC/sign compare (29k)	1.82
>
>Selecting between these schemes, holding all else constant gives the fast
>compare scheme about a 2% edge in overall performance over the BTC scheme.
>However, McFarling & Hennessey give this warning about the fast compare
>scheme: "...the timing of the simple compare is a concern, because it must
>complete in time to change the instruction address going out in the next
>cycle. This could easily end up as the critical path controlling instruction
>cache access and, hence, cycle time." They are right; the timing was a
>concern, and we had to put in special hardware to perform the fast compare
>and to quickly translate branch addresses to physical addresses.  It is this
>translation hardware (Micro-TLB) that adds about 1% of a cycle per
>instruction or 6% per branch.  However, the critical paths are on-chip ones,
>and can be expected to scale with the technology.

Yeah, MIPSCO may have done a good thing here; we'll all have to wait for
the next technologies to know for sure.  I also think the MIPSCO approach
reflects the facts that: (1) you guys are going for a slightly (if not
significantly) different market and (2) you guys have control over almost
everything in your system environment, by virtue of being able to write
the compilers, write the operating systems, and build the system hardware.
This is a significant advantage in the UNIX market.  At AMD, things were
much less under control, so we opted for features that are easy to
understand, clearly scalable with technology (no ifs, ands, or buts),
and modular (i.e., it is easy to take the TLB out of the Am29000; maybe
it is as easy to take it out of the MIPS, I haven't really thought about
it.).  The separate instruction and data buses clearly scale easily with
clock speed, at least up to a point.  30 MHz Am29000s will be no problem
and should be available (I am speculating) soon after 25 MHz parts (one
has to wonder if the RAMs can keep up, but this isn't just the Am29000's
problem).

>Time and considerable further development will occur before AMD can supply
>performance data from running large benchmarks under real system conditions,
>and I understand that we'll have to make do with what we've got.  However
>the only benchmark of potentially meaningful size from AMD (sipasm) performs
>substantially less than 17 (780-relative) MIPS with an external cache as
>well as burst-mode memory, and as I understand it, these results do not take
>multiprogramming cache effects and finite memory write speed into account.

I think Tim responded to this with some hard data.  Perhaps there was some
confusion (and I didn't do anything to clear it up before):  the Am29000/
VDRAM combination *IS* lower performance than the Am29000/cache combination,
at least on average.  However, for some grahpics benchmarks, at least,
the Am29000/VDRAM combination would surprise you (anyway, it surprised
me and I tend (as of late) to be optimistic!).  The Am29000 probably makes
a good UNIX-box CPU too, but this is much more debatable until a proof-by-
existence can be constructed.

>I'm at a disadvantage in quoting performance data for the AMD part, but the
>branch target cache miss rate on the larger benchmarks is in the 50% range,
>is it not? That would mean that much of the branch behaviour of these larger
>programs is not simple loops, and the analogy of the AMD 29k BTC holding 32
>loops of any size isn't really valid.

Well, holding 32 loops of any size isn't really a big win anyway since we
all know that loops tend to be smaller than "any size."  I guess I
overstated that one a bit.

>What looks bad for the AMD part is
>that branches that miss in the BTC get a four-cycle delay. (Please correct
>me if I'm wrong on this, but I'm assuming that the 4 words in each BTC entry
>reflect that it will take at least 4 cycles to start up again after a BTC
>miss.) At risk of contradicting statement made earlier in this message,
>that would cause the average branch time to be about 3.5 cycles.

You are off here.  The 4 words in each BTC entry are there so that *up to*
4 cycles of initial latency can be overlapped.  If the external memory
responds to the initial request sooner, so much the better, especially
when the BTC misses.  So, a BTC miss incurs the latency of the external
instruction memory, not a fixed 4 cycles.

>As to what MIPS will do to make external caches work at 25, 30, 35 MHz,
>I'm afraid this isn't the right forum to be discussing our future
>products.

Understood; anyway I didn't mean that the MIPS bus was un-fixable.
Clearly there are (some easy) things which will fix the problem.

>We have publicly stated that we will improve the performance
>of our products at a rate of doubling performance every eighteen months,
>and our current product plans are running faster than that rate.
>If I could ask, what happens to the Am29k bus at 40 to 55 MHz?

Well, if it were left alone, then its cycle time would scale and a device
would have only 25 to 18 ns to respond.  25 is sorta reasonable, but
18 sounds pretty silly.  Probably the buses will have to be made wider
so that adequate bandwidth can be had with reasonable (bus) cycle times.
By the time technology (at least at AMD, and I am speculating now since
I don't work there anymore) gets to Am29000s with 55 MHz clocks, there
will be better packaging technology in the main stream (which is where
it must be if commodity parts like the Am29000 are to use it).

    bcase

bcase@apple.UUCP (Brian Case) (05/19/87)

In article <781@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes:

A bunch of garbage.  I don't know on what drug I was, but I be trying
that one again.  Let me make summarize some corrections pointed out by
Phil Ngai.

> How about 80ns 256K DRAMs, interleaved two-way.  With these RAMs (available
> at Fry's for $7 each, so proabably for $4 from a real suppler), the Am29000
> need have only *two* instructions in each branch target entry.

Well, I got the density and the price right, but that is about it.  There is
*no* way an 80 ns DRAM can give a 2 cycle first access.  The Am29000, like
any processor has clock-to-address-valid and set-up times.  Sheesh, sorry.
These DRAMs interleaved are still interesting, but because of DRAM cycle-
time constraints, more than two-way is probably necessary.

> We're not even talking about video DRAMs, which are probably cheaper than
> these high-speed guys.  But, video DRAMs are available at least
> in 120 ns versions, three 25 MHz Am29000 cycles, so it should still be
> possible, even considering that extra control logic may add a cycle to
> the first access.  I know that such RAMs are more expensive than
> pedestrian 256K DRAMs (by about a factor of 2), and this is a real
> consideration for some people, but a full main memory system can be built
> for real computers with these parts (look inside a Macintosh Plus or
> Mac II (UNIX, but not fantastic performance) for a lesson in minmalist
> deisgn).

It was pointed out that (at least some) VDRAMs have some bizzareness that
lengthens the apparent first access time.  Also, I seem to have implied
that the Mac uses VDRAMs for its main memory.  This is clearly not the
case!  I simply meant to say that the Mac is an example of minimalist
design (mostly for cheap, easy manufacturability).

> >We have publicly stated that we will improve the performance
> >of our products at a rate of doubling performance every eighteen months,
> >and our current product plans are running faster than that rate.
> >If I could ask, what happens to the Am29k bus at 40 to 55 MHz?
> 
> Well, if it were left alone, then its cycle time would scale and a device
> would have only 25 to 18 ns to respond.  25 is sorta reasonable, but
> 18 sounds pretty silly.

Sigh, again, at any clock speed, the time given to a memory to respond is
mitigated by clock-to-address-valid and set-up times.  At 40 to 55 MHz,
assuming all the relavent timing parameters scale (this is what everyone
assumes), the Am29000 would give about 9 to 7 ns for a memory to respond
in one cycle.  Not much, but burst mode is much better leaving about 20
to 15 ns for each sequential access.  This is the price you have to pay
for high performance.

Again, with tail between my legs, let me appologize for so fervently
spouting rubish.

    bcase