[comp.arch] target caching

lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (02/20/88)

The TF-1 people at IBM intend to use an interesting trick to simplify their
CPU.

DRAMs can be purchased that have "page mode" - that is, you can access the
next-address value much more quickly than a randomly addressed value.  This
is because each random access can leave a large number of bits in a long
register (say, 1024 bits, in the case of a 1Mb RAM). A page-mode access just
shifts the register. 

So, the TF-1 CPU chip will expect another 32 bits of instruction every 20ns.
As long as the PC just upcounts, they claim that page-mode RAMs will be fast
enough.

When the CPU decides to branch, of course, there's trouble. They solve this
by keeping a cache of the instruction streams at 32 recent branch targets.
If the target PC hits, then they fetch instructions from the cached stream,
until the RAMs have done their random access, and are ready to page-mode
again.

I haven't studied the recent RAM offerings well enough to count the cycles,
and critique the speed expectations. I guess it sounds fine, and it does
sound simple. But, there's a major catch: it's a Harvard architecture.  The
memory is code-only, so that grubby data won't spoil the code's pipelined
perfection.

I know that some recent RAM chips are dual-ported, supposedly so that a
processor can write image data through the random port, while a graphics
screen is being refreshed through the page-mode port. Would these chips
allow the TF-1 trick to work in non-Harvard designs ? 
-- 
	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

oconnor@sunset.steinmetz (Dennis M. O'Connor) (02/21/88)

An article by lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) says:
] The TF-1 people at IBM intend to use an interesting trick to simplify
] their CPU.

] DRAMs can be purchased that have "page mode" - that is, you can access the
] next-address value much more quickly than a randomly addressed value.  This
] is because each random access can leave a large number of bits in a long
] register (say, 1024 bits, in the case of a 1Mb RAM). A page-mode access just
] shifts the register. 
] So, the TF-1 CPU chip will expect another 32 bits of instruction every 20ns.
] As long as the PC just upcounts, they claim that page-mode RAMs will be fast
] enough.
] When the CPU decides to branch, of course, there's trouble. They solve this
] by keeping a cache of the instruction streams at 32 recent branch targets.
] If the target PC hits, then they fetch instructions from the cached stream,
] until the RAMs have done their random access, and are ready to page-mode
] again.

Well, it's may be interesting but it's not original. GE's own RPM40
already does this ( but better (IMHO) than you describe ), and I believe
the AMD29000 gives you the CHOICE of doing something like this.

That memory system is not going to be simple, by the way :
branches are not your ownly problem. You need to handle crossing
page boundaries in your RAM as well. But that's doable.

As described, it's also not going to be Rad-Hard. Dynamic never is.

] I haven't studied the recent RAM offerings well enough to count the cycles,
] and critique the speed expectations. I guess it sounds fine, and it does
] sound simple. But, there's a major catch: it's a Harvard architecture.  The
] memory is code-only, so that grubby data won't spoil the code's pipelined
] perfection.

(Humor mode on) That's not a catch, that's a FEATURE! (HM off).
Seriously folks, at 200MBytes/sec of JUST instruction fetch,
you weren't thinking of sharing that nice, simple, unidirectional
instruction bus with messy old bi-directional data, were you?

] I know that some recent RAM chips are dual-ported, supposedly so that a
] processor can write image data through the random port, while a graphics
] screen is being refreshed through the page-mode port. Would these chips
] allow the TF-1 trick to work in non-Harvard designs ? 

No. The "TF-1 trick" (which was the "RPM40 trick" and the "29000
trick" FIRST, BTW) needs a Harvard architecture, to provide sufficient
bandwidth and, more importantly, to separate nice regular simple
instruction-stream behavior from complex semi-random data access.
] -- 
] 	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

--
	Dennis O'Connor 	oconnor@sunset.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
    "Nuclear War is NOT the worst thing people can do to this planet."

tim@amdcad.AMD.COM (Tim Olson) (02/21/88)

In article <910@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes:
| The TF-1 people at IBM intend to use an interesting trick to simplify their
| CPU.
| 
| DRAMs can be purchased that have "page mode" - that is, you can access the
| next-address value much more quickly than a randomly addressed value.  This
| is because each random access can leave a large number of bits in a long
| register (say, 1024 bits, in the case of a 1Mb RAM). A page-mode access just
| shifts the register. 

This sounds more like Video-DRAM (VRAM) to me.  VRAMS have a
static-column shifter that can shift out the next sequential bit every
cycle without any subsequent addresses, where as page-mode or
static-column mode must be supplied a partial address for every access.

| When the CPU decides to branch, of course, there's trouble. They solve this
| by keeping a cache of the instruction streams at 32 recent branch targets.
| If the target PC hits, then they fetch instructions from the cached stream,
| until the RAMs have done their random access, and are ready to page-mode
| again.

Wow!  Either there is serendipity involved here, or the TF-1 architects
closely studied the Am29000 Manual -- this is the exact method we use to
keep the pipeline fed during branches -- even the number of entries is
the same!

| I haven't studied the recent RAM offerings well enough to count the cycles,
| and critique the speed expectations. I guess it sounds fine, and it does
| sound simple. But, there's a major catch: it's a Harvard architecture.  The
| memory is code-only, so that grubby data won't spoil the code's pipelined
| perfection.
|
| I know that some recent RAM chips are dual-ported, supposedly so that a
| processor can write image data through the random port, while a graphics
| screen is being refreshed through the page-mode port. Would these chips
| allow the TF-1 trick to work in non-Harvard designs ? 

That's exactly what a VRAM does.  It has effectively two ports: the
random access port (for loads/stores and branch addresses), and the
serial port (for sequential instruction fetches).  This allows a
Harvard-architecture machine to have separate buses for performance,
while maintaining a shared instruction/data memory.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

bcase@apple.UUCP (Brian Case) (02/22/88)

In article <20482@amdcad.AMD.COM> tim@amdcad.UUCP (Tim Olson) writes:
>In article <910@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes:
>| When the CPU decides to branch, of course, there's trouble. They solve this
>| by keeping a cache of the instruction streams at 32 recent branch targets.
>| If the target PC hits, then they fetch instructions from the cached stream,
>| until the RAMs have done their random access, and are ready to page-mode
>| again.
>
>Wow!  Either there is serendipity involved here, or the TF-1 architects
>closely studied the Am29000 Manual -- this is the exact method we use to
>keep the pipeline fed during branches -- even the number of entries is
>the same!

And by the way, AMD has a patent pending on this (Phil Frieden's name leads
the list; thanks Phil!  (hope I spelled your last name right!)).