[comp.arch] cache speed

mlewis@unocss.UUCP (Marcus S. Lewis) (08/17/89)

Dumb question time. (I hate when I'M asking the dumb questions)

A co-worker is building a "no compromise" performance-oriented '386
MSDOS machine. Aside from that obvious contradiction, I have a real
question and comp.arch is the only place I expect to get an answer.

How fast do the cache chips have to be on a 33MHz 386? For main memory,
I have always believed in the "reciprocal of the clock speed" rule for
chip speed, but the caches offered on the systems under consideration
are 15-20 ns chips.  Now, I am informed by this same performance-
oriented person that the cache is all dual ported and thus has to run
twice as fast.  I have seen a lot of discussion on multi-ported memory
here, with never a mention of what is required to pull this off.
I think what I want is a very brief tutorial on what it means to multi-
port memory.  I know WHY, but I don't know HOW.

"Thank you for your support"
Marc

-- 
Na khuya mne podpis'?                 |  Internet: cs057@zeus.unl.edu      
                                      |  UUCP:     uunet!btni!unocss!mlewis
Go for it!                            |  Bitnet:   CS057@UNOMA1            
---------------------------------------------------------------------------

roy@phri.UUCP (Roy Smith) (08/18/89)

In article <1473@unocss.UUCP> mlewis@unocss.UUCP (Marcus S. Lewis) writes:
> A co-worker is building a "no compromise" performance-oriented '386
> MSDOS machine. [...] How fast do the cache chips have to be on a 33MHz 386?

	Cache is, by definition, a compromise.  If you really want to build
a "no compromise" machine, make the entire main memory out of SRAM fast
enough to keep up with the CPU.
-- 
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
{att,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@alanine.phri.nyu.edu
"The connector is the network"

oconnordm@CRD.GE.COM (Dennis M. O'Connor) (08/18/89)

roy@phri (Roy Smith) writes:
] Cache is, by definition, a compromise.  If you really want to build
]a "no compromise" machine, make the entire main memory out of SRAM fast
]enough to keep up with the CPU.

Sorry, not true. In a large memory system, propogation delays, buffer
latencies and the like will rapidly become the single largest component
in your memory latency. Buffers add delay, more chips on one line
adds capacitance, and eventually just the 1 nanosecond per foot
speed-of-light delay becomes a factor.

Once you starting adding off-chip, off-board, down-the-bus,
onto-the-board, onto-the-cRAM-chip, off-the-RAM-chip, off-board,
up-the-bus, onto-the-board and back-onto-the-chip delays,
you begin to really want on-chip cache. Lacking that,
even on-same-board cache looks good. And even lacking that,
on-only-a-few-boards cache looks better than in-only-a-few-cabinets
large RAM systems.

Cache can give you higher performance than the single-level
memory system, by using the probability that your programs
really not using ALL that memory. Smaller amounts of RAM,
due to the laws of physics, are faster than large amounts,
all other things being equal.

Summary : it's currently impossible to build a 16-megabyte NMOS,
CMOS, or TTL memory system with a 25ns access time. But with a good
cache design, you can build a system with almost the same performance.



--
 Dennis O'Connor      OCONNORDM@CRD.GE.COM       UUNET!CRD.GE.COM!OCONNORDM
	 "Did you exchange
	  a walk-om part in the war,

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (08/18/89)

In article <1736@crdgw1.crd.ge.com>, Dennis O'Connor writes:
>Summary : it's currently impossible to build a 16-megabyte NMOS,
>CMOS, or TTL memory system with a 25ns access time. But with a good
>cache design, you can build a system with almost the same performance.

Well, it is not exactly 25ns, but each of the 4 cpu's of our ETA-10G
at FSU has 32MBytes of 30ns CMOS SRAM.  The time required for a load
is about 6 cycles, at 7ns/cycle.  Deleting the instruction issue time
(1-2 cycles?), this is getting awfully close to 30ns.
They never could get the 128MByte memory arrays working at 7ns....

--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu
		   mccalpin@delocn.udel.edu

davidb@braa.inmos.co.uk (David Boreham) (08/21/89)

In article <MCCALPIN.89Aug18054110@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>In article <1736@crdgw1.crd.ge.com>, Dennis O'Connor writes:
>>Summary : it's currently impossible to build a 16-megabyte NMOS,
>>CMOS, or TTL memory system with a 25ns access time. But with a good
>>cache design, you can build a system with almost the same performance.
>
>Well, it is not exactly 25ns, but each of the 4 cpu's of our ETA-10G
>at FSU has 32MBytes of 30ns CMOS SRAM.  The time required for a load
>is about 6 cycles, at 7ns/cycle.  Deleting the instruction issue time
>(1-2 cycles?), this is getting awfully close to 30ns.
>They never could get the 128MByte memory arrays working at 7ns....
>
>--
>John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu
>		   mccalpin@delocn.udel.edu

Are you sure that the actual *memory* subsystem was cycled in 30ns ?
As the previous poster pointed out, you can get 20--25ns SRAMs, but
to build a 32Mbyte system which randomly cycles in 30ns, using CMOS
sounds rather unlikely. Perhaps the memory is interleaved ?
Also, surely the instruction fetch would be pipelined to overlap
with the previous instruction ?
Anyway, I wonder how many others share my concern that SRAM
designers work to achieve cycle times equal to the access times,
whereas very few systems can actually use a cycle time that fast ?

David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb
Bristol,  England            |      (us): uunet!inmos-c!davidb
+44 454 616616 ex 543        | Internet : @col.hp.com:davidb@inmos-c

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (08/22/89)

>	Cache is, by definition, a compromise.  If you really want to build
>a "no compromise" machine, make the entire main memory out of SRAM fast
>enough to keep up with the CPU.
>-- 
>Roy Smith, Public Health Research Institute
>455 First Avenue, New York, NY 10016
>{att,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@alanine.phri.nyu.edu


Now, then, what about these fast new DRAMs?

IBM announced a 16ns part, followed by Hitachi announcing a 20ns part.
I believe that they were reasonably large (1 Mbit -- I just left the
latest of a series of articles at home).

Apparently this big jump up in DRAM performance is attained by just doing
things the sensible, brute-force way --- no more multiplexed lines, plus
a bit of bipolar on the CMOS memory chip for drivers.

Anyone have more details?  Anyone care to speculate on what faster DRAMs
will do for computer architecture?  Has anyone run simulations, either
hardware (faster DRAMs let me do away with cache), or, probably more important,
economic (fast DRAMs with lotsa pins will ride the technology curve down 
1 yr? 2 yrs? behind regular DRAMs)?

--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

mlewis@unocss.UUCP (Marcus S. Lewis) (08/22/89)

From article <1736@crdgw1.crd.ge.com>, by oconnordm@CRD.GE.COM
 (Dennis M. O'Connor):
> roy@phri (Roy Smith) writes:
> ] Cache is, by definition, a compromise.  If you really want to build

> Summary : it's currently impossible to build a 16-megabyte NMOS,

Gentles, please, my apologies.  I had no intention of starting a war here
of all places.  In my original post, I mentioned a co-worker "building"
a hi-perf 386.  This is, at least in THIS forum, a misstatement.  I
have received lots of mail, almost all of which ended at least in asking
why not use the 386 cache controller.  Now I see my mistake.

He is not _DESIGNING_ a system.  He is looking over motherboards, and I
was questioning (a) why he needed 128K cache, and (b) why it needed to
be 15 ns for a 33 MHz CPU.  Especially since the board has the 15 ns
chips as an option (upgrade from 25 ns).  I'm trying to save my company
some money here.  He's buying the stuff, but I get to assemble it into a
machine.  We did this once before, with a Texas MicroSystems 16 MHz CPU
in a passive backplane.  He wants 33 MHz, and the only place he can get
it NOW is as a MB system. 

We have a "strategic product", a voice-response system based on this
machine, currently running 8 channels under MS-DOS, and the CPU is
having trouble keeping up.  The next step is to get some faster voice
boards, port the existing system to a faster setup, then convert the
whole shot to some Unix variant.  My contention is that 128K cache MAY
help in the MS-DOS version, since it is one humongo program, but in the
Unix version it will be multiple tasks, scheduled by Unix, in which case
I doubt that the cache is going to be all that useful, at least not in
that size. 

Is this clearer? I am still somewhat in the dark, having lurked here for
over a year, about some of the terminology you folks use.  Like, (dumb,
yes, but I'm a system manager) what do you mean by a cache line? And how
does the controller deal with an invalidation: is there a penalty on a
real miss, and if so does the larger cache increase that penalty?

But please don't argue about it.
Thanks a bunch.  This has been a very productive lurk.
Marc

-- 
Na khuya mne podpis'?                 |  Internet: cs057@zeus.unl.edu      
                   preferred machine->|  UUCP:     uunet!btni!unocss!mlewis
Go for it!                            |  Bitnet:   CS057@UNOMA1            
---------------------------------------------------------------------------

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (08/22/89)

In article <MCCALPIN.89Aug18054110@masig3.ocean.fsu.edu> I wrote that
the memory subsytem on the ETA-10G had an access time near 30 ns.

In article <1878@brwa.inmos.co.uk> davidb@braa.inmos.co.uk (David
Boreham) replied:
>Are you sure that the actual *memory* subsystem was cycled in 30ns ?

I'm not sure how to answer this, but I will make the following notes:
(1) There is no cache.
(2) The memory is definitely interleaved.  I don't remember how many
    banks, or what the bank busy time is.
(3) A scalar load instruction requires either 6 or 8 cycles -- I don't
    remember which.
(4) The instruction decode is probably overlapped with the previous
    instruction.

So I conclude that the memory access time is 40-50 ns. Not as good as
I thought originally, but still fairly impressive.  It will be even
more impressive if they can deliver the 128 MB array at those speeds.
I hear that the 128 MB array works fine at 10.5 ns, and they are trying
to tweak it to run at 7ns to deliver to FSU.

A similar access time number can be obtained from the vector unit
timings.  The vector startup overhead is 16 cycles in the best case.
This consists of the time to load the first element of each array,
plus the pipeline length (5 cycles ???), plus the time required to put
the first result back to memory, plus any other work that can't be
overlapped.  This suggests an access time pretty close to 5-6 cycles....

>As the previous poster pointed out, you can get 20--25ns SRAMs, but
>to build a 32Mbyte system which randomly cycles in 30ns, using CMOS
>sounds rather unlikely. Perhaps the memory is interleaved ?

The previous timings are based on the absence of memory bank conflicts.
The memory is CMOS SRAM.

>Also, surely the instruction fetch would be pipelined to overlap
>with the previous instruction ?

Most likely, but I don't have a reference on what exactly is overlapped
on that machine.  It seems that the details of the Cyber 205 are much
more widely known than the details of the ETA-10.
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu
		   mccalpin@delocn.udel.edu

slackey@bbn.com (Stan Lackey) (08/22/89)

In article <1878@brwa.inmos.co.uk> davidb@inmos.co.uk (David Boreham) writes:
>Anyway, I wonder how many others share my concern that SRAM
>designers work to achieve cycle times equal to the access times,
>whereas very few systems can actually use a cycle time that fast ?

The thing that bothers me is that designers may be trading off access
time for cycle time.  I read in this n/g that customers demand equal
cycle and access times.  The only way I can imagine this is that those
customers either A) when asked didn't understand the question, B) were
nervous about somebody changing something on them, or C) have the
fantasy that they're someday going to actually use vanilla SRAMS
cycling at their access times (I guess they're either going to get
registers with zero prop delays, zero setup times, and distribute
clocks with zero skew; or have a brilliant design innovation that in
some way breaks a law of physics).

Now if the SRAM's had internal registers on address and data, it would
be a different matter (actually, some do).  Of course, pipelining is
required for this to be a useful thing; the registers end up making
things worse if the goal is minimum access time at the cost of
bandwidth.
-Stan

hutch@celerity.uucp (Jim Hutchison) (08/23/89)

>Roy Smith, Public Health Research Institute, writes:
>	Cache is, by definition, a compromise.  If you really want to build
>a "no compromise" machine, make the entire main memory out of SRAM fast
>enough to keep up with the CPU.

Yes, Cache is a compromise.  Mainly to your wallet and the speed of light.
Electrons, as you know, take time to negotiate their way down a length of
wire.  So with that in mind, some memory will (probably) always be faster
to use than other memory.  As soon as you leave the board, you have to deal
with a bus or a long i/o path.  As soon as you leave the cabinet...  You get
the picture (atleast how I see it).  Its a compromise you can't really avoid,
if you want to have a lot of memory.

/*    Jim Hutchison   		{dcdwest,ucbvax}!ucsd!celerity!hutch  */
/*    Disclaimer:  I am not an official spokesman for FPS computing   */

rpeglar@csinc.UUCP (Rob Peglar x615) (08/24/89)

In article <MCCALPIN.89Aug22071730@masig3.ocean.fsu.edu>, mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
> In article <MCCALPIN.89Aug18054110@masig3.ocean.fsu.edu> I wrote that
> the memory subsytem on the ETA-10G had an access time near 30 ns.
> 
> In article <1878@brwa.inmos.co.uk> davidb@braa.inmos.co.uk (David
> Boreham) replied:
> >Are you sure that the actual *memory* subsystem was cycled in 30ns ?
> 
> I'm not sure how to answer this, but I will make the following notes:
> (1) There is no cache.

Not in the sense of a general purpose I/D cache.  There is, however, a
"buffer" (really a cache) for instructions.  The instruction fetch is
rather coarse - thus, many loops can fit w/o accessing main mem.

> (2) The memory is definitely interleaved.  I don't remember how many
>     banks, or what the bank busy time is.

I do, but I can't tell you, lest the hordes of CDC lawyers descend upon
me.  The banking/bank busy may be considered proprietary info, and people
like me (former employees) can be sued.  In my opinion, however, it is
similar to its predecessor machine(s).

> (3) A scalar load instruction requires either 6 or 8 cycles -- I don't
>     remember which.
> (4) The instruction decode is probably overlapped with the previous
>     instruction.

See above.  In my opinion, the scalar load (0x7e {64-bit}, 0x5e {32-bit})
instructions are abysmally slow.  More cycles than you would guess.  This
was, and is, a major factor in the ETA-10's scalar/vector imbalance, an
attribute which contributed to its negative taste in a lot of customers'
and potential customers' mouths.
Just my opinion.

> 
> So I conclude that the memory access time is 40-50 ns. Not as good as
> I thought originally, but still fairly impressive.  It will be even
> more impressive if they can deliver the 128 MB array at those speeds.
> I hear that the 128 MB array works fine at 10.5 ns, and they are trying
> to tweak it to run at 7ns to deliver to FSU.

Only 4 months later than promised, and counting   :-)

> 
> A similar access time number can be obtained from the vector unit
> timings.  The vector startup overhead is 16 cycles in the best case.
> This consists of the time to load the first element of each array,
> plus the pipeline length (5 cycles ???), plus the time required to put
> the first result back to memory, plus any other work that can't be
> overlapped.  This suggests an access time pretty close to 5-6 cycles....

You're in the ballpark.

> 
> >As the previous poster pointed out, you can get 20--25ns SRAMs, but
> >to build a 32Mbyte system which randomly cycles in 30ns, using CMOS
> >sounds rather unlikely. Perhaps the memory is interleaved ?
> 
> The previous timings are based on the absence of memory bank conflicts.
> The memory is CMOS SRAM.
> 
> >Also, surely the instruction fetch would be pipelined to overlap
> >with the previous instruction ?
> 
> Most likely, but I don't have a reference on what exactly is overlapped
> on that machine.  It seems that the details of the Cyber 205 are much
> more widely known than the details of the ETA-10.

You are quite correct, John.  This was, in my opinion, a deliberate strategy
on CDC's part.  

Have fun.

Rob

Rob Peglar	...!uunet!csinc!rpeglar
Manager, Software R&D, Workstation Group
Control Systems, Inc., St. Paul, MN

The opinions expressed herein are solely those of the author.  
Such opinions do not reflect in any way upon my employer.
So there.