[comp.arch] fast memories

jsp@b.gp.cs.cmu.edu (John Pieper) (06/06/89)

Why should we go to custom memory chips?

With an on-chip cache, there are several ways to implement fast memory without
going to custom chips. The processor needs to support out-of-order memory
operations to support some of the fancier optimizations, but given this,
page-mode DRAMS can be interleaved to give you very good performance,
especially if you have an on-chip cache and load a line at a time.

Without this, a harvard architecture can use page-mode to get cache-like
speeds for the instruction stream, and a fancier interleaved scheme for data
accesses.

Why bend over backwards (inter-company contracts, risk, design cost, etc)
for a 100% when you can have an easy 90% solution? The marginal gain isn't
worth it.
-- 
John Pieper				 jsp@cs.cmu.edu
School of Computer Science,  Carnegie-Mellon University
-------------------------------------------------------
--

brooks@vette.llnl.gov (Eugene Brooks) (06/06/89)

In article <5128@pt.cs.cmu.edu> jsp@b.gp.cs.cmu.edu (John Pieper) writes:
>
>Why should we go to custom memory chips?
Memory chips which are built to be compatible with a high performance
micro destined for commodity use won't be "custom."

>With an on-chip cache, there are several ways to implement fast memory without
>going to custom chips. The processor needs to support out-of-order memory
>operations to support some of the fancier optimizations, but given this,
>page-mode DRAMS can be interleaved to give you very good performance,
>especially if you have an on-chip cache and load a line at a time.
The latest and greatest VLSI micros utilize "page mode" drams and those
page mode drams are not fast enough.

>Why bend over backwards (inter-company contracts, risk, design cost, etc)
>for a 100% when you can have an easy 90% solution? The marginal gain isn't
>worth it.
Someone who bends over backwards to create a 100 MFLOP micro is not bending
over backwards to create a ram chip to go with it.



brooks@maddog.llnl.gov, brooks@maddog.uucp

brooks@maddog.llnl.gov (Eugene Brooks) (06/06/89)

In article <26450@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>>With an on-chip cache, there are several ways to implement fast memory without
>>going to custom chips. The processor needs to support out-of-order memory
>>operations to support some of the fancier optimizations, but given this,
>>page-mode DRAMS can be interleaved to give you very good performance,
>>especially if you have an on-chip cache and load a line at a time.
>The latest and greatest VLSI micros utilize "page mode" drams and those
>page mode drams are not fast enough.
Sorry about this piece of brain damage I read the posting too quickly
I shouldn't have responded to this as my point was that the interleaving several
chips raises the cost of the memory system and this was the reason to interleave
directly on the chip.  If the micro chips capable of 60 MFLOP floating rates are not
custom, the memory chips to go with them are not custom either.


brooks@maddog.llnl.gov, brooks@maddog.uucp

slackey@bbn.com (Stan Lackey) (06/06/89)

In article <26450@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>In article <5128@pt.cs.cmu.edu> jsp@b.gp.cs.cmu.edu (John Pieper) writes:
>>
>>Why should we go to custom memory chips?
>Memory chips which are built to be compatible with a high performance
>micro destined for commodity use won't be "custom."

There are already many custom memory chips.  Cache tag RAM's, video RAM's
and the like all started out that way.  The 88200 is a custom memory
chip.  An extremist (like me) might consider the i860 as a custom memory
chip.  :-)

What percentage of a chip should be memory cells in order for it to really
be a considered a custom memory chip?

>>page-mode DRAMS can be interleaved to give you very good performance,
>>especially if you have an on-chip cache and load a line at a time.

I have seen many applications (the big problems for which people want
the heavy iron) which don't utilize a cache well, even with a
half-megabyte cache.  For example, a matrix multiplication processes
one matrix down columns and the other across rows.  Some cases like this
can actually get poorer performance with a large line size.

>>Why bend over backwards (inter-company contracts, risk, design cost, etc)
>>for a 100% when you can have an easy 90% solution? The marginal gain isn't
The best case is where a company makes both the uP and the memory (88000
for example).
The gain isn't marginal, either.  You statements may be OK now, but next
generation will see the CPU in the 5 to 20ns range, with srams in the
20ns and drams in the 80ns range?  Clearly needs work.
-Stan  "Do I have an opinion yet?"

khb@chiba.Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (06/07/89)

In article <40985@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>
>I have seen many applications (the big problems for which people want
>the heavy iron) which don't utilize a cache well, even with a
>half-megabyte cache.  For example, a matrix multiplication processes
>one matrix down columns and the other across rows.  Some cases like this
>can actually get poorer performance with a large line size.

Many of those big iron applications run just fine with a modest (say
64-128K cache) combined with sensible implementation. Consider
languages which are array savvy (say APL and f88) or libraries that
are ... then cache can be effectively utilized.

Or consider

				CONMAN

Code from CalTech, authors: Arthur Raefsky, Scott D. King, and
Bradford Hager.

This code is a "well vectorized" _F_inite _E_lement analysis code. The
algorithm is numerically stable, robust, and computationally
efficient. It has been hosted on numerous scalar machines, hypercubes and
vector machines.
..... stuff deleted for space reasons

				Machines

CrayXMP 4/8 ctss, cft77               65Mflops (measured for BM1,2) 45Mflops(BM3)
    YMP 8/16 UNICOS, one processor    95Mflops (measured for BM1,2)
Cray-2 (one processor) UNICOS     
Convex C1XP                     
Sun 4/260 FPU1, f77 v1.2 
Sun 4/330 FPU2, f77 v1.2 dalign and not
Campus          f77 v1.2 dalign and not
The Cray Mflops figures were measured (not computed) by the hardware
speedometer. 

It should be noted that the earlier generation of code (which this
code is meant to replace) ran slower in VECTOR mode than this code
does in SCALAR mode .... it is the authors contention that this is
often the case...that a good implementation often does well on many
different machines.

....
                               Timing table

                 bm1                bm2                   bm3
          scalar     vector    scalar   vector     scalar   vector

Cray-2      -        180.2      -        178.0       -       573.4
XMP       2869.3     153.2      -        154.3     6094.7    398.6
YMP         -         92.0      -         91.2       -       233.7

convex    9383.2    2021.2    9383.3    1979.8    14513.8   4383.2

4/330               6808                6871.6              7678.7
4/330dalign         5975.56             5993.0              6145.31

ss-1                8880.84             9044.51            10177.09
ss-1dalign          7804.50             7779.85             9946.17  

4/260              10290.91            10263.47            12613.88
4/280fpu2           8445.43             8428.12            10228.52

.....

The point being that Arthur (et al.s) algorithm runs quite nicely are
well designed vector machines (i.e. acheives good vectorization rates)
_and_ on scalar machines. A later implementation employes a better
vectorized matrix factorization step, which increases the overall
vectorization considerably.

The key is that this is a modified skyline direct solver ... so cache
works quite nicely.

Arthur can be reached at arthur@oasis.stanford.edu for more details
about the science involved.

>
>>>Why bend over backwards (inter-company contracts, risk, design cost, etc)
>>>for a 100% when you can have an easy 90% solution? The marginal gain isn't
>The best case is where a company makes both the uP and the memory (88000
>for example).
>The gain isn't marginal, either.  You statements may be OK now, but next
>generation will see the CPU in the 5 to 20ns range, with srams in the
>20ns and drams in the 80ns range?  Clearly needs work.
>-Stan  "Do I have an opinion yet?"

Well, a large register file has been described as a compiler managed
cache :>

Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)

bcase@cup.portal.com (Brian bcase Case) (06/09/89)

>You[r] statements may be OK now, but next generation will see the CPU
>in the 5 to 20ns range, with srams in the 20ns and drams in the 80ns
>range?  Clearly needs work.

And some of that work is being done now.  True, processors will be at
50 MHz and beyond.  However, the contention that memories will not keep
up is not necessarily true.  I know of an experimental SRAM now being
fabbed that will have an access time of about 4 ns with a 16K x 4
organization and no on-chip address registers or other gunk.  One
problem:  the die is big (4mm x 6mm?).  I think Cypress is already
sampling a 4K SRAM in the same access-time neighborhood.  IBM said
they have a 22ns 1Mbit DRAM (although nobody can figure out how real
it is).

Memory latency is a problem, but to say that it won't be solved seems
alarmist.

slackey@bbn.com (Stan Lackey) (06/10/89)

In article <19257@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>And some of that work is being done now.  True, processors will be at
>50 MHz and beyond.  However, the contention that memories will not keep
>up is not necessarily true.  I know of an experimental SRAM now being

I was just quoting history; for example, in 1977 I designed DRAMS into
a project, and the best that could be had were around 100ns.  Here we
are 12 years later and the fastest are 70 (that suppliers will accept
a purchase order for large quantities).  In the meantime, processors
have gone from 5MHz to 20.  That doesn't look like scaling to me.

>Memory latency is a problem, but to say that it won't be solved seems
>alarmist.

I'm not being an alarmist.  For example, to deal with faster CPU's
than memories (and for other reasons not to the point), caches got
invented.  I'm just saying that more innovation will be required than
just cranking up semiconductor processes.
-Stan