jsp@b.gp.cs.cmu.edu (John Pieper) (06/06/89)
Why should we go to custom memory chips? With an on-chip cache, there are several ways to implement fast memory without going to custom chips. The processor needs to support out-of-order memory operations to support some of the fancier optimizations, but given this, page-mode DRAMS can be interleaved to give you very good performance, especially if you have an on-chip cache and load a line at a time. Without this, a harvard architecture can use page-mode to get cache-like speeds for the instruction stream, and a fancier interleaved scheme for data accesses. Why bend over backwards (inter-company contracts, risk, design cost, etc) for a 100% when you can have an easy 90% solution? The marginal gain isn't worth it. -- John Pieper jsp@cs.cmu.edu School of Computer Science, Carnegie-Mellon University ------------------------------------------------------- --
brooks@vette.llnl.gov (Eugene Brooks) (06/06/89)
In article <5128@pt.cs.cmu.edu> jsp@b.gp.cs.cmu.edu (John Pieper) writes: > >Why should we go to custom memory chips? Memory chips which are built to be compatible with a high performance micro destined for commodity use won't be "custom." >With an on-chip cache, there are several ways to implement fast memory without >going to custom chips. The processor needs to support out-of-order memory >operations to support some of the fancier optimizations, but given this, >page-mode DRAMS can be interleaved to give you very good performance, >especially if you have an on-chip cache and load a line at a time. The latest and greatest VLSI micros utilize "page mode" drams and those page mode drams are not fast enough. >Why bend over backwards (inter-company contracts, risk, design cost, etc) >for a 100% when you can have an easy 90% solution? The marginal gain isn't >worth it. Someone who bends over backwards to create a 100 MFLOP micro is not bending over backwards to create a ram chip to go with it. brooks@maddog.llnl.gov, brooks@maddog.uucp
brooks@maddog.llnl.gov (Eugene Brooks) (06/06/89)
In article <26450@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >>With an on-chip cache, there are several ways to implement fast memory without >>going to custom chips. The processor needs to support out-of-order memory >>operations to support some of the fancier optimizations, but given this, >>page-mode DRAMS can be interleaved to give you very good performance, >>especially if you have an on-chip cache and load a line at a time. >The latest and greatest VLSI micros utilize "page mode" drams and those >page mode drams are not fast enough. Sorry about this piece of brain damage I read the posting too quickly I shouldn't have responded to this as my point was that the interleaving several chips raises the cost of the memory system and this was the reason to interleave directly on the chip. If the micro chips capable of 60 MFLOP floating rates are not custom, the memory chips to go with them are not custom either. brooks@maddog.llnl.gov, brooks@maddog.uucp
slackey@bbn.com (Stan Lackey) (06/06/89)
In article <26450@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >In article <5128@pt.cs.cmu.edu> jsp@b.gp.cs.cmu.edu (John Pieper) writes: >> >>Why should we go to custom memory chips? >Memory chips which are built to be compatible with a high performance >micro destined for commodity use won't be "custom." There are already many custom memory chips. Cache tag RAM's, video RAM's and the like all started out that way. The 88200 is a custom memory chip. An extremist (like me) might consider the i860 as a custom memory chip. :-) What percentage of a chip should be memory cells in order for it to really be a considered a custom memory chip? >>page-mode DRAMS can be interleaved to give you very good performance, >>especially if you have an on-chip cache and load a line at a time. I have seen many applications (the big problems for which people want the heavy iron) which don't utilize a cache well, even with a half-megabyte cache. For example, a matrix multiplication processes one matrix down columns and the other across rows. Some cases like this can actually get poorer performance with a large line size. >>Why bend over backwards (inter-company contracts, risk, design cost, etc) >>for a 100% when you can have an easy 90% solution? The marginal gain isn't The best case is where a company makes both the uP and the memory (88000 for example). The gain isn't marginal, either. You statements may be OK now, but next generation will see the CPU in the 5 to 20ns range, with srams in the 20ns and drams in the 80ns range? Clearly needs work. -Stan "Do I have an opinion yet?"
khb@chiba.Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (06/07/89)
In article <40985@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: > >I have seen many applications (the big problems for which people want >the heavy iron) which don't utilize a cache well, even with a >half-megabyte cache. For example, a matrix multiplication processes >one matrix down columns and the other across rows. Some cases like this >can actually get poorer performance with a large line size. Many of those big iron applications run just fine with a modest (say 64-128K cache) combined with sensible implementation. Consider languages which are array savvy (say APL and f88) or libraries that are ... then cache can be effectively utilized. Or consider CONMAN Code from CalTech, authors: Arthur Raefsky, Scott D. King, and Bradford Hager. This code is a "well vectorized" _F_inite _E_lement analysis code. The algorithm is numerically stable, robust, and computationally efficient. It has been hosted on numerous scalar machines, hypercubes and vector machines. ..... stuff deleted for space reasons Machines CrayXMP 4/8 ctss, cft77 65Mflops (measured for BM1,2) 45Mflops(BM3) YMP 8/16 UNICOS, one processor 95Mflops (measured for BM1,2) Cray-2 (one processor) UNICOS Convex C1XP Sun 4/260 FPU1, f77 v1.2 Sun 4/330 FPU2, f77 v1.2 dalign and not Campus f77 v1.2 dalign and not The Cray Mflops figures were measured (not computed) by the hardware speedometer. It should be noted that the earlier generation of code (which this code is meant to replace) ran slower in VECTOR mode than this code does in SCALAR mode .... it is the authors contention that this is often the case...that a good implementation often does well on many different machines. .... Timing table bm1 bm2 bm3 scalar vector scalar vector scalar vector Cray-2 - 180.2 - 178.0 - 573.4 XMP 2869.3 153.2 - 154.3 6094.7 398.6 YMP - 92.0 - 91.2 - 233.7 convex 9383.2 2021.2 9383.3 1979.8 14513.8 4383.2 4/330 6808 6871.6 7678.7 4/330dalign 5975.56 5993.0 6145.31 ss-1 8880.84 9044.51 10177.09 ss-1dalign 7804.50 7779.85 9946.17 4/260 10290.91 10263.47 12613.88 4/280fpu2 8445.43 8428.12 10228.52 ..... The point being that Arthur (et al.s) algorithm runs quite nicely are well designed vector machines (i.e. acheives good vectorization rates) _and_ on scalar machines. A later implementation employes a better vectorized matrix factorization step, which increases the overall vectorization considerably. The key is that this is a modified skyline direct solver ... so cache works quite nicely. Arthur can be reached at arthur@oasis.stanford.edu for more details about the science involved. > >>>Why bend over backwards (inter-company contracts, risk, design cost, etc) >>>for a 100% when you can have an easy 90% solution? The marginal gain isn't >The best case is where a company makes both the uP and the memory (88000 >for example). >The gain isn't marginal, either. You statements may be OK now, but next >generation will see the CPU in the 5 to 20ns range, with srams in the >20ns and drams in the 80ns range? Clearly needs work. >-Stan "Do I have an opinion yet?" Well, a large register file has been described as a compiler managed cache :> Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist ! kbierman@sun.com I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
bcase@cup.portal.com (Brian bcase Case) (06/09/89)
>You[r] statements may be OK now, but next generation will see the CPU >in the 5 to 20ns range, with srams in the 20ns and drams in the 80ns >range? Clearly needs work. And some of that work is being done now. True, processors will be at 50 MHz and beyond. However, the contention that memories will not keep up is not necessarily true. I know of an experimental SRAM now being fabbed that will have an access time of about 4 ns with a 16K x 4 organization and no on-chip address registers or other gunk. One problem: the die is big (4mm x 6mm?). I think Cypress is already sampling a 4K SRAM in the same access-time neighborhood. IBM said they have a 22ns 1Mbit DRAM (although nobody can figure out how real it is). Memory latency is a problem, but to say that it won't be solved seems alarmist.
slackey@bbn.com (Stan Lackey) (06/10/89)
In article <19257@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes: >And some of that work is being done now. True, processors will be at >50 MHz and beyond. However, the contention that memories will not keep >up is not necessarily true. I know of an experimental SRAM now being I was just quoting history; for example, in 1977 I designed DRAMS into a project, and the best that could be had were around 100ns. Here we are 12 years later and the fastest are 70 (that suppliers will accept a purchase order for large quantities). In the meantime, processors have gone from 5MHz to 20. That doesn't look like scaling to me. >Memory latency is a problem, but to say that it won't be solved seems >alarmist. I'm not being an alarmist. For example, to deal with faster CPU's than memories (and for other reasons not to the point), caches got invented. I'm just saying that more innovation will be required than just cranking up semiconductor processes. -Stan