[comp.arch] Do you have bandwidth?

schow@leibniz.uucp (Stanley Chow) (04/15/89)

Since Micheal Slater has already posted a good summary of the '040 and
486; there is not much point in starting a new battle in the 68K vs i86
war. (At least wait till there is more public information).

In the mean time, I offer a new topic to burn up bandwidth (that is, 
net.bandwidth, not bus.bandwidth).
 
In a recent series of articles about address modes and other topics,
some posters claim that memory bandwidth is not a problem - to quote
Brian Case, "bandwidth can be had in abundance". I happen to think that
we do not enough bandwidth now. What to other people think?
 
Just to make sure there are enough pieces so that everyone can post a
different answer, I will start with a list of pieces and you can fill
in the interfaces. Please try to at least type what you mean and by all
means, put in a couple of real (or maximum) numbers.
 
[Feel free to talk about parallel/multi-processing.]
 
    Piece of system
 
    Execution Core (possibly many)
    On-chip Cache  (possibly split)              chip
			 ---------------------------------
    Off-chip Cache (possibly multi-level)        board
    Main memory    (possibly multi-level)
    Bulk memory    (for lack of a better term)
 
 
Specific interfaces that may be of interest:
 
1) Execution Core  to  On-chip I-Cache.
 
    It seems people can already build cores that are faster than the
    on-chip cache. One can always throw silicon at a multiplier to make
    it faster (I know, there are limits with loading, ..).
 
 1a) Straight line execution
 
    Even in this simpler case, I understand that most chips are limited
    by the cache, not the core. Any chip designers want to comment?
 
 
2) Execution Core  to  On-chip D-cache.
 
    A very hard problem by all accounts. Everyone (almost) adds delay
    slots one way or another.
 
 
3) On-chip  to  off-chip.
 
    A well know problem. How wide do you think buses will get? How fast?
 
 
4) How should off-chip cache be controlled? By the cpu chip?
 
 
5) Invent the interface problem of your choice. This can be made as hard
   or as easy as you want.
 
 
 
Stanley Chow   ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
 
 
Please don't tell my boss I am starting this discussion, he thinks I
am working hard on software!

davis@clocs.cs.unc.edu (Mark Davis) (04/15/89)

In article <407@bnr-fos.UUCP> schow@leibniz.uucp (Stanley Chow) writes:
>In a recent series of articles about address modes and other topics,
>some posters claim that memory bandwidth is not a problem - to quote
>Brian Case, "bandwidth can be had in abundance". I happen to think that
>we do not enough bandwidth now. What to other people think?
>  ...
>    It seems people can already build cores that are faster than the
>    on-chip cache. One can always throw silicon at a multiplier to make
>    it faster (I know, there are limits with loading, ..).

You can always improve bandwidth with silicon (and wires).  To double
bandwidth, double the data bus size.  You can also use interleave or
special chip modes (static column or page mode access) to improve
bandwidth.

As I remember, Brian Case's statement was indeed referring to
bandwidth.

On the other hand, latency (roughly the number of cycles to get the
data after you figure out its address), is a much more difficult
problem.  Making the latency twice as good (50% as long) can be very
tough.  Some latency's are not possible with current technology ( 1 ns
latency for a 1 Megaword system for example).  Can you rephrase your
questions to discriminate between bandwidth and latency?

Thanks - Mark (davis@cs.unc.edu or uunet!mcnc!davis)

mccalpin@loligo.cc.fsu.edu (John McCalpin) (04/16/89)

In article <7766@thorin.cs.unc.edu> davis@cs.unc.edu (Mark Davis) writes:
>You can always improve bandwidth with silicon (and wires).  To double
>bandwidth, double the data bus size.
>
>On the other hand, latency (roughly the number of cycles to get the
>data after you figure out its address), is a much more difficult problem.
>Making the latency twice as good (50% as long) can be very tough.
>Mark (davis@cs.unc.edu or uunet!mcnc!davis)

One place where the distinction between latency and bandwidth shows up
very clearly is in the CDC/ETA line of supercomputers.  These machines
(the Cyber 205 and ETA-10) use a memory-to-memory vector architecture.

The machine being installed now at FSU (an ETA-10G) supports a
sustained bandwidth of 6.85 GByte/s from each CPU's 32 MByte (soon 128
MByte) local memory through the vector pipes and back to memory.  This
bandwidth consists of two 64-bit loads and one 64-bit store for each of
2 vector pipes on each CPU every 7 ns cycle. Total: 6 words/clock * 8
Bytes/word * 143 M clock/s = 6.85 GB/s.  I think that this memory is
implemented in 35 ns SRAM.  Latency on the 10.5ns machine is about 6-8
cycles, or 60-80 ns.  I don't know how this will scale with the faster
CPU's.

The second-level memory consists of 1 GByte of DRAM.  It has 8 ports 
capable of sustaining one 64-bit word/clock transfers. The aggregate 
transfer rate is thus 9.1 GByte/s. The hardware setup time for a transfer
is supposed to be about 256 cycles, but I don't know what fraction of this
is the actual memory latency.

Disclaimer: I don't work for CDC/ETA.  In fact, I don't work much at all....
-- 
----------------------   John D. McCalpin   ------------------------
Dept of Oceanography & Supercomputer Computations Research Institute
mccalpin@masig1.ocean.fsu.edu		mccalpin@nu.cs.fsu.edu
--------------------------------------------------------------------

mccalpin@loligo.cc.fsu.edu (John McCalpin) (04/17/89)

In article <592@loligo.cc.fsu.edu> I wrote:

>One place where the distinction between latency and bandwidth shows up
>very clearly is in the CDC/ETA line of supercomputers.  These machines
>(the Cyber 205 and ETA-10) use a memory-to-memory vector architecture.

I then went on to discuss the bandwidth, but not the latency.  I guess
that I didn't make a very clear distinction. :-)

Recap: we have a machine with 4 7 ns CPU's, each with 32 MB of SRAM and
a 6850 MB/s memory channel.  The CPUs share another 1 GB of DRAM, with
4 1140 MB/s channels currently installed (one to each CPU).

The latency is important because the overhead of setting up a
memory-to-memory vector operation includes the memory latency plus the
pipe length, plus other stuff relating to decoding the instruction,
etc.

The latency of the SRAM on the ETA-10 is about 6-8 cycles, and the pipe
length is 5.  So even if the instruction took zero time to decode
(don't we all wish!), there should be an overhead of 11-13 cycles on
each vector operation.  In fact, the hardware overhead on the ETA-10 is
down to about 16-23 cycles, depending on how the banks are aligned for
the input and output vectors.  This allows very good performance on
fairly short vectors.

The latency tends to be more of a bother in the random gather/scatter
instructions.  The ETA-10 (like the Cray machines) uses banked memory,
set up so that sequential accesses come at full (6850 MB/s) speed.
Random accesses can be MUCH slower.  Repeated accesses to the same bank
(typically resulting from a stride through an array which is a multiple
of 8 or 16) result in a full latency delay on each access.  Most ETA-10
users would really like to see the latency go down so that bank
conflicts would be less trouble on random gathers/scatters.

Disclaimer: I don't work for CDC/ETA.  In fact, I don't work much at all....
-- 
----------------------   John D. McCalpin   ------------------------
Dept of Oceanography & Supercomputer Computations Research Institute
mccalpin@masig1.ocean.fsu.edu		mccalpin@nu.cs.fsu.edu
--------------------------------------------------------------------

schow@bnr-public.uucp (Stanley Chow) (04/18/89)

In article <7766@thorin.cs.unc.edu> davis@cs.unc.edu (Mark Davis) writes:
>You can always improve bandwidth with silicon (and wires).  To double
>bandwidth, double the data bus size.  You can also use interleave or
>special chip modes (static column or page mode access) to improve
>bandwidth.
>

Within a chip, yes, one can widen the bus. Even there, routing problems
will restrict it. The 128 bit bus on the recent Intel chips seems to be
a pratical limit for now. 512 bit buses probably need triple metal levels
even in sub-micro processes.

Outside of a chip, I would have seriouse doubts about a very wide bus
unless you have lot's of money. A 128 bit bus with 32 bit address comes
to 160 pins before control line, add in power and ground, double it for
I & D, and we are looking at a packaging problem. Come to think of it,
ground bounce will probably make the packaging look easy.

My view is that even with interleave and page-mode, etc, we can make
execution cores and on-chip caches that are much faster than any bus.
Even in terms of raw bandwidth, but especially in latency time required.

>Can you rephrase your
>questions to discriminate between bandwidth and latency?
>

Actually, this question on bandwidth is my followup on the recent 
discussion. I am fustrated by bandwidth and latency at very turn; yet
all the people on the net seem to think bandwidth and/or latency is not
a problem. Is this because everyone know something I don't? Do I have
a particularly difficult problem?

Basically, I like get a feel for what other poeple think are the
bottle-necks. So comments on bandwidth *and* latency are appreciated.

The real hidden agenda is the old RISC-CISC war. If bandwidth is a real
problem, then RISC is not a good solution. [Oh no, did I just restart
a religous war?]

Stanley Chow   ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-pulic

What opinion? Did I say something? Come on, you wouldn't fire me just
because I didn't put in a disclaimer?

jesup@cbmvax.UUCP (Randell Jesup) (04/21/89)

In article <418@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes:
>Outside of a chip, I would have seriouse doubts about a very wide bus
>unless you have lot's of money. A 128 bit bus with 32 bit address comes
>to 160 pins before control line, add in power and ground, double it for
>I & D, and we are looking at a packaging problem. Come to think of it,
>ground bounce will probably make the packaging look easy.
...
>Actually, this question on bandwidth is my followup on the recent 
>discussion. I am fustrated by bandwidth and latency at very turn; yet
>all the people on the net seem to think bandwidth and/or latency is not
>a problem. Is this because everyone know something I don't? Do I have
>a particularly difficult problem?

	Well, my opinions on this are pretty well known here, I think.

	To restate: I agree bandwidth could well become a problem.  Bandwidth
problems come in several flavors: packaging is a big one, ram speed gets in
there also (especially if you don't want to pay astronomical prices for
it).  I won't even go into bus bandwidth.

	The traditional ways to improve bandwidth are running out of steam,
or at least starting to.  It's getting harder to keep adding pins to these
(very large) packages, while still running them at reasonable rates.  Also,
the signals are getting fast enough that capacitive pad loads from static
protection (combined with fan-out) are limiting the speed at which you can
run the lines.

	However, there are interesting non-traditional solutions to these
peoblems that may save our bacon.

	Ram speed is also an issue: you can get 50Mhz '030's, but the ram
to keep up with it is EXPENSIVE.  Processor speed has been increasing
faster than ram access time has been decreasing (for both CISC and RISC).

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

erskine@dalcsug.UUCP (Neil Erskine) (04/21/89)

In article <6658@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>
>	The traditional ways to improve bandwidth are running out of steam,
>or at least starting to.  It's getting harder to keep adding pins to these
>(very large) packages, while still running them at reasonable rates.  Also,
>the signals are getting fast enough that capacitive pad loads from static
>protection (combined with fan-out) are limiting the speed at which you can
>run the lines.
>

	I'm no engineer, but if the capacitive pad loads are
restricting the speed of off-chip signalling, why not dispense with
them, and provide the static protection at the board level?  This
might make board assembly more costly (due to the increased care
required), and the board itself more costly (it might have to be
encased in metal), but if it gives a significant degree of additional
speed, the bother and expense seem worth it.  Alternatively, there may
be some reasons why board level protection can't do the job; in which
case what are those reasons?

matloff@crow.Berkeley.EDU (Norman Matloff) (04/27/89)

In article <7766@thorin.cs.unc.edu> davis@cs.unc.edu (Mark Davis) writes:
>In article <407@bnr-fos.UUCP> schow@leibniz.uucp (Stanley Chow) writes:

>>In a recent series of articles about address modes and other topics,
>>some posters claim that memory bandwidth is not a problem - to quote
>>Brian Case, "bandwidth can be had in abundance". I happen to think that
>>we do not enough bandwidth now. What to other people think?

>You can always improve bandwidth with silicon (and wires).  To double
>bandwidth, double the data bus size.  You can also use interleave or
>special chip modes (static column or page mode access) to improve
>bandwidth.

These measures, e.g. wider buses, may just shift the bottleneck to 
something else.  There is still a strong limitation on a chip's 
number of pins, right?  The area of a rectangle grows much faster 
than the perimeter, and of course there are mechanical reasons why 
pins can't be too small.  Thus the ratio of number of I/O channels
of a chip to bits stored in the chip will probably get worse, not
better.

We are developing an optical interconnect which has plenty of bandwidth,
since it bypasses the pins and reads from the chip directly [see 1988
ACM Supercomputing Conf.]  But it does indeed seem to us  --  at this
stage, at least  --  that huge bandwidth can not be exploited fully in
many, maybe most, applications.

I certainly would like to hear what others have to say about this.

   Norm

jps@wucs1.wustl.edu (James Sterbenz) (05/02/89)

In article <23649@agate.BERKELEY.EDU> matloff@heather.ucdavis.edu (Norm Matloff) writes:

>These measures, e.g. wider buses, may just shift the bottleneck to 
>something else.  There is still a strong limitation on a chip's 
>number of pins, right?  The area of a rectangle grows much faster 
>than the perimeter, and of course there are mechanical reasons why 
>pins can't be too small.

This is a problem with packages that have pins only on the perimeter
(such as DIPs), but not for PGAs.  Of course, pin limitation is still
a problem, but not quite as bad for PGAs.

>Thus the ratio of number of I/O channels
>of a chip to bits stored in the chip will probably get worse, not
>better.

In spite of all the other things that most of us think as important,
packaging remains one of the most important limitations to system
performance.  This is one of the reasons that micros and workstations
will have trouble reaching the performance of mainframes and supercomputers;
current high performance packaging and cooling is just too expensive.

It will be very interesting to see what happens when a cheap, easy
chip interconect allowing close 3-D stacking becomes available
(assuming corresponding cooling).
-- 
James Sterbenz  Computer and Communications Research Center
                Washington University in St. Louis 314-726-4203
INTERNET:       jps@wucs1.wustl.edu
UUCP:           wucs1!jps@uunet.uu.net

schmitz@fas.ri.cmu.edu (Donald Schmitz) (05/03/89)

>In spite of all the other things that most of us think as important,
>packaging remains one of the most important limitations to system
>performance.  This is one of the reasons that micros and workstations
>will have trouble reaching the performance of mainframes and supercomputers;
>current high performance packaging and cooling is just too expensive.

Just saw a blurb in one of the trade papers about a new inter-connect
technology, developed by Cinch, called "Cinapse".  I'm still waiting on info,
but from the description, they use silver butt contacts - I'm guessing the
trick is to somehow make the contacts be springy to make sure all of them in
an array touch.  From memory, the density was about twice that of current
PGAs (240 contacts/in^2 sticks in my mind).  They are pushing this as a bus
connector technology, but it seems possible to use it for chips too.

Interesting question, if packaging allowed you to have twice as many pins per
CPU (pick your favorite existing design), what would you do with them?

Don Schmitz
--

jesup@cbmvax.UUCP (Randell Jesup) (05/04/89)

In article <833@wucs1.wustl.edu> jps@wucs1.UUCP (James Sterbenz) writes:
>This is a problem with packages that have pins only on the perimeter
>(such as DIPs), but not for PGAs.  Of course, pin limitation is still
>a problem, but not quite as bad for PGAs.

	But then again PGA's have thermal expansion coefficient problems due
to mismatch with the coefficient of board they mount on (or so I was told).
That's why the RPM-40 in in a leadless chip carrier instead of a PGA.  (Perhaps
PGA's with sufficient pins weren't rated for 40 MHz, either.)

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

mark@mips.COM (Mark G. Johnson) (05/04/89)

In article <6759@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
 >	But then again PGA's have thermal expansion coefficient problems
 >  due to mismatch with the coefficient of board they mount on (or so I
 >  was told).  That's why the RPM-40 in in a leadless chip carrier
 >  instead of a PGA.  (Perhaps PGA's with sufficient pins weren't
 >   rated for 40 MHz, either.)
 >Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

Seems to me that PGA's do just fine with respect to both temperature
and frequency.  The defacto industry-standard CMOS video DAC (Brooktree 458)
comes in an 84 pin PGA, dissipates 2.2 Watts, and runs at 125 MHz.
Maybe Brooktree is more clever with "rating" their package than RPM40 was.

Intel's i860 CMOS microprocessor comes in a 168-pin PGA, runs at 40 MHz,
and dissipates 3.5 Watts.  Maybe Intel .....

Hewlett-Packard's most recent HP-n000 series microprocessor, built in
NMOS, dissipates 26 Watts and is mounted in a 408 pin PGA.

Various ECL and GaAs RISC processors are about-to-be-introduced, from
several sources, dissipating godzilla amounts of power and running at mucho
Megahertz --- and several of them are in PGA packages.

Looks like folks who really want to, can make PGAs go quite far indeed.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	...!decwrl!mips!mark	(408) 991-0208

jesup@cbmvax.UUCP (Randell Jesup) (05/05/89)

In article <18753@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>Seems to me that PGA's do just fine with respect to both temperature
>and frequency.  The defacto industry-standard CMOS video DAC (Brooktree 458)
>comes in an 84 pin PGA, dissipates 2.2 Watts, and runs at 125 MHz.
>Maybe Brooktree is more clever with "rating" their package than RPM40 was.

	Maybe.  We needed 140 or so pins, most of them running at 40Mhz,
CMOS device, 2-3 watts.  Also, this was in '85 or '86, and we wanted to use
an out-of-the-box package (this wasn't a production part, so we didn't want 
to pay to develop/certify a package).

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (05/06/89)

In article <18753@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>Hewlett-Packard's most recent HP-n000 series microprocessor, built in
>NMOS, dissipates 26 Watts and is mounted in a 408 pin PGA.

That's an impressive number of pins. It's probably not very dense,
though - on 100-mil centres, that's over four square inches, to hold
a square centimeter (or so) of chip.  Long paths!  Also, pins take up
board area, since they go through all signal layers, rather than just
through relevant ones.

One obvious improvement is to go surface mount, with a pad-grid
array.  The Motorola "Hypermodule" uses 88000's mounted in these.
I've only seen photos so far, but the claim is 288 pads in 1.1 inch x
1.1 inch, using 60-mil centres.

Actually, there's only 143 signal lines - the rest are power, ground,
and thermal. It will be interesting to see the response from
competitors.
-- 
Don		D.C.Lindsay 	Carnegie Mellon School of Computer Science
--