[comp.arch] One aspect of bandwidth

mark@mips.COM (Mark G. Johnson) (04/15/89)

In article <407@bnr-fos.UUCP> schow@leibniz.uucp (Stanley Chow) writes:
>In a recent series of articles about address modes and other topics,
>some posters claim that memory bandwidth is not a problem - to quote
>Brian Case, "bandwidth can be had in abundance". I happen to think that
>we do not enough bandwidth now. What to other people think?
> 
>[ ... parallel/multi-processing.]


Seems to me that backplanes (or, in general terms, processor-to-
main-storage data trunks) are nearly pooped out.  At least for the
kinds of air cooled, medium-priced computers that tend to have
microprocessors for CPUs.

For example, the 20-VUP {1 VUP == 1 VAX-780 unit of performance}
"M/2000" machine transfers 32 data bits (+ checkbits, etc) per
40 nsec, over a backplane that's about 1/3 meter long.  This is
traffic between the 128 kbytes of cache and the dRAM.

Wow.  100 Mbytes/sec for a 20-VUP uniprocessor.  Imagine what will
be needed to keep a multiprocessor fed, where each of the processors
is 60-VUPS or more.  Bigger caches alone won't solve the problem;
hit rates are pretty good already with `only' 128 kbytes of cache.
Plus there may be extra backplane traffic, over and above the
necessary CPU<->memory transfers, to implement coherency.  Yuck.

Wider datapaths _might_ get a 4X improvement, if you don't mind
paying for 128 wires.  And slick electrical wizardry like Bandgap
Transciever Logic (TM nat'l semi) might allow a backplane bus
containing 12 card slots to perform one transfer per 20 ns, giving
another 2X improvement.  But we still need another 3X to keep
an 8way x 60VUP multiprocessor from starving!

And that's just for the very next generation of air cooled, medium
priced CPU --- 60 VUP performance is eminently forseeable in the next
6 to 24 months (schedule varies depending upon who you talk to :-).
Soon thereafter it will be 1994, when you can buy 125-250 VUP micros
for cheap, and when the few remaining software issues concerning
32-way multiprocessors will be solved :-).  *Those* kinds of machines
will, um, provide a challenge for the processor-to-memory
data transfer engineers.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	...!decwrl!mips!mark	(408) 991-0208

mash@mips.COM (John Mashey) (04/16/89)

In article <17500@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
.....
>Seems to me that backplanes (or, in general terms, processor-to-
>main-storage data trunks) are nearly pooped out.  At least for the
>kinds of air cooled, medium-priced computers that tend to have
>microprocessors for CPUs.
....
>And that's just for the very next generation of air cooled, medium
>priced CPU --- 60 VUP performance is eminently forseeable in the next
>6 to 24 months (schedule varies depending upon who you talk to :-).
>Soon thereafter it will be 1994, when you can buy 125-250 VUP micros
>for cheap, and when the few remaining software issues concerning
>32-way multiprocessors will be solved :-).

There's a issue of UNIX Review coming up a while off on something
like "fundamental systems technologies".  We (the editorial board, that is)
were discussing this issue, which might tentatively 3 articles, looking at
thetechnologies over the next few years:

1. CPUs (great! mips/flops are almost free)

2. Busses (oh, oh! we're in trouble trying to match the CPUs)

3. I/O (esp. disks) (well, now we're REALLY in trouble).

A particularly interesting issue about busses: MultiBus, and especially,
VMEBUS, happened about the same time as micros to which they were matched.
VMEBUS had enough headroom to remain adequate, even for a 20-VUPS machine
(albeit barely, as an I/O bus).  This was good, because it let people
leverage off many other people's controller development. (Remeber that
a VMEBUS is something like 40MB peak, 25MB sustained).

As Mark notes, the coming 100+ VUP micros aren't so forgiving.
1) In a balanced system, a single VMEBUS (with block-mode controllers,
appropriately interleaved/fifoed memory controllers, etc) is out of gas
for anything above a 20-25-VUPS system, for sure.

2) Maybe no one will expect to have a "standard" memory bus of any sort,
and instead, just connect up controllers to I/O adaptors.
This is sort of sad.

3) it is pretty easy to look ahead just a few years, to see how to build
systems in the 500-1000-VUPS range, which could fit in a deskside (large
deskside) package, if you wanted, assuming you could build a main
memory bus in the 750-1500MB/sec region......

4) Does anybody know of anything like that being proposed as a standard?
No? well, it looks like we're in for a lot more multiple-bus architectures...:-)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

bcase@cup.portal.com (Brian bcase Case) (04/17/89)

>>In a recent series of articles about address modes and other topics,
>>some posters claim that memory bandwidth is not a problem - to quote
>>Brian Case, "bandwidth can be had in abundance". I happen to think that
>>we do not enough bandwidth now. What to other people think?
>
>Seems to me that backplanes (or, in general terms, processor-to-
>main-storage data trunks) are nearly pooped out.  At least for the
>kinds of air cooled, medium-priced computers that tend to have
>microprocessors for CPUs.

Yes, I couldn't agree more.  Look at the fastest backplane buses around.
Ardent's is a "measly" 256 megabytes (now there's a nice round number!)
per second.  That's really good for a backplane, but kinda awful for a
CPU<-->Memory interconnect.  Backplane bandwidth is relatively hard to
improve.  I once worked on a CPU design whose ECL backplane would have
had 1.2 or so Gbytes/sec of bandwidth.  This would have just kept up
with cache misses on two *integer* processors (or was it one?  the
instructions were quite wide)!  And the latency was nothing to write home
about (but commensurate with the technology).

If I remember correctly, what I was referring to with the slightly-out-
of-context-quote "bandwidth can be had in abundance" was tightly-coupled,
or even on-chip, CPU<-->Memory interconnect.  Even then, the statement
sounds a little overexuberant, I guess.  Still, we have a box full of tools
for increasing bandwidth, but hardly anthing besides "shrink the process!"
for decreasing latency.  At least on-chip RAM will scale with the CPU.

>And that's just for the very next generation of air cooled, medium
>priced CPU --- 60 VUP performance is eminently forseeable in the next
>6 to 24 months (schedule varies depending upon who you talk to :-).
>Soon thereafter it will be 1994, when you can buy 125-250 VUP micros
>for cheap, and when the few remaining software issues concerning
>32-way multiprocessors will be solved :-).

This is one set of reasons why I like uniprocessors.  Nice, simple,
easy to understand uniprocessors.

bcase@cup.portal.com (Brian bcase Case) (04/17/89)

>2) Maybe no one will expect to have a "standard" memory bus of any sort,
>and instead, just connect up controllers to I/O adaptors.
>This is sort of sad.

But very necessary.  At very high rates, it just doesn't make sense for
small-medium sized computers to connect to memory over a "bus" (unless
that bus is ECL, 32 to 64 bytes wide, 8 transfers per transaction (needed
to recover from the latency), 20ns per transfer (bandwidth:  1.6 to 3.2
Gbytes/sec).  Using current DRAM technology, the array size needed to
implement this requires much space and the minimum memory size is huge.
With next generation DRAM technology, the min. size is even bigger!!).
Though it's done for different reasons, Macintosh is a sign of times to
come:  all the (high-speed) main memory is "special," and is right next
to the CPU/cache.  What do you do for multiprocessors? Build that ECL bus
and charge several $million.

kleonard@gvlv2.GVL.Unisys.COM (Ken Leonard) (04/17/89)

* ... 
* FDDI-speed components that could be used in a super-simple "external serial 
* backplane" based on point-to-point links. For example, consider AMD's "TAXI" 
* ... 
* This is "only" 12.5 Mbyte/sec per channel, but you could have many of 
* them, even one per device. Note for comparision that even the fastest 
* ... 
* I'd certainly like to see an El Cheapo fiber-optic I/O bus ("Cheap-O-Bus"?), 
* ... 
* Something along the lines of a 100 Mbit/s Starlan or 10baseT, that is, 
* active hubs or concentrators (unless only two ends) with point-to-point 
* links to CPUs or devices. 
* The trick is in keeping the design (and link protocol) cheap enough that 
* you could have "100baseF" disks and other peripherals the same way you have 
* SCSI disks today. And of course, it could *also* be used as an LAN, much 
* the same way "370"-like mainframes use Channel-To-Channel adapters today. 
* (FDDI is supposed to so this, but it looks too expensive/complex to me. And 
* do you really need 1300nm light so you can put your disks two kilometers away?) 
* Anybody interested? 
We (Unisys Defense Systems, NIS/SSE) are presently building a first cut at a 
very high speed local-like bus-like lan-like thing. For reasons not necessarily 
very good, it is (probably unfortunately) called Very High Speed Local Area 
Network (VHSLAN). The very first implementation, four CPU nodes, is to be up 
and running late this year. We are aiming at 40MByte/sec to prove the pieces 
work, and about 120MByte/sec to deliver next year. These are "real" "net after 
all overhead" "really uninterrupted indefinitely continuous" data rates. The 
rates apply per node-to-node path, regardless of the number of paths 
concurrently operating, including the case of every node in the network 
concurrently talking to some another node. The ostensible application is 
"supercomputer" memory-to-memory megablock transfer or continuous streams, 
although we generally tell those who sign the chits that it's "file transfer" 
or similar too-restrictive terms. We haven't completely figured out how to deal 
with the "where-to-send-this-piece" info for a given "block" as fast as we can 
move the piece--that is, even with darnnear no protocol, we can't get away with 
absolutely no protocol, and any protocol has to be "figured out" and "acted on" 
by something inside the net and that something, so far, looks like it still has 
to be electronic logics. Of course, a degenerate, strictly two-ended 
implementation or a dedicated ring (SuperCPU process distribution?) wouldn't 
need decisions by the net since it wouldn't be a net so it could really run at 
full rate all the time. And then, a smart-enough disk-bank controller thing 
which made the platter space look like a flat memory space might be 
interestingly sharable/interleavable across a processor cluster. And there are 
a few other ways we think it might be useful. Now 40MB/s or even 120MB/s is no 
contender for CPU-bus-of-the-year when compared to rates between a SuperCPU and 
its own corehouse or its own tightly coupled disk bank or its busplane-coupled 
twin neighbor. But it looks pretty good, we think, as an arbitrarily 
configurable routeable directable interconnectable "full performance at full 
concurrency" LAN-like thing for a cluster or assemblage or campus full of 
SuperCPUs and their consorts. And, after all, we are talking about these rates 
per single-conductor connection, with _n_ conductors in parallel per "channel" 
_not_ nearly _n_ times as expensive as one per channel. It is also probably 
going to be affordable, and maybe even economically justifiable. We are doing 
all this because, among other things, we think that busplane-derived 
architectures _are_ just about at their limit and both the problem-program folk 
and the system-architecture folk are going to _have_ to really work at process 
distribution or problem segmentation or computational load sharing--none of 
which are really the same as parallelism, necessarily. 
 
Ken Leonard 
--- 
This represents neither my employer's opinion nor my own--It's just something 
I overheard in a low-class bar down by the docks. 

gd@geovision.uucp (Gord Deinstadt) (04/17/89)

In article <17500@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes:

 Seems to me that backplanes (or, in general terms, processor-to-
 main-storage data trunks) are nearly pooped out.  At least for the
 kinds of air cooled, medium-priced computers that tend to have
 microprocessors for CPUs.
 [gives examples]
                                 ...*Those* kinds of machines
 will, um, provide a challenge for the processor-to-memory
 data transfer engineers.

Fiber optics.
-- 
Gord Deinstadt           gd@geovision.uucp

kds@blabla.intel.com (Ken Shoemaker) (04/18/89)

In article <17527@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <17500@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>.....
>>Soon thereafter it will be 1994, when you can buy 125-250 VUP micros
>>for cheap, and when the few remaining software issues concerning
>>32-way multiprocessors will be solved :-).
>
>3) it is pretty easy to look ahead just a few years, to see how to build
>systems in the 500-1000-VUPS range, which could fit in a deskside (large
>deskside) package, if you wanted, assuming you could build a main
>memory bus in the 750-1500MB/sec region......
>
>4) Does anybody know of anything like that being proposed as a standard?
>No? well, it looks like we're in for a lot more multiple-bus architectures...:-)

Well, large memory arrays tend to be long latency/fast transfer time
devices, so supporting multiple outstanding requests seems like a good
start.  As for speed on the bus, I believe that the Futurebus supports
asynchronous transfers running just as fast as the handshaking lines can
wiggle.  And it also supports a distributed multiprocessor cache consistency
protocol.  How nice.  Also, large, multi-level caches tend to favor long
line sizes (well, the multi-level is kinda superfluous to the point, but is
probably necessary for the system implementation) which fits well in all
this.

On chip caches can also take care of lots of the processor to memory
bandwidth, but you don't want to make them too large(!) because as you make
them larger, they get slower (even if they are on chip).  The good news is
they don't have to be tremendously huge to have an effect.  Our measurements
of the 8K i486 cache shows a better than 90% hit ratio for a large number of
program traces run on the Sun 386i.  And this is for both user and system
code.  Of course, your mileage will vary.  And in a large multi-processor
system, you would still want to have a second level (external) cache, but it
would have to be at least 256K to make a difference (reference to an Alliant
(?) paper from a while ago about using multi-level caches in a large,
multiprocessor system).

And as for I/O, maybe the next generation of systems will use their mips 
to do something besides generating I/O requests.  The kind of thing that 
the MIPS guy at Comdex said absolutely requires "risc performance" 
(whatever that means, maybe VUPS will become RUPS or RIPS) to do.  
Nah, too radical a thought.
----------
I've decided to take George Bush's advice and watch his press conferences
	with the sound turned down...			-- Ian Shoales
Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds

mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) (04/18/89)

In article <17298@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>Though it's done for different reasons, Macintosh is a sign of times to
>come:  all the (high-speed) main memory is "special," and is right next
>to the CPU/cache.  What do you do for multiprocessors? Build that ECL bus
>and charge several $million.

Excuse me, I'm a complete novice in this area, but is it necessary that
_all_ the memory of each processor be shared, and thus be on a very
fast and expensive bus?  Why can't the bulk of each process' memory
(most of the data, and all of the instructions) be local, and you have
to explicitly ask for shared memory space, with the understanding that access
will be significantly slower.  Transferring data would be faster than
message-type systems, like a connection machine.

I guess that in this scheme, there would be some extra physical memory
riding on the bus for the shared data, but if one really didn't want
to waste anything, is it possible to have a software-selectable but
physically implemented device to configure banks of real memory to
individual processors  or shared pool?

Matt Kennel
   mbkennel@phoenix.princeton.edu

dave@lethe.UUCP (Dave Collier-Brown) (04/18/89)

In article <17298@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
| Though it's done for different reasons, Macintosh is a sign of times to
| come:  all the (high-speed) main memory is "special," and is right next
| to the CPU/cache.  What do you do for multiprocessors? Build that ECL bus
| and charge several $million.

  Well, sort of.
  You do charge several millions, but you don't so much build a bus as
you do a star, with the memory in the middle and the processors out on
the arms.  The thing in the middle is called a system controller on a 'bun
and costs a non-trivial amount of money.

  One of my dear friends still works on Honeywells and regularly gets
into uncontrolled snickering fits when the magazine-types start talking
about bus minis replacing mainframes because they have so many more more 
mips...

--dave (who worked for HW before all the Bull) c-b

-- 
David Collier-Brown,  | {toronto area...}lethe!dave
72 Abitibi Ave.,      |  Joyce C-B:
Willowdale, Ontario,  |     He's so smart he's dumb.
CANADA. 223-8968      |

rec@dg.dg.com (Robert Cousins) (04/18/89)

In article <7794@phoenix.Princeton.EDU> mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) writes:
>
>Excuse me, I'm a complete novice in this area, but is it necessary that
>_all_ the memory of each processor be shared, and thus be on a very
>fast and expensive bus?  
>
>Matt Kennel
>   mbkennel@phoenix.princeton.edu

Actually, this makes sense for the general case.  A large portion of the
average unix systems' RAM contains read-only data structures and code
which will be constant from boot to shutdown.  This can be replicated
throughout the system.  Approximately 40% of all CPU time on a Unix
machine (varies with load and OS release) is spent in the kernel.  THerefore,
if each CPU has a private copy of the kernel text, the code fetches from
this 40% of the time can stay away from the shared memory system.  In
effect, this reduces the system memory bandwidth by some large amount,
which will depend upon the cache and CPU configuration.  

However, there are noticeable applications where this fails.  These 
include kernels which are paged, swapped or otherwise change with time.
In this case, the management of replicated memories becomes somewhat
hardware and/or software intensive.  Interestingly enough, even this
disadvantage is not too much for some machines to still get a win out 
of it.

Robert Cousins

Speaking for myself alone.

prem@crackle.amd.com (Prem Sobel) (04/18/89)

In article <7794@phoenix.Princeton.EDU> mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) writes:
>Excuse me, I'm a complete novice in this area, but is it necessary that
>_all_ the memory of each processor be shared, and thus be on a very
>fast and expensive bus?

The real issue is not really cost (at least for some), but speed. No matter
how fast you build the memory, the memory bandwidth will be saturated for
some number of processors. To get unlimited linear speedup one MUST use
local memory and minimize either interprocessor communications or use of
shared memory.

>I guess that in this scheme, there would be some extra physical memory
>riding on the bus for the shared data, but if one really didn't want
>to waste anything, is it possible to have a software-selectable but
>physically implemented device to configure banks of real memory to
>individual processors  or shared pool?

Creve Maples when he was at Lawrence Berekely labs built just suchg a
machine called MIDAS which had reconfigurable memory that could be made
local, shared and dynamically switched. It got quite impressive speedup
and reliability (when the software to support that was done). The real
trick was the reconfigurable interconnects.

kleonard@gvlv2.GVL.Unisys.COM (Ken Leonard) (04/21/89)

* [stuff deleted] 
* Is anyone out there actually doing work with fiber optic busses? 
--well, our so-called VHSLAN (now building prototype) can be run can be run 
as a "bus" (sort of ring-like) as well as a "net" (sort of crossbar-like.) 
* I remember reading that the bandwidth of current fiber optic 
--true, but watch out for what "bandwidth" really means 
* cables is in the 3500 _Gigahertz_ range in the IR band alone. 
--let's say that the channel bandpass is 3500 gigabits max with present 
technology but not forget that by the time we're almost economically reasonable 
and have gotten the error rate really under control, we're talking about 
2000 to 2400 gigabaud at approx 1/2 bit information per baud (yes, I typed 
it right, about 2 line state times per INFORMATION bit) 
* This would seem to be adequate for the next generation of 
* computers (but probably not the generation after that....), esp 
--well, it just about fits the current generation top-end systems with 
one fiber-wavelength-channel per data channel 
* considering that fiber optic cables were meant for multiplexing. 
--OUCH--that word again--_multiplexing_--multiple frequencies/wavelengths 
per fiber is NOT MULTIPLEXING--unless you regard the common case of umpteen 
zillion radio and TV stations sharing the same luminiferous aether to be 
also a case of multiplexing 
--so-called frequency- or wavelength-division multiplexing is NOT AT ALL 
SIMILAR TO time-division or contention or code-division multiplexing 
and is almost an entirely separate discussion 
--FDM/WDM is really PROPAGATION MEDIUM SHARING 
--TDM/CM/CDM etc is SIGNAL CHANNEL SHARING, at best, and is often (usually 
at the present?) more like DATA CHANNEL SHARING 
* Each board could have it's own frequency range, allowing 
* multiple boards to talk at the same time. Any comments? 
--and let's not forget ((decision) engine sharing?) and (switch space sharing?) 
and (information sharing?) and (?) 
--ain't this fun? 

henry@utzoo.uucp (Henry Spencer) (04/23/89)

In article <17298@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>... all the (high-speed) main memory is "special," and is right next
>to the CPU/cache.  What do you do for multiprocessors? Build that ECL bus
>and charge several $million.

Or admit that global shared memory simply cannot be made fast enough for
such systems without several $million, and provide local memory for speed
with slower global memory for coordination only.  Potentially much more
of a hassle for the software, but still workable.
-- 
Mars in 1980s:  USSR, 2 tries, |     Henry Spencer at U of Toronto Zoology
2 failures; USA, 0 tries.      | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

henry@utzoo.uucp (Henry Spencer) (04/23/89)

In article <199@gvlv2.GVL.Unisys.COM> kleonard@gvlv2.GVL.Unisys.COM (Ken Leonard) writes:
>...We haven't completely figured out how to deal 
>with the "where-to-send-this-piece" info for a given "block" as fast as we can 
>move the piece...

Are you familiar with Chesson's "protocol engine" work?  He's looking at
a fully general virtual-circuit protocol that can keep up with 100Mbps FDDI.
Yes, it's done in hardware.  The protocol is carefully organized so it can
be processed "on the fly", i.e. the hardware can keep up with 100Mpbs
indefinitely, not just in bursts, provided there's somewhere for the data
to go at that speed.
-- 
Mars in 1980s:  USSR, 2 tries, |     Henry Spencer at U of Toronto Zoology
2 failures; USA, 0 tries.      | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

ching@pepsi.amd.com (Mike Ching) (04/23/89)

In article <1989Apr22.225625.5883@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
 >In article <199@gvlv2.GVL.Unisys.COM> kleonard@gvlv2.GVL.Unisys.COM (Ken Leonard) writes:
 >>...We haven't completely figured out how to deal 
 >>with the "where-to-send-this-piece" info for a given "block" as fast as we can 
 >>move the piece...
 >
 >Are you familiar with Chesson's "protocol engine" work?  He's looking at
 >a fully general virtual-circuit protocol that can keep up with 100Mbps FDDI.
 >Yes, it's done in hardware.  The protocol is carefully organized so it can
 >be processed "on the fly", i.e. the hardware can keep up with 100Mpbs
 >indefinitely, not just in bursts, provided there's somewhere for the data
 >to go at that speed.

Actually 100Mbits is the low end of the range being addressed by the protocol
engine. The intent is to have a protocol and hardware that can support future
gigabit networks as well as the current 100megabit one.

Mike Ching

kleonard@gvlv2.GVL.Unisys.COM (Ken Leonard) (04/24/89)

In article <25368@amdcad.AMD.COM> ching@pepsi.AMD.COM (Mike Ching) writes:
* In article <1989Apr22.225625.5883@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
*  >In article <199@gvlv2.GVL.Unisys.COM> kleonard@gvlv2.GVL.Unisys.COM (Ken Leonard) writes:
*  >>...We haven't completely figured out how to deal 
*  >>with the "where-to-send-this-piece" info for a given "block" as fast as we can 
*  >>move the piece...
*  >
*  >Are you familiar with Chesson's "protocol engine" work?  He's looking at
*  >a fully general virtual-circuit protocol that can keep up with 100Mbps FDDI.
*  >Yes, it's done in hardware.  The protocol is carefully organized so it can
*  >be processed "on the fly", i.e. the hardware can keep up with 100Mpbs
*  >indefinitely, not just in bursts, provided there's somewhere for the data
*  >to go at that speed.
* 
* Actually 100Mbits is the low end of the range being addressed by the protocol
* engine. The intent is to have a protocol and hardware that can support future
* gigabit networks as well as the current 100megabit one.
* 
* Mike Ching

kleonard@gvlv2.GVL.Unisys.COM (Ken Leonard) (04/24/89)

Sorry for the doublepost folks, our news-responder has a problem with
building response files.
In article <25368@amdcad.AMD.COM> ching@pepsi.AMD.COM (Mike Ching) writes:
* In article <1989Apr22.225625.5883@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
*  >In article <199@gvlv2.GVL.Unisys.COM> kleonard@gvlv2.GVL.Unisys.COM (Ken Leonard) writes:
*  >>...We haven't completely figured out how to deal 
*  >>with the "where-to-send-this-piece" info for a given "block" as fast as we can 
*  >>move the piece...
*  >
*  >Are you familiar with Chesson's "protocol engine" work?  He's looking at
*  >a fully general virtual-circuit protocol that can keep up with 100Mbps FDDI.
*  >Yes, it's done in hardware.  The protocol is carefully organized so it can
*  >be processed "on the fly", i.e. the hardware can keep up with 100Mpbs
*  >indefinitely, not just in bursts, provided there's somewhere for the data
*  >to go at that speed.
* 
* Actually 100Mbits is the low end of the range being addressed by the protocol
* engine. The intent is to have a protocol and hardware that can support future
* gigabit networks as well as the current 100megabit one.
Yes, we have seen some of Chesson's work, and similar work including some being
done wlsewhere in Unisys.  And there are two key differences (undermining
assumptions?) between those works and what we are trying to do...
-
1)  "fully general"
We are not attempting, and do not particularly want to achieve, a network
to handle "the general case".  Our objective is to interconnect VERY high-speed,
VERY high-volume, VERY high-load [hosts].  The immediate case is a campus-scale
set of supercomputers and associated peripherals [storage or source or sink or
interaction] communicating at something close to as fast as they can compute.
Now 100 MFlop/sec at (only?) 32 bit/number is 3.2 Gbit/sec.  Which is not quite
yet what these systems _sustain_ for _many_ seconds, but that is our target
range.  Refer to current work on "computational visualization", and similar
ideas, then consider _numerous_ "visualization workstations" and _several_
or more "primary computational engines".  
-
2)  "virtual circuit"
The virtual circuit concept, as usually interpreted and applied in networks
of extremely many endpoints, is necessarily (?) part of any general case
approach, but unfortunately carries unnecessary baggage in both the
intellectual and practical senses. (did I say that right?)  
We do, necessarily, have a concept of "circuit", at least to the extent
that there is an established and finite agreement between two end points 
to exchange information. 
And our concept of a circuit is "virtual", at least to some extent, in that
not all segments of the intervening signal media are irrevocably and
uniterruptedly dedicated to the particular circuit or session, or that
on two disjoint occasions, an apparently same circuit between a given pair of
end points may utilize different signal paths.
-------
We view a transaction as involving at least a MegaByte, and well likely
numerous MegaBytes of data in at least one of the two directions of the
circuit.  And we view a session as involving numerous to numerous _hundreds_
or even _thousands_ of transactions.
So a session easily approaches several to numerous GigaBytes of traffic
or more, and becomes many to very many seconds in duration.
-------
In the case of "vizualization", or the more general case of "process
distribution", or the worse case of "coupled computations", there are also
the elements of "timeliness" and "computational compactness".
--Timeliness applies to things like step-results (e.g. arrays?) (large chunks)
which must be moved from one process engine/locus to another with enough
expediency (?) that the receiver will always find its next needed input
ready to be used.  Timeliness also applies, maybe even more critically,
when the unit intermediate result is not so large.  That is, a megaword
intermediate array may admit some significant buffering for some processes
but a 20-word unit vector probably needs to "get there now" and
"be used immediately" so that a return result and "be returned PDQ".
So the idea of timeliness runs head-on into things like packet-delay or
channel-access-latency or acknowledgement-turnaround--in terms of what
has to happen "inside" the "network space" and also in terms of
how much computation the sender or receiver  has to devote just to
sending and receiving instead of "useful" computing.
--Computational compactness applies to things like array-like or
whole-image-like or file-like transaction data units, in that none of
the unit is useful to the receiver until all of the unit has been received,
or that the sender cannot compute the next unit until the now completed
unit has been sent.  So even if we can move data a fair bit better than the
long-average computation rate, we end up losing a lot of very expensive
computation engine cycles waiting for these bunches of data to "get here"
of "get gone".
--The concepts of shared networking, shared channel media, virtual circuits
sharing physical route segments, etc., running everything at levelled or
average rates, even aggregated onto very high-speed facilities, end up
having a backlash effect that EVERYTHING, END-TO-END, ends up being
shared and multiplexed and multiprocessed and levelled.  Which ends up,
at the very least, with an awful lot of overhead spread across the
realm or concentrated in a couple of bottlenecks--but still a lot
of overhead and, worse, logical complexity which we think is unnecessary.
---------
So, anyhow, that's enough for one diatribe disguised as a tutorial.
---
Regardz,
Ken

aglew@mcdurb.Urbana.Gould.COM (04/29/89)

>In article <17298@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>| Though it's done for different reasons, Macintosh is a sign of times to
>| come:  all the (high-speed) main memory is "special," and is right next
>| to the CPU/cache.  What do you do for multiprocessors? Build that ECL bus
>| and charge several $million.
>
>  Well, sort of.
>  You do charge several millions, but you don't so much build a bus as
>you do a star, with the memory in the middle and the processors out on
>the arms.  The thing in the middle is called a system controller on a 'bun
>and costs a non-trivial amount of money.
>
>  One of my dear friends still works on Honeywells and regularly gets
>into uncontrolled snickering fits when the magazine-types start talking
>about bus minis replacing mainframes because they have so many more more 
>mips...
>
>David Collier-Brown,  | {toronto area...}lethe!dave

Now that there is no bus to snoop on, cache coherence between caches at the 
ends of the arms of the star [?] becomes problematic.  Start looking at
software managed cache coherency, or strictly private memory. Start thinking
about making a second stage cache controller at the center of the star,
that copies the tags in the private caches at the ends of the star, and uses
that to control "selectcasts" for snooping. Start worrying about having more
tags than data in your cache - or start upping the page size to 32K - 1M.
Realize that interprocessor communication is *slow*; think about a special
synchronization bus...

Yep... I've been through it once. Now that I'm working with microprocessors
again, I'll not have to worry about it for, oh, say maybe a year or so.

dave@lethe.UUCP (Dave Collier-Brown) (04/30/89)

>>  Well, sort of.
>>  You do charge several millions, but you don't so much build a bus as
>>you do a star, with the memory in the middle and the processors out on
>>the arms.  The thing in the middle is called a system controller on a 'bun
>>and costs a non-trivial amount of money.

In article <28200304@mcdurb> aglew@mcdurb.Urbana.Gould.COM writes:
>Now that there is no bus to snoop on, cache coherence between caches at the 
>ends of the arms of the star [?] becomes problematic.  Start looking at
>software managed cache coherency, or strictly private memory. 
[several more reasonable alternatives left out]

  If memory serves, there is no true cache on the DPS-8 processor, but there
is some sort of load/store consistancy logic. There is supposed to be a quite
normal-looking cache **in the controller** of the NEC-designed DPS 90s, but I
know so little about the new machines that this may be completely wrong.  I
don't work on honeybuns any more (alas!).
  Does anyone know the cache consistancy scheme on the DPS-90 or the
NEC/Honeywell supercomputer?

--dave


-- 
David Collier-Brown,  | {toronto area...}lethe!dave
72 Abitibi Ave.,      |  Joyce C-B:
Willowdale, Ontario,  |     He's so smart he's dumb.
CANADA. 223-8968      |