[comp.arch] Networking for Distributed Computing

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/05/91)

The recent introduction of the HP "Snakes" computer systems
underscores a critical characteristic of modern scientific computing,
that is that the rules of the game change *very* quickly.

It is easy to convince one's self that the most cost-effective
computing environment over the long term (which in these days means "a
few years") is a heterogeneous distributed network, with low-cost
hardware that is updated incrementally.

It is vital in this case to use open systems and off-the-shelf
technology.   Unfortunately, the most commonly available networking
option (ethernet) uses a broadcast approach, which is definitely
sub-optimal for the communications needs of many "natural"
parallel distributed algorithms.

In talking to my IBM salescritter about this topic, he suggested using
an additional SCSI interface as the networking option.  I think that
this is a potentially important idea.  It uses off-the shelf hardware,
and only requires the writing of some (fairly simple?) SCSI device
drivers.

The system I envision would consist of some moderate number of
high-performance cheap workstations (IBM RS/6000-320 or HP/9000-720),
lets take 32 as an example.  An additional SCSI interface on each unit
would provide 7 high-speed, point-to-point interfaces, which could be
reconfigured (via a patch panel) to provide a large number of
different topologies, including: a line, a 2-D mesh, a 3-D mesh, a
ring, etc.

I believe that it would be easy to modify a Linda-like software system
to operate efficiently on such a system.  Simply map particular named
tuple-spaces into particular device drivers so that point-to-point
communications may be executed directly.  A "default" or un-named
tuple space could still be handled in a global fashion using the
broadcast network.  It might be used for things like global
accumulations and synchronization messages.

A suitably designed code (for example a 3-D spectral element code for
fluid dynamics using explicit time marching techniques) should be
capable of 1 GFLOPS performance on a network of 32 IBM RS/6000-320's.
With current university discounts and third-party memory, such a
system (with 32 MB RAM per node) could be assembled for less than
$500,000.  Apparently, the HP/9000-720 would be capable of slightly
higher performance at a similar cost (though I don't know about
3rd-party memory for the HP's).

So what is wrong with this idea?  Am I misunderstanding what can be
done with a SCSI interface or how hard the implementation of a
buffered FIFO would be?
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/06/91)

The recent introduction of the HP "Snakes" computer systems
underscores a critical characteristic of modern scientific computing,
that is that the rules of the game change *very* quickly.

It is easy to convince one's self that the most cost-effective
computing environment over the long term (which in these days means "a
few years") is a heterogeneous distributed network, with low-cost
hardware that is updated incrementally.

It is vital in this case to use open systems and off-the-shelf
technology.  Unfortunately, the most commonly available networking
option (ethernet or FDDI) uses a broadcast approach, which is
definitely sub-optimal for the communications needs of many "natural"
parallel distributed algorithms.

In talking to my IBM salescritter about this topic, he suggested using
an additional SCSI interface as the networking option.  I think that
this is a potentially important idea.  It uses off-the shelf hardware,
and only requires the writing of some (fairly simple?) SCSI device
drivers.

The system I envision would consist of some moderate number of
high-performance cheap workstations (IBM RS/6000-320 or HP/9000-720),
let's take 32 as an example.  An additional SCSI interface on each unit
would provide 7 high-speed, point-to-point interfaces, which could be
reconfigured (via a patch panel) to provide a large number of
different topologies, including: a line, a 2-D mesh, a 3-D mesh, a
ring, etc.

I believe that it would be easy to modify a Linda-like software system
to operate efficiently on such a system.  Simply map particular named
tuple-spaces into particular device drivers so that point-to-point
communications may be executed directly.  A "default" or un-named
tuple space could still be handled in a global fashion using the
broadcast network.  It might be used for things like global
accumulations and synchronization messages.

A suitably designed code (for example a 3-D spectral element code for
fluid dynamics using explicit time marching techniques) should be
capable of 1 GFLOPS performance on a network of 32 IBM RS/6000-320's.
With current university discounts and third-party memory, such a
system (with 32 MB RAM per node) could be assembled for less than
$500,000.  Apparently, the HP/9000-720 would be capable of slightly
higher performance at a similar cost (though I don't know about
3rd-party memory for the HP's).

So what is wrong with this idea?  Am I misunderstanding what can be
done with a SCSI interface or how hard the implementation of a
buffered FIFO would be?
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

gnn@minestrone.Berkeley.EDU (George Neville-Neil) (04/06/91)

In article <1991Apr5.182853.20728@hubcap.clemson.edu>,
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:

|>The system I envision would consist of some moderate number of
|>high-performance cheap workstations (IBM RS/6000-320 or HP/9000-720),
|>lets take 32 as an example.  An additional SCSI interface on each unit
|>would provide 7 high-speed, point-to-point interfaces, which could be
|>reconfigured (via a patch panel) to provide a large number of
|>different topologies, including: a line, a 2-D mesh, a 3-D mesh, a
|>ring, etc.

|>So what is wrong with this idea?  Am I misunderstanding what can be
|>done with a SCSI interface or how hard the implementation of a
|>buffered FIFO would be?


Hmm.  I am not sure about this idea.  I am not an expert on SCSI but I seem to
remember that you can only have seven (maybe eight) SCSI devices on a
SCSI string
(one cable/controller).  Also you would be limited by transmission distance.  

If it is true that the processors/systems/whatever would have to be within a
close proximity (a meter or two) wouldn't it be cheaper to build something like
this using cards in a backplane ??  

Also, the SCSI spec was basically built around disk architectures and
synchronizing reads and writes etc.  If this is all you want the net to do then
fine.  But this also means that if you want TCP/IP or UDP or some other
protocol
it will have to be done in software with the SCSI stuff as a low level
protocol in your network.  It's doable but is it worth it ??

To sum up, I'm not sure, but I don't think it's an optimal approach.

You might look into the literature to see if anyone has used SCSI as a
network instead of as just a master/slave controller system.

Hope this helped....

Later,
George


George Neville-Neil      		Kinky is as kinky does.
gnn@mammoth.berkeley.edu 

Warning: Taking drugs approved by the FDA can be hazardous to your health.

tbray@watsol.waterloo.edu (Tim Bray) (04/06/91)

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
 Unfortunately, the most commonly available networking
 option (ethernet) uses a broadcast approach, which is definitely
 sub-optimal for the communications needs of many "natural"
 parallel distributed algorithms.
 ...proposed complex SCSI message-passer...
 So what is wrong with this idea?
 done with a SCSI interface or how hard the implementation of a
 buffered FIFO would be?

What's wrong with this idea: first you have to prove that when you build
this system, you have a bottleneck at the Ethernet.  Then you have to
prove you can't fix it by just dropping a same-but-faster FDDI spine or
suchlike.

As for me, when I first saw the Ethernet protocol, I said: mickey mouse -
that'll never fly for real work.  When I heard about people wanting to use
TCP/IP for local-area-networking, I said: that protocol will blow Ethernet
out of the water in about 10 minutes.  Sigh, 0 for 2, and another lesson in
the futility of intuition in predicting performance bottlenecks.  My own data
point: I've worked on a lot of different distributed environments, and 95% of
the time, you run out of CPU, or filesystem bandwidth, or context switches,
or something, before your Ethernet runs out of gas.

Of course, you could be right.
Tim Bray, Open Text Systems

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (04/08/91)

In article <1991Apr5.182853.20728@hubcap.clemson.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
>Unfortunately, the most commonly available networking
>option (ethernet) uses a broadcast approach, which is definitely
>sub-optimal for the communications needs of many "natural"
>parallel distributed algorithms.

Actually, shared-channel broadcast is optimal, as long as
- you don't run out of bandwidth (or pile up big latencies).
- detecting that a message isn't for you, doesn't impact your performance.

The trouble with point-to-point links is that you wind up implementing
message forwarding, the downside being
- more code
- more latency
- performance impact on the intermediaries

You can sort-of cure this with the right topology, or with hardware
support, but you still have to write the code or design the hardware.
(There have been some very interesting attempts lately - e.g. DEC SRC
AutoNet - to build systems which dyamically discover their topology.)

>In talking to my IBM salescritter about this topic, he suggested using
>an additional SCSI interface as the networking option.

>A suitably designed code (for example a 3-D spectral element code for
>fluid dynamics using explicit time marching techniques) should be
>capable of 1 GFLOPS performance on a network of 32 IBM RS/6000-320's.

Kung's "Law" says that if you scale node performance, without
increasing communication bandwidth, then nodes require more memory:
N, N^2 or even N^3 as much, depending on algorithm. Before choosing a
communications setup, I would want to study your application's
characteristics, and work up some ratios and granularities.
-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/08/91)

> On 7 Apr 91 17:50:25 GMT, lindsay@gandalf.cs.cmu.edu (Donald Lindsay) said:

I wrote:
me> Unfortunately, the most commonly available networking option
me> (ethernet) uses a broadcast approach, which is definitely
me> sub-optimal for the communications needs of many "natural"
me> parallel distributed algorithms.

Donald> Actually, shared-channel broadcast is optimal, as long as -
Donald> you don't run out of bandwidth (or pile up big latencies).  -
Donald> detecting that a message isn't for you, doesn't impact your
Donald> performance.

The shared broadcast channel might be "optimal" for some hypothetical
general problem, but if the problem naturally requires only
nearest-neighbor communications, then I can't see how broadcast is
more "optimal" than nearest-neighbor point-to-point communications.

The potential problem with the broadcast net is that in a
well-balanced system, all the processors are going to try to dump all
their communications at once, thus saturating the network.  I am not
sure how to take this properly into account in the performance
analysis....



Donald> The trouble with point-to-point links is that you wind up implementing
Donald> message forwarding, the downside being
Donald> - more code
Donald> - more latency
Donald> - performance impact on the intermediaries

No!  I have no interest in message forwarding!  That is what the
ethernet is for.  The SCSI-based network idea is strictly intended as
a supplement to allow more networks to be in operation for
nearest-neighbor communications.



me> A suitably designed code (for example a 3-D spectral element code
me> for fluid dynamics using explicit time marching techniques) should
me> be capable of 1 GFLOPS performance on a network of 32 IBM
me> RS/6000-320's.

Donald> Kung's "Law" says that if you scale node performance, without
Donald> increasing communication bandwidth, then nodes require more
Donald> memory: N, N^2 or even N^3 as much, depending on algorithm.
Donald> Before choosing a communications setup, I would want to study
Donald> your application's characteristics, and work up some ratios
Donald> and granularities.

Well, I have done a fair bit of work on this.  The work per node per
step is:

	(FP ops)/node/step = 140 L M N^2

while the communication required per interface per step is 

	(64-bit words read/written)/side/step = 7 * (L,M)*N

where (L,M) means either L or M, depending on what side one is
communicating through.  

"Interior" nodes will have 4 "sides", "edge" nodes will have 3
"sides", and "corner" nodes will have 2 "sides".

I envision a 4x4 mesh of nodes, with L & M between 25 and 75 and N
being between 5 and 12.   This gives node computation times (on a 5-10
MFLOPS cpu) of near 1 second between communications.

The part of the problem that I do not know how to model is the time
required for the communications part.  If it were point-to-point, I
would use a latency plus a quantity of data divided by the transfer
rate.   With a broadcast network, I do not know how to model the
reduction of the transfer rate caused by network saturation....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

stein@dhw68k.cts.com (Rick 'Transputer' Stein) (04/12/91)

In article <12606@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>Kung's "Law" says that if you scale node performance, without
>increasing communication bandwidth, then nodes require more memory:
>N, N^2 or even N^3 as much, depending on algorithm. Before choosing a
>communications setup, I would want to study your application's
>characteristics, and work up some ratios and granularities.
>-- 
>Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics
Can you provide a literature reference on this "law?"
-- 
Richard M. Stein (aka, Rick 'Transputer' Stein)
Sole proprietor of Rick's Software Toxic Waste Dump and Kitty Litter Co.
"You build 'em, we bury 'em." uucp: ...{spsd, zardoz, felix}!dhw68k!stein

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (04/21/91)

In article <1991Apr12.015503.28001@dhw68k.cts.com> stein@dhw68k.cts.com (Rick 'Transputer' Stein) writes:
>>Kung's "Law" says that if you scale node performance, without
>>increasing communication bandwidth, then nodes require more memory:
>>N, N^2 or even N^3 as much, depending on algorithm.
>Can you provide a literature reference on this "law?"

"Memory Requirements for Balanced Computer Architectures"
H.T. Kung
13th Annual International Symposium on Computer Architecture
1986, Tokyo

I think it's also a CMU tech report, and I believe he published
something similar in the Journal of Complexity.

Sorry to take so long to respond: deadlines.

-- 
Don		D.C.Lindsay 	Carnegie Mellon Robotics Institute