mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/05/91)
The recent introduction of the HP "Snakes" computer systems underscores a critical characteristic of modern scientific computing, that is that the rules of the game change *very* quickly. It is easy to convince one's self that the most cost-effective computing environment over the long term (which in these days means "a few years") is a heterogeneous distributed network, with low-cost hardware that is updated incrementally. It is vital in this case to use open systems and off-the-shelf technology. Unfortunately, the most commonly available networking option (ethernet) uses a broadcast approach, which is definitely sub-optimal for the communications needs of many "natural" parallel distributed algorithms. In talking to my IBM salescritter about this topic, he suggested using an additional SCSI interface as the networking option. I think that this is a potentially important idea. It uses off-the shelf hardware, and only requires the writing of some (fairly simple?) SCSI device drivers. The system I envision would consist of some moderate number of high-performance cheap workstations (IBM RS/6000-320 or HP/9000-720), lets take 32 as an example. An additional SCSI interface on each unit would provide 7 high-speed, point-to-point interfaces, which could be reconfigured (via a patch panel) to provide a large number of different topologies, including: a line, a 2-D mesh, a 3-D mesh, a ring, etc. I believe that it would be easy to modify a Linda-like software system to operate efficiently on such a system. Simply map particular named tuple-spaces into particular device drivers so that point-to-point communications may be executed directly. A "default" or un-named tuple space could still be handled in a global fashion using the broadcast network. It might be used for things like global accumulations and synchronization messages. A suitably designed code (for example a 3-D spectral element code for fluid dynamics using explicit time marching techniques) should be capable of 1 GFLOPS performance on a network of 32 IBM RS/6000-320's. With current university discounts and third-party memory, such a system (with 32 MB RAM per node) could be assembled for less than $500,000. Apparently, the HP/9000-720 would be capable of slightly higher performance at a similar cost (though I don't know about 3rd-party memory for the HP's). So what is wrong with this idea? Am I misunderstanding what can be done with a SCSI interface or how hard the implementation of a buffered FIFO would be? -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/06/91)
The recent introduction of the HP "Snakes" computer systems underscores a critical characteristic of modern scientific computing, that is that the rules of the game change *very* quickly. It is easy to convince one's self that the most cost-effective computing environment over the long term (which in these days means "a few years") is a heterogeneous distributed network, with low-cost hardware that is updated incrementally. It is vital in this case to use open systems and off-the-shelf technology. Unfortunately, the most commonly available networking option (ethernet or FDDI) uses a broadcast approach, which is definitely sub-optimal for the communications needs of many "natural" parallel distributed algorithms. In talking to my IBM salescritter about this topic, he suggested using an additional SCSI interface as the networking option. I think that this is a potentially important idea. It uses off-the shelf hardware, and only requires the writing of some (fairly simple?) SCSI device drivers. The system I envision would consist of some moderate number of high-performance cheap workstations (IBM RS/6000-320 or HP/9000-720), let's take 32 as an example. An additional SCSI interface on each unit would provide 7 high-speed, point-to-point interfaces, which could be reconfigured (via a patch panel) to provide a large number of different topologies, including: a line, a 2-D mesh, a 3-D mesh, a ring, etc. I believe that it would be easy to modify a Linda-like software system to operate efficiently on such a system. Simply map particular named tuple-spaces into particular device drivers so that point-to-point communications may be executed directly. A "default" or un-named tuple space could still be handled in a global fashion using the broadcast network. It might be used for things like global accumulations and synchronization messages. A suitably designed code (for example a 3-D spectral element code for fluid dynamics using explicit time marching techniques) should be capable of 1 GFLOPS performance on a network of 32 IBM RS/6000-320's. With current university discounts and third-party memory, such a system (with 32 MB RAM per node) could be assembled for less than $500,000. Apparently, the HP/9000-720 would be capable of slightly higher performance at a similar cost (though I don't know about 3rd-party memory for the HP's). So what is wrong with this idea? Am I misunderstanding what can be done with a SCSI interface or how hard the implementation of a buffered FIFO would be? -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET
gnn@minestrone.Berkeley.EDU (George Neville-Neil) (04/06/91)
In article <1991Apr5.182853.20728@hubcap.clemson.edu>, mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: |>The system I envision would consist of some moderate number of |>high-performance cheap workstations (IBM RS/6000-320 or HP/9000-720), |>lets take 32 as an example. An additional SCSI interface on each unit |>would provide 7 high-speed, point-to-point interfaces, which could be |>reconfigured (via a patch panel) to provide a large number of |>different topologies, including: a line, a 2-D mesh, a 3-D mesh, a |>ring, etc. |>So what is wrong with this idea? Am I misunderstanding what can be |>done with a SCSI interface or how hard the implementation of a |>buffered FIFO would be? Hmm. I am not sure about this idea. I am not an expert on SCSI but I seem to remember that you can only have seven (maybe eight) SCSI devices on a SCSI string (one cable/controller). Also you would be limited by transmission distance. If it is true that the processors/systems/whatever would have to be within a close proximity (a meter or two) wouldn't it be cheaper to build something like this using cards in a backplane ?? Also, the SCSI spec was basically built around disk architectures and synchronizing reads and writes etc. If this is all you want the net to do then fine. But this also means that if you want TCP/IP or UDP or some other protocol it will have to be done in software with the SCSI stuff as a low level protocol in your network. It's doable but is it worth it ?? To sum up, I'm not sure, but I don't think it's an optimal approach. You might look into the literature to see if anyone has used SCSI as a network instead of as just a master/slave controller system. Hope this helped.... Later, George George Neville-Neil Kinky is as kinky does. gnn@mammoth.berkeley.edu Warning: Taking drugs approved by the FDA can be hazardous to your health.
tbray@watsol.waterloo.edu (Tim Bray) (04/06/91)
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
Unfortunately, the most commonly available networking
option (ethernet) uses a broadcast approach, which is definitely
sub-optimal for the communications needs of many "natural"
parallel distributed algorithms.
...proposed complex SCSI message-passer...
So what is wrong with this idea?
done with a SCSI interface or how hard the implementation of a
buffered FIFO would be?
What's wrong with this idea: first you have to prove that when you build
this system, you have a bottleneck at the Ethernet. Then you have to
prove you can't fix it by just dropping a same-but-faster FDDI spine or
suchlike.
As for me, when I first saw the Ethernet protocol, I said: mickey mouse -
that'll never fly for real work. When I heard about people wanting to use
TCP/IP for local-area-networking, I said: that protocol will blow Ethernet
out of the water in about 10 minutes. Sigh, 0 for 2, and another lesson in
the futility of intuition in predicting performance bottlenecks. My own data
point: I've worked on a lot of different distributed environments, and 95% of
the time, you run out of CPU, or filesystem bandwidth, or context switches,
or something, before your Ethernet runs out of gas.
Of course, you could be right.
Tim Bray, Open Text Systems
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (04/08/91)
In article <1991Apr5.182853.20728@hubcap.clemson.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: >Unfortunately, the most commonly available networking >option (ethernet) uses a broadcast approach, which is definitely >sub-optimal for the communications needs of many "natural" >parallel distributed algorithms. Actually, shared-channel broadcast is optimal, as long as - you don't run out of bandwidth (or pile up big latencies). - detecting that a message isn't for you, doesn't impact your performance. The trouble with point-to-point links is that you wind up implementing message forwarding, the downside being - more code - more latency - performance impact on the intermediaries You can sort-of cure this with the right topology, or with hardware support, but you still have to write the code or design the hardware. (There have been some very interesting attempts lately - e.g. DEC SRC AutoNet - to build systems which dyamically discover their topology.) >In talking to my IBM salescritter about this topic, he suggested using >an additional SCSI interface as the networking option. >A suitably designed code (for example a 3-D spectral element code for >fluid dynamics using explicit time marching techniques) should be >capable of 1 GFLOPS performance on a network of 32 IBM RS/6000-320's. Kung's "Law" says that if you scale node performance, without increasing communication bandwidth, then nodes require more memory: N, N^2 or even N^3 as much, depending on algorithm. Before choosing a communications setup, I would want to study your application's characteristics, and work up some ratios and granularities. -- Don D.C.Lindsay .. temporarily at Carnegie Mellon Robotics
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/08/91)
> On 7 Apr 91 17:50:25 GMT, lindsay@gandalf.cs.cmu.edu (Donald Lindsay) said:
I wrote:
me> Unfortunately, the most commonly available networking option
me> (ethernet) uses a broadcast approach, which is definitely
me> sub-optimal for the communications needs of many "natural"
me> parallel distributed algorithms.
Donald> Actually, shared-channel broadcast is optimal, as long as -
Donald> you don't run out of bandwidth (or pile up big latencies). -
Donald> detecting that a message isn't for you, doesn't impact your
Donald> performance.
The shared broadcast channel might be "optimal" for some hypothetical
general problem, but if the problem naturally requires only
nearest-neighbor communications, then I can't see how broadcast is
more "optimal" than nearest-neighbor point-to-point communications.
The potential problem with the broadcast net is that in a
well-balanced system, all the processors are going to try to dump all
their communications at once, thus saturating the network. I am not
sure how to take this properly into account in the performance
analysis....
Donald> The trouble with point-to-point links is that you wind up implementing
Donald> message forwarding, the downside being
Donald> - more code
Donald> - more latency
Donald> - performance impact on the intermediaries
No! I have no interest in message forwarding! That is what the
ethernet is for. The SCSI-based network idea is strictly intended as
a supplement to allow more networks to be in operation for
nearest-neighbor communications.
me> A suitably designed code (for example a 3-D spectral element code
me> for fluid dynamics using explicit time marching techniques) should
me> be capable of 1 GFLOPS performance on a network of 32 IBM
me> RS/6000-320's.
Donald> Kung's "Law" says that if you scale node performance, without
Donald> increasing communication bandwidth, then nodes require more
Donald> memory: N, N^2 or even N^3 as much, depending on algorithm.
Donald> Before choosing a communications setup, I would want to study
Donald> your application's characteristics, and work up some ratios
Donald> and granularities.
Well, I have done a fair bit of work on this. The work per node per
step is:
(FP ops)/node/step = 140 L M N^2
while the communication required per interface per step is
(64-bit words read/written)/side/step = 7 * (L,M)*N
where (L,M) means either L or M, depending on what side one is
communicating through.
"Interior" nodes will have 4 "sides", "edge" nodes will have 3
"sides", and "corner" nodes will have 2 "sides".
I envision a 4x4 mesh of nodes, with L & M between 25 and 75 and N
being between 5 and 12. This gives node computation times (on a 5-10
MFLOPS cpu) of near 1 second between communications.
The part of the problem that I do not know how to model is the time
required for the communications part. If it were point-to-point, I
would use a latency plus a quantity of data divided by the transfer
rate. With a broadcast network, I do not know how to model the
reduction of the transfer rate caused by network saturation....
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@brahms.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET
stein@dhw68k.cts.com (Rick 'Transputer' Stein) (04/12/91)
In article <12606@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >Kung's "Law" says that if you scale node performance, without >increasing communication bandwidth, then nodes require more memory: >N, N^2 or even N^3 as much, depending on algorithm. Before choosing a >communications setup, I would want to study your application's >characteristics, and work up some ratios and granularities. >-- >Don D.C.Lindsay .. temporarily at Carnegie Mellon Robotics Can you provide a literature reference on this "law?" -- Richard M. Stein (aka, Rick 'Transputer' Stein) Sole proprietor of Rick's Software Toxic Waste Dump and Kitty Litter Co. "You build 'em, we bury 'em." uucp: ...{spsd, zardoz, felix}!dhw68k!stein
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (04/21/91)
In article <1991Apr12.015503.28001@dhw68k.cts.com> stein@dhw68k.cts.com (Rick 'Transputer' Stein) writes: >>Kung's "Law" says that if you scale node performance, without >>increasing communication bandwidth, then nodes require more memory: >>N, N^2 or even N^3 as much, depending on algorithm. >Can you provide a literature reference on this "law?" "Memory Requirements for Balanced Computer Architectures" H.T. Kung 13th Annual International Symposium on Computer Architecture 1986, Tokyo I think it's also a CMU tech report, and I believe he published something similar in the Journal of Complexity. Sorry to take so long to respond: deadlines. -- Don D.C.Lindsay Carnegie Mellon Robotics Institute