pasek@ncrcce.StPaul.NCR.COM (Michael A. Pasek) (10/24/89)
In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes (in response to article <36597@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks): >Eugene, ever seen a transputer? Overhead for receiving or sending a >message is 19 cycles (630 ns for a 30 MHz part). The actual transfer >is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s ^^^^^^^^^^^^^^^^^^^^^ Although you describe this as "message passing" between the computers, it sounds to me like you have one "global" memory which is controlled by the "dedicated DMA machine", and this memory is "shared" by all the "local" processors. What difference does it make whether you set some register in your "dedicated DMA machine" and tell it to move some data to another location, or just set an address latch in your micro somewhere and do a "store" instruction ? M. A. Pasek Switching Software Development NCR Comten, Inc. (612) 638-7668 CNG Development 2700 N. Snelling Ave. pasek@c10sd3.StPaul.NCR.COM Roseville, MN 55113
davidb@titan.inmos.co.uk (David Boreham) (10/26/89)
In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes (in response to article <36597@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks): >Eugene, ever seen a transputer? Overhead for receiving or sending a >message is 19 cycles (630 ns for a 30 MHz part). The actual transfer >is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s and then in <1646@ncrcce.StPaul.NCR.COM> pasek@c10sd3.StPaul.NCR.COM (M. A. Pasek) says some things which I don't really understand. Anyway, here is what a transputer link actually IS: ------------ ! ! ---------- ! DMA ! ------------ ! Memory ! ! Engine ! ! ! ! on CPU A !=======! !=======! Serializer !-------- ! ! ! ! ! ! ! ---------- ! ! ------------ ! ! ! ! ! ! ! ------------ ! Serial ! Link ! ---------------------------------------------------------------- ! ! ------------ ! ! ! ! ---------- ! DMA ! ------------ ! ! Memory ! ! Engine ! ! ! ! ! on CPU B !=======! !=======! Serializer !-------- ! ! ! ! ! ! ---------- ! ! ------------ ! ! ! ! ------------ Everything above the dotted line is on transputer A, everything below the line in on transputer B. Of course there is also a CPU which fights for local memory on a cycle-by-cycle basis. Each transputer has N copies of the link circuitry (including 2N DMA engines). Currently 2 <= N <= 4 and the serial link goes at 20Mbits. David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb Bristol, England | (us): uunet!inmos-c!davidb +44 454 616616 ex 543 | Internet : @col.hp.com:davidb@inmos-c
rik@cs.washington.edu (Rik Littlefield) (10/27/89)
In article <1646@ncrcce.StPaul.NCR.COM>, pasek@ncrcce.StPaul.NCR.COM (Michael A. Pasek) writes: [about the transputer] > Although you describe this as "message passing" between the computers, > it sounds to me like you have one "global" memory which is controlled by the > "dedicated DMA machine", and this memory is "shared" by all the "local" > processors. What difference does it make whether you set some register in > your "dedicated DMA machine" and tell it to move some data to another > location, or just set an address latch in your micro somewhere and do a > "store" instruction ? > It matters a great deal, both in raw performance and in how you program the beasts. On a transputer, 10 microseconds is what, 30 instructions? Say you're in a tiny network, average 3 hops each out and back. That's 6 hops, 180 instructions latency. We can quibble about the numbers, but the point is, it's a long time. What this means is that it's dangerous to think "when I need this datum, I'll just go get it". That programming model works OK on the "shared memory" machines, which have much lower latency. On a transputer array, it gives out quickly as you try programs that do progressively more sharing. The same thing happens on any other existing distributed memory machine. Of course, as several posters have pointed out, if each node has enough things to do, you may be able to hide the latency. Perhaps a more fundamental difference is that shared memory machines provide hardware to guarantee that all processors have a consistent view of memory. Cooperation between processors can be controlled just by synchronizing. There's no need to explicitly update other processors' copies of shared data. With the distributed memory machines, no processor will find out about an update until it asks or another one tells it. The programmer can be insulated from this necessity by the compiler and/or runtime system, as with Kai Li's shared virtual memory (why isn't it "virtual shared memory"?), but the performance penalty can be severe if you're unlucky about the sharing patterns. It looks to me like the distributed memory machines are pretty good at running three kinds of programs. First are those that just don't require much write-sharing. Second are those with communication patterns that can be mapped to nearest-neighbor links, assuming that you're lucky enough to have a low-overhead machine like the transputer. Third are those in which the latency can be hidden and overhead amortized by handling lots of independent store/fetch requests at once. (Combinations of the above are best of all ;-) Using that third approach, there seem to be quite a few applications that can run well even with high communications overhead. (See Fox's book, "Solving Problems on Concurrent Processors".) But lack of software support is a big problem. Programs like that can be written using explicit message passing, but it's not easy. Several people, including myself, are working on compilation and runtime techniques to let you write programs using a shared memory data model, but have them execute efficiently on a message passer. It's not an easy problem (Ph.D. thesis material), but there are several promising approaches. Stay tuned, but don't hold your breath. --Rik
sc@klingon.pc.cs.cmu.edu (Siddhartha Chatterjee) (10/27/89)
In article <9605@june.cs.washington.edu> rik@cs.washington.edu (Rik Littlefield) writes: >Perhaps a more fundamental difference is that shared memory machines >provide hardware to guarantee that all processors have a consistent view of >memory. Cooperation between processors can be controlled just by Well, yes and no. On a bus-connected machine like the Encore you have snoopy caches that keep the data consistent and KEEP IT CLOSE TO THE PROCESSOR. On something like a Butterfly, the default behaviour is that shared data is not cached; so yes, memory is consistent, but you have to go out across the network to access it. You can do explicit cache flushing in software, but that's a different game. -- ---- ARPA: Siddhartha.Chatterjee@CS.CMU.EDU USPS: School of Computer Science, CMU, Pittsburgh, PA 15213 ----
brooks@vette.llnl.gov (Eugene Brooks) (10/27/89)
In article <6708@pt.cs.cmu.edu> sc@klingon.pc.cs.cmu.edu (Siddhartha Chatterjee) writes: >Well, yes and no. On a bus-connected machine like the Encore you have >snoopy caches that keep the data consistent and KEEP IT CLOSE TO THE >PROCESSOR. On something like a Butterfly, the default behaviour is that >shared data is not cached; so yes, memory is consistent, but you have to go >out across the network to access it. You can do explicit cache flushing in >software, but that's a different game. One can have hardware enforcement of coherence of caches in a scalable system. Many proposed mechanisms exist in the literature for this. I think that these system are much too complex to "guess" as to the best method, but simulation results will tell the story and vendors will be able to make suitable hardware available in the next couple of years. Coherent cache systems are just too friendly to not to build them. brooks@maddog.llnl.gov, brooks@maddog.uucp