[comp.arch] Shared Memory vs. Distributed Systems

pasek@ncrcce.StPaul.NCR.COM (Michael A. Pasek) (10/24/89)

In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes (in
response to article <36597@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks):
>Eugene, ever seen a transputer? Overhead for receiving or sending a 
>message is 19 cycles (630 ns for a 30 MHz part). The actual transfer
>is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s
              ^^^^^^^^^^^^^^^^^^^^^

Although you describe this as "message passing" between the computers,
it sounds to me like you have one "global" memory which is controlled by the
"dedicated DMA machine", and this memory is "shared" by all the "local"
processors.  What difference does it make whether you set some register in
your "dedicated DMA machine" and tell it to move some data to another 
location, or just set an address latch in your micro somewhere and do a
"store" instruction ?  

M. A. Pasek          Switching Software Development         NCR Comten, Inc.
(612) 638-7668              CNG Development               2700 N. Snelling Ave.
pasek@c10sd3.StPaul.NCR.COM                               Roseville, MN  55113

davidb@titan.inmos.co.uk (David Boreham) (10/26/89)

In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes (in
response to article <36597@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks):
>Eugene, ever seen a transputer? Overhead for receiving or sending a 
>message is 19 cycles (630 ns for a 30 MHz part). The actual transfer
>is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s

and then  in <1646@ncrcce.StPaul.NCR.COM>  pasek@c10sd3.StPaul.NCR.COM (M. A. Pasek) 
says some things which I don't really understand. 

Anyway, here is what a transputer link actually IS:


                              ------------
                             !            !
           ----------        !   DMA      !        ------------
          !  Memory  !       !   Engine   !       !             !
          ! on CPU A !=======!            !=======!  Serializer !--------
          !          !       !            !       !             !       !
           ----------        !            !        ------------         !
                             !            !                             !
                             !            !                             !
                              ------------                              ! Serial
                                                                        !  Link
                                                                        !
  ----------------------------------------------------------------      !
                                                                        !
                              ------------                              !
                             !            !                             !
           ----------        !   DMA      !        ------------         !
          !  Memory  !       !   Engine   !       !             !       !
          ! on CPU B !=======!            !=======!  Serializer !--------
          !          !       !            !       !             !
           ----------        !            !        ------------
                             !            !
                             !            !
                              ------------


Everything above the dotted line is on transputer A, everything below the line
in on transputer B. Of course there is also a CPU which fights for local memory
on a cycle-by-cycle basis. Each transputer has N copies of the link circuitry
(including 2N DMA engines). Currently  2 <= N <= 4 and the serial link goes at
20Mbits.



David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb
Bristol,  England            |      (us): uunet!inmos-c!davidb
+44 454 616616 ex 543        | Internet : @col.hp.com:davidb@inmos-c

rik@cs.washington.edu (Rik Littlefield) (10/27/89)

In article <1646@ncrcce.StPaul.NCR.COM>, pasek@ncrcce.StPaul.NCR.COM 
(Michael A. Pasek) writes:  [about the transputer]

> Although you describe this as "message passing" between the computers,
> it sounds to me like you have one "global" memory which is controlled by the
> "dedicated DMA machine", and this memory is "shared" by all the "local"
> processors.  What difference does it make whether you set some register in
> your "dedicated DMA machine" and tell it to move some data to another 
> location, or just set an address latch in your micro somewhere and do a
> "store" instruction ?  
> 

It matters a great deal, both in raw performance and in how you program
the beasts.

On a transputer, 10 microseconds is what, 30 instructions?  Say you're in a
tiny network, average 3 hops each out and back.  That's 6 hops, 180
instructions latency.  We can quibble about the numbers, but the point is,
it's a long time.  What this means is that it's dangerous to think "when I
need this datum, I'll just go get it".  That programming model works OK on
the "shared memory" machines, which have much lower latency.  On a
transputer array, it gives out quickly as you try programs that do
progressively more sharing.  The same thing happens on any other existing
distributed memory machine.  Of course, as several posters have pointed out,
if each node has enough things to do, you may be able to hide the latency.

Perhaps a more fundamental difference is that shared memory machines
provide hardware to guarantee that all processors have a consistent view of
memory.  Cooperation between processors can be controlled just by
synchronizing.  There's no need to explicitly update other processors'
copies of shared data.  With the distributed memory machines, no processor
will find out about an update until it asks or another one tells it.  The
programmer can be insulated from this necessity by the compiler and/or
runtime system, as with Kai Li's shared virtual memory (why isn't it
"virtual shared memory"?), but the performance penalty can be severe if
you're unlucky about the sharing patterns.

It looks to me like the distributed memory machines are pretty good at
running three kinds of programs.  First are those that just don't require
much write-sharing.  Second are those with communication patterns that can
be mapped to nearest-neighbor links, assuming that you're lucky enough to
have a low-overhead machine like the transputer.  Third are those in which
the latency can be hidden and overhead amortized by handling lots of
independent store/fetch requests at once.  (Combinations of the above are
best of all ;-)

Using that third approach, there seem to be quite a few applications that
can run well even with high communications overhead.  (See Fox's book,
"Solving Problems on Concurrent Processors".)  But lack of software support
is a big problem.  Programs like that can be written using explicit message
passing, but it's not easy.  Several people, including myself, are working
on compilation and runtime techniques to let you write programs using a
shared memory data model, but have them execute efficiently on a message
passer.  It's not an easy problem (Ph.D. thesis material), but there are
several promising approaches.  Stay tuned, but don't hold your breath.

--Rik

sc@klingon.pc.cs.cmu.edu (Siddhartha Chatterjee) (10/27/89)

In article <9605@june.cs.washington.edu> rik@cs.washington.edu (Rik Littlefield) writes:
>Perhaps a more fundamental difference is that shared memory machines
>provide hardware to guarantee that all processors have a consistent view of
>memory.  Cooperation between processors can be controlled just by

Well, yes and no.  On a bus-connected machine like the Encore you have
snoopy caches that keep the data consistent and KEEP IT CLOSE TO THE
PROCESSOR.  On something like a Butterfly, the default behaviour is that
shared data is not cached; so yes, memory is consistent, but you have to go
out across the network to access it.  You can do explicit cache flushing in
software, but that's a different game.

-- 
----
ARPA:	Siddhartha.Chatterjee@CS.CMU.EDU
USPS:	School of Computer Science, CMU, Pittsburgh, PA 15213
----

brooks@vette.llnl.gov (Eugene Brooks) (10/27/89)

In article <6708@pt.cs.cmu.edu> sc@klingon.pc.cs.cmu.edu (Siddhartha Chatterjee) writes:
>Well, yes and no.  On a bus-connected machine like the Encore you have
>snoopy caches that keep the data consistent and KEEP IT CLOSE TO THE
>PROCESSOR.  On something like a Butterfly, the default behaviour is that
>shared data is not cached; so yes, memory is consistent, but you have to go
>out across the network to access it.  You can do explicit cache flushing in
>software, but that's a different game.
One can have hardware enforcement of coherence of caches in a scalable system.
Many proposed mechanisms exist in the literature for this.  I think that
these system are much too complex to "guess" as to the best method, but
simulation results will tell the story and vendors will be able to make
suitable hardware available in the next couple of years.  Coherent cache
systems are just too friendly to not to build them.


brooks@maddog.llnl.gov, brooks@maddog.uucp