[comp.arch] Is Shared Memory Necessary?

crowl@cs.rochester.edu (Lawrence Crowl) (05/10/88)

In article <674@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>... Everyones favourite trick seems to be finding evermore complicated ways of
>getting large numbers of CPUs to talk to the memory all at once.  Just imagine
>an ever increasing number of waiters trying to get in and out of the same
>kitchen all at once through one door, and you can see the mess.  OK, let's
>increase the number of doors ... in hardware terms this means separating the
>memory into several pages which can be accessed simultaneously, thereby
>increasing the effective bandwidth of the memory.  Is this really admitting
>that shared memory is not necessary?  Surely the highest bandwidth is achieved
>when each processor has its own memory which it shares with noone else?  It
>also makes the hardware a lot smaller. ...  shared-memory is not necessary;
>it's a software issue that shouldn't be solved in hardware.

Yes, the highest bandwidth is achieved when when each processor has exclusive
memory.  However, processes on different processors must still communicate with
each other.  Non-shared memory communication typically costs two orders of
magnitude more than shared memory communication.  What's worse, even when
processes are on the same processor, software engineering issues often require
that they communicate via the same slow inter-processor mechanism.  So the
fastest time to solve a problem may well lie with shared memory even though its
bandwidth is lower.  Communication is a software issue.  Making it cheap
requires hardware.  My conclusion is that shared-memory is not necessary, but
that it is worth its cost.
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

retrac@titan.rice.edu (John Carter) (05/10/88)

In article <9559@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes:
>In article <674@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>>Surely the highest bandwidth is achieved
>>when each processor has its own memory which it shares with noone else?  It
>>also makes the hardware a lot smaller. ...  shared-memory is not necessary;
>>it's a software issue that shouldn't be solved in hardware.
>
>Yes, the highest bandwidth is achieved when when each processor has exclusive
>memory.  However, processes on different processors must still communicate with
>each other.  Non-shared memory communication typically costs two orders of
>magnitude more than shared memory communication.  What's worse, even when
>processes are on the same processor, software engineering issues often require
>that they communicate via the same slow inter-processor mechanism.  So the
>fastest time to solve a problem may well lie with shared memory even though its
>bandwidth is lower.  Communication is a software issue.  Making it cheap
>requires hardware.  My conclusion is that shared-memory is not necessary, but
>that it is worth its cost.

I think that it's currently worth the cost, but I doubt seriously that it'll
be a realistic approach as we try to increase the number of processors beyond
several dozen.  A shared (hardware) memory that needed to handle thousands of
processors would be much slower than a conventional memory (how much slower
would depend on the actual architectute and what decisions you made about the
semantics of memory access).  The RP-3 project at IBM is doing some interest-
ing work on large shared memory architectures.  Their work aside, I think that
you should be able to get "optimal" performance by providing very fast
dsistributed memory communication, hardward support for IPC (interprocess[or]
communication), intelligent/clever communication protocols, and intelligent
compilers that can schedule the IPC so as to remove or reduce the delay
associated with waiting for remote memory access.  Kai Li (at Princeton,
I believe) has done some very good work on implementing shared memory
on top of a distributed memory machine/system (i.e., the architecture is
distributed memory, but the programmer's view if of shared memory).

John Carter               Internet: retrac@rice.edu
Dept of Computer Science  CSNET:    retrac@rice.edu
Rice University           UUCP:     {internet node or backbone}!rice!retrac
Houston, TX

brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (05/17/88)

In article <685@thalia.rice.edu> retrac@rice.edu (John Carter) writes:
>I think that it's currently worth the cost, but I doubt seriously that it'll
>be a realistic approach as we try to increase the number of processors beyond
>several dozen.  A shared (hardware) memory that needed to handle thousands of
Experience with the 30 CPU Sequent Symmetry system indicates that the
limit for bus based systems with the right copy/back cache protocol
is beyond 30 processors.

Experience with the Cerberus multiprocessor simulator indicates that
a non-bus shared memory system can be very successfully used with
processor counts of several hundred and beyond without fancy fetch-and-op
support in the memory subsystem.  Give me a couple hundred Cray class
processors and I would be happy for at least a week. :-)

hankd@pur-ee.UUCP (Hank Dietz) (05/17/88)

In article <685@thalia.rice.edu>, retrac@titan.rice.edu (John Carter) writes:
> In article <9559@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes:
> >In article <674@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
> >>Surely the highest bandwidth is achieved
> >>when each processor has its own memory which it shares with noone else?  It
> >>also makes the hardware a lot smaller. ...  shared-memory is not necessary;
> >>it's a software issue that shouldn't be solved in hardware.
> >
> >Yes, the highest bandwidth is achieved when when each processor has exclusive
> >memory.  However, processes on different processors must still communicate with
> >each other.  Non-shared memory communication typically costs two orders of
> >magnitude more than shared memory communication.  What's worse, even when
> >processes are on the same processor, software engineering issues often require
> >that they communicate via the same slow inter-processor mechanism.  So the
...
> several dozen.  A shared (hardware) memory that needed to handle thousands of
> processors would be much slower than a conventional memory (how much slower
> would depend on the actual architectute and what decisions you made about the
> semantics of memory access).  The RP-3 project at IBM is doing some interest-
> ing work on large shared memory architectures.  Their work aside, I think that
...
> associated with waiting for remote memory access.  Kai Li (at Princeton,
> I believe) has done some very good work on implementing shared memory
> on top of a distributed memory machine/system (i.e., the architecture is
> distributed memory, but the programmer's view if of shared memory).

The following is a list, in approximate chronological order, of very large
scale shared memory address space (but physically distributed memory) MIMD
machines which have been proposed, simulated, and/or built:

CHoPP		Columbia university HOmogeneous Parallel Processor; a machine
		with a LogN stage smart switch network between processors and
		memory modules.  Each switch contains a pre- and post-arrival
		cache called an RFM:  Repetition Filter Memory.  Multiple
		requests for (at least temporarily) read-only items are
		combined (serviced by the cache, rather than by memory) at
		the point in the net where their paths cross.

Denelcor HEP	Uses microtasking to hide shared memory latency.

NYU Ultra	Instead of RFMs, uses F&O (Fetch and Op) smart switches,
		which can combine semaphore operations in the net, but only
		if they actually collide in a particular switch node.  Does
		nothing about read-only objects, however, it also has a
		local memory for each processor.  F&O happens to be really
		good at doing associative reductions, but not all that good
		at combining semaphore ops because in a MIMD they rarely
		collide (rather, they usually cross paths).

BBN Butterfly	A "dumb" LogN stage network is used, but it is much faster
		than the 68K-family processors of the machine, so it works.

IBM RP3		Research Parallel Processing Prototype; originally to be the
		NYU machine "done right."  Memory is physically local to
		processors but globally addressible, with hardware support
		for address mapping.  Was to have two networks:  slow F&O
		and fast circuit switching...  but only the fast one was
		implemented due to cost constraints.

RFM-MIMD	Son-of-CHoPP.  This machine differs in that it uses RFM+
		switches, which can combine both read and write requests
		which cross paths, and it uses only a single stage net and
		memory modules are local to processor nodes, yet globally
		addressable.  Further, the processor nodes of the MIMD
		constitute VLIW machines tailored to hide occasionally large
		memory latencies.

Cedar		From University of Illinois...  a bunch of Alliants sharing
		a fast global store.  Uses a cluster structure for memory
		interconnection.  Not really as scalable as the others....

BBM Monarch	Next-generation of the Butterfly?

CARP Machine	Compiler-oriented Architecture Research at Purdue Machine;
		vaguely son-of-RFM-MIMD and son-of-PASM.  Lots of compiler
		driven "tricks" used to manage memory....

Horizon/Tera	Burton Smith's new machine (HEP being his old one).  Sort-of
		like RFM-MIMD, but with microtasking and with a mesh net.
		Details not yet available.

#ifdef	FLAME
		Don't you guys read the earlier postings before posting your
		responses?  I posted something a couple of weeks ago that
		would have cleared-up most of the argument, but you guys
		just seem to want to argue, rather than to discuss.
#endif

     __         /|
  _ |  |  __   / |  Compiler-oriented
 /  |--| |  | |  |  Architecture
/   |  | |__| |_/   Researcher from
\__ |  | | \  |     Purdue
    \    |  \  \
	 \      \   Prof. Hank Dietz, Purdue EE

turner@uicsrd.csrd.uiuc.edu.UUCP (05/18/88)

/* Written 11:31 am  May 10, 1988 by retrac@titan.rice.edu in comp.arch */

...                     The RP-3 project at IBM is doing some interest-
ing work on large shared memory architectures.  Their work aside, I think that
you should be able to get "optimal" performance by providing very fast
dsistributed memory communication, hardward support for IPC (interprocess[or]
communication), intelligent/clever communication protocols, and intelligent
compilers that can schedule the IPC so as to remove or reduce the delay
associated with waiting for remote memory access. ...

John Carter               Internet: retrac@rice.edu
Dept of Computer Science  CSNET:    retrac@rice.edu
Rice University           UUCP:     {internet node or backbone}!rice!retrac
Houston, TX
/* End of text from uicsrd.csrd.uiuc.edu:comp.arch */

For "optimal" performance on *some* applications, local memories and
fast interprocessor communication may well be enough.  But for others
communication costs may well eliminate any speedup due to parallelism.

Why is it that everyone seems to assume that machines must either have
shared memory OR distributed memory, never a little of both?  From my
point of view I would like to see a machine with: fast local memory;
slower, but deeply pipelined, shared memory; and a thin global sync
bus (for barrier sync).  We have had memory heirarchies for decades
now, why should they cease to be useful now?
---------------------------------------------------------------------------
 Steve Turner (on the Si prairie  - UIUC CSRD)

UUCP:	 {ihnp4,seismo,pur-ee,convex}!uiucdcs!uicsrd!turner
ARPANET: turner%uicsrd@a.cs.uiuc.edu           
CSNET:	 turner%uicsrd@uiuc.csnet            *-)    Mutants for
BITNET:	 turner@uicsrd.csrd.uiuc.edu                Nuclear Power!  (-%

lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (05/19/88)

In article <8125@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) gives a list of
MIMD machines with large-scale shared memory address space but physically
distributed memory:
....
>CHoPP
>Denelcor HEP
>NYU Ultra
>BBN Butterfly
>IBM RP3
>RFM-MIMD
>Cedar
>BBM Monarch
>CARP Machine
>Horizon/Tera
....

Is the list restricted to existing machines? If not, then I wold like to add
the "Fluent Machine" being projected here at Yale. Physically this consists
of processors with local memory connected in a butterfly pattern. These
local memories will however be seen as one shared memory by programs. To
accomplish this messages have to be routed in the network and this is done
by something called "fluent routing", which is the essential new thing for
this machine. It will also support combining operations such as full
multiprefix.

If I am to give my personal opinion on very large scale global address space
machines, then it is that I think they must provide local, fixed routing
between close processors as an alternative addressing mode. The generality
of the global address scheme will always have to be paid in long access time
compared with local access.  Many algorithms, especially in fields such as
scientific computing, have a quite data-independent structure which means
that the communication pattern between the computation steps can be
determined at compile-time and mapped to the network of processors to
provide a high degree of locality.

Bjorn Lisper

kuehn@super.ORG (James T. Kuehn) (05/20/88)

In article <29577@yale-celray.yale.UUCP> lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes:
>
>If I am to give my personal opinion on very large scale global address space
>machines, then it is that I think they must provide local, fixed routing
>between close processors as an alternative addressing mode. The generality
>of the global address scheme will always have to be paid in long access time
>compared with local access.

Not necessarily.  The confusion seems to be that you are making
large global address space synonymous with long, fixed access time.  
Consider the Denelcor HEP, which was Hank's first example.  
It had a large global address space and a variable memory access time 
that depended on the distance from the processor (number of network hops).
If your compiler knows about memory hierarchies and can distinguish
"near" addresses from "far" ones (in terms of latencies), it can schedule
values accordingly.  The flat, shared address space makes life convenient
for the compiler, especially when it doesn't know where to schedule a value.
If in addition, you have a machine that hides latency by some technique
(as did HEP), you're home free.  In summary:

Reducing latency is always a win, but when you don't know how, hide it too!

Jim Kuehn
kuehn%super.org@sh.cs.net

billo@cmx.npac.syr.edu (Bill O) (05/23/88)

In article <43700039@uicsrd.csrd.uiuc.edu> turner@uicsrd.csrd.uiuc.edu writes:

>Why is it that everyone seems to assume that machines must either have
>shared memory OR distributed memory, never a little of both?  From my
>point of view I would like to see a machine with: fast local memory;
>slower, but deeply pipelined, shared memory; and a thin global sync
>bus (for barrier sync).  We have had memory heirarchies for decades
>now, why should they cease to be useful now?

I know of one architecture that has both local memory and globally
shared memory -- the BBN Butterfly.  Is I understand it (correct me if
I'm wrong), the processing nodes on the Butterfly each have memory
which is local in the sense that it can be accessed very quickly by
the local processor, but which is globally shared, in the sense that
all the other processors can access it too, via the big global switch.
Architecturally, each node "thinks" it has a very big globally shared
memory, but software can take advantage of fast local access by
arranging for each processor's most frequently accessed data to be in
that portion of memory which is truly local.

Hmm... I hope I've got that right.

Are there any other multi-processor architectures that have some mix of
both local and shared memory?

Bill O'Farrell, Northeast Parallel Architectures Center at Syracuse University
(billo@cmx.npac.syr.edu)

emiller@bbn.com (ethan miller) (05/23/88)

In article <501@cmx.npac.syr.edu> billo@cmx.npac.syr.edu (Bill O'Farrell) writes:
->In article <43700039@uicsrd.csrd.uiuc.edu> turner@uicsrd.csrd.uiuc.edu writes:
->
->>Why is it that everyone seems to assume that machines must either have
->>shared memory OR distributed memory, never a little of both?  From my
->>point of view I would like to see a machine with: fast local memory;
->>slower, but deeply pipelined, shared memory; and a thin global sync
->>bus (for barrier sync).  We have had memory heirarchies for decades
->>now, why should they cease to be useful now?
->
->I know of one architecture that has both local memory and globally
->shared memory -- the BBN Butterfly.  Is I understand it (correct me if
->I'm wrong), the processing nodes on the Butterfly each have memory
->which is local in the sense that it can be accessed very quickly by
->the local processor, but which is globally shared, in the sense that
->all the other processors can access it too, via the big global switch.
->Architecturally, each node "thinks" it has a very big globally shared
->memory, but software can take advantage of fast local access by
->arranging for each processor's most frequently accessed data to be in
->that portion of memory which is truly local.
->
->Hmm... I hope I've got that right.
->
->Bill O'Farrell, Northeast Parallel Architectures Center at Syracuse University
->(billo@cmx.npac.syr.edu)

The description of the Butterfly is essentially correct.  One element that
it is lacking is a memory associated with no processor node.  The delay for
a single longword (32 bit) read/write to another processor is about 4 micro-
seconds, while it's very fast (~250ns (?)) to local memory.  A nice
addition, which I think "turner" is referring to is an intermediate--a memory
accessible with the same delay to all nodes, along with local memory for
each node.  The Butterfly could do this with memory-only nodes, though
I don't know of any plans for those (I work for a different part of BBN).

Essentially, then, the Butterfly has a distributed-only memory model,
but accesses may SEEM like the memory is globally shared.  Memory transfer
between nodes is handled by the PNC, which is a microcoded processor.
Global sharing is an illusion created by this processor.

ethan
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
ethan miller              | "Quod erat demonstrandum, baby."
BBN Laboratories          | "Oooh, you speak French!"
ARPAnet : emiller@bbn.com |
PHONEnet: (617) 873-3091  | Disclaimer: It's MY opinion, not BBN's.

rogers@ofc.Columbia.NCR.COM (H. L. Rogers) (05/23/88)

In article <501@cmx.npac.syr.edu> billo@cmx.npac.syr.edu (Bill O'Farrell) writes:
>In article <43700039@uicsrd.csrd.uiuc.edu> turner@uicsrd.csrd.uiuc.edu writes:
>
>Are there any other multi-processor architectures that have some mix of
>both local and shared memory?

Yes.  The NCR TOWER 32/800 has local memory for each processing element,
whether it be an application processor, file processor, terminal
processor, communications processor, or what-have-you processor.  It
provides fast local access and is shared globally through messaging
across Multibus II (I think I got that right.).

The point of the original article implies that many people think shared
memory is necessary for speed.  Like most things in life, it depends on
what you want to do.  Shared memory is necessary for cooperating processes
which share the same data to enjoy high speed, but single-threaded
processes can't stand the overhead.  Intelligent customers will choose
the architecture that best fits their application environment.
-- 

------------
HL Rogers    (hl.rogers@ncrcae.Columbia.NCR.COM)

tpo@dde.uucp (Thomas P.S. Olesen) (05/31/88)

In article <43700039@uicsrd.csrd.uiuc.edu> turner@uicsrd.csrd.uiuc.edu writes:
>
>Why is it that everyone seems to assume that machines must either have
>shared memory OR distributed memory, never a little of both?  From my
>point of view I would like to see a machine with: fast local memory;
>slower, but deeply pipelined, shared memory; and a thin global sync
>bus (for barrier sync).  We have had memory heirarchies for decades
>now, why should they cease to be useful now?

The Supermax computer made by "my" company, has both distributed and
shared memory. 
The Supermax is a multiprocessor machine with up to 8 x 68020 
application processors, and up to 8 other 680x0/8085 processores for
handling IO.

Every processor has local memory (up to 256Mb), but they are made
so they can setup the mmu and make data read/write/read-modify-write
in all other cpues memory, using the common bus. But NOT instruction 
fetch.

This achitecture makes it possible to run all processors at high
speed and have fast memory access AND have shared memory.

The OS is SVR3.1 and shard memory for users processes is fully
supported. A memory access to another cpu is a little slower
than a local access. I think remote access take about 4 times a
local access.

Thomas P. Olesen

-- 
*****************************************************************************
Thomas P.S. Olesen                               Dansk Data Elektronik A/S
E-mail: ..!mcvax!diku!dde!tpo                    System Software Department
        or       tpo@dde.dk