[comp.arch] Shared Memory

laurel@super.ORG (Our Friends Up the Way) (11/01/88)

In article <13529@lll-winken.llnl.gov> brooks@maddog.UUCP (Eugene Brooks) writes:
>The Crays are not only faster, but they also have real honest to God shared
>memory.

I would not call Cray's version of shared memory, real honest to God 
shared memory. There are many things it can't do, that other systems can,
such as shared text and the like.

-- 
Michael Tighe
Supercomputer Research Center
ARPA: laurel@super.org

brooks@maddog.llnl.gov (Eugene Brooks) (11/02/88)

In article <850@super.ORG> laurel@super.UUCP (Michael Tighe) writes:
>I would not call Cray's version of shared memory, real honest to God 
>shared memory. There are many things it can't do, that other systems can,
>such as shared text and the like.
The Cray hardware supports seperate text and data segments, whether
the text is "shared" or not is a feature of the operating system and
not the hardware.  It would be nice if there were more than one data
segment, so that private memory with staticly allocated addresses
could be supported, provided that it did not make the machine go
slower.

ccc_ldo@waikato.ac.nz (02/07/90)

In <7695@pt.cs.cmu.edu>, lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
mentions, as a drawback of shared memory, the example where a commit log
for reliable transactions might be written to battery-backed memory
instead of to disk. The idea was that, by complicating the instruction
sequence necessary to do the write, you lessen the chance of a wild program
corrupting the log.

Lindsay says this works, but "only partially", as the wild program can
still corrupt the log by calling the subroutine that knows how to write to it.

It seems to me the same objection applies to an on-disk log.

How about if you don't have a subroutine to do the write, but only in-line
code? Of course, this would lead to even more strange behaviour if you
should accidentally branch to the code.

dhepner@hpisod2.HP.COM (Dan Hepner) (02/09/90)

From: ccc_ldo@waikato.ac.nz
>The idea was that, by complicating the instruction
>sequence necessary to do the write, you lessen the chance of a wild program
>corrupting the log.

>How about if you don't have a subroutine to do the write, but only in-line
>code? Of course, this would lead to even more strange behaviour if you
>should accidentally branch to the code.

The way to address this potential is with software design, not through
attempts to make the instruction sequences tougher to land in.

Reasonably designed software would almost certainly encounter a
failed assertion before actually corrupting the log.

ccc_ldo correctly notes the lack of fundamental difference between memory 
and disk for protecting logs from corruption due to software bugs.  Buggy 
software, or maybe even a perfect wild branch, could corrupt the log 
regardless of if it's written to memory, disk, or stone.

Dan Hepner

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/12/90)

In article <81.25d07596@waikato.ac.nz> ccc_ldo@waikato.ac.nz writes:
>In <7695@pt.cs.cmu.edu>, lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
>mentions, as a drawback of shared memory, the example where a commit log
>for reliable transactions might be written to battery-backed memory
>instead of to disk. The idea was that, by complicating the instruction
>sequence necessary to do the write, you lessen the chance of a wild program
>corrupting the log.
>
>Lindsay says this works, but "only partially", as the wild program can
>still corrupt the log by calling the subroutine that knows how to write to 
>it. It seems to me the same objection applies to an on-disk log.

In article <13910015@hpisod2.HP.COM> dhepner@hpisod2.HP.COM 
	(Dan Hepner) writes:
>Reasonably designed software would almost certainly encounter a
>failed assertion before actually corrupting the log.

Yes, accessing disks is complicated, so that's not the difference.

I think that the disk is fundamentally more reliable, because it's
slow.  Let me explain by sketching a "stable storage" scheme.

Assume that a data record R fits in a disk sector, and that when you
read it back, there is enough checksumming (at whatever levels) so
that trashed data will be known as such. Dedicate a block of three
disk sectors: call them {A,B,C}. They will be read after reboots.

Write is done by filling a buffer with {R,0,R} and doing a
three-sector write to {A,B,C}.  The "failure model" of the disk is
that a crash may corrupt one sector that is being written, or two,
but not three.

Reading R involves reading both copies. There are several
possibilities. For example, A and C may both checksum fine, but be
different.  In that case, we assume that the crash occured while
writing B, so crash recovery uses the record in A. 

This works fairly well. But why would it work better than the special
memory? I think it's because it's *slow*. The chance that the system
will be insane for duration D surely goes down as D increases. 

This is bad news, because the whole point of having the memory was to
speed things up.

Of course, one way out of this bind is to claim that power failures
are the dominant field failure, and that a good power-fail trap can
make systems die in nanoseconds. Kind of an easy answer: I'd love to
hear a better way out.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (10/04/90)

In article <1990Oct3.013509.1470@news.iastate.edu> carter@iastate.edu 
	(Carter Michael Brannon) writes:
>While I must agree
>with you that shared-memory systems are far from dinosaurs, I must disagree
>categorically that they are the only cost-effective way to build a successful
>massively-parallel computer.

It is clear that it is simplest to design a message-based ensemble
machine.  A lot of issues can be punted to software, and the hardware
that is designed, pretty well has to be there in any case.

It is also clear that a shared-memory ensemble machine is desirable.
Sharing allows some nice programming techniques, and makes a
difference when trying to port a lot of software to the new machine.

So: the shared-memory machine is harder to design, but (broadly) more
valuable.  Now for the Big Question: is it necessarily more
expensive?

I don't think so, although that's easier to say than prove.

To begin with, adding function to an existing chip may be free, from
the manufacturing viewpoint.  This idea hints to us that increasing a
design's complexity doesn't necessarily increase its price.

Next, at what level will we share?  If it's at the level of virtual
memory pages, then this is already being done, over LANs, by several
operating systems (e.g. Mach).  The only hardware facilities required
are 1) a decent MMU on each node, and 2) page-sized messages.

The problem with many current ensemble machines is their indecent
MMUs.  I submit that an MMU is valuable in any case, so page-sharing
makes no special hardware demands: zero, none.  Not that hardware
couldn't be useful: and the interconnect has to be fast enough so
that the machine remains balanced under the new usage patterns.

If a machine is to have finer-grained sharing, then a number of
efficiency issues come up.  I am aware of a variety of approaches,
each with its own penalty or price.  I believe that eventually, one
of these will be implemented in a way that adds essentially nothing
to the machine's manufacturing cost.  

-- 
Don		D.C.Lindsay

misha@teenage-mutant.ai.mit.edu (Mike Bolotski) (10/04/90)

In article <10651@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>It is clear that it is simplest to design a message-based ensemble
>machine.  A lot of issues can be punted to software, and the hardware
>that is designed, pretty well has to be there in any case.
>
>It is also clear that a shared-memory ensemble machine is desirable.

This is not necessarily clear.  There are various penalties
associated with shared memories, including performance and cost.

>Sharing allows some nice programming techniques, and makes a
>difference when trying to port a lot of software to the new machine.

Granted, most parallel software was written under the shared memory
model, and if the main constraint is porting existing software, then
shared memory machines win. This argument was used in the past
on various innovations.

>Next, at what level will we share?  If it's at the level of virtual
>memory pages, then this is already being done, over LANs, by several
>operating systems (e.g. Mach).  The only hardware facilities required
>are 1) a decent MMU on each node, and 2) page-sized messages.

3) a tolerable interconnection network.

While sharing at the page level across a network works, increasing
the number of workstations to say, 1000, will seriously impede
performance.

>If a machine is to have finer-grained sharing, then a number of
>efficiency issues come up.  I am aware of a variety of approaches,
>each with its own penalty or price.  I believe that eventually, one
>of these will be implemented in a way that adds essentially nothing
>to the machine's manufacturing cost.  

I disagree.  Shared memory automatically creates contention at
for frequently used data.  Caching is a possible solution, but 
cache coherence protocols for fine grained machines become incredibly
complex, and further impact performance.

Further, shared memory does not obey the natural laws of physics,
in a sense.  The abstract shared memory model pretends that there
is a large number of communication paths to a single point. This
simply isn't true, and additional hardware is required to provide
this illusion to the programmer. This hardware costs time and
dollars.

Parallel programs designed to reflect the underlying physical
reality will operate faster than those that work in a virtual
model. I suppose an IMHO is in order here.

Mike Bolotski          Artificial Intelligence Laboratory, MIT
misha@ai.mit.edu       Cambridge, MA

misha@teenage-mutant.ai.mit.edu (Mike Bolotski) (10/04/90)

In article <10651@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>It is clear that it is simplest to design a message-based ensemble
>machine.  A lot of issues can be punted to software, and the hardware
>that is designed, pretty well has to be there in any case.
>
>It is also clear that a shared-memory ensemble machine is desirable.

This is not necessarily clear.  There are various penalties
associated with shared memories, including performance and cost.

>Sharing allows some nice programming techniques, and makes a
>difference when trying to port a lot of software to the new machine.

Granted, most parallel software was written under the shared memory
model, and if the main constraint is porting existing software, then
shared memory machines win. This argument was used in the past
on various innovations.

>Next, at what level will we share?  If it's at the level of virtual
>memory pages, then this is already being done, over LANs, by several
>operating systems (e.g. Mach).  The only hardware facilities required
>are 1) a decent MMU on each node, and 2) page-sized messages.

3) a tolerable interconnection network.

While sharing at the page level across a network works, increasing
the number of workstations to say, 1000, will seriously impede
performance.

>If a machine is to have finer-grained sharing, then a number of
>efficiency issues come up.  I am aware of a variety of approaches,
>each with its own penalty or price.  I believe that eventually, one
>of these will be implemented in a way that adds essentially nothing
>to the machine's manufacturing cost.  

I disagree.  Shared memory automatically creates contention at
for frequently used data.  Caching is a possible solution, but 
cache coherence protocols for fine grained machines become incredibly
complex, and further impact performance.

Further, shared memory does not obey the natural laws of physics,
in a sense.  The abstract shared memory model pretends that there
is a large number of communication paths to a single point. This
simply isn't true, and additional hardware is required to provide
this illusion to the programmer. This hardware costs time and
dollars.

Parallel programs designed to reflect the underlying physical
reality will operate faster than those that work in a virtual
model. I suppose an IMHO is in order here.

Mike Bolotski          Artificial Intelligence Laboratory, MIT
misha@ai.mit.edu       Cambridge, MA

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (10/06/90)

In article <11181@life.ai.mit.edu> misha@teenage-mutant.ai.mit.edu 
	(Mike Bolotski) writes:
>While sharing at the page level across a network works, increasing
>the number of workstations to say, 1000, will seriously impede
>performance.

Well, yes.  However, I prefer to talk about the cost difference
between a large message-based MIMD machine, and a large shared-memory
MIMD machine.  I assume that the message-based machine is reasonably
well designed.  Hence, asking it to transport page-sized messages is
not unreasonable: at worst, the message traffic will be dissimilar to
the workload expected by the designers.

>>If a machine is to have finer-grained sharing, then a number of
>>efficiency issues come up.  I am aware of a variety of approaches,
>>each with its own penalty or price.  I believe that eventually, one
>>of these will be implemented in a way that adds essentially nothing
>>to the machine's manufacturing cost.

>I disagree.  Shared memory automatically creates contention
>for frequently used data.  Caching is a possible solution, but
>cache coherence protocols for fine grained machines become incredibly
>complex, and further impact performance.

As I pointed out, complexity does not necessarily make a machine more
expensive to build.  What costs are things like extra pins and wider
connectors.  A cache coherence protocol mostly involves adding
control hardware that can send and receive messages.  We were already
assuming some sort of hardware support for messages.

(If the enlarged cache line tags require extra SRAM chips, then
that's also a cost.  Also, message hardware can be semi-adequate
without having MMU access, whereas the sharing hardware definitely
talks with the MMU.)

>Further, shared memory does not obey the natural laws of physics,
>in a sense.  The abstract shared memory model pretends that there
>is a large number of communication paths to a single point. This
>simply isn't true, and additional hardware is required to provide
>this illusion to the programmer. This hardware costs time and
>dollars.

Message based systems offer the same pretense.  Any decent message-
based MIMD machine will allow a message to go quickly from any node,
to any other node.  Yes, there are nearest-neighbor designs out
there.  But they don't pretend to generality, and inevitably, library
software is written for forwarding messages.  The original NCUBE
machine was like this, but you will notice that the newer NCUBE-2 has
hardware support for forwarding.

>Parallel programs designed to reflect the underlying physical
>reality will operate faster than those that work in a virtual
>model. I suppose an IMHO is in order here.

IMHO you're right.  A parallel program, of either kind, ignores the
underlying truths at its peril.  The interconnect can have traffic
jams.  If 1023 nodes are all requesting decisions from a single node,
then everybody gets to do a lot of waiting.  And so on.
-- 
Don		D.C.Lindsay

priol@irisa.fr (Thierry Priol) (10/10/90)

A software user point of view in using a Shared Virtual Memory on
Distributed Memory Parallel Computer.

We have recently worked on implementing a SVM on an iPSC/2. Using
a SVM on DMPC can improve the efficiency of some irregular algorithms
(Ray-tracing, LocusRoute, Metal, etc...). We have tested a message-based
and shared-memory approach on our iPSC/2. Results are better with the
second approach!

Several works has been done recently to design a hardware device
for implementing efficiently a SVM (PLUS from CMU, ALEWIFE from MIT,
Kendall Square Research, and some European Esprit Projects). These
parallel computers are scalable and provide a shared memory. Cache
coherence is a difficult problem (not solved on BBN (no cache) or RP3
(cache but not coherence by hardware (software only), i think?).
There are some ideas for managing cache coherence by analyzing programs
which allow to minimize the cost of cache coherence (A. Veidenbaum paper
or Cytron and al.)

may be, these new architectures are the next generation of DMPC ?

What are your feeling about that ?


NB : i apologize for my poor english.

-- 
Thierry PRIOL                                Phone:  99 36 20 00
IRISA / INRIA U.R. Rennes                    Fax:    99 38 38 32
Campus Universitaire de Beaulieu             Telex:  UNIRISA 950 473F
35042 RENNES CEDEX - FRANCE                  E-mail: priol@irisa.fr

david@tera.com (David Callahan) (10/10/90)

I've seen three basic issues raised in the discussion
of shared memory multiprocessors:

1. They are/are not scalable.

2. There is a performance advantage to having the programmer
explicitly manage data placement.

3. There is a hardware cost advantage in not implementing
shared-memory but a software cost disadvantage.

I'd like to discuss the first two; the third I believe is 
obvious but the net cost advantage is unknown and subject to 
market forces and budget pressures. This discussion does not seem
limited to comp.arch. On comp.parallel there was an advance program
for a workshop on ``Unstructured Scientific Computation on 
Scalable Multiprocessors'' where it is clear that ``scalable'' is
being used as a code word for non-shared memory and in the current
issue of CACM in a side bar to the article by Gul Agha there is a
simple assertion that shared-memory multiprocessors are not scalable
with a reference to work by Dally.

What is shared-memory? An innocuous looking question but one which is
hard to answer since (given points 2 and 3) we are really interested
in the programming issue. At that level we might have this property:

SM1. Shared-memory implies that any thread of control can access any
word of memory without the explicit assistance of another thread.

This definition is meant to exclude traditional message-passing
programming models which are the rule in production hypercube machines
but not to exclude experiments with other language paradigms such as
Linda or Prolog, or the various attempts to compile shared memory programs onto
message-passing systems. Clearly shared-memory could be implemented on
essentially any hardware by having the language implementation provide
server threads to service read and write requests. Or that capability
can be directly implemented in hardware --- which is the usual meaning
of shared memory in this group.

This definition does not capture the programming costs inherent when 
``remote'' memory is much more costly to access than ``close'' memory
and so we need an additional constraint:

SM2. Shared-memory implies that, with probability 1, the placement of
data does not influence performance by more than a factor of 2.

The clause ``with probability 1'' means that bank conflicts are
unlikely (hot spots can still be a problem). The factor of 2 is
obviously arbitrary but 2 seems small enough that its not worth
programming around.  This constraint was first described to me by
Burton Smith and was influenced by Leslie Valiant (everyone should
read his recent CACM article).  Hot spots in an algorithm are a
problem for non-shared memory systems as well and so they simply must
be programmed around.

What is scalability? This is another slippery term made difficult
because of different assumptions about what is scaling. In particular,
if I double the number of processors I am willing to run a bigger
problem to get twice my current performance (no way around Amdahl's
Law) but I don't want to reprogram. (Actually, most programs won't
scale arbitrarily since new performance bottlenecks arise in the
algorithm unless its designed with extreme care but that is 
architecture independent.) This gives rise to the definitions:

S1. A system architecture is scalable if the deliverable performance
is a linear function of the number of processors.

S2. A shared-memory system is scalable if it is scalable and
satisfies SM2 for any number of processors.

A basic question is: can you design a scalable, shared-memory
architecture? I'll answer this question by giving an example.  Start
with a packet-switch interconnection network. This network consists of
a NxNxN 3d mesh, toroidally connected in all dimensions.  Each
dimension is ``folded over'' so that there are no long wires (and so
satisfies S1).  Now uniformly embed in this network P=NxN processors
and P memory modules.  This basic model satisfies SM1 and we must
argue that it can also satisfy SM2 therefore S2. Note that hypercubes
and meshes with dimension greater than 3 will not scale because of
long wires unless corners and edges are made visible to the
programmer.

We will assume that memory references are uniformly distributed; the
processors will hash virtual addresses to get physical
addresses (as in RP3 and Cydrome) and so this assumption is valid except
for hot spots. The average distance in the network is 
	3*(N/4) = sqrt(P)*3/64
and we define the latency, L, of the network to be the round trip
time through the network plus memory chips. The network is larger than
the number of processors to provide adequate bisection bandwidth.
	
If each processor only issues one memory operation and then waits for
it to complete, the system could not scale since the performance would
degrade as the latency increased. In fact the processor must be able
to maintain L memory operations in concurrent execution where L grows
with sqrt(P).

One way to do this is to have a vector processor with vector length L.
Unfortunately this vector length is exposed to the programmer and
restricts the general applicability of the system.  An alternative is
to use a data-flow style processor, which has the disadvantage that
registers are hard to use effectively (though variants are currently
being studied).

A compromise is to have each processor support L independent
instruction streams. Each stream has its own program counter and
register set but all streams share the same set of functional units
and the same memory port via low-level time multiplexing. These
processors are generally referred to as ``multi-threaded'' processors.
They have the advantage of supporting general parallelism but also have
a register set to support compile-time optimizations.

In all of these architectures, not all of the parallelism in a program
is translated into ``speedup''. Rather some of it is ``spent'' to
support shared memory (and also functional unit pipeline latencies).
This parallelism is what Valiant refers to as ``parallel
slackness''. 

For this architecture, if we double the number of processors, then the
latency increases by no more than sqrt(2) (about 1.4) and so the
number of streams in an running program should increase by twice this
in order to realize a doubling in delivered performance.  For a
problem of fixed size, you will eventually run out of parallelism and
the only way to improve performance is to reduce memory latency but
this is merely Amdahl's Law.  To reduce memory latency for this
extreme case we can add local memory to each processor and expose its
presence to the programmer so he can leave the shared-memory world
when his problem is tight on parallelism and he absolutely must have
the extra performance. This machine is essentially the product being
developed by us at Tera Computer Co.

We can make this system less effective at providing shared-memory by
reducing the size of the network to NxNxK where K < N but leaving the
number of processors at NxN. As K decreases, the system moves toward
as 2D mesh-connected system which does not have the bandwidth to
support shared-memory and the programmer will have to use local memory
more aggressively to sustain the same level of performance.  Since the
network is a major cost in the system, we see that giving up shared
memory will reduce hardware cost.  Does it raise software costs by
more?

--
David Callahan  				       david@tera.com
Tera Computer Co. 400 North 34th Street, Suite 300  Seattle WA, 98103

grunwald@foobar.colorado.edu (Dirk Grunwald) (10/13/90)

I'm not terribly suprised that a program with a largely read-only
database, such as a ray tracer, does better under SVM/NVM/DVM.

Consider such a program on a disk system w/o enough memory. You can
either check residence on each object access, or you can just have the
MMU handle it. The 95% case is that the object is local.

Of course, this is won't work so well if you naively transform a
program to a network memory machine; of course, porting a shared
memory problem to e.g, an XPC encore or the new BBN machines will be
hard if you don't think about it.