laurel@super.ORG (Our Friends Up the Way) (11/01/88)
In article <13529@lll-winken.llnl.gov> brooks@maddog.UUCP (Eugene Brooks) writes: >The Crays are not only faster, but they also have real honest to God shared >memory. I would not call Cray's version of shared memory, real honest to God shared memory. There are many things it can't do, that other systems can, such as shared text and the like. -- Michael Tighe Supercomputer Research Center ARPA: laurel@super.org
brooks@maddog.llnl.gov (Eugene Brooks) (11/02/88)
In article <850@super.ORG> laurel@super.UUCP (Michael Tighe) writes: >I would not call Cray's version of shared memory, real honest to God >shared memory. There are many things it can't do, that other systems can, >such as shared text and the like. The Cray hardware supports seperate text and data segments, whether the text is "shared" or not is a feature of the operating system and not the hardware. It would be nice if there were more than one data segment, so that private memory with staticly allocated addresses could be supported, provided that it did not make the machine go slower.
ccc_ldo@waikato.ac.nz (02/07/90)
In <7695@pt.cs.cmu.edu>, lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) mentions, as a drawback of shared memory, the example where a commit log for reliable transactions might be written to battery-backed memory instead of to disk. The idea was that, by complicating the instruction sequence necessary to do the write, you lessen the chance of a wild program corrupting the log. Lindsay says this works, but "only partially", as the wild program can still corrupt the log by calling the subroutine that knows how to write to it. It seems to me the same objection applies to an on-disk log. How about if you don't have a subroutine to do the write, but only in-line code? Of course, this would lead to even more strange behaviour if you should accidentally branch to the code.
dhepner@hpisod2.HP.COM (Dan Hepner) (02/09/90)
From: ccc_ldo@waikato.ac.nz >The idea was that, by complicating the instruction >sequence necessary to do the write, you lessen the chance of a wild program >corrupting the log. >How about if you don't have a subroutine to do the write, but only in-line >code? Of course, this would lead to even more strange behaviour if you >should accidentally branch to the code. The way to address this potential is with software design, not through attempts to make the instruction sequences tougher to land in. Reasonably designed software would almost certainly encounter a failed assertion before actually corrupting the log. ccc_ldo correctly notes the lack of fundamental difference between memory and disk for protecting logs from corruption due to software bugs. Buggy software, or maybe even a perfect wild branch, could corrupt the log regardless of if it's written to memory, disk, or stone. Dan Hepner
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/12/90)
In article <81.25d07596@waikato.ac.nz> ccc_ldo@waikato.ac.nz writes: >In <7695@pt.cs.cmu.edu>, lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) >mentions, as a drawback of shared memory, the example where a commit log >for reliable transactions might be written to battery-backed memory >instead of to disk. The idea was that, by complicating the instruction >sequence necessary to do the write, you lessen the chance of a wild program >corrupting the log. > >Lindsay says this works, but "only partially", as the wild program can >still corrupt the log by calling the subroutine that knows how to write to >it. It seems to me the same objection applies to an on-disk log. In article <13910015@hpisod2.HP.COM> dhepner@hpisod2.HP.COM (Dan Hepner) writes: >Reasonably designed software would almost certainly encounter a >failed assertion before actually corrupting the log. Yes, accessing disks is complicated, so that's not the difference. I think that the disk is fundamentally more reliable, because it's slow. Let me explain by sketching a "stable storage" scheme. Assume that a data record R fits in a disk sector, and that when you read it back, there is enough checksumming (at whatever levels) so that trashed data will be known as such. Dedicate a block of three disk sectors: call them {A,B,C}. They will be read after reboots. Write is done by filling a buffer with {R,0,R} and doing a three-sector write to {A,B,C}. The "failure model" of the disk is that a crash may corrupt one sector that is being written, or two, but not three. Reading R involves reading both copies. There are several possibilities. For example, A and C may both checksum fine, but be different. In that case, we assume that the crash occured while writing B, so crash recovery uses the record in A. This works fairly well. But why would it work better than the special memory? I think it's because it's *slow*. The chance that the system will be insane for duration D surely goes down as D increases. This is bad news, because the whole point of having the memory was to speed things up. Of course, one way out of this bind is to claim that power failures are the dominant field failure, and that a good power-fail trap can make systems die in nanoseconds. Kind of an easy answer: I'd love to hear a better way out. -- Don D.C.Lindsay Carnegie Mellon Computer Science
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (10/04/90)
In article <1990Oct3.013509.1470@news.iastate.edu> carter@iastate.edu (Carter Michael Brannon) writes: >While I must agree >with you that shared-memory systems are far from dinosaurs, I must disagree >categorically that they are the only cost-effective way to build a successful >massively-parallel computer. It is clear that it is simplest to design a message-based ensemble machine. A lot of issues can be punted to software, and the hardware that is designed, pretty well has to be there in any case. It is also clear that a shared-memory ensemble machine is desirable. Sharing allows some nice programming techniques, and makes a difference when trying to port a lot of software to the new machine. So: the shared-memory machine is harder to design, but (broadly) more valuable. Now for the Big Question: is it necessarily more expensive? I don't think so, although that's easier to say than prove. To begin with, adding function to an existing chip may be free, from the manufacturing viewpoint. This idea hints to us that increasing a design's complexity doesn't necessarily increase its price. Next, at what level will we share? If it's at the level of virtual memory pages, then this is already being done, over LANs, by several operating systems (e.g. Mach). The only hardware facilities required are 1) a decent MMU on each node, and 2) page-sized messages. The problem with many current ensemble machines is their indecent MMUs. I submit that an MMU is valuable in any case, so page-sharing makes no special hardware demands: zero, none. Not that hardware couldn't be useful: and the interconnect has to be fast enough so that the machine remains balanced under the new usage patterns. If a machine is to have finer-grained sharing, then a number of efficiency issues come up. I am aware of a variety of approaches, each with its own penalty or price. I believe that eventually, one of these will be implemented in a way that adds essentially nothing to the machine's manufacturing cost. -- Don D.C.Lindsay
misha@teenage-mutant.ai.mit.edu (Mike Bolotski) (10/04/90)
In article <10651@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >It is clear that it is simplest to design a message-based ensemble >machine. A lot of issues can be punted to software, and the hardware >that is designed, pretty well has to be there in any case. > >It is also clear that a shared-memory ensemble machine is desirable. This is not necessarily clear. There are various penalties associated with shared memories, including performance and cost. >Sharing allows some nice programming techniques, and makes a >difference when trying to port a lot of software to the new machine. Granted, most parallel software was written under the shared memory model, and if the main constraint is porting existing software, then shared memory machines win. This argument was used in the past on various innovations. >Next, at what level will we share? If it's at the level of virtual >memory pages, then this is already being done, over LANs, by several >operating systems (e.g. Mach). The only hardware facilities required >are 1) a decent MMU on each node, and 2) page-sized messages. 3) a tolerable interconnection network. While sharing at the page level across a network works, increasing the number of workstations to say, 1000, will seriously impede performance. >If a machine is to have finer-grained sharing, then a number of >efficiency issues come up. I am aware of a variety of approaches, >each with its own penalty or price. I believe that eventually, one >of these will be implemented in a way that adds essentially nothing >to the machine's manufacturing cost. I disagree. Shared memory automatically creates contention at for frequently used data. Caching is a possible solution, but cache coherence protocols for fine grained machines become incredibly complex, and further impact performance. Further, shared memory does not obey the natural laws of physics, in a sense. The abstract shared memory model pretends that there is a large number of communication paths to a single point. This simply isn't true, and additional hardware is required to provide this illusion to the programmer. This hardware costs time and dollars. Parallel programs designed to reflect the underlying physical reality will operate faster than those that work in a virtual model. I suppose an IMHO is in order here. Mike Bolotski Artificial Intelligence Laboratory, MIT misha@ai.mit.edu Cambridge, MA
misha@teenage-mutant.ai.mit.edu (Mike Bolotski) (10/04/90)
In article <10651@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >It is clear that it is simplest to design a message-based ensemble >machine. A lot of issues can be punted to software, and the hardware >that is designed, pretty well has to be there in any case. > >It is also clear that a shared-memory ensemble machine is desirable. This is not necessarily clear. There are various penalties associated with shared memories, including performance and cost. >Sharing allows some nice programming techniques, and makes a >difference when trying to port a lot of software to the new machine. Granted, most parallel software was written under the shared memory model, and if the main constraint is porting existing software, then shared memory machines win. This argument was used in the past on various innovations. >Next, at what level will we share? If it's at the level of virtual >memory pages, then this is already being done, over LANs, by several >operating systems (e.g. Mach). The only hardware facilities required >are 1) a decent MMU on each node, and 2) page-sized messages. 3) a tolerable interconnection network. While sharing at the page level across a network works, increasing the number of workstations to say, 1000, will seriously impede performance. >If a machine is to have finer-grained sharing, then a number of >efficiency issues come up. I am aware of a variety of approaches, >each with its own penalty or price. I believe that eventually, one >of these will be implemented in a way that adds essentially nothing >to the machine's manufacturing cost. I disagree. Shared memory automatically creates contention at for frequently used data. Caching is a possible solution, but cache coherence protocols for fine grained machines become incredibly complex, and further impact performance. Further, shared memory does not obey the natural laws of physics, in a sense. The abstract shared memory model pretends that there is a large number of communication paths to a single point. This simply isn't true, and additional hardware is required to provide this illusion to the programmer. This hardware costs time and dollars. Parallel programs designed to reflect the underlying physical reality will operate faster than those that work in a virtual model. I suppose an IMHO is in order here. Mike Bolotski Artificial Intelligence Laboratory, MIT misha@ai.mit.edu Cambridge, MA
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (10/06/90)
In article <11181@life.ai.mit.edu> misha@teenage-mutant.ai.mit.edu (Mike Bolotski) writes: >While sharing at the page level across a network works, increasing >the number of workstations to say, 1000, will seriously impede >performance. Well, yes. However, I prefer to talk about the cost difference between a large message-based MIMD machine, and a large shared-memory MIMD machine. I assume that the message-based machine is reasonably well designed. Hence, asking it to transport page-sized messages is not unreasonable: at worst, the message traffic will be dissimilar to the workload expected by the designers. >>If a machine is to have finer-grained sharing, then a number of >>efficiency issues come up. I am aware of a variety of approaches, >>each with its own penalty or price. I believe that eventually, one >>of these will be implemented in a way that adds essentially nothing >>to the machine's manufacturing cost. >I disagree. Shared memory automatically creates contention >for frequently used data. Caching is a possible solution, but >cache coherence protocols for fine grained machines become incredibly >complex, and further impact performance. As I pointed out, complexity does not necessarily make a machine more expensive to build. What costs are things like extra pins and wider connectors. A cache coherence protocol mostly involves adding control hardware that can send and receive messages. We were already assuming some sort of hardware support for messages. (If the enlarged cache line tags require extra SRAM chips, then that's also a cost. Also, message hardware can be semi-adequate without having MMU access, whereas the sharing hardware definitely talks with the MMU.) >Further, shared memory does not obey the natural laws of physics, >in a sense. The abstract shared memory model pretends that there >is a large number of communication paths to a single point. This >simply isn't true, and additional hardware is required to provide >this illusion to the programmer. This hardware costs time and >dollars. Message based systems offer the same pretense. Any decent message- based MIMD machine will allow a message to go quickly from any node, to any other node. Yes, there are nearest-neighbor designs out there. But they don't pretend to generality, and inevitably, library software is written for forwarding messages. The original NCUBE machine was like this, but you will notice that the newer NCUBE-2 has hardware support for forwarding. >Parallel programs designed to reflect the underlying physical >reality will operate faster than those that work in a virtual >model. I suppose an IMHO is in order here. IMHO you're right. A parallel program, of either kind, ignores the underlying truths at its peril. The interconnect can have traffic jams. If 1023 nodes are all requesting decisions from a single node, then everybody gets to do a lot of waiting. And so on. -- Don D.C.Lindsay
priol@irisa.fr (Thierry Priol) (10/10/90)
A software user point of view in using a Shared Virtual Memory on Distributed Memory Parallel Computer. We have recently worked on implementing a SVM on an iPSC/2. Using a SVM on DMPC can improve the efficiency of some irregular algorithms (Ray-tracing, LocusRoute, Metal, etc...). We have tested a message-based and shared-memory approach on our iPSC/2. Results are better with the second approach! Several works has been done recently to design a hardware device for implementing efficiently a SVM (PLUS from CMU, ALEWIFE from MIT, Kendall Square Research, and some European Esprit Projects). These parallel computers are scalable and provide a shared memory. Cache coherence is a difficult problem (not solved on BBN (no cache) or RP3 (cache but not coherence by hardware (software only), i think?). There are some ideas for managing cache coherence by analyzing programs which allow to minimize the cost of cache coherence (A. Veidenbaum paper or Cytron and al.) may be, these new architectures are the next generation of DMPC ? What are your feeling about that ? NB : i apologize for my poor english. -- Thierry PRIOL Phone: 99 36 20 00 IRISA / INRIA U.R. Rennes Fax: 99 38 38 32 Campus Universitaire de Beaulieu Telex: UNIRISA 950 473F 35042 RENNES CEDEX - FRANCE E-mail: priol@irisa.fr
david@tera.com (David Callahan) (10/10/90)
I've seen three basic issues raised in the discussion of shared memory multiprocessors: 1. They are/are not scalable. 2. There is a performance advantage to having the programmer explicitly manage data placement. 3. There is a hardware cost advantage in not implementing shared-memory but a software cost disadvantage. I'd like to discuss the first two; the third I believe is obvious but the net cost advantage is unknown and subject to market forces and budget pressures. This discussion does not seem limited to comp.arch. On comp.parallel there was an advance program for a workshop on ``Unstructured Scientific Computation on Scalable Multiprocessors'' where it is clear that ``scalable'' is being used as a code word for non-shared memory and in the current issue of CACM in a side bar to the article by Gul Agha there is a simple assertion that shared-memory multiprocessors are not scalable with a reference to work by Dally. What is shared-memory? An innocuous looking question but one which is hard to answer since (given points 2 and 3) we are really interested in the programming issue. At that level we might have this property: SM1. Shared-memory implies that any thread of control can access any word of memory without the explicit assistance of another thread. This definition is meant to exclude traditional message-passing programming models which are the rule in production hypercube machines but not to exclude experiments with other language paradigms such as Linda or Prolog, or the various attempts to compile shared memory programs onto message-passing systems. Clearly shared-memory could be implemented on essentially any hardware by having the language implementation provide server threads to service read and write requests. Or that capability can be directly implemented in hardware --- which is the usual meaning of shared memory in this group. This definition does not capture the programming costs inherent when ``remote'' memory is much more costly to access than ``close'' memory and so we need an additional constraint: SM2. Shared-memory implies that, with probability 1, the placement of data does not influence performance by more than a factor of 2. The clause ``with probability 1'' means that bank conflicts are unlikely (hot spots can still be a problem). The factor of 2 is obviously arbitrary but 2 seems small enough that its not worth programming around. This constraint was first described to me by Burton Smith and was influenced by Leslie Valiant (everyone should read his recent CACM article). Hot spots in an algorithm are a problem for non-shared memory systems as well and so they simply must be programmed around. What is scalability? This is another slippery term made difficult because of different assumptions about what is scaling. In particular, if I double the number of processors I am willing to run a bigger problem to get twice my current performance (no way around Amdahl's Law) but I don't want to reprogram. (Actually, most programs won't scale arbitrarily since new performance bottlenecks arise in the algorithm unless its designed with extreme care but that is architecture independent.) This gives rise to the definitions: S1. A system architecture is scalable if the deliverable performance is a linear function of the number of processors. S2. A shared-memory system is scalable if it is scalable and satisfies SM2 for any number of processors. A basic question is: can you design a scalable, shared-memory architecture? I'll answer this question by giving an example. Start with a packet-switch interconnection network. This network consists of a NxNxN 3d mesh, toroidally connected in all dimensions. Each dimension is ``folded over'' so that there are no long wires (and so satisfies S1). Now uniformly embed in this network P=NxN processors and P memory modules. This basic model satisfies SM1 and we must argue that it can also satisfy SM2 therefore S2. Note that hypercubes and meshes with dimension greater than 3 will not scale because of long wires unless corners and edges are made visible to the programmer. We will assume that memory references are uniformly distributed; the processors will hash virtual addresses to get physical addresses (as in RP3 and Cydrome) and so this assumption is valid except for hot spots. The average distance in the network is 3*(N/4) = sqrt(P)*3/64 and we define the latency, L, of the network to be the round trip time through the network plus memory chips. The network is larger than the number of processors to provide adequate bisection bandwidth. If each processor only issues one memory operation and then waits for it to complete, the system could not scale since the performance would degrade as the latency increased. In fact the processor must be able to maintain L memory operations in concurrent execution where L grows with sqrt(P). One way to do this is to have a vector processor with vector length L. Unfortunately this vector length is exposed to the programmer and restricts the general applicability of the system. An alternative is to use a data-flow style processor, which has the disadvantage that registers are hard to use effectively (though variants are currently being studied). A compromise is to have each processor support L independent instruction streams. Each stream has its own program counter and register set but all streams share the same set of functional units and the same memory port via low-level time multiplexing. These processors are generally referred to as ``multi-threaded'' processors. They have the advantage of supporting general parallelism but also have a register set to support compile-time optimizations. In all of these architectures, not all of the parallelism in a program is translated into ``speedup''. Rather some of it is ``spent'' to support shared memory (and also functional unit pipeline latencies). This parallelism is what Valiant refers to as ``parallel slackness''. For this architecture, if we double the number of processors, then the latency increases by no more than sqrt(2) (about 1.4) and so the number of streams in an running program should increase by twice this in order to realize a doubling in delivered performance. For a problem of fixed size, you will eventually run out of parallelism and the only way to improve performance is to reduce memory latency but this is merely Amdahl's Law. To reduce memory latency for this extreme case we can add local memory to each processor and expose its presence to the programmer so he can leave the shared-memory world when his problem is tight on parallelism and he absolutely must have the extra performance. This machine is essentially the product being developed by us at Tera Computer Co. We can make this system less effective at providing shared-memory by reducing the size of the network to NxNxK where K < N but leaving the number of processors at NxN. As K decreases, the system moves toward as 2D mesh-connected system which does not have the bandwidth to support shared-memory and the programmer will have to use local memory more aggressively to sustain the same level of performance. Since the network is a major cost in the system, we see that giving up shared memory will reduce hardware cost. Does it raise software costs by more? -- David Callahan david@tera.com Tera Computer Co. 400 North 34th Street, Suite 300 Seattle WA, 98103
grunwald@foobar.colorado.edu (Dirk Grunwald) (10/13/90)
I'm not terribly suprised that a program with a largely read-only database, such as a ray tracer, does better under SVM/NVM/DVM. Consider such a program on a disk system w/o enough memory. You can either check residence on each object access, or you can just have the MMU handle it. The 95% case is that the object is local. Of course, this is won't work so well if you naively transform a program to a network memory machine; of course, porting a shared memory problem to e.g, an XPC encore or the new BBN machines will be hard if you don't think about it.