crowl@cs.rochester.edu (Lawrence Crowl) (05/10/88)
In article <674@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes: >... Everyones favourite trick seems to be finding evermore complicated ways of >getting large numbers of CPUs to talk to the memory all at once. Just imagine >an ever increasing number of waiters trying to get in and out of the same >kitchen all at once through one door, and you can see the mess. OK, let's >increase the number of doors ... in hardware terms this means separating the >memory into several pages which can be accessed simultaneously, thereby >increasing the effective bandwidth of the memory. Is this really admitting >that shared memory is not necessary? Surely the highest bandwidth is achieved >when each processor has its own memory which it shares with noone else? It >also makes the hardware a lot smaller. ... shared-memory is not necessary; >it's a software issue that shouldn't be solved in hardware. Yes, the highest bandwidth is achieved when when each processor has exclusive memory. However, processes on different processors must still communicate with each other. Non-shared memory communication typically costs two orders of magnitude more than shared memory communication. What's worse, even when processes are on the same processor, software engineering issues often require that they communicate via the same slow inter-processor mechanism. So the fastest time to solve a problem may well lie with shared memory even though its bandwidth is lower. Communication is a software issue. Making it cheap requires hardware. My conclusion is that shared-memory is not necessary, but that it is worth its cost. -- Lawrence Crowl 716-275-9499 University of Rochester crowl@cs.rochester.edu Computer Science Department ...!{allegra,decvax,rutgers}!rochester!crowl Rochester, New York, 14627
retrac@titan.rice.edu (John Carter) (05/10/88)
In article <9559@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes: >In article <674@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes: >>Surely the highest bandwidth is achieved >>when each processor has its own memory which it shares with noone else? It >>also makes the hardware a lot smaller. ... shared-memory is not necessary; >>it's a software issue that shouldn't be solved in hardware. > >Yes, the highest bandwidth is achieved when when each processor has exclusive >memory. However, processes on different processors must still communicate with >each other. Non-shared memory communication typically costs two orders of >magnitude more than shared memory communication. What's worse, even when >processes are on the same processor, software engineering issues often require >that they communicate via the same slow inter-processor mechanism. So the >fastest time to solve a problem may well lie with shared memory even though its >bandwidth is lower. Communication is a software issue. Making it cheap >requires hardware. My conclusion is that shared-memory is not necessary, but >that it is worth its cost. I think that it's currently worth the cost, but I doubt seriously that it'll be a realistic approach as we try to increase the number of processors beyond several dozen. A shared (hardware) memory that needed to handle thousands of processors would be much slower than a conventional memory (how much slower would depend on the actual architectute and what decisions you made about the semantics of memory access). The RP-3 project at IBM is doing some interest- ing work on large shared memory architectures. Their work aside, I think that you should be able to get "optimal" performance by providing very fast dsistributed memory communication, hardward support for IPC (interprocess[or] communication), intelligent/clever communication protocols, and intelligent compilers that can schedule the IPC so as to remove or reduce the delay associated with waiting for remote memory access. Kai Li (at Princeton, I believe) has done some very good work on implementing shared memory on top of a distributed memory machine/system (i.e., the architecture is distributed memory, but the programmer's view if of shared memory). John Carter Internet: retrac@rice.edu Dept of Computer Science CSNET: retrac@rice.edu Rice University UUCP: {internet node or backbone}!rice!retrac Houston, TX
brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (05/17/88)
In article <685@thalia.rice.edu> retrac@rice.edu (John Carter) writes: >I think that it's currently worth the cost, but I doubt seriously that it'll >be a realistic approach as we try to increase the number of processors beyond >several dozen. A shared (hardware) memory that needed to handle thousands of Experience with the 30 CPU Sequent Symmetry system indicates that the limit for bus based systems with the right copy/back cache protocol is beyond 30 processors. Experience with the Cerberus multiprocessor simulator indicates that a non-bus shared memory system can be very successfully used with processor counts of several hundred and beyond without fancy fetch-and-op support in the memory subsystem. Give me a couple hundred Cray class processors and I would be happy for at least a week. :-)
hankd@pur-ee.UUCP (Hank Dietz) (05/17/88)
In article <685@thalia.rice.edu>, retrac@titan.rice.edu (John Carter) writes: > In article <9559@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes: > >In article <674@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes: > >>Surely the highest bandwidth is achieved > >>when each processor has its own memory which it shares with noone else? It > >>also makes the hardware a lot smaller. ... shared-memory is not necessary; > >>it's a software issue that shouldn't be solved in hardware. > > > >Yes, the highest bandwidth is achieved when when each processor has exclusive > >memory. However, processes on different processors must still communicate with > >each other. Non-shared memory communication typically costs two orders of > >magnitude more than shared memory communication. What's worse, even when > >processes are on the same processor, software engineering issues often require > >that they communicate via the same slow inter-processor mechanism. So the ... > several dozen. A shared (hardware) memory that needed to handle thousands of > processors would be much slower than a conventional memory (how much slower > would depend on the actual architectute and what decisions you made about the > semantics of memory access). The RP-3 project at IBM is doing some interest- > ing work on large shared memory architectures. Their work aside, I think that ... > associated with waiting for remote memory access. Kai Li (at Princeton, > I believe) has done some very good work on implementing shared memory > on top of a distributed memory machine/system (i.e., the architecture is > distributed memory, but the programmer's view if of shared memory). The following is a list, in approximate chronological order, of very large scale shared memory address space (but physically distributed memory) MIMD machines which have been proposed, simulated, and/or built: CHoPP Columbia university HOmogeneous Parallel Processor; a machine with a LogN stage smart switch network between processors and memory modules. Each switch contains a pre- and post-arrival cache called an RFM: Repetition Filter Memory. Multiple requests for (at least temporarily) read-only items are combined (serviced by the cache, rather than by memory) at the point in the net where their paths cross. Denelcor HEP Uses microtasking to hide shared memory latency. NYU Ultra Instead of RFMs, uses F&O (Fetch and Op) smart switches, which can combine semaphore operations in the net, but only if they actually collide in a particular switch node. Does nothing about read-only objects, however, it also has a local memory for each processor. F&O happens to be really good at doing associative reductions, but not all that good at combining semaphore ops because in a MIMD they rarely collide (rather, they usually cross paths). BBN Butterfly A "dumb" LogN stage network is used, but it is much faster than the 68K-family processors of the machine, so it works. IBM RP3 Research Parallel Processing Prototype; originally to be the NYU machine "done right." Memory is physically local to processors but globally addressible, with hardware support for address mapping. Was to have two networks: slow F&O and fast circuit switching... but only the fast one was implemented due to cost constraints. RFM-MIMD Son-of-CHoPP. This machine differs in that it uses RFM+ switches, which can combine both read and write requests which cross paths, and it uses only a single stage net and memory modules are local to processor nodes, yet globally addressable. Further, the processor nodes of the MIMD constitute VLIW machines tailored to hide occasionally large memory latencies. Cedar From University of Illinois... a bunch of Alliants sharing a fast global store. Uses a cluster structure for memory interconnection. Not really as scalable as the others.... BBM Monarch Next-generation of the Butterfly? CARP Machine Compiler-oriented Architecture Research at Purdue Machine; vaguely son-of-RFM-MIMD and son-of-PASM. Lots of compiler driven "tricks" used to manage memory.... Horizon/Tera Burton Smith's new machine (HEP being his old one). Sort-of like RFM-MIMD, but with microtasking and with a mesh net. Details not yet available. #ifdef FLAME Don't you guys read the earlier postings before posting your responses? I posted something a couple of weeks ago that would have cleared-up most of the argument, but you guys just seem to want to argue, rather than to discuss. #endif __ /| _ | | __ / | Compiler-oriented / |--| | | | | Architecture / | | |__| |_/ Researcher from \__ | | | \ | Purdue \ | \ \ \ \ Prof. Hank Dietz, Purdue EE
turner@uicsrd.csrd.uiuc.edu.UUCP (05/18/88)
/* Written 11:31 am May 10, 1988 by retrac@titan.rice.edu in comp.arch */ ... The RP-3 project at IBM is doing some interest- ing work on large shared memory architectures. Their work aside, I think that you should be able to get "optimal" performance by providing very fast dsistributed memory communication, hardward support for IPC (interprocess[or] communication), intelligent/clever communication protocols, and intelligent compilers that can schedule the IPC so as to remove or reduce the delay associated with waiting for remote memory access. ... John Carter Internet: retrac@rice.edu Dept of Computer Science CSNET: retrac@rice.edu Rice University UUCP: {internet node or backbone}!rice!retrac Houston, TX /* End of text from uicsrd.csrd.uiuc.edu:comp.arch */ For "optimal" performance on *some* applications, local memories and fast interprocessor communication may well be enough. But for others communication costs may well eliminate any speedup due to parallelism. Why is it that everyone seems to assume that machines must either have shared memory OR distributed memory, never a little of both? From my point of view I would like to see a machine with: fast local memory; slower, but deeply pipelined, shared memory; and a thin global sync bus (for barrier sync). We have had memory heirarchies for decades now, why should they cease to be useful now? --------------------------------------------------------------------------- Steve Turner (on the Si prairie - UIUC CSRD) UUCP: {ihnp4,seismo,pur-ee,convex}!uiucdcs!uicsrd!turner ARPANET: turner%uicsrd@a.cs.uiuc.edu CSNET: turner%uicsrd@uiuc.csnet *-) Mutants for BITNET: turner@uicsrd.csrd.uiuc.edu Nuclear Power! (-%
lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (05/19/88)
In article <8125@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) gives a list of MIMD machines with large-scale shared memory address space but physically distributed memory: .... >CHoPP >Denelcor HEP >NYU Ultra >BBN Butterfly >IBM RP3 >RFM-MIMD >Cedar >BBM Monarch >CARP Machine >Horizon/Tera .... Is the list restricted to existing machines? If not, then I wold like to add the "Fluent Machine" being projected here at Yale. Physically this consists of processors with local memory connected in a butterfly pattern. These local memories will however be seen as one shared memory by programs. To accomplish this messages have to be routed in the network and this is done by something called "fluent routing", which is the essential new thing for this machine. It will also support combining operations such as full multiprefix. If I am to give my personal opinion on very large scale global address space machines, then it is that I think they must provide local, fixed routing between close processors as an alternative addressing mode. The generality of the global address scheme will always have to be paid in long access time compared with local access. Many algorithms, especially in fields such as scientific computing, have a quite data-independent structure which means that the communication pattern between the computation steps can be determined at compile-time and mapped to the network of processors to provide a high degree of locality. Bjorn Lisper
kuehn@super.ORG (James T. Kuehn) (05/20/88)
In article <29577@yale-celray.yale.UUCP> lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes: > >If I am to give my personal opinion on very large scale global address space >machines, then it is that I think they must provide local, fixed routing >between close processors as an alternative addressing mode. The generality >of the global address scheme will always have to be paid in long access time >compared with local access. Not necessarily. The confusion seems to be that you are making large global address space synonymous with long, fixed access time. Consider the Denelcor HEP, which was Hank's first example. It had a large global address space and a variable memory access time that depended on the distance from the processor (number of network hops). If your compiler knows about memory hierarchies and can distinguish "near" addresses from "far" ones (in terms of latencies), it can schedule values accordingly. The flat, shared address space makes life convenient for the compiler, especially when it doesn't know where to schedule a value. If in addition, you have a machine that hides latency by some technique (as did HEP), you're home free. In summary: Reducing latency is always a win, but when you don't know how, hide it too! Jim Kuehn kuehn%super.org@sh.cs.net
billo@cmx.npac.syr.edu (Bill O) (05/23/88)
In article <43700039@uicsrd.csrd.uiuc.edu> turner@uicsrd.csrd.uiuc.edu writes: >Why is it that everyone seems to assume that machines must either have >shared memory OR distributed memory, never a little of both? From my >point of view I would like to see a machine with: fast local memory; >slower, but deeply pipelined, shared memory; and a thin global sync >bus (for barrier sync). We have had memory heirarchies for decades >now, why should they cease to be useful now? I know of one architecture that has both local memory and globally shared memory -- the BBN Butterfly. Is I understand it (correct me if I'm wrong), the processing nodes on the Butterfly each have memory which is local in the sense that it can be accessed very quickly by the local processor, but which is globally shared, in the sense that all the other processors can access it too, via the big global switch. Architecturally, each node "thinks" it has a very big globally shared memory, but software can take advantage of fast local access by arranging for each processor's most frequently accessed data to be in that portion of memory which is truly local. Hmm... I hope I've got that right. Are there any other multi-processor architectures that have some mix of both local and shared memory? Bill O'Farrell, Northeast Parallel Architectures Center at Syracuse University (billo@cmx.npac.syr.edu)
emiller@bbn.com (ethan miller) (05/23/88)
In article <501@cmx.npac.syr.edu> billo@cmx.npac.syr.edu (Bill O'Farrell) writes: ->In article <43700039@uicsrd.csrd.uiuc.edu> turner@uicsrd.csrd.uiuc.edu writes: -> ->>Why is it that everyone seems to assume that machines must either have ->>shared memory OR distributed memory, never a little of both? From my ->>point of view I would like to see a machine with: fast local memory; ->>slower, but deeply pipelined, shared memory; and a thin global sync ->>bus (for barrier sync). We have had memory heirarchies for decades ->>now, why should they cease to be useful now? -> ->I know of one architecture that has both local memory and globally ->shared memory -- the BBN Butterfly. Is I understand it (correct me if ->I'm wrong), the processing nodes on the Butterfly each have memory ->which is local in the sense that it can be accessed very quickly by ->the local processor, but which is globally shared, in the sense that ->all the other processors can access it too, via the big global switch. ->Architecturally, each node "thinks" it has a very big globally shared ->memory, but software can take advantage of fast local access by ->arranging for each processor's most frequently accessed data to be in ->that portion of memory which is truly local. -> ->Hmm... I hope I've got that right. -> ->Bill O'Farrell, Northeast Parallel Architectures Center at Syracuse University ->(billo@cmx.npac.syr.edu) The description of the Butterfly is essentially correct. One element that it is lacking is a memory associated with no processor node. The delay for a single longword (32 bit) read/write to another processor is about 4 micro- seconds, while it's very fast (~250ns (?)) to local memory. A nice addition, which I think "turner" is referring to is an intermediate--a memory accessible with the same delay to all nodes, along with local memory for each node. The Butterfly could do this with memory-only nodes, though I don't know of any plans for those (I work for a different part of BBN). Essentially, then, the Butterfly has a distributed-only memory model, but accesses may SEEM like the memory is globally shared. Memory transfer between nodes is handled by the PNC, which is a microcoded processor. Global sharing is an illusion created by this processor. ethan +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ethan miller | "Quod erat demonstrandum, baby." BBN Laboratories | "Oooh, you speak French!" ARPAnet : emiller@bbn.com | PHONEnet: (617) 873-3091 | Disclaimer: It's MY opinion, not BBN's.
rogers@ofc.Columbia.NCR.COM (H. L. Rogers) (05/23/88)
In article <501@cmx.npac.syr.edu> billo@cmx.npac.syr.edu (Bill O'Farrell) writes: >In article <43700039@uicsrd.csrd.uiuc.edu> turner@uicsrd.csrd.uiuc.edu writes: > >Are there any other multi-processor architectures that have some mix of >both local and shared memory? Yes. The NCR TOWER 32/800 has local memory for each processing element, whether it be an application processor, file processor, terminal processor, communications processor, or what-have-you processor. It provides fast local access and is shared globally through messaging across Multibus II (I think I got that right.). The point of the original article implies that many people think shared memory is necessary for speed. Like most things in life, it depends on what you want to do. Shared memory is necessary for cooperating processes which share the same data to enjoy high speed, but single-threaded processes can't stand the overhead. Intelligent customers will choose the architecture that best fits their application environment. -- ------------ HL Rogers (hl.rogers@ncrcae.Columbia.NCR.COM)
tpo@dde.uucp (Thomas P.S. Olesen) (05/31/88)
In article <43700039@uicsrd.csrd.uiuc.edu> turner@uicsrd.csrd.uiuc.edu writes: > >Why is it that everyone seems to assume that machines must either have >shared memory OR distributed memory, never a little of both? From my >point of view I would like to see a machine with: fast local memory; >slower, but deeply pipelined, shared memory; and a thin global sync >bus (for barrier sync). We have had memory heirarchies for decades >now, why should they cease to be useful now? The Supermax computer made by "my" company, has both distributed and shared memory. The Supermax is a multiprocessor machine with up to 8 x 68020 application processors, and up to 8 other 680x0/8085 processores for handling IO. Every processor has local memory (up to 256Mb), but they are made so they can setup the mmu and make data read/write/read-modify-write in all other cpues memory, using the common bus. But NOT instruction fetch. This achitecture makes it possible to run all processors at high speed and have fast memory access AND have shared memory. The OS is SVR3.1 and shard memory for users processes is fully supported. A memory access to another cpu is a little slower than a local access. I think remote access take about 4 times a local access. Thomas P. Olesen -- ***************************************************************************** Thomas P.S. Olesen Dansk Data Elektronik A/S E-mail: ..!mcvax!diku!dde!tpo System Software Department or tpo@dde.dk