hhd0@GTE.COM (Horace Dediu) (10/18/89)
In article <20336@princeton.Princeton.EDU>, mg@notecnirp.Princeton.EDU (Michael Golan) writes: > This came for various people - the references are so confusing I removed them > so as not to put the wrong words in someone's mouth: > > >>>Supercomputers of the future will be scalable multiprocessors made of many > >>>hundreds to thousands of commodity microprocessors. > >> > >This is the stuff of research papers right now, and rapid progress is being > >made in this area. The key issue is not having the components which establish > >the interconnect cost much more than the micros, their off chip caches, > >I currently lean to scalable coherent cache systems which minimize programmer > >effort. The exact protocols and hardware implementation which work best > >for real applications is a current research topic. > > 1) There is no parallel machine currently the works faster than non-parallel > machines for the same price. The "fastest" machines are also non-parallel - > these are vector processors. > Consider the 8k processor NCUBE 2--"The World's Fastest Computer." (yes, one of those). According to their literature: "8,192 64 bit processors each equivalent to one VAX 780. It delivers 60 billion instructions per second, 27 billion scalar FLOPS, exceeding the performance of any other currently available or recently announced supercomputer." It's distributed memory .5MB per processor, runs UNIX, and is a hypercube. I don't know the price, but I bet it's less than a Cray. Interesting to talk about GigaFLOPS. This is fast. > 2) A lot of research is going on - and went on for over 10 years now. As far > as I know, no *really* scalable parallel architecture with shared memory exists > that will scale far above 10 processors (i.e. 100). And it does not seems to > me this will be possible in the near future. Who cares about shared memory? Distributed is the only way to scale. Everybody realizes this since it can be proven. The only reason shared memory machines exist is because we don't yet know how to make good distributed machines. (Yeah, right! tell that to Ncube) IMHO shared memory is a hack using available bus technology while waiting for the real parallel machines to come. (they're already here) > 3) personally I feel parallel computing has no real future as the single cpu > gets a 2-4 folds performance boost every few years, and parallel machines > constructions just can't keep up with that. It seems to me that for at least > the next 10 years, non-parallel machines will still give the best performance > and the best performance/cost. This is very ambiguous. Parallel machines can use off-the shelf CPU's. If a fast micro is available then you can design a parallel machine around it as you would any workstation. The other problem: if cpu's increase 2-4 folds every few years and if this can be maintained for 10 years you can only expect a 32 fold increase. This is nothing. You can't expect problems to stay that small. If you expect to go beyond that you'll hit a wall with the fundamental boundary of the speed of light. You can't drive clock rates to infinity. The only way to speed up is to do it in parallel. Sure it's hard to program, but it's a new field, the tools are rudimentary and only hardware people are involved in their development. If enough effort is put into it parallel machines should not be any harder to program than your basic workstation. > time that really matters. And while computers get faster, its seems software > complexity and the need for faster and faster machines is growing even more > rapidly. Of course. To solve hard problems you *need* parallel execution. It's no secret that every big iron maker and every supercomputer shop is developing parallel machines. These are still modest efforts (<100 cpu's), but the leading egde is now in the 10k coarse grained, 64k fine grained processors. This should scale nicely to 1M processors in the next decade. After that we can expect some kind of new barriers to come up. > Michael Golan > mg@princeton.edu -- Horace Dediu \"That's the nature of research--you don't know |GTE Laboratories (617) 466-4111\ what in hell you're doing." `Doc' Edgerton |40 Sylvan Road UUCP: ...!harvard!bunny!hhd0................................|Waltham, MA 02254 Internet: hhd0@gte.com or hhd0%gte.com@relay.cs.net..........|U. S. A.
yodaiken@freal.cs.umass.edu (victor yodaiken) (10/19/89)
In article <7651@bunny.GTE.COM> hhd0@GTE.COM (Horace Dediu) writes: >Consider the 8k processor NCUBE 2--"The World's Fastest Computer." >(yes, one of those). According to their literature: >"8,192 64 bit processors each equivalent to one VAX 780. It delivers >60 billion instructions per second, 27 billion scalar FLOPS, exceeding the >performance of any other currently available or recently announced >supercomputer." It's distributed memory .5MB per processor, runs UNIX, >and is a hypercube. > >I don't know the price, but I bet it's less than a Cray. Like to see the delivered price of a 8k processor system. >Interesting to >talk about GigaFLOPS. This is fast. > This sounds like one of those total b.s. measures obtained by multiplying the number of processors by the max mips/mflops rate per processor. > >Who cares about shared memory? Distributed is the only way to scale. >Everybody realizes this since it can be proven. Proof citation? Sketch? There is a lot of mythologizing about parallelism. Parallel processing is a standard technique for speed which is used in every carry-lookahead adder, every bus, etc. It seems reasonable to believe that parallelism will be an important technique in the future. It seems POSSIBLE that using multiple cpu's will be a useful technique. On the other hand there is no reason why this technique must work, and it seems at least as possible that cpu's should not be the basic unit of parallel computation. >It's no secret that every big iron maker and every supercomputer shop is >developing parallel machines. These are still modest efforts (<100 cpu's), >but the leading egde is now in the 10k coarse grained, 64k fine grained >processors. This should scale nicely to 1M processors in the next decade. >After that we can expect some kind of new barriers to come up. I admire your confidence, but am unconvinced. Evidence? victor
jdarcy@multimax.UUCP (Jeff d'Arcy) (10/19/89)
From article <7651@bunny.GTE.COM>, by hhd0@GTE.COM (Horace Dediu):
> Who cares about shared memory? Distributed is the only way to scale.
I'd like to see you back this one up with some *real* proof. I could
possibly agree with the statement that distributed is the *best* way
to scale, or that distribution is necessary for *large* (>~30) scale
multiprocessing. I think that shared memory architectures will still
be viable for a long time, perhaps as a component of a distributed
environment. If you disagree please provide reaasons.
Jeff d'Arcy jdarcy@encore.com "Quack!"
Encore has provided the medium, but the message remains my own
tomlic@yoda.ACA.MCC.COM (Chris Tomlinson) (10/19/89)
From article <20416@princeton.Princeton.EDU>, by mg@notecnirp.Princeton.EDU (Michael Golan): > In article <7651@bunny.GTE.COM> hhd0@GTE.COM (Horace Dediu) writes: >> >>Consider the 8k processor NCUBE 2--"The World's Fastest Computer." >>(yes, one of those). According to their literature: >>"8,192 64 bit processors each equivalent to one VAX 780. It delivers >>60 billion instructions per second, 27 billion scalar FLOPS, exceeding the > > This imply a VAX 780 is a 7 mips machine ? The architecture of the processor is similar to the VAX ISA, not the performance. > >>performance of any other currently available or recently announced >>supercomputer." It's distributed memory .5MB per processor, runs UNIX, > ^^^^^^^^^^^^^^^^^^^^^^ >>and is a hypercube. > > .5MB ? And this is faster than a Cray? How many problems you can't even I understand that NCUBE makes provisions for up to 64MB per node on those systems using the 64 bit processors. They also apparently have incorporated a through-routing capability in the processors similar to that found on the Symult mesh-connected machines. > solve on this? And for how many, a 32Mb single VAX 780 will beat ?! > One of the well known problems wtih Hypercubes is that if you look at a job > that uses the whole memory (in this case 4Gb = Big Cray), a single machine > with the same performance of one processor (and all memory) will be almost > as good and sometimes even better. The current trends in distributed memory MIMD machines are towards very low communication latencies by comparison with the first generation machines that used software routing and slow communication hardware. This has a tendency to drive the machines more towards shared-memory like access times, but of course physical limitations simply mean that DM-MIMD machines are a scalable way of approximating shared-memory worse and worse as the machine gets larger, but at least the machine can get larger. > > My original point was that MIMD, unless it has shared memory, is very hard > to make use of with typical software/algorithms. Some problems can be solved > nicely on a Hypercube, but most of them can not! And the state of the art The state-of-the-art in parallel algorithm development is advancing rapidly as machines become available to experiment on. It is more of an issue of algorithm design than paralyzing sequential codes. There are quite a number of problems that are tackled on Crays because of superior scalar performance that do not make significant use of the SIMD vector capabilities. I would point to the development of BLAS-2 and -3 as indications that even on current supercomputers compiler technology just doesn't carry the day by itself. > in compilers, while having some luck with vectorized code, and less luck > with shared memory code, has almost no luck with message-passing machines. > > > Michael Golan > mg@princeton.edu > My opinions are my own. You are welcome not to like them. Chris Tomlinson tomlic@MCC.COM --opinions....
david@cs.washington.edu (David Callahan) (10/19/89)
In article <7651@bunny.GTE.COM> hhd0@GTE.COM (Horace Dediu) writes: >Who cares about shared memory? Distributed is the only way to scale. Perhaps you forgot a smiley? Or perhaps when you say "shared" you mean "centralized"? "Shared" memory is part of the virtual machine and clearly can be implemented on a machine with "distributed" packaging of memory with processors. The BBN Butterfly and the RP3 are both "distributed" memory machines in the sense that memory is packaged with processsors and the hardware takes care of building "messages" for every memory request. From a programing point of view, machines like the NCUBE have three (IMHO) serious faults: message passing is done in software and is therefore has orders of magnitude more latency than a "shared" memory machine; data movement now requires software-controlled cooperation on both processors; and finally, the programmer must determine the location of the "most recent" value of every variable and which processor was the last to write it or next to use it. I care about shared memory --- its makes parallel machines much easier to program. >Everybody realizes this since it can be proven. >The only reason shared memory machines exist is because we don't yet know >how to make good distributed machines. (Yeah, right! tell that to Ncube) >IMHO shared memory is a hack using available bus technology while waiting for >the real parallel machines to come. (they're already here) Shared memory has nothing to do with busses --- it has to do with programming. >Horace Dediu \"That's the nature of research--you don't know |GTE Laboratories >(617) 466-4111\ what in hell you're doing." `Doc' Edgerton |40 Sylvan Road >UUCP: ...!harvard!bunny!hhd0................................|Waltham, MA 02254 >Internet: hhd0@gte.com or hhd0%gte.com@relay.cs.net..........|U. S. A. Disclaimer: I work for a company designing a multiprocessor that supports shared memory programming. -- David Callahan (david@tera.com, david@june.cs.washington.edu,david@rice.edu) Tera Computer Co. 400 North 34th Street Seattle WA, 98103
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (10/20/89)
Since I care about parallel machines, my two cents worth: FACT: we have proof-by-existence that - massively parallel machines can be built, can be reliable, etc. (defining massive as "more than 1000 processor chips"). - they can have aggregate properties ( GIPS, GFLOPS, GB, GB/s IO) that are in the supercomputer league. Yes, I have details. - they allow memory-intensive algorithms, since they can use, in main memory, the slower/cheaper DRAMs that Cray uses only in backing memory. Yes, I can back this up. - for selected large applications, these machines already are the fastest hardware, and the cheapest hardware. Yes, both. - for selected applications, these machines aren't that hard to program. IT SEEMS AGREED THAT: - MIMD machines can have automatic load balancing, timesharing, etc. (Actually, the timesharing is called "spacesharing".) - MIMD machines with virtual memory could conveniently fault pages around between nodes. - some applications could use a Connection Machine with millions of processors. - programming isn't as easy as we'd like. RESEARCHERS HAVE FOND HOPES THAT, SOME DAY, - automatic parallelization onto these machines will become practical for many applications. - shared-memory/cache-coherency will be cost-effective on large MIMDs. MY OPINION: - most supercomputer applications will wind up on massively parallel machines. - there will aways be a market for the fastest possible single CPU. - MIMD machines don't want the fastest possible node, because (so far) the money is better spent buying several cheaper nodes. - conventional instruction set architectures are suitable bases for MIMD nodes. - "scaling laws" are sophistry when 8K node MIMDs are here now. Sorry to be so wordy. -- Don D.C.Lindsay Carnegie Mellon Computer Science
mbutts@mentor.com (Mike Butts) (10/20/89)
From article <10164@encore.Encore.COM>, by jdarcy@multimax.UUCP (Jeff d'Arcy): > From article <7651@bunny.GTE.COM>, by hhd0@GTE.COM (Horace Dediu): >> Who cares about shared memory? Distributed is the only way to scale. > > I'd like to see you back this one up with some *real* proof. I could > possibly agree with the statement that distributed is the *best* way > to scale, or that distribution is necessary for *large* (>~30) scale > multiprocessing. I think that shared memory architectures will still > be viable for a long time, perhaps as a component of a distributed > environment. If you disagree please provide reaasons. Distributed architectures are obviously most desirable from a hardware point of view, because they are simple and (nearly) arbitrarily scalable. IMHO there are two reasons why shared memory architectures will continue to be more important for many of us for a long time to come. 1) Shared memory systems are *much* easier to program, and software development costs *much* more than hardware nowadays. I say this based on my experience as a hardware engineer in a mostly software engineering environment. 2) Many problems have proven inefficient so far on distributed memory architectures. Distributed machines succeed beautifully on problems where some real physical space can be mapped onto processor/memory nodes which are arranged in a regular topology which is similar to the topology of the problem. Modeling physical media, such as solids, liquids or gases, takes advantage of the fact that the state of one parcel only directly affects the state of its nearby neighbors. Problems with irregular topology, such as electronic circuits, are very much harder to solve efficiently, because the state of one gate or transistor may affect the state of another at a great distance. Communications is far more irregular and expensive, so speedups suffer. Static load balancing among the processors is also much harder. Shared architectures need not statically partition the problem, and communication speed depends much less on distance. I'm aware of new algorithmic technology being developed to attack these problems, such as distributed time discrete event simulation techniques, but there's still a lot of work ahead. I agree that distributed memory is the only way to scale if you can, but there are important problems which are much more readily solved on shared architectures, at least so far. A hybrid architecture, with physically distributed but logically shared memory, of which several examples have been built, may be the best transition path. -- Michael Butts, Research Engineer KC7IT 503-626-1302 Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton, OR 97005 !{sequent,tessi,apollo}!mntgfx!mbutts mbutts@pdx.MENTOR.COM Opinions are my own, not necessarily those of Mentor Graphics Corp.
bzs@world.std.com (Barry Shein) (10/22/89)
Why oh why is are people searching around for a philosopher's stone of computing architectures. This isn't science, this is aberrant psychology. There exists problems which are best run on (pick one or more) MIMD, SIMD, scalar, vector-processors, and/or hybrids. There also exist problems which don't care what they're run on, at least not much. For example, time-sharing and typical database environments seem to run very well on MIMD systems for very little re-programming effort (in the case of time-sharing, none.) This is because of the large granularity of the applications and their measuring of "runs well" as mere response time. There are other examples for MIMD and examples for SIMD and other types of processors. To say (MIMD/SIMD) processors are a bad idea because there exists some large set of problems which are either impossible or very hard to optimize for those architectures is so goddamn stupid it boggles the mind. MIMD processors are relatively easy and cheap to build out of mostly commodity parts. SIMD processors appear to be very, very good at certain classes of problems which are very important to some people. Important enough that they'll buy a SIMD box just to run those problems and tell people with other problems that don't fit so well to go fly a kite (which is the mathematically correct answer to them.) We've had multi-processing and hardware optimizations almost since computing began. What do you think makes a mainframe a mainframe? Multi-processing, particularly in the I/O channels because mainframes are bought for high I/O throughput. Most DP shops aren't CPU bound, they're I/O bound, so they buy their CPUs in the channels. It continues to astounds me how, particularly in the academic computer science community, some dodo will stand in front of an audience, show that there exists a class of problems which don't run well on parallel architectures, and conclude that therefore parallel architectures are bad. What *frightens* me is that N people will sit in the audience and nod their heads in agreement and go out to spread this gospel (as we've seen on this list) instead of riding the dodo out on the first rail. LOOK, there are folks out there using all these architectures and winning big. Consider that before you attempt to prove on paper again that it's impossible for a honeybee to fly. What would be *useful* would be a taxonomy of algorithms classified by how well (and difficulty of adaptation) to various architectures. But it's so much easier to throw peanut shells from the bleachers. -- -Barry Shein Software Tool & Die, Purveyors to the Trade | bzs@world.std.com 1330 Beacon St, Brookline, MA 02146, (617) 739-0202 | {xylogics,uunet}world!bzs
mccalpin@masig3.masig3.ocean.fsu.edu (John D. McCalpin) (10/22/89)
In article <1989Oct21.234930.905@world.std.com> bzs@world.std.com (Barry Shein) writes: >Why oh why is are people searching around for a philosopher's stone of >computing architectures. This isn't science, this is aberrant >psychology. While there may be some small component of "holy grail"-itis in this thread, I think that most people are discussing a different problem here. This isn't science --- it is technology plus market forces.... >There exists problems which are best run on (pick one or more) MIMD, >SIMD, scalar, vector-processors, and/or hybrids. There also exist >problems which don't care what they're run on, at least not much. I don't think that anyone disputes this. The question is whether or not each of these types of architectures can acquire a sufficient market to be competitive with whatever architecture is selling best, and which therefore has the most R&D money and the best economies of scale. >To say (MIMD/SIMD) processors are a bad idea because there exists some >large set of problems which are either impossible or very hard to >optimize for those architectures is so goddamn stupid it boggles the >mind. On the other hand, it is a perfectly reasonable thing to decide that it is not worth my time to learn how to work with/program on/optimize on a particular architecture because it is not likely to be commercially successful. It is easy enough to be wrong about the commercial success aspects, but it is an unavoidable question. >It continues to astounds me how, particularly in the academic computer >science community, some dodo will stand in front of an audience, show >that there exists a class of problems which don't run well on parallel >architectures, and conclude that therefore parallel architectures are >bad. >What *frightens* me is that N people will sit in the audience and nod >their heads in agreement and go out to spread this gospel (as we've >seen on this list) instead of riding the dodo out on the first rail. I believe that it is far more common for people to conclude that parallel architectures are not an effective approach for the class of problems being discussed. Since many of us out here are _users_, rather than designers, it is hardly surprising that we would downplay the potential usefulness of architectures that are believed to be unhelpful in our chosen work. This is quite a reasonable response -- it reminds me of democracy and enlightened self-interest and all of that stuff.... :-) >LOOK, there are folks out there using all these architectures and >winning big. Consider that before you attempt to prove on paper again >that it's impossible for a honeybee to fly. It is a good point to note here that lots of the people who think that parallel architectures are not useful in their field are wrong. >But it's so much easier to throw peanut shells from the bleachers. And so much more fun! ;-) -- John D. McCalpin - mccalpin@masig1.ocean.fsu.edu mccalpin@scri1.scri.fsu.edu mccalpin@delocn.udel.edu
crowl@cs.rochester.edu (Lawrence Crowl) (10/23/89)
In article <7651@bunny.GTE.COM> hhd0@GTE.COM (Horace Dediu) writes: >Who cares about shared memory? Distributed is the only way to scale. >Everybody realizes this since it can be proven. The only reason shared >memory machines exist is because we don't yet know how to make good >distributed machines. IMHO shared memory is a hack using available bus >technology while waiting for the real parallel machines to come. You are mixing two concepts --- memory architecture (as the processors see it) and communication interconnect. Commonly available shared memory systems tend to use a bus interconnect, so people assume that this is the only interconnect for shared memory. This assumption is wrong. SHARED MEMORY DOES NOT IMPLY A BUS INTERCONNECT. The BBN Butterfly, IBM RP3, and NYU Ultracomputer all supported a shared memory implemented over an FFT interconnect without busses. The Butterfly is commercially available with up to 512 processors. The RP3 is a research machine designed for as many as 512 processors, though I don't know if IBM has configured one that large. I don't recall the Ultracomputer size. SHARED MEMORY IS SCALABLE. If the system supports a scalable interconnection, and processors have local memory, then a shared memory system is scalable. With local memory, only information that is truly shared need be communicated between processors. This is exactly the information that must be communicated on distributed memory system via message passing. SHARED MEMORY IS DESIREABLE. The latency on remote memory access is typically two orders of magnitured faster than message passing on distributed memory. Applications with small messages performed on an infrequent bases will see significant performance improvements. For instance, a shared memory system can increment a shared counter far faster than any distributed memory system. SHARED MEMORY HAS A COST. Implementing shared memory over a scalable interconnect may require a larger aggregate bandwidth than that of distributed memory systems.. I don't think there has been enough research here to know the real tradeoff, but such a result would not suprise me. -- Lawrence Crowl 716-275-9499 University of Rochester crowl@cs.rochester.edu Computer Science Department ...!{allegra,decvax,rutgers}!rochester!crowl Rochester, New York, 14627
brooks@vette.llnl.gov (Eugene Brooks) (10/24/89)
In article <1989Oct23.152120.25967@cs.rochester.edu> crowl@snipe.cs.rochester.edu (Lawrence Crowl) writes: >SHARED MEMORY IS DESIREABLE. The latency on remote memory access is typically >two orders of magnitured faster than message passing on distributed memory. Given equivalent performance interconnect, which rarely occurs because the message passing machines tend to get short changed on the communication hardware, I have found the "shared memory" systems to have much better communication performance. This is because the communication between processors is directly supported in the memory management hardware. In the message passing machines sending a message invokes a "kernel call" on both the sending and recieving ends. This system call overhead is much greater than the hardware latency itself, ammounting to a factor 5 or more. One could try for complex hardware support of messaging, but a better solution is to just memory map it... Please note: I am not talking about the really horrible interrupt handling of message forwarding here. This only compounds a bad situation for kernel overhead. brooks@maddog.llnl.gov, brooks@maddog.uucp
ingoldsb@ctycal.UUCP (Terry Ingoldsby) (10/24/89)
In article <7651@bunny.GTE.COM>, hhd0@GTE.COM (Horace Dediu) writes: > The only reason shared memory machines exist is because we don't yet know > how to make good distributed machines. (Yeah, right! tell that to Ncube) > IMHO shared memory is a hack using available bus technology while waiting for > the real parallel machines to come. (they're already here) .... > > Of course. To solve hard problems you *need* parallel execution. > It's no secret that every big iron maker and every supercomputer shop is > developing parallel machines. These are still modest efforts (<100 cpu's), > but the leading egde is now in the 10k coarse grained, 64k fine grained > processors. This should scale nicely to 1M processors in the next decade. > After that we can expect some kind of new barriers to come up. It is true that the only hope for the kind of performance improvement that will be required for the next generation of software is parallel processing. It is also true that many different parallel processing machines exist. Here is where I start to be less optimistic about how quickly we can adopt parallel processing. The basic problem is *NOT* the software. Although awkward to write, it can be done and it is reasonable to assume that techniques will be developed as the years go by. The (IMHO) fundamental problem is that different problems partition differently. By this I mean that to be executed across many P.E.s (processing elements) a problem must be partitioned. One criteria for partitioning the problem is to subdivide it in such a way that the different P.E.s do as little interP.E. communication as is possible. Generally speaking, the more you divide the problem, the more P.E.s need to talk to each other. This, of itself, is not necessarily bad (at least, it is tolerable). The problem is that different classes of problems subdivide in different ways. Thus it is difficult to set up any *efficient* communication strategy that will let P.E's talk to other P.E.s (without going through lots of intermediary P.Es) for general problems. Until someone addresses this issue, I am hard pressed to believe that a *general purpose* parallel machine will be developed. I can easily foresee a time when there will be a custom (parallel) vision processor in a computer, a custom speech recognition processor, a custom database processor, ... and so on. I cannot see a given parallel architecture doing all of the above. My 2 cents! -- Terry Ingoldsby ctycal!ingoldsb@calgary.UUCP Land Information Systems or The City of Calgary ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb
vorbrueg@bufo.usc.edu (Jan Vorbrueggen) (10/24/89)
In article <36597@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >Given equivalent performance interconnect, which rarely occurs because the >message passing machines tend to get short changed on the comm. hardware, >I have found the "shared memory" systems to have much better communication >performance. This is because the communication between processors is >directly supported in the memory management hardware. In the message passing >machines sending a message invokes a "kernel call" on both the sending and >recieving ends. This system call overhead is much greater than the hardware >latency itself, ammounting to a factor 5 or more. One could try for complex >hardware support of messaging, but a better solution is to just memory map it. > >Please note: I am not talking about the really horrible interrupt handling >of message forwarding here. This only compounds a bad situation for kernel >overhead. Eugene, ever seen a transputer? Overhead for receiving or sending a message is 19 cycles (630 ns for a 30 MHz part). The actual transfer is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s unidirectional or 2.4 MByte/s bidirectional. At 4 links/transputer this gives 9.6 Mbytes/s, close to what most memory interfaces will allow. Of course, very short messages will limit your transfer rate; however, at 128 Bytes/message you see about 80% of the maximum rate. There is no system call involved - the compiler just generates the necessary instruction. Message forwarding isn't so difficult either. I've read of a system requiring less than 10 us overhead per through-route (this probably is for the destination link being available). No interrupt handling involved here - that part is all handled in hardware. Next generation (i.e., promised for start of 1991) will have 100 Mbit/s per link and the possibility of hardware routing (a la wormhole). The cpu will be faster by factor of 4 or so and a memory bandwith to match. -- Jan Vorbrueggen
brooks@vette.llnl.gov (Eugene Brooks) (10/24/89)
In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes: >Eugene, ever seen a transputer? Overhead for receiving or sending a Yes I have, a group here is trying to use them in a image processing project. >Message forwarding isn't so difficult either. I've read of a system >requiring less than 10 us overhead per through-route (this probably >is for the destination link being available). No interrupt handling >involved here - that part is all handled in hardware. This level of overhead for each hop is completely intolerable. brooks@maddog.llnl.gov, brooks@maddog.uucp
peter@ficc.uu.net (Peter da Silva) (10/24/89)
In article <1989Oct23.152120.25967@cs.rochester.edu> crowl@snipe.cs.rochester.edu (Lawrence Crowl) writes: > SHARED MEMORY HAS A COST. Implementing shared memory over a scalable > interconnect may require a larger aggregate bandwidth than that of distributed > memory systems.. I don't think there has been enough research here to know > the real tradeoff, but such a result would not suprise me. Also, the job of programming a shared memory system is a lot harder than programming a system with messages as the communication medium. The number of successful message-based operating systems demonstrates this. Of course you can implement messages in shared memory (a trivial proof is aforementioned operating systems), and gain a perfromance improvement. The question is whether this offsets the extra bandwidth. Ah well, I'd much rather have both. -- Peter da Silva, *NIX support guy @ Ferranti International Controls Corporation. Biz: peter@ficc.uu.net, +1 713 274 5180. Fun: peter@sugar.hackercorp.com. `-_-' "That particular mistake will not be repeated. There are plenty of 'U` mistakes left that have not yet been used." -- Andy Tanenbaum (ast@cs.vu.nl)
slackey@bbn.com (Stan Lackey) (10/24/89)
In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes: >In article <36597@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: > >>Given equivalent performance interconnect, which rarely occurs because the >>message passing machines tend to get short changed on the comm. hardware, >>I have found the "shared memory" systems to have much better communication >>performance ... >Eugene, ever seen a transputer? Overhead for receiving or sending a >message is 19 cycles (630 ns for a 30 MHz part). The actual transfer >is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s >unidirectional or 2.4 MByte/s bidirectional. 1) Although it has high? peak bandwidth, the latency is still there. The interconnect system waits for the processor to do an access, then the processor waits for the interconnect to get the data. Many microseconds go by between the time the CPU needs dependent data and it is usable. 2) This thread has been leaning toward the 'many killer micros in parallel' being the supercomputers of the future. I agree, but I think it is still far away; commodity micros just don't include what they need to support massively parallel systems, in the sense of providing ease-of-use in a general purpose way. And they never will, if the #1 priority is to run whetstone and dhrystone fast. Providing these capabilities requires assumptions all the way from application development, through the compiler, the OS, the chip interface, and on down to the bare silicon. It is Real Tough for the guys who design commodity chips to anticipate what the vast range of users are going to want, and they're just going to end up not pleasing everybody. Please recall that even the hypercubes (well inmos, ncube and thinking machines anyway) do not use commodity microprocessors, but proprietary ones that do the extras in hardware that they need -Stan Disclaimer: not necessarily the views of my organization.
a186@mindlink.UUCP (Harvey Taylor) (10/24/89)
In Msg-ID: <47279@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: |In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) | writes: |>In article <36597@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov |(Eugene Brooks) writes: |> |>> Given equivalent performance interconnect, which rarely occurs |>> because the message passing machines tend to get short changed on |>> the comm. hardware, I have found the "shared memory" systems to |>> have much better communication performance ... | |>Eugene, ever seen a transputer? Overhead for receiving or sending a |>message is 19 cycles (630 ns for a 30 MHz part). The actual transfer |>is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s |>unidirectional or 2.4 MByte/s bidirectional. | |1) Although it has high? peak bandwidth, the latency is still there. |The interconnect system waits for the processor to do an access, then |the processor waits for the interconnect to get the data. Many |microseconds go by between the time the CPU needs dependent data and |it is usable. This is getting close to something I have wondered about. How applicable is Amdahl's Law to the transputer? How might it be affected by the network topology? <-Harvey "Insanity: A perfectly rational adjustment to an insane world." -RD Laing Harvey Taylor Meta Media Productions uunet!van-bc!rsoft!mindlink!Harvey_Taylor a186@mindlink.UUCP
hsu@uicsrd.csrd.uiuc.edu (William Tsun-Yuk Hsu) (10/25/89)
In article <6655@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes: > >Also, the job of programming a shared memory system is a lot harder than >programming a system with messages as the communication medium. Umm, have you tried? When I took a parallel programming class a couple years ago, everybody griped about the problems of coding Gaussian elimination for the hypercube. Hardly anybody griped about implementing it for the Sequent balance. (This single example of course doesn't prove anything, but I would like to see some examples of problems that are clearly easier to code for a message-passing system.) Bill
priol@irisa.irisa.fr (Thierry Priol,TB131,Equipe Hypercubes,9936200-547,) (10/25/89)
A software user point of view: Sometimes, it is interesting to simulate a shared memory on a distributed memory parallel computers. We have found that a ray-tracing algorithm is more efficient on a DMPC when it is implementing with a shared memory programming model. It's due by the fact that it is difficult to partition the database a priori (for example with a geometric partition). A software shared memory service makes a dynamic data distribution (when caches are using) during the computation. A paper which describes our experimentation will be available soon. May be other algorithms are in the same case ? I hope that such a software shared memory service would be available on DMPC in future operating system. Thierry PRIOL PS: I apologize for my poor english. --------------------------------------------------------------------- Thierry PRIOL Phone: 99 36 20 00 IRISA / INRIA U.R. Rennes Fax: 99 38 38 32
roger@wraxall.inmos.co.uk (Roger Shepherd) (10/25/89)
In article <47279@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes: > >>Eugene, ever seen a transputer? Overhead for receiving or sending a >>message is 19 cycles (630 ns for a 30 MHz part). The actual transfer >>is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s >>unidirectional or 2.4 MByte/s bidirectional. > >1) Although it has high? peak bandwidth, the latency is still there. >The interconnect system waits for the processor to do an access, then >the processor waits for the interconnect to get the data. Many >microseconds go by between the time the CPU needs dependent data and >it is usable. > This is true BUT the time between a request being made and the data returning is NOT WASTED. It can be used to execute other processes. This use of `excess' parallelism is important precisely because it can hide latency. This is one reason why the transputer not only incorporates high performance communication links but also a hardware scheduler. The use of excess parallelism to hide latency is not new, it was used in the HEP. >Please recall that even the hypercubes (well inmos, ncube and thinking >machines anyway) do not use commodity microprocessors, but proprietary >ones that do the extras in hardware that they need > I'm not sure what you mean by commodity microprocessors. The Inmos transputers ARE commodity microprocessors, they are feely available to anyone who wants to buy them. The communication system which is provided with every transputer is there because it is generically useful in multiprocessor system (and you might be surprised just how many electronic systems are multiprocessor - my PC has at least 3 microprocessors in it (before I plug in my transputer card))! Roger Shepherd, INMOS Ltd JANET: roger@uk.co.inmos 1000 Aztec West UUCP: ukc!inmos!roger or uunet!inmos-c!roger Almondsbury INTERNET: roger@inmos.com +44 454 616616 ROW: roger@inmos.com OR roger@inmos.co.uk
daver@dg.dg.com (Dave Rudolph) (10/25/89)
In article <36662@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes: >>Message forwarding isn't so difficult either. I've read of a system >>requiring less than 10 us overhead per through-route (this probably >>is for the destination link being available). No interrupt handling >>involved here - that part is all handled in hardware. >This level of overhead for each hop is completely intolerable. And what is the overhead for each interconnect level in a "shared memory" machine such as the ultracomputer? Keep in mind that in such a machine, every non-local memory access must go through each level and back while the processor waits.
slackey@bbn.com (Stan Lackey) (10/25/89)
In article <2658@ganymede.inmos.co.uk> roger@inmos.co.uk (Roger Shepherd) writes: >In article <47279@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >>[in a hypercube] >>the interconnect system waits for the processor to do an access, then >>the processor waits for the interconnect to get the data. Many >>microseconds go by between the time the CPU needs dependent data and >>it is usable. >This is true BUT the time between a request being made and the data >returning is NOT WASTED. It can be used to execute other processes. The context of my posting was of getting high speedups on a single application on a massively parallel computer in an easy-to-use, general purpose way. My statements concerned the use of commodity micros without special architectural mechanisms intended to support massively parallel processing. By 'commodity' I meant true general purpose micros produced with massive volumes and multiple sources. >The communication system which is >provided with every transputer is there because it is generically >useful in multiprocessor system (and you might be surprised just how >many electronic systems are multiprocessor - my PC has at least 3 >microprocessors in it (before I plug in my transputer card))! If the three micros in your PC don't provide a speedup for Lotus by running the Lotus code in parallel, it is outside the scope of this discussion. Which should probably be in comp.parallel anyway. I'm not saying that the communication mechanism in a transputer is 'bad' or unnecessary; I mentioned it as a supportive example of architectural extensions at the processor level for making massively parallel systems more effective. -Stan
stan@Solbourne.COM (Stan Hanks) (10/25/89)
In article <6655@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes: >Also, the job of programming a shared memory system is a lot harder than >programming a system with messages as the communication medium. The number >of successful message-based operating systems demonstrates this. Not necessarily so. It just demonstrates that there is more hardware available with which to build message-passing operating systems. It's bunches easier to build the hardware needed for a message passing system -- all you need is CPUs with memory and some messaging media connection, right? You can use PCs with serial lines in the trivial case, on up to some sort of massively parallel system in the more complex. You can't even begin to count the number of ways that you can construct such a system. It is however substantially more complicated to construct a shared memory system. If you don't believe me, count the number available (relative to the number of systems which have the hardware necessary to build a messaging system). Or, try building one yourself sometime.... 8{) Regards, -- Stanley P. Hanks Science Advisor Solbourne Computer, Inc. Phone: Corporate: (303) 772-3400 Houston: (713) 964-6705 E-mail: ...!{boulder,sun,uunet}!stan!stan stan@solbourne.com
david@cs.washington.edu (David Callahan) (10/26/89)
In article <224@dg.dg.com> uunet!dg!daver (David Rudolph) writes: >In article <36662@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >>In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes: >>>Message forwarding isn't so difficult either. I've read of a system >>>requiring less than 10 us overhead per through-route >>This level of overhead for each hop is completely intolerable. >And what is the overhead for each interconnect level in a "shared >memory" machine such as the ultracomputer? Keep in mind that in such a >machine, every non-local memory access must go through each level and >back while the processor waits. How does two clock ticks per node sound? One for routing and one on the wires. If you processor is slow enough it might be only one :-) Your second statement has two false assumptions in it. First, no need to build a "dancehall" machine: embed the processors in a 3d mesh so that each processor is closer to some memory than the rest. Say 256 processors in a 16x16x16 cube gives an average distance of 12 hops to a memory module. Note that this is half again log(256)=8 but there are packing issues, message flux issues, and redundant path issues that make the comparison difficult. The second false assumption is that the processor waits. Sure, with memory 50 cycles away its easy to build a machine that waits but it is also possible to build a machine that multiplexs 50 independent instruction streams on one ALU so, with sufficient parallelism the memory latency is a non-issue such as the HEP. Of course, you need 50 streams. -- David Callahan (david@tera.com, david@june.cs.washington.edu,david@rice.edu) Tera Computer Co. 400 North 34th Street Seattle WA, 98103
khoult@bbn.com (Kent Hoult) (10/26/89)
In article <9600@june.cs.washington.edu> david@june.cs.washington.edu.cs.washington.edu (David Callahan) writes: >In article <224@dg.dg.com> uunet!dg!daver (David Rudolph) writes: >>In article <36662@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >>>In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes: >>>>Message forwarding isn't so difficult either. I've read of a system >>>>requiring less than 10 us overhead per through-route >>>This level of overhead for each hop is completely intolerable. > >>And what is the overhead for each interconnect level in a "shared >>memory" machine such as the ultracomputer? Keep in mind that in such a >>machine, every non-local memory access must go through each level and >>back while the processor waits. In a machine like the BBN TC-2000 with a butterfly type switch. All remote memory is equidistant from all others. Each switch stage takes 25 nS in each direction, and there are at most 3 stages (in a 504 node machine). The normal time for a 32 bit read or write is around 1.5 uS to any other node. The time for local references tends to be more in the 200 nS area. Kent Hoult TEL: (617) 873-4385 ARPA: khoult@bbn.com
brooks@maddog.llnl.gov (Eugene Brooks) (10/26/89)
In article <224@dg.dg.com> uunet!dg!daver (David Rudolph) writes: >And what is the overhead for each interconnect level in a "shared >memory" machine such as the ultracomputer? Keep in mind that in such a >machine, every non-local memory access must go through each level and >back while the processor waits. A good figure is a two clock pipeline delay per stage. A network using 8x8 switch nodes will have 4 stages for a 4096 processor system (why think small). Round trip is 16 clocks. Considering a 40 MHZ clock this would be a round trip transit time of .4 microseconds. You would add to this of course the memory chip cycle time, pipeline delays associated with getting on and off each end of the each network, and the number of clocks required to worm hole the message through the available wires. The memory chip cycle time and "fixed delays" associated with getting on and off the network amount to more than the pipeline delay through the network. A one microsecond total delay for a remote memory reference (cache miss) is not out of the question. A two microsecond delay is trival to achieve. The CPU does not have to wait for each remote reference to complete before starting another, we certainly did not stall the processor on every shared memory reference in the Cerberus multiprocessor simulator, but current KILLER MICROS have a very limited capability for multiple outstanding memory requests. Hopefully we can keep them alive with cache hits for now, and they will get better about multiple outstanding requests in the future. brooks@maddog.llnl.gov, brooks@maddog.uucp