mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (10/15/89)
In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov writes: >mash@mips.com pointed out some important considerations in the issue >of whether supercomputers as we know them will survive. I thought >that I would attempt to get a discussion started. Here is a simple >fact for the mill, related to the question of whether or not machines >delivering the fastest performance at any price have room in the >market. >Fact number 1: >The best of the microprocessors now EXCEED supercomputers for scalar >performance and the performance of microprocessors is not yet stagnant. >On scalar codes, commodity microprocessors ARE the fastest machines at >any price and custom cpu architectures are doomed in this market. >brooks@maddog.llnl.gov, brooks@maddog.uucp This much has been fairly obvious for a few years now, and was made especially clear by the introduction of the MIPS R-3000 based machines at about the beginning of 1989. I think that this point is irrelevant to the more appropriate purpose of supercomputers, which is to run long (or large), compute-intensive problems that happen to map well onto available architectures. Both factors (memory/time and efficiency) are important here. It is generally not necessary to run short jobs on supercomputers, and it is not cost-effective to run scalar jobs on vector machines. On the other hand, I have several codes that run >100 times faster on the ETA-10G relative to a 25 MHz MIPS R-3000. Since I need to run these codes for hundreds of ETA-10G hours, the equivalent time on the workstation is over one year. The introduction of vector workstations (Ardent & Stellar) changes these ratios substantially. The ETA-10G runs my codes only 20 times faster than the new Ardent Titan. In this environment, the important question is, "Can I get an average of more than 1.2 hours of supercomputer time per day". If not, then the Ardent provides better average wall-clock turnaround. It seems to me that the introduction of fast scalar and vector workstations can greatly enhance the _important_ function of supercomputers --- which is to allow the calculation of problems that are otherwise too big to handle. By removing scalar jobs and vector jobs of short duration from the machine, more resources can be allocated to the large calculations that cannot proceed elsewhere. Enough mumbling.... -- John D. McCalpin - mccalpin@masig1.ocean.fsu.edu mccalpin@scri1.scri.fsu.edu mccalpin@delocn.udel.edu
brooks@maddog.llnl.gov (10/15/89)
mash@mips.com pointed out some important considerations in the issue of whether supercomputers as we know them will survive. I thought that I would attempt to get a discussion started. Here is a simple fact for the mill, related to the question of whether or not machines delivering the fastest performance at any price have room in the market. Fact number 1: The best of the microprocessors now EXCEED supercomputers for scalar performance and the performance of microprocessors is not yet stagnant. On scalar codes, commodity microprocessors ARE the fastest machines at any price and custom cpu architectures are doomed in this market. brooks@maddog.llnl.gov, brooks@maddog.uucp
colwell@mfci.UUCP (Robert Colwell) (10/15/89)
In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov () writes: >Fact number 1: >The best of the microprocessors now EXCEED supercomputers for scalar >performance and the performance of microprocessors is not yet stagnant. >On scalar codes, commodity microprocessors ARE the fastest machines at >any price and custom cpu architectures are doomed in this market. I take my hat off to them, too, because that's no mean feat. But don't forget that the supercomputers didn't set out to be the fastest machines on scalar code. If they had, they'd all have data caches, non-interleaved main memory, and no vector facilities. What the supercomputer designers are trying to do is balance their machines to optimally execute a certain set of programs, not the least of which are the LLL loops. In practice this means that said machines have to do very well on vectorizable code, while not falling down badly on the scalar stuff (lest Amdahl's law come to call.) So while it's ok to chortle at how the micros have caught up on the scalar stuff, I think it would be an unwarranted extrapolation to imply that the supers have been superseded unless you also specify the workload. And by the way, it's the design constraints at the heavy-duty, high parallelism, all functional-units-going-full-tilt-using-the-entire-memory- bandwidth that make the price of the supercomputers so high, not the constraints that predominate at the scalar end. That's why I conclude that when the micro/workstation guys want to play in the supercomputer sandbox they'll either have to bring their piggy banks to buy the appropriate I/O and memory, or convince the users that they can live without all that performance. Bob Colwell ..!uunet!mfci!colwell Multiflow Computer or colwell@multiflow.com 31 Business Park Dr. Branford, CT 06405 203-488-6090
preston@titan.rice.edu (Preston Briggs) (10/15/89)
In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov () writes: >The best of the microprocessors now EXCEED supercomputers for scalar >performance and the performance of microprocessors is not yet stagnant. Is this a fair statement? I've played some with the i860 and I can write (by hand so far) code that is pretty fast. However, the programs where it really zooms are vectorizable. That is, I can make this micro solve certain problems well; but these are the same problems that a vector machines handle well. Getting good FP performance from a micro seems to require pipelining. Keeping the pipe(s) full seems to require a certain amount of parallelism and regularity. Vectorizable loops work wonderfully well. Perhaps I've misunderstood your intent, though. Perhaps you meant that an i860 (or Mips or whatever) can outrun a Cray (or Nec or whatever) on some programs. I guess I'm still doubtful. Do you have examples you can tell us about? Thanks, Preston Briggs
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (10/15/89)
Gordon Bell, in the September CACM (p.1095) says, "By the end of 1989, the performance of the RISC, one-chip microprocessor should surpass and remain ahead of any available minicomputer or mainframe for nearly every significant benchmark and computational workload. By using ECL gate arrays, it is relatively easy to build processors that operate at 200 MHz (5 ns. clock) by 1990." (For those who don't know, Mr. Bell has his name on the PDP-11, the VAX, and the Ardent workstation.) The big iron is fighting back, and that involves reducing their chip count. Once, a big cpu took ~10^4 chips: now it's more like 10^2. I expect it will shortly be ~10 chips. Shorter paths, you know. I see the hot micros and the big iron meeting in the middle. What will distinguish their processors? Mainly, there will be cheap systems. And then, there will be expensive ones, with liquid cooling, superdense packaging, mongo buses, bad yield, all that stuff. Even when no multichip processors remain, there will still be $1K systems and $10M systems. Of course, there is no chance that the $10M system will be uniprocessor. -- Don D.C.Lindsay Carnegie Mellon Computer Science
brooks@vette.llnl.gov (Eugene Brooks) (10/16/89)
In article <1081@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >So while it's ok to chortle at how the micros have caught up on the scalar >stuff, I think it would be an unwarranted extrapolation to imply that the >supers have been superseded unless you also specify the workload. Microprocessor development is not ignoring vectorizable workloads. The latest have fully pipeline floating point and are capable of pipelining several memory accesses. As I noted, interleaving directly on the memory chip is trivial and memory chip makers will do it soon. Micros now dominate the performance game for scalar code and are moving on to vectorizable code. After all, these little critters mutate and become more voracious every 6 months and vectorizable code is the only thing left for them to conquer. No NEW technology needs to be developed, all the micro-chip and memory-chip makers need to do is to decide to take over the supercomputer market. They will do this with their commodity parts. Supercomputers of the future will be scalable multiprocessors made of many hundreds to thousands of commodity microprocessors. They will be commodity parts because these parts will be the fastest around and they will be cheap. These scalable machines will have hundreds of commodity disk drives ganged up for parallel access. Commodity parts will again be used because of the cost advantage leveraged into a scalable system using commodity parts. The only custom logic will be the interconnect which glues the system together, and error correcting logic which glues many disk drives together into a reliable high performance system. The CM data vault is a very good model here. NOTHING WILL WITHSTAND THE ATTACK OF THE KILLER MICROS! brooks@maddog.llnl.gov, brooks@maddog.uucp
brooks@vette.llnl.gov (Eugene Brooks) (10/16/89)
In article <2121@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov () writes: >>The best of the microprocessors now EXCEED supercomputers for scalar >>performance and the performance of microprocessors is not yet stagnant. > >Is this a fair statement? I've played some with the i860 and Yes, in the sense that a scalar dominated program has been compiled for the i860 with a "green" compiler, no pun intended, and the same program was compiled with a mature optimizing compiler on the XMP, and the 40MHZ i860 is faster for this code. Better compilers for the i860 will open up the speed gap relative to the supercomputers. >I can write (by hand so far) code that is pretty fast. >However, the programs where it really zooms are vectorizable. Yes, this micro beats the super on scalar code, and is not too sloppy for hand written code which exploits its cache and pipes well. The compilers are not there yet for the vectorizable stuff on the i860. Even if there were good compilers, the scalar-vector speed differential is not as great on the i860 as it is on a supercomputer. Of course, interleaved memory chips will arrive and microprocessors will use them. Eventually the high performance micros will take the speed prize for vectorizable code as well, but this will require another few years of development. brooks@maddog.llnl.gov, brooks@maddog.uucp
brooks@vette.llnl.gov (Eugene Brooks) (10/16/89)
In article <6523@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >Gordon Bell, in the September CACM (p.1095) says, "By the end of >1989, the performance of the RISC, one-chip microprocessor should >surpass and remain ahead of any available minicomputer or mainframe >for nearly every significant benchmark and computational workload. It has already happened for SOME workloads, those which hit cache well and are scalar dominated. This was done without ECL parts. The ECL parts will only make matters worse for custom processors, as Bell indicates, dominating performance for all workloads. >I see the hot micros and the big iron meeting in the middle. What >will distinguish their processors? Nothing. >Mainly, there will be cheap >systems. And then, there will be expensive ones, with liquid cooling, >superdense packaging, mongo buses, bad yield, all that stuff. Even >when no multichip processors remain, there will still be $1K systems >and $10M systems. Of course, there is no chance that the $10M system >will be uniprocessor. The $10M systems will be scalable systems built out of the same microprocessor. These systems will probably be based on coherent caches, the micros having respectable on chip caches which stay in sync with very large off chip caches. The off chip caches are kept coherent through scalable networks. The "custom" value added part of the machine for the supercomputer vendor to design is the interconnect and the I-O system. The supercomputer vendor will still have a cooling problem on his hands because of the density of heat sources in such a machine. brooks@maddog.llnl.gov, brooks@maddog.uucp
mike@thor.acc.stolaf.edu (Mike Haertel) (10/16/89)
In article <1081@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >I take my hat off to them, too, because that's no mean feat. But don't >forget that the supercomputers didn't set out to be the fastest machines >on scalar code. If they had, they'd all have data caches, non-interleaved >main memory, and no vector facilities. What the supercomputer designers Excuse me, non-interleaved main memory? I've always assumed that interleaved memory could help scalar code too. After all, instruction fetch tends to take place from successive addresses. Of course if main memory is very fast there is no point to interleaving it, but if all you've got is drams with slow cycle times, I would expect that interleaving them would benefit even straight scalar code. -- Mike Haertel <mike@stolaf.edu> ``There's nothing remarkable about it. All one has to do is hit the right keys at the right time and the instrument plays itself.'' -- J. S. Bach
eric@snark.uu.net (Eric S. Raymond) (10/16/89)
In <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov wrote: > The best of the microprocessors now EXCEED supercomputers for scalar > performance and the performance of microprocessors is not yet stagnant. > On scalar codes, commodity microprocessors ARE the fastest machines at > any price and custom cpu architectures are doomed in this market. Yes. And though this is a recent development, an unprejudiced observer could have seen it coming for several years. I did, and had the temerity to say so in print way back in 1986. My reasoning then is still relevant; *speed goes where the volume market is*, because that's where the incentive and development money to get the last mw-sec out of available fabrication technology is concentrated. Notice that nobody talks about GaAs technology for general-purpose processors any more? Or dedicated Lisp machines? Both of these got overhauled by silicon microprocessors because commodity chipmakers could amortize their development costs over such a huge base that it became economical to push silicon to densities nobody thought it could attain. You heard it here first: The supercomputer crowd is going to get its lunch eaten the same way. They're going to keep sinking R&D funds into architectural fads, exotic materials, and the quest for ever more ethereal heights of floating point performance. They'll have a lot of fun and generate a bunch of sexy research papers. Then one morning they're going to wake up and discover that the commodity silicon guys, creeping in their petty pace from day to day, have somehow managed to get better real-world performance out of their little boxes. And supercomputers won't have a separate niche market anymore. And the supercomputer companies will go the way of LMI, taking a bunch of unhappy investors with them. La di da. Trust me. I've seen it happen before... -- Eric S. Raymond = eric@snark.uu.net (mad mastermind of TMN-Netnews)
seibel@cgl.ucsf.edu (George Seibel) (10/16/89)
In <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov wrote: > The best of the microprocessors now EXCEED supercomputers for scalar > performance and the performance of microprocessors is not yet stagnant. > On scalar codes, commodity microprocessors ARE the fastest machines at > any price and custom cpu architectures are doomed in this market. Speaking of "commodities", I think a lot of people have lost sight of, or perhaps never recognized something about the vast majority of supercomputers. They are shared. How often do you get a Cray processor all to yourself? Not very often, unless you have lots of money, or Uncle Sam is picking up the tab so you can design atomic bombs faster. As soon as you have more than one job per processor, you're talking about *commodity Mflops*. The issue is no longer performance at any cost, because if it was you would order another machine at that point. The important thing is Mflops/dollar for most people, and that's where the micros are going to win in a lot of cases. George Seibel, UCSF
rpeglar@csinc.UUCP (Rob Peglar x615) (10/16/89)
In article <35825@lll-winken.LLNL.GOV>, brooks@maddog.llnl.gov writes: > mash@mips.com pointed out some important considerations in the issue of whether > supercomputers as we know them will survive. I thought that I would attempt > to get a discussion started. Here is a simple fact for the mill, related to > the question of whether or not machines delivering the fastest performance > at any price have room in the market. > > Fact number 1: > The best of the microprocessors now EXCEED supercomputers for scalar > performance and the performance of microprocessors is not yet stagnant. > On scalar codes, commodity microprocessors ARE the fastest machines at > any price and custom cpu architectures are doomed in this market. > > brooks@maddog.llnl.gov, brooks@maddog.uucp Brooks is making a good point here. By "this market", I assume he means the one defined above, (as well as by mash) - to paraphrase, "the fastest box at any price". I'll let go what "fastest" and "box" mean for sake of easy discussion :-) Most of us, I hope, can fathom what price is. Anyway, I agree with mash that there is - albeit small - a market for the machine with the highest peak absolute performance (pick your number, the most popular one recently seems to be Linpack 100x100 all Fortran, Dongarra's Table One). The national labs have proven that point for almost a generation. I believe that it will take at least one more generation - those who weaned on machines from CDC, then CRI - before a more reasonable approach to machine procurement comes to pass. Thus, I disagree that there will *always* be a market for this sort of thing. Status symbols may be OK in cars, but for machines purchased with taxpayer dollars, the end is near. Hence, Brooks' "attack of the killer micros". However, I do believe that there will always be a market for various types of processors and processor architectures. Killer scalar micros are finding wide favor as above. Vector supers and their offspring, e.g. the i860 and other 64-bit things, will always dominate codes which can be easily vectorized and do not lend themselves well to parallel computation. Medium-scale OTS-technology machines like Sequent will start (are starting) to dominate OLTP and RDBMS work, perfect tasks for symmetric MP machines. (Pyramid, too; hi Chris). Massive parallel machines will eventually settle into production shops, perhaps running one and only one application, but running it at speeds that boggle the mind. It's up to the manufacturers to decide 1) which game they want to play 2) for what stakes 3) with what competition 4) for how long 5) etc. etc.etc. That's what makes working for a manufacturer such fun and terror at once. Rob ------
mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (10/17/89)
In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >Microprocessor development is not ignoring vectorizable workloads. The >latest have fully pipeline floating point and are capable of pipelining >several memory accesses. As I noted, interleaving directly on the memory >chip is trivial and memory chip makers will do it soon. [ ... more > stuff deleted ... ] > They will do this with their commodity parts. It is not at all clear to me that the memory bandwidth required for running vector codes is going to be developed in commodity parts. To be specific, a single 64-bit vector pipe requires a sustained bandwidth of 24 bytes per clock cycle. Is an ordinary, garden-variety commodity microprocessor going to be able to use 6 32-bit words-per-cycle of memory bandwidth on non-vectorized code? If not, then there is a strong financial incentive not to include that excess bandwidth in commodity products.... In addition, the engineering/cost trade-off between memory bandwidth and memory latency will continue to exist for the "KILLER MICROS" as it does for the current generation of supercomputers. Some users will be willing to sacrifice latency for bandwidth, and others will be willing to do the opposite. Economies of scale will not eliminate this trade-off, except perhaps by eliminating the companies that take the less profitable position (e.g. ETA). >Supercomputers of the future will be scalable multiprocessors made of >many hundreds to thousands of commodity microprocessors. They will >be commodity parts because these parts will be the fastest around and >they will be cheap. It seems to me that the experience in the industry is that general-purpose processors are not usually very effective in parallel-processing applications. There is certainly no guarantee that the uniprocessors which are successful in the market will be well-suited to the parallel supercomputer market -- which is not likely to be a big enough market segment to have any control over what processors are built.... The larger chip vendors are paying more attention to parallelism now, but it appears to be in the context of 2-4 processor parallelism. It is not likely to be possible to make these chips work together in configurations of 1000's with the application of "glue" chips.... This is not to mention the fact that software technology for these parallel supercomputers is depressingly immature. I think traditional moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle existing scientific workloads better than 1000-processor parallel machines for quite some time.... -- John D. McCalpin - mccalpin@masig1.ocean.fsu.edu mccalpin@scri1.scri.fsu.edu mccalpin@delocn.udel.edu
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/17/89)
There's more to supercomputing than scalar speed. One of the primary things you can do on a supercomputer is run large programs quickly. Virtual memory is nice, but some programs cause it to thrash. That's when it's nice to have a real 4GB machine. The same thing can be said about vector processing, some programs can be done using vector processors (or lots of parallel processors) faster than scalar. I don't see the death of the supercomputer, but a redefinition of problems needing one. I have more memory on my home computer than all the computers at this site when I started working here (hell the total was <2MB). Like wise CPU and even disk. The number of problems which I can't solve on my home system is a lot smaller than it was back then. However, thats the kicker, that real problems are limited in size. Someone said that the reason for micros catching up is that the development cost could be spread over the users. For just that reason the vector processors will stay expensive, because fewer users will need (ie. buy) them. There will always be a level of hardware needed to solve problems which are not shared by many users. While every problem has a scalar portion, many don't need vectors, or even floating point. I think this goes for word size, too. When I see that the Intel 586 will have a 64 bit word I fail to generate any excitement. The main effect will be to break all the programs which assume that short==16 bits (I've ported to the Cray, this *is* a problem). If you tell me I can have 64 bit ints, excuse me if I don't feel the need to run right out and place an order. Even as memory gets cheaper I frequently need 1-2 million ints, and having them double in size is not going to help keep cost down. I think that the scalar market will continue to be micros, but I don't agree with Eric that the demand for supercomputers will vanish, or that micros will catch them for the class of problems which are currently being run on supercomputers. The improving scalar performance will reduce the need for vector processing, and keep them from getting economies of scale. He may well be right that some of the companies will fall, since the micros will be able to solve a lot of the problems which are not massively vectorable or inherently require huge addressing space. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "The world is filled with fools. They blindly follow their so-called 'reason' in the face of the church and common sense. Any fool can see that the world is flat!" - anon
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/17/89)
>In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov () writes: > This article certainly generated some responses. Unfortunately, some responders seemed to miss (or chose to ignore :-) the tongue-in-cheek nature of the title. I used to argue, only a couple of years ago, that supercomputers produced cheaper scalar computing cycles than "smaller" systems. That isn't true today. However, supercomputers still produce cheaper floating point results on vectorizable jobs. And, they produce memory bandwidth cheaper than other systems. That may change, too. Q: What will it take to replace a Cray with a bunch of micros? A: (IMHO) : A "cheap" Multiport Interleaved Memory subsystem. In order to do that, you need to provide a way to build such subsystems out of a maximum of 3 different chips, and be able to scale the number of processors and interleaving up and down. A nice goal might be a 4-port/32-way-interleaved 64-bit-wide subsystem cheap enough for a $100 K system. (That is only enough memory bandwidth for a 1 CPU Cray-like system, or 4 micro based CPUs with only 1 word/cycle required, but it would sure be a big step forward.) The subsystem needs to provide single level local-like memory, like a Cray. [Or, show a way to make, in software, a truly distributed system as efficient as a local memory system (PhD thesis material...- I am betting on hardware solutions in the short run...)]. You also need to provide a reasonably reliable way for the memory to subsystem connections to be made. This is sort of hard hardware level engineering. For example, you probably can't afford the space for 32 VME buses... Does anyone have any suggestions on how the connections into and out of such memory subsystems could be made without a Cray-sized bundle of connectors? On the topic of the original posting, what I have seen is that micro based workstations are eating away fast at the minicomputer market, just on the basis of price performance, leaving only workstation clusters, vector machines (Convex-sized to Cray-sized), and other big iron, such as very large central storage servers. So, I wouldn't write off big iron just yet, but obviously some companies will be selling a lot more workstations and a lot fewer minicomputers than they were planning. Quiz: Why does Cray use *8* way interleaving per memory *port* on the Cray Y-MP? Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
colwell@mfci.UUCP (Robert Colwell) (10/17/89)
In article <7369@thor.acc.stolaf.edu> mike@thor.stolaf.edu () writes: >In article <1081@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >>I take my hat off to them, too, because that's no mean feat. But don't >>forget that the supercomputers didn't set out to be the fastest machines >>on scalar code. If they had, they'd all have data caches, non-interleaved >>main memory, and no vector facilities. What the supercomputer designers > >Excuse me, non-interleaved main memory? I've always assumed that >interleaved memory could help scalar code too. After all, instruction >fetch tends to take place from successive addresses. Of course if >main memory is very fast there is no point to interleaving it, but >if all you've got is drams with slow cycle times, I would expect >that interleaving them would benefit even straight scalar code. I meant that as a shorthand way of putting across the idea that the usual compromise is one of memory size, memory bandwidth, and memory latency. For the canonical scalar code you don't need a very large memory, and the bandwidth may not be as important to you as the latency (pointer chasing is an example). The point I was making was that the supercomputers have incorporated design decisions, such as very large physical memory, and very high bandwidth to and from that memory, so that their multiple functional units can be kept usefully busy while executing 'parallel' code. Were you to set out to design a machine which didn't (or couldn't) use those multiple buses (pin limits on a single-chip micro for instance) then that bandwidth isn't worth as much to you and you might be better off with a flat, fast memory, which is what most workstations do (or used to do, anyway). Bob Colwell ..!uunet!mfci!colwell Multiflow Computer or colwell@multiflow.com 31 Business Park Dr. Branford, CT 06405 203-488-6090
pan@propress.com (Philip A. Naecker) (10/17/89)
In article <35825@lll-winken.LLNL.GOV>, brooks@maddog.llnl.gov writes: > Fact number 1: > The best of the microprocessors now EXCEED supercomputers for scalar > performance and the performance of microprocessors is not yet stagnant. > On scalar codes, commodity microprocessors ARE the fastest machines at > any price and custom cpu architectures are doomed in this market. Alas, I believe you have been sucked into the MIPS=Performance falacy. There is *not* a simple relationship between something as basic as scalar performance and something as complex as overall application (or even routine) performance. Case in point: The R2000 chipset implemented on the R/120 (mentioned by others in this conversation) has, by all measures *excellent* scalar performance. One would benchmark it at about 12-14 times a microVAX. However, in real-world, doing-useful-work, not-just-simply-benchmarking situations, one finds that actual performance (i.e., performance in very simple routines with very simple algorithms doing simple floating point operations) is about 1/2 that expected. Why? Because memory bandwidth is *not* as good on a R2000 as it is on other machines, even machines with considerably "slower" processors. There are several components to this, the most important being the cache implementation on an R/120. Other implementations using the R2000/R3000/Rx000 chipsets might well do much better, but only with considerable effort and cost, both of which mean that those "better" implementations will begin to approach the price/ performance of the "big" machines that you argue will be killed by the price/performance of commodity microprocessors. I think you are to a degree correct, but one must always tailor such generalities with a dose of real-world applications. I didn't, and I got bit to the tune of a fine bottle of wine. :-( Phil _______________________________________________________________________________ Philip A. Naecker Consulting Software Engineer Internet: pan@propress.com Suite 101 uunet!prowest!pan 1010 East Union Street Voice: +1 818 577 4820 Pasadena, CA 91106-1756 FAX: +1 818 577 0073 Also: Technology Editor DEC Professional Magazine _______________________________________________________________________________
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/17/89)
In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: (Another amusing challenge:) >After all, these little critters mutate and become more voracious every >6 months and vectorizable code is the only thing left for them to conquer. (I like the picture of fat computer vendors, or at least fat marketing depts, hunched together in bunkers hiding from the killer micros. I have no doubt that they are planning a software counterattack. Watch out for a giant MVS robot built to save the day! :-) >No NEW technology needs to be developed, all the micro-chip and memory-chip >makers need to do is to decide to take over the supercomputer market. > > They will do this with their commodity parts. The only problem I see with this is the interconnection technology. The *rest* of it is, or will soon be, commodity market stuff. >Supercomputers of the future will be scalable multiprocessors made of many >hundreds to thousands of commodity microprocessors. The appropriate interconnection technology for this has not, to my knowledge, been determined. Perhaps you might explain how it will be done? The rest, I agree, is doable at this point, though some of it is not trivial. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/17/89)
In article <127@csinc.UUCP> rpeglar@csinc.UUCP (Rob Peglar x615) writes: >In article <35825@lll-winken.LLNL.GOV>, brooks@maddog.llnl.gov writes: >that point for almost a generation. I believe that it will take at least >one more generation - those who weaned on machines from CDC, then CRI - >before a more reasonable approach to machine procurement comes to pass. In my experience, gov't labs are very cost conscious. I could tell a lot of stories on this. Suffice it to say that many people who have come to gov't labs from private industry get frustrated with just how cost conscious the gov't can be (almost an exact quote: "In my last company, if we needed another 10GBytes, all we had to do was ask, and they bought it for us." That was when 10 GBytes cost $300 K.) The reason supercomputer are used so much is that they get the job done more cheaply. You may question whether or not new nuclear weapons need to be designed, but I doubt if the labs doing it would use Crays if that were not the cheapest way to get the job done. Private industry concerns with the same kinds of jobs also use supercomputers the same way. Oil companies, for example. At various times, oil companies have owned more supercomputers than govt labs. >Thus, I disagree that there will *always* be a market for this sort of >thing. Status symbols may be OK in cars, but for machines purchased with >taxpayer dollars, the end is near. Hence, Brooks' "attack of the killer >micros". I will make a reverse claim: People who want status symbols buy PC's for their office. These PC's, the last time I checked, were only 1/1000th as cost effective at doing scientific computations as supercomputers. Talk about *waste*... :-) Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
brooks@vette.llnl.gov (Eugene Brooks) (10/17/89)
In article <33798@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >>Supercomputers of the future will be scalable multiprocessors made of many >>hundreds to thousands of commodity microprocessors. > >The appropriate interconnection technology for this has not, to my knowledge, >been determined. Perhaps you might explain how it will be done? The rest, >I agree, is doable at this point, though some of it is not trivial. This is the stuff of research papers right now, and rapid progress is being made in this area. The key issue is not having the components which establish the interconnect cost much more than the microprocessors, their off chip caches, and their main memory. We have been through message passing hypercubes and the like, which minimize hardware cost while maximizing programmer effort. I currently lean to scalable coherent cache systems which minimize programmer effort. The exact protocols and hardware implementation which work best for real applications is a current research topic. The complexity of the situtation is much too high for a vendor to just pick a protocol and build without first running very detailed simulations of the system on real programs. brooks@maddog.llnl.gov, brooks@maddog.uucp
brooks@vette.llnl.gov (Eugene Brooks) (10/17/89)
In article <MCCALPIN.89Oct16141656@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >The larger chip vendors are paying more attention to parallelism now, >but it appears to be in the context of 2-4 processor parallelism. It >is not likely to be possible to make these chips work together in >configurations of 1000's with the application of "glue" chips.... These microprocessors, for the most part, are being designed to work in a small processor count coherent cache shared memory environment. This is the reason why examining scalable coherent cache systems is so imporant. The same micros, with their capability to lock a cache line for a while to do an indivisible op, will work fine in the scalable systems. I agree that they won't be optimal, but they will be within 90% of optimal and that is all that is required. The MAJOR problem with current micros in a scalable shared memory environment is their 32 bit addressing. Unfortunately, no 4 processor system will ever need more than 32 bit addresses, so we will have to BEG the micro vendors to put in bigger pointer support.. >This is not to mention the fact that software technology for these >parallel supercomputers is depressingly immature. I think traditional >moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle >existing scientific workloads better than 1000-processor parallel >machines for quite some time.... The software question is the really hary one, that is why LLNL is sponsoring the Massively Parallel Computing Initiative. We see scalable machines being very cost effective and are making a substantial effort in the application software area. brooks@maddog.llnl.gov, brooks@maddog.uucp
brooks@vette.llnl.gov (Eugene Brooks) (10/17/89)
In article <33802@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >I will make a reverse claim: People who want status symbols buy PC's for their >office. These PC's, the last time I checked, were only 1/1000th as cost >effective at doing scientific computations as supercomputers. Talk about >*waste*... :-) A "PC" with a MIPS R3000 or an Intel i860 in it is about 70 times more cost effective for scalar codes, and we run a lot of those on our supercomputers at LLNL, and about 3 to 7 times more cost effective for highly vectorized codes. In fact, much to our computer center's dismay, research staff are voting with their wallet and buying these "PC"s in droves. Our computer center is responding by buying microprocessor powered machines, currently in bus based shared memory multiprocessor form, but eventually in scalable shared memory multiprocessor form. brooks@maddog.llnl.gov, brooks@maddog.uucp
mg@notecnirp.Princeton.EDU (Michael Golan) (10/17/89)
This came for various people - the references are so confusing I removed them so as not to put the wrong words in someone's mouth: >>>Supercomputers of the future will be scalable multiprocessors made of many >>>hundreds to thousands of commodity microprocessors. >> >This is the stuff of research papers right now, and rapid progress is being >made in this area. The key issue is not having the components which establish >the interconnect cost much more than the micros, their off chip caches, >I currently lean to scalable coherent cache systems which minimize programmer >effort. The exact protocols and hardware implementation which work best >for real applications is a current research topic. Last year, I took a graduate level course in parallel computing here at Princeton. I would like to make the following comments, which are my *own*: 1) There is no parallel machine currently the works faster than non-parallel machines for the same price. The "fastest" machines are also non-parallel - these are vector processors. 2) A lot of research is going on - and went on for over 10 years now. As far as I know, no *really* scalable parallel architecture with shared memory exists that will scale far above 10 processors (i.e. 100). And it does not seems to me this will be possible in the near future. "A lot of research" does not imply any effective results - especially in CS - just take a look how many people write articles improving time from O(N log log N) to O(Nlog log log N), which will never be practical for N<10^20 or so (the log log is just an example; you know what I mean). 3) personally I feel parallel computing has no real future as the single cpu gets a 2-4 folds performance boost every few years, and parallel machines constructions just can't keep up with that. It seems to me that for at least the next 10 years, non-parallel machines will still give the best performance and the best performance/cost. 4) I think Cray-like machines will be here for a long long time. People talk about Cray-sharing. This is true, but when an engineer needs a simulation to run and it takes 1 day each time, if you run it on a 2 or 3 day machine, he sits doing nothing for that time, which costs you a lot, i.e. it is turn-around time that really matters. And while computers get faster, its seems software complexity and the need for faster and faster machines is growing even more rapidly. Michael Golan mg@princeton.edu The opinions expressed above are my *own*. You are welcome not to like them.
kleonard@gvlv2.GVL.Unisys.COM (Ken Leonard) (10/17/89)
In article <12070@cgl.ucsf.EDU> seibel@cgl.ucsf.edu (George Seibel) writes:
* In <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov wrote:
* > On scalar codes, commodity microprocessors ARE the fastest machines at
* > any price and custom cpu architectures are doomed in this market.
*
* Speaking of "commodities",...
* ...
* *commodity Mflops*. The issue is no longer performance at any cost, because
* if it was you would order another machine at that point. The important
* thing is Mflops/dollar for most people, and that's where the micros are
* going to win in a lot of cases.
---- well, first...
Maybe, even, the _commodity_ is _not_ _M_flops per dollar, but just
_flop_flops per dollar? That is, if the cycle time to "set up the problem",
"crunch the numbers", "get the plot/list/display" is under _whatever_ upper
limit fits with _my_ mode of "useful work", then I very likely _do_not_care_
if it gets any shorter (i.e. if the _flop_flops per second per dollar goes
higher). This becomes, IMHO, even more significant if my "useful" cycle time
is available to me _truly_ whenever _I_ darn well feel the need.
All of which works, again, to the advantage of microcrunchers.
---- and, second...
A non-trivial part of the demand for megacrunchers, IMHO, stems from
solution methods which have evolved from the days when _only_ "big"
machines were available for "big" jobs (any jobs?) and _just_had_to_be_
shared. For what _I_ do, anyhow, (and probably a _lot_ of other folk
somewhere out there), the "all-in-one-swell-foop" analyses/programs/techniques
are not the _only_ way to get to the _results_ needed to do the job--and
they may well _not_ be the "best" way. I often find that somewhat more of
somewhat smaller steps get me to my target faster than otherwise. That is,
if I can only get 1 or 2 or 10 passes per day through the megacruncher, the
job takes more work from me and more time on the calendar and more bucks from
whoever is paying the tab, than if I can just as many as I need of smaller
passes.
---- also third...
And those smaller passes may well be easier (and thus faster) to program,
and more amenable to validation/assurance/etc.
And they may admit algorithms which work plenty fast on a dedicated machine
even if it is pretty small but would not work very fast at all on a shared
machine even if it is quite big (maybe especially because it is "big
architecture").
---- so, finally...
I believe in micros.
-------------
regardz,
Ken Leonard
swarren@eugene.uucp (Steve Warren) (10/17/89)
In article <33788@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: [...] >Does anyone have any suggestions on how the connections into and out of such >memory subsystems could be made without a Cray-sized bundle of connectors? [...] Multiplexed optical busses driven by integrated receivers with the optics, decoders, and logic-level drivers on the same substrate. It's the obvious solution (one I think many companies are working on). DISCLAIMER: This opinion is in no way related to my employment with Convex Computer Corporation. (As far as I know we aren't working on optical busses, but then I'm not in New Products). --Steve ------------------------------------------------------------------------- {uunet,sun}!convex!swarren; swarren@convex.COM
swarren@eugene.uucp (Steve Warren) (10/17/89)
In article <20336@princeton.Princeton.EDU> mg@notecnirp.edu (Michael Golan) writes: >Last year, I took a graduate level course in parallel computing here at >Princeton. I would like to make the following comments, which are my *own*: > >1) There is no parallel machine currently the works faster than non-parallel >machines for the same price. The "fastest" machines are also non-parallel - >these are vector processors. > The Cray XMP with one processor costs approx. $2.5M. The 4 processor Convex C240S costs $1.5M. On typical scientific applications the performance of the 240S is about 140% of the single processor Cray XMP. (The 240S is the newest model with enhanced performance CPUs). Also, vector processors are technically nonparallel, but the implementation involves parallel function units that are piped up so that at any one instant in time there are multiple operations occurring. Vectors are a way of doing parallel processing on a single stream of data. These were the only points I would disagree with. --Steve ------------------------------------------------------------------------- {uunet,sun}!convex!swarren; swarren@convex.COM
rpeglar@csinc.UUCP (Rob Peglar x615) (10/17/89)
In article <33802@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > In article <127@csinc.UUCP> rpeglar@csinc.UUCP (Rob Peglar x615) writes: > > >that point for almost a generation. I believe that it will take at least > >one more generation - those who weaned on machines from CDC, then CRI - > >before a more reasonable approach to machine procurement comes to pass. > > In my experience, gov't labs are very cost conscious. I could tell a lot of > stories on this. Suffice it to say that many people who have come to gov't labs > from private industry get frustrated with just how cost conscious the gov't can > be (almost an exact quote: "In my last company, if we needed another 10GBytes, > all we had to do was ask, and they bought it for us." That was when 10 GBytes > cost $300 K.) The reason supercomputer are used so much is that they get the > job done more cheaply. You may question whether or not new nuclear weapons > need to be designed, but I doubt if the labs doing it would use Crays > if that were not the cheapest way to get the job done. Private industry > concerns with the same kinds of jobs also use supercomputers the same way. > Oil companies, for example. At various times, oil companies have owned more > supercomputers than govt labs. Good point. However, oil companies in particular are notorious for having procurements follow the "biggest and baddest = best" philosophy. Hugh, you know as well as I that supercomputer procurement is not a rational or scientific process - it's politics, games, and who knows who. Cheap, efficient, usable, etc.etc. - all take a back seat to politics. However, if the "job" is defined as running one (or some small number of) code(s) for hours then there is no question that only a super will do. The point that Brooks' doesn't make, but implies only, is that the *way* scientific computing is being done changes all the time. One-job killer codes are becoming less prevalent. The solutions must change as the workload changes. Sure, there are always codes which cannot be run (Lincoln's attributed quote (compressed) - "supercomputer == only one generation behind the workload" - but yesterdays' killer code, needing 8 hours of 4 million 64-bit words, can now be done on the desktop. (see below) > > >Thus, I disagree that there will *always* be a market for this sort of > >thing. Status symbols may be OK in cars, but for machines purchased with > >taxpayer dollars, the end is near. Hence, Brooks' "attack of the killer > >micros". > > I will make a reverse claim: People who want status symbols buy PC's for their Please. Are you saying that NAS,LLNL,LANL,etc.etc. don't compete for status defined as big,bad hardware? Just the glorious battle between Ames and Langley provides one with enough chuckles to last quite a while. > office. These PC's, the last time I checked, were only 1/1000th as cost > effective at doing scientific computations as supercomputers. Talk about > *waste*... :-) > > Look again. I'll give you a real live example. Buy any 386 33mhz machine, with a reasonable cache (e.g. at least 128 kB) of fast SRAM, and 8 MB or so of (slower) DRAM. Plug in a Mercury co-processor board, and use Fortran (supplied by Mercury) to compile Dr. Dongarra's Table One Linpack. Results on PC - 1.8Mflops. Using a coded BLAS, you get 4.7 Mflops. This is 64-bit math. Last time *I* checked, the Cray Y-MP stood at 79 Mflops. Cost of Cray Y-MP? You and I know what that is. Even discounting life cycle costing (which for any Cray machine, is huge due to bundled maintenance, analysts, etc.etc.), the performance ratio of Y to PC is 79/1.8 = 43.88. I'll bet my year's salary that the price ratio is higher than that. To ballpark, price for the PC setup is around $20K. Moving down all the time. Even if the Y-MP 1/32 was only $2M (which it is not) that would be 100:1 price ratio. Of course, that is only one code. Truly, your mileage will vary. The price/performance ratio of an overall system is dependent on many variables. After all that, Brooks' point is still valid. Micros using commodity HW and cheap (sometimes free) software are closing the gap. They have already smashed the price/performance barrier(for many codes), and the slope of their absolute performance improvements over time is much larger than any of the true super vendors (any==1 now, at least US) The game is nearly over. Rob ...uunet!csinc!rpeglar
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/17/89)
In article <35986@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >A "PC" with a MIPS R3000 or an Intel i860 in it is about 70 times more cost A quick clarification: The "PC's" I was talking about are IBM PC's and clones based on Intel 80x86 chips, *not* SGI or DEC machines based on R3000/R3010s. "PC" may also be extended to Apple Mac and Mac II machines by some people. Most of the "PC" boosters that I am thinking of, and from which we have heard in this newsgroup recently, are also "offended" by the "excessive" power and cost of MIPSCo based machines. Not me, obviously, but most of these people do not consider an SGI 4D/25 a "PC". >effective for scalar codes, and we run a lot of those on our supercomputers >at LLNL, and about 3 to 7 times more cost effective for highly vectorized > codes. Well, I admit, I hadn't done a calculation for some months. Last time I did it, I was somewhat disappointed by the inflated claims surrounding micro based systems. I have been hearing "wolf!" for 15 years, so it is easy to be blase' about it. But, this USENET discussion stimulated me to look at it again. Another quick calculation shows a *big change*. It appears to me, on the face of it, that cost/delivered FLOP is now about even. I don't see the 3 -7 X advantage to the micros yet, but maybe you are looking at the faster 60-100MHz systems that will fast be arriving. I used SGI 4D/280's as the basis of comparison, since that appears to be the most cost effective of such systems that I have good pricing information on. Anyway, how long has it taken Cray to shave a few ns off the clock? In less than a year we should see systems based on the new micro chips. Yikes. It looks like the ATTACK OF THE KILLER MICROS. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
desnoyer@apple.com (Peter Desnoyers) (10/18/89)
> > In my experience, gov't labs are very cost conscious. I could tell a > > lot of stories on this. Suffice it to say that many people who have come > > to gov't labs from private industry get frustrated with just how cost > > conscious the gov't can be. (almost an exact quote: "In my last company, > > if we needed another 10GBytes, all we had to do was ask, and they bought > > it for us." That was when 10 GBytes cost $300 K.) The reason > > supercomputer are used so much is that they get the job done more > > cheaply. From what I know of DOD procurement (my father works for a US Navy lab) one factor may be that the time and effort needed to justify spending $25,000 of Uncle Sam's money on a super-micro, along with the effort of spec'ing it as sole-source or taking bids, is no doubt far more than 1/400th the effort needed to procure a $10M supercomputer. Peter Desnoyers Apple ATG (408) 974-4469
brooks@vette.llnl.gov (Eugene Brooks) (10/18/89)
In article <20336@princeton.Princeton.EDU> mg@notecnirp.edu (Michael Golan) writes: >1) There is no parallel machine currently the works faster than non-parallel >machines for the same price. The "fastest" machines are also non-parallel - >these are vector processors. This is false. There are many counter examples for specific applications. >2) A lot of research is going on - and went on for over 10 years now. As far >as I know, no *really* scalable parallel architecture with shared memory exists >that will scale far above 10 processors (i.e. 100). And it does not seems to >me this will be possible in the near future. Again, this is wrong. Many scalable architectures exist in the literature and some of them are well proven using simulation on real application codes. >3) personally I feel parallel computing has no real future as the single cpu >gets a 2-4 folds performance boost every few years, and parallel machines >constructions just can't keep up with that. It seems to me that for at least >the next 10 years, non-parallel machines will still give the best performance >and the best performance/cost. Massively parallel computing has a future because the performance increases are 100 or 1000 fold. I agree with the notion that using 2 processors, if the software problems are severe, is not worth it because next years micro will be twice as fast. Next years supercomputer, however, will not be twice as fast. >4) I think Cray-like machines will be here for a long long time. People talk >about Cray-sharing. This is true, but when an engineer needs a simulation to >run and it takes 1 day each time, if you run it on a 2 or 3 day machine, he >sits doing nothing for that time, which costs you a lot, i.e. it is turn-around >time that really matters. And while computers get faster, its seems software >complexity and the need for faster and faster machines is growing even more >rapidly. Cray like machines will be here for a long time indeed. They will, however, be implemented on single or nearly single chip microprocessors. I do not think that the "architecture" is bad, only the implementation has become nearly obselete. It is definitely obselete for scalar code and vectorized code will follow within 5 years. brooks@maddog.llnl.gov, brooks@maddog.uucp
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/18/89)
In article <36057@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >Cray like machines will be here for a long time indeed. They will, however, >be implemented on single or nearly single chip microprocessors. I do not >think that the "architecture" is bad, only the implementation has become >nearly obselete. It is definitely obselete for scalar code and vectorized >code will follow within 5 years. I agree with you here. In fact, did anyone notice a recent newspaper article (In Tuesday's Merc. News - from Knight Ridder:) "Control Data to use Mips design" "Control Data Corp. has cast its lot with Mips Computer Systems, inc. to design the brains of its future computers, choosing a new computer architecture developed by the Sunnyvale Company." ... "The joint dev. agreement with Mips means Control Data will use [...] the RISC architecture developed by that firm..." Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
ggw@wolves.uucp (Gregory G. Woodbury) (10/18/89)
In article <MCCALPIN.89Oct16141656@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene >Brooks) writes: > >>Microprocessor development is not ignoring vectorizable workloads. The >>latest have fully pipeline floating point and are capable of pipelining >>several memory accesses. >>[ ... more stuff deleted ... ] > >It is not at all clear to me that the memory bandwidth required for >running vector codes is going to be developed in commodity parts. To >be specific, a single 64-bit vector pipe requires a sustained >bandwidth of 24 bytes per clock cycle. Is an ordinary, garden-variety >commodity microprocessor going to be able to use 6 32-bit >words-per-cycle of memory bandwidth on non-vectorized code? If not, >then there is a strong financial incentive not to include that excess >bandwidth in commodity products.... > This is quite a statement. Don't forget - even if the micro can not make FULL use of a vector pipeline, including one will enhance performance significantly. The theoretical folks in this forum are quite useful in the development of theoretical maxima, but even some partial vector capabilities in a floating point unit will be greeted with joy. Lots and lots of "commodity" programs out there do things that would benefit from some primitive vector computations. Just in the past couple of weeks we have had some discussions here about the price/performance aspects of these "Killer Micros". ( I do want to acknowledge that my price figures were a little skewed - another round of configuration work with various vendors has shown that I can find a decent bus speed and SCSI disks in the required price range - thanks for some of the pointers!) > >In addition, the engineering/cost trade-off between memory bandwidth >and memory latency will continue to exist for the "KILLER MICROS" as >it does for the current generation of supercomputers. Some users will >be willing to sacrifice latency for bandwidth, and others will be >willing to do the opposite. Economies of scale will not eliminate >this trade-off, except perhaps by eliminating the companies that take >the less profitable position (e.g. ETA). This is an good restatment of the recent "SCSI on steroids" discussion. The vendor who can first put a "real" supercomputer or "real" mainframe on (or beside) the desktop for <$50,000 will make a killing. Calling something a "Personal Mainframe" makes marketing happy, but not being able to keep that promise makes for unhappy customers ;-) -- Gregory G. Woodbury Sysop/owner Wolves Den UNIX BBS, Durham NC UUCP: ...dukcds!wolves!ggw ...dukeac!wolves!ggw [use the maps!] Domain: ggw@cds.duke.edu ggw@ac.duke.edu ggw%wolves@ac.duke.edu Phone: +1 919 493 1998 (Home) +1 919 684 6126 (Work) [The line eater is a boojum snark! ] <standard disclaimers apply>
kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/18/89)
In article <35897@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >In article <2121@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >>In article <35825@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov () writes: >>>The best of the microprocessors now EXCEED supercomputers for scalar >>Is this a fair statement? I've played some with the i860 and >Yes, in the sense that a scalar dominated program has been compiled for >the i860 with a "green" compiler, no pun intended, and the same program >was compiled with a mature optimizing compiler on the XMP, and the 40MHZ >i860 is faster for this code. Better compilers for the i860 will open The Cray-XMP is considerably slower than the YMP. The single-processor XMP is no-longer a supercomputer. Take a program requiring more than 128MBytes of memory (or 64 MBytes for that matter (but I personally prefer more than 256M to excerice the VM system alittle!)) (i.e. a relatively BIG job, a *supercomputer* job) and then compare any micro you want or any other system you want with the YMP. or something in that class. and then try it on a multiprocessor YMP, and Please STOP USING A SINGLE-PROCESSOR xmp AS THE DEFINITION OF A SUPERCOMPUTER, thank you. And it would be nice if people used "LIST PRICE" for "COMPLETE SYSTEMS" when comparing prices. (LIST PRICE = PEAK PRICE !!) (COMPLETE SYSTEM = with all needed software and a few GBytes of disk with a few controllers)
kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/18/89)
In article <127@csinc.UUCP> rpeglar@csinc.UUCP (Rob Peglar x615) writes: >(pick your number, the most popular one recently seems to be Linpack >100x100 all Fortran, Dongarra's Table One). The national labs have proven Throw away ALL your copies of the LINPACK 100x100 benchmark if you are interested in supercomputers. The 300x300 is barely big enough and uses a barely good-enough algorithm to qualify for supercomputer comparison as a low-impact guideline only. JJD has lots of warning words in the first paragraphs of his list but looks like most people go right to the table and never read the paper. If you must use a single-program benchmark, use the lesson taught by the Sandia people (JohnGustafson, et.al.): Keep the time fixed and vary the problem size.
mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (10/18/89)
In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >Supercomputers of the future will be scalable multiprocessors made of >many hundreds to thousands of commodity microprocessors. They will be >commodity parts because these parts will be the fastest around and >they will be cheap. These scalable machines will have hundreds of >commodity disk drives ganged up for parallel access. Commodity parts >will again be used because of the cost advantage leveraged into a >scalable system using commodity parts. The only custom logic will be >the interconnect which glues the system together, and error correcting >logic which glues many disk drives together into a reliable high >performance system. The CM data vault is a very good model here. I think that it is interesting that you expect the same users who can't vectorize their codes on the current vector machines to be able to figure out how to parallelize them on these scalable MIMD boxes. It seems to me that the automatic parallelization problem is much worse than the automatic vectorization problem, so I think a software fix is unlikely.... In fact, I think I can say it much more strongly than that: Extrapolating from current experience with MIMD machines, I don't think that the fraction of users that can use a scalable MIMD architecture is likely to be big enough to support the economies of scale required to compete with Cray and their vector machines. (At least for the next 5 years or so). I *do* think is that the romance with vector machines has worn off, and people are realizing that they are not the answer to everyone's problems. This is a good thing --- I like it when people migrate their scalar codes off of the vector machines that I am trying to get time on!!! What is driving the flight from traditional supercomputers to high-performance micros is turnaround time on scalar codes. From my experience, if the code is really not vectorizable, then it is probably not parallelizable either, and scalable machines won't scale. These users are going to want the fastest single-processor micro available, unless their memory requirements are too big their ability to purchase. The people who can vectorize their codes are still getting 100:1 improvements going to supercomputers --- my code is over 500 times faster on an 8-cpu Cray Y/MP than on a 25 MHz R-3000/3010. So the market for traditional supercomputers won't disappear, it will just be more limited than many optimists have predicted. -- John D. McCalpin - mccalpin@masig1.ocean.fsu.edu mccalpin@scri1.scri.fsu.edu mccalpin@delocn.udel.edu
rwa@cs.AthabascaU.CA (Ross Alexander) (10/18/89)
brooks@vette.llnl.gov (Eugene Brooks) writes: >In article <33802@ames.arc.nasa.gov> (Hugh LaMaster) writes: >>[...] a reverse claim: People who want status symbols buy PC's for their >>office. These PC's, the last time I checked, were only 1/1000th as cost >>effective at doing scientific computations as supercomputers. Talk about >>*waste*... :-) >A "PC" with a MIPS R3000 or an Intel i860 in it is about 70 times more cost >effective for scalar codes, and we run a lot of those on our supercomputers C'mon, Eugene, address the claim, not a straw man of your own invention. Hugh means people who buy intel-hackitecture machines from Big Blue. Do you really mean LLNL people buy mips-engine boxes as office status symbols? Not a very nice thing to say about your own team ;-). And then you contradict yourself by saying these same mips-or-whatever boxes are 70 times more effective: are they status symbols, or are they machines to do work? Make up your mind :-) :-). Ross
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/18/89)
In article <33802@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: | I will make a reverse claim: People who want status symbols buy PC's for their | office. These PC's, the last time I checked, were only 1/1000th as cost | effective at doing scientific computations as supercomputers. Talk about | *waste*... :-) What you say is true, but you seem to draw a strange conclusion from it... very few people do scientific calculations on a PC. They are used for spreadsheets, word processing, and even reading news ;-) These are things which supercomputers do poorly. Benchmark nroff on a Cray... EGAD! it's slower than an IBM 3081! Secondly *any* computer becomes less cost effective as it is used less. Unless you have the workload to heavily use a supercomputer you will find the cost gets really steep. Think of it this way, a technical worker cost a company about $50000 a year (or more), counting salary and benefits. The worker works 240 days a year (2 weeks vacation, 10 days holiday and sick), at a cost per *working hour* of $26 more or less. For a $1600 PC to be cost effective in just a year it must save about 16 minutes a day, which is pretty easy to do. You also get increased productivity. Obviously not every PC is utilized well. Neither are workstations (how many hours drawing fractals and playing games) or supercomputers, for that matter. That problem is a management issue, not a factor of computer size. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "The world is filled with fools. They blindly follow their so-called 'reason' in the face of the church and common sense. Any fool can see that the world is flat!" - anon
ingoldsb@ctycal.UUCP (Terry Ingoldsby) (10/19/89)
In article <MCCALPIN.89Oct16141656@masig3.ocean.fsu.edu>, mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: > In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene > Brooks) writes: > > >Microprocessor development is not ignoring vectorizable workloads. The > >latest have fully pipeline floating point and are capable of pipelining ... > It seems to me that the experience in the industry is that > general-purpose processors are not usually very effective in > parallel-processing applications. There is certainly no guarantee > that the uniprocessors which are successful in the market will be > well-suited to the parallel supercomputer market -- which is not > likely to be a big enough market segment to have any control over what > processors are built.... Agreed. The only general purpose systems that I am aware of that exploit parallel processing do so through specialized processors to handle certain functions (eg. matrix multipliers, I/O processors) or have a small (< 16) number of general purpose processors. > > The larger chip vendors are paying more attention to parallelism now, > but it appears to be in the context of 2-4 processor parallelism. It > is not likely to be possible to make these chips work together in > configurations of 1000's with the application of "glue" chips.... It doesn't seem to be just a case of using custom designed chips as opposed to generic glue. The problem is fundamentally one of designing a system that allows the problem to be divided across many processors AND (this is the tricky part) that provides an efficient communication path between the sub-components of the problem. In the general case this may not be possible. Note that mother nature hasn't been able to do it (eg. the human brain isn't very good at arithmetic, but for other applications its stupendous). > > This is not to mention the fact that software technology for these > parallel supercomputers is depressingly immature. I think traditional > moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle > existing scientific workloads better than 1000-processor parallel > machines for quite some time.... > -- I don't think we should berate ourselves about the techniques available for splitting workloads. No one has ever proved that such an activity is even possible for most problems (at a *large* scale). The activities that are amenable to parallel processing (eg. image processing, computer vision) will probably only be feasible on architectures specifically designed for those functions. Note that I'm not saying to give up on parallel processing; on the contrary I believe that it is the only way to do certain activities. I am saying that the notion of a general purpose massively parallel architecture that efficiently executes all kinds of algorithms is probably a naive and simplistic view of the world. -- Terry Ingoldsby ctycal!ingoldsb@calgary.UUCP Land Information Systems or The City of Calgary ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb
slackey@bbn.com (Stan Lackey) (10/19/89)
In article <MCCALPIN.89Oct18103933@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >In article <35896@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene >Brooks) writes: > >>Supercomputers of the future will be scalable multiprocessors made of >>many hundreds to thousands of commodity microprocessors. >I think that it is interesting that you expect the same users who >can't vectorize their codes on the current vector machines to be able >to figure out how to parallelize them on these scalable MIMD boxes. >It seems to me that the automatic parallelization problem is much >worse than the automatic vectorization problem, ... Yes, there seems to be the perception running around that "parallelization" must be harder than "vectorization". I am not saying it isn't, because I and not a compiler writer, but I sure can give some reasons why it might not be. Vectorization requires the same operation to be repeatedly performed on the elements of a vector. Parallel processors can perform different operations, such as conditional branching within a loop that is being performed in parallel. Dependencies between loop iterations can be handled in a PP that has the appropriate communication capabilities, whereas most (all?) vector machines require that all elements be independent (except for certain special cases, like summation and dot product.) This can be done by message passing, or if you have shared memory, with interlocks. Parallel processors are not limited to operations for which there are corresponding vector instructions provided in the hardware. Well that's all I can think of right now. Anyone else care to add anything? -Stan
chris@dg.dg.com (Chris Moriondo) (10/19/89)
In article <35977@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >This is the stuff of research papers right now, and rapid progress is being >made in this area. The key issue is not having the components which establish >the interconnect cost much more than the microprocessors, their off chip >caches, and their main memory. The only really scalable interconnect schemes of which I am aware are multistage interconnects which grow (N log N) as you linearly increase the numbers of processors and memories. So in the limit the machine is essentially ALL INTERCONNECT NETWORK, which obviously costs more than the processors and memories. (Maybe this is what SUN means when they say "The Network IS the computer"? :-) How do you build a shared-memory multi where the cost of the interconnect scales linearly? Obviously I am discounting busses, which don't scale well past very small numbers of processors. >We have been through message passing hypercubes and >the like, which minimize hardware cost while maximizing programmer effort. >I currently lean to scalable coherent cache systems which minimize programmer >effort. While message passing multicomputers maximize programmer effort in the sense that they don't lend themselves to "dusty deck" programs, they have the advantage that the interconnect costs scale linearly with the size machine. They also present a clean programmer abstraction that presents the true cost of operations to the programmer. I read a paper by (I think) Larry Snyder wherein he argued that the PRAM abstraction causes programmer to produce suboptimal parallel algorithms by leading one to think that simple operations have linear cost when in reality they can't be better than N log N. chrism -- Insert usual disclaimers here --
bga@odeon.ahse.cdc.com (Bruce Albrecht) (10/19/89)
In article <35979@lll-winken.LLNL.GOV>, brooks@vette.llnl.gov (Eugene Brooks) writes: > Unfortunately, no 4 processor system will ever need more than 32 bit > addresses, so we will have to BEG the micro vendors to put in bigger > pointer support.. Oh really? CDC has several customers that have databases that exceed 2**32 bytes. Our file organization considers files to be virtual memory segments. We already need pointers larger than 32 bits. IBM's AS400 has a virtual address space greater than 32 bits, too. If the micro venders don't see a need for it, they're not paying attention to what the mainframes are really providing for their very large system customers.
chris@dg.dg.com (Chris Moriondo) (10/19/89)
In article <20336@princeton.Princeton.EDU> mg@notecnirp.edu (Michael Golan) writes: >3) personally I feel parallel computing has no real future as the single cpu >gets a 2-4 folds performance boost every few years, and parallel machines >constructions just can't keep up with that. It seems to me that for at least >the next 10 years, non-parallel machines will still give the best performance >and the best performance/cost. Actually, the rate of improvement in single cpu performance seems to have flattened out in recent supercomputers, and they have turned to more parallelism to continue to deliver more performance. If you project the slope of the clock rates of supercomputers, you will see sub-nanosecond CYCLE times before 1995. I don't see any technologies in the wings which promise to allow this to continue... chrism
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/19/89)
In article <35979@lll-winken.LLNL.GOV>, brooks@vette.llnl.gov (Eugene Brooks) writes: | The MAJOR problem with current micros | in a scalable shared memory environment is their 32 bit addressing. | Unfortunately, no 4 processor system will ever need more than 32 bit | addresses, so we will have to BEG the micro vendors to put in bigger | pointer support.. The Intel 80386 has 32 bit segments, but its still a segmented system, and the virtual address space is (I believe) 40 bits. The *physical* space is 32 bits, though. The 586 has been described in the press as a 64 bit machine. Seems about right, the problem which people are seeing right now is that file size is getting over 32 bits, and that makes all the database stuff seriously ugly, complex, and subject to programming error. I think you can assume that no begging will be needed, but if you let the vendors think that you need it the price will rise ;-) -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "The world is filled with fools. They blindly follow their so-called 'reason' in the face of the church and common sense. Any fool can see that the world is flat!" - anon
rec@dg.dg.com (Robert Cousins) (10/19/89)
In article <2450@odeon.ahse.cdc.com> bga@odeon.ahse.cdc.com (Bruce Albrecht) writes: >In article <35979@lll-winken.LLNL.GOV>, brooks@vette.llnl.gov (Eugene Brooks) writes: >> Unfortunately, no 4 processor system will ever need more than 32 bit >> addresses, so we will have to BEG the micro vendors to put in bigger >> pointer support.. > >Oh really? CDC has several customers that have databases that exceed 2**32 >bytes. Our file organization considers files to be virtual memory segments. >We already need pointers larger than 32 bits. IBM's AS400 has a virtual >address space greater than 32 bits, too. If the micro venders don't see a >need for it, they're not paying attention to what the mainframes are really >providing for their very large system customers. In 1947, John Von Neumann anticipated that 4K word of 40 bits each was enough for contemporary problems and so the majority of machines then had that much RAM (or what passed for it in the technology of the day). This is ~2**12 bytes worth of usefulness in todays thinking (though not in bits). Over the next 40 years we've grown to the point where 2**32 bytes is a common theoretical limit for machines with a large number of machines in the 2**30 bytes is fairly common. This translates into 18-20 bits of address over 40 years. Or, 1 bit of address every 2 years or so. Given the trend to having micro architectures last 5 to 8 years, this means that a micro architecture should have atleast 4 additional address lines at its announce or 5 additional when its development is started. In the PC space, 16 megabytes seems to be the common upper limit. Any PC therefore should have not 2**24 as a limit but 2**26 at the minimum. IMHO, at least :-) Robert Cousins Dept. Mgr, Workstation Dev't. Data General Corp. Speaking for myself alone.
hascall@atanasoff.cs.iastate.edu (John Hascall) (10/19/89)
In article <???> bga@odeon.ahse.cdc.com (Bruce Albrecht) writes: }In article <???>, brooks@vette.llnl.gov (Eugene Brooks) writes: }> Unfortunately, no 4 processor system will ever need more than 32 bit }Oh really? CDC has several customers that have databases that exceed 2**32 ... }We already need pointers larger than 32 bits. IBM's AS400 has a virtual }address space greater than 32 bits, too. I don't know about CDC, but the AS/400 uses what is called Single Level Storage, that is, all memory and disk are in one humongous address space. Many people do require more than 2**32 bytes of disk farm, but very few people are using 2**32 bytes of memory space--so in a more typical system the need for (pointers) more than 32 bits is rather uncommon % John Hascall % although I'm sure we'll hear from a number of them now :-)
joel@cfctech.UUCP (Joel Lessenberry) (10/19/89)
In article <1633@atanasoff.cs.iastate.edu> hascall@atanasoff.UUCP (John Hascall) writes: >... >}We already need pointers larger than 32 bits. IBM's AS400 has a virtual >}address space greater than 32 bits, too. > > I don't know about CDC, but the AS/400 uses what is called Single Level > Storage, that is, all memory and disk are in one humongous address space. > John Hascall > is anyone else out their interested in starting an AS/400 thread? It is IBM's most advanced system.. Single level storage Object Oriented Arch Context addressing Hi level machine Instruction set 64 bit logical addressing True complete I/D split, no chance for self modifying code joel Joel Lessenberry, Distributed Systems | +1 313 948 3342 joel@cfctech.UUCP | Chrysler Financial Corp. joel%cfctech.uucp@mailgw.cc.umich.edu | MIS, Technical Services {sharkey|mailrus}!cfctech!joel | 2777 Franklin, Sfld, MI
hsu@uicsrd.csrd.uiuc.edu (William Tsun-Yuk Hsu) (10/20/89)
In article <220@dg.dg.com> chris@dg.dg.com (Chris Moriondo) writes: > >The only really scalable interconnect schemes of which I am aware are >multistage interconnects which grow (N log N) as you linearly increase the >numbers of processors and memories... > >While message passing multicomputers maximize programmer effort in the sense >that they don't lend themselves to "dusty deck" programs, they have the >advantage that the interconnect costs scale linearly with the size machine. Ummm, message passing does not necessarily mean a single-stage interconnect. Also, most commercial message passing systems these days are hypercubes, and it's oversimplifying to claim that the cost of the hypercube interconnect scales linearly with system size. Remember that there are O(logN) ports per processor. Check out the paper by Abraham and Padmanabhan in the '86 International Conference on Parallel Processing, for another view on interconnect cost and performance comparisons. Most point-to-point parallel architectures, where the fan-out per processor also grows linearly with the system size, tend to be things like rings and meshes that are less popular for more general purpose parallel computing. Are you referring to these rather than hypercubes? Bill Hsu
brooks@vette.llnl.gov (Eugene Brooks) (10/20/89)
In article <9078@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu writes: >The Cray-XMP is considerably slower than the YMP. The YMP is 30% faster than a the XMP I was referring to. This is for scalar dominiated compiled code and is a rather general result. Just in case you doubt my sources, I runs codes on both a YMP 8/32 and an XMP 4/16 frequently enough to be a good judge of speed. >The single-processor XMP is no-longer a supercomputer. Only if the difference between supercomputer and not is a 30% speed increase. I argue that a 30% speed increase is not significant, a frigging factor of 2 is not significant from my point of view. Both the XMP and the YMP are in the same class. Perhaps later YMPs will have more memory putting them in a slightly improved class. >Take a program requiring more than 128MBytes of memory (or 64 MBytes >for that matter (but I personally prefer more than 256M to excerice the >VM system alittle!)) (i.e. a relatively BIG job, a *supercomputer* job) >and then compare any micro you want >or any other system you want with the YMP. or something in >that class. and then try it on a multiprocessor YMP, and Please >STOP USING A SINGLE-PROCESSOR xmp AS THE DEFINITION OF A SUPERCOMPUTER, >thank you. I have no interest in single cpu micros with less than 128MB. I prefer 256 MB. I want enough main memory to hold my problems. >And it would be nice if people used "LIST PRICE" for "COMPLETE SYSTEMS" >when comparing prices. (LIST PRICE = PEAK PRICE !!) (COMPLETE SYSTEM = >with all needed software and a few GBytes of disk with a few controllers) I am talking list price for the system. A frigging XMP eating micro with suitable memory, about 64 meg at the minimum, can be had for 60K. The YMP costs about 3 million a node. The micro matches its performance for my applications. Which do you think I want to buy time on? Of course, I prefer a 3 million dollar parallel micro based system which has 50-100 nodes and runs circles around the YMP processor for my application. brooks@maddog.llnl.gov, brooks@maddog.uucp
brooks@vette.llnl.gov (Eugene Brooks) (10/20/89)
In article <MCCALPIN.89Oct18103933@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >I think that it is interesting that you expect the same users who >can't vectorize their codes on the current vector machines to be able >to figure out how to parallelize them on these scalable MIMD boxes. I can only point out specific examples which I have experience with. For certain Monte Carlo radiation transport codes, vectorization is a very painful experience which involves much code rewriting to obtain meager performance increases. I have a direct experience with such a vectorization effort on a "new" and not dusty deck code. We got a factor of 2 as the upperbound for performance increases from vectorization on the XMP. The problem was all the operations performed under masks. LOTS of wasted cycles. The same problem, however, was easily coded in an EXPLICITLY PARALLEL language and obtained impressive speedups of 24 out of 30 processors on a Sequent Symmetry. It ran at 2.8 times XMP performance on hardware costing much less. We are moving on to a 126 processor BBN Butterfly-II now which should deliver more than 40 times the performance of the XMP at similar system cost. >It seems to me that the automatic parallelization problem is much >worse than the automatic vectorization problem, so I think a software >fix is unlikely.... Automatic vectorization is much easier than automatic parallelization in a global sense. This is why high quality vectorizing compilers exist, in addition to the high availability of hardware, and why automatic GLOBALLY parallizing compilers dont. The problem with some codes is that they must be globally parallelized, and right now an expliticly parallel lingo is the way to get it done. >In fact, I think I can say it much more strongly than that: >Extrapolating from current experience with MIMD machines, I don't >think that the fraction of users that can use a scalable MIMD >architecture is likely to be big enough to support the economies of >scale required to compete with Cray and their vector machines. (At >least for the next 5 years or so). I do not agree, LLNL (a really big user of traditional supercomputers) has hatched the Massively Parallel Computing Initiative to achieve this goal on a broad application scale within 3 years. We will see what happens... >What is driving the flight from traditional supercomputers to >high-performance micros is turnaround time on scalar codes. From my >experience, if the code is really not vectorizable, then it is >probably not parallelizable either, and scalable machines won't scale. Not true, I have several counter examples of highly parallel but scalar codes. >The people who can vectorize their codes are still getting 100:1 >improvements going to supercomputers --- my code is over 500 times >faster on an 8-cpu Cray Y/MP than on a 25 MHz R-3000/3010. So the >market for traditional supercomputers won't disappear, it will just be >more limited than many optimists have predicted. Yes, using all 8 cpus on the YMP and if each cpu is spending most of its time doing 2 vector reads, a multiply and an add, and one vector write, all chained up it will run circles around the current killer micros which are tuned for scalar performance. This situation will change in the next few years. brooks@maddog.llnl.gov, brooks@maddog.uucp
rod@venera.isi.edu (Rodney Doyle Van Meter III) (10/20/89)
In article <490@ctycal.UUCP> ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes: > >Note that I'm not saying to give up on parallel processing; on the contrary >I believe that it is the only way to do certain activities. I am saying >that the notion of a general purpose massively parallel architecture that >efficiently executes all kinds of algorithms is probably a naive and >simplistic view of the world. Depends on how you classify "all" algorithms. Nary a machine ever made is good at every algorithm ever invented. I suspect fine-grain SIMD machines are the way to go for a broader class of algorithms than we currently suspect. Cellular automata, fluid flow, computer vision, certain types of image processing and computer graphics have all shown themselves to be amenable to running on a Connection Machine. I'm sure the list will continue to grow. In fact Dow Jones himself now owns two; anybody know what he's doing with them? Peak performance for a CM-2, fully decked out, is on the order of 10 Gflops. This is with 64K 1-bit processors and 2K Weitek FP chips. The individual processors are actually pretty slow, 10-100Kips, I think. Imagine what this baby'd be like if they were actually fast! Their Datavault only has something like 30MB/sec transfer rate, which seems pretty poor for that many disks with that much potential bandwidth. Rumors of a CM-3 abound. More memory (1 Mbit/processor?), more processors (I think the addressing for processors is already in the neighborhood of 32 bits), more independent actions perhaps going as far as local loops, etc. I was told by a guy from Thinking Machines that they get two basic questions when describing the machine: 1) Why so many processors? 2) Why so few processors? Answering the second one is easy: It was the most they could manage. Answering the first one is harder, because the people who ask tend not to grasp the concept at all. What do I think? I think the next ten years are going to be very interesting! --Rod
brooks@vette.llnl.gov (Eugene Brooks) (10/20/89)
In article <220@dg.dg.com> chris@dg.dg.com (Chris Moriondo) writes: >The only really scalable interconnect schemes of which I am aware are >multistage interconnects which grow (N log N) as you linearly increase the >numbers of processors and memories. So in the limit the machine is essentially >ALL INTERCONNECT NETWORK, which obviously costs more than the processors and >memories. (Maybe this is what SUN means when they say "The Network IS the >computer"? :-) How do you build a shared-memory multi where the cost of the >interconnect scales linearly? Obviously I am discounting busses, which don't >scale well past very small numbers of processors. The cost of the interconnect can't be made to scale linearly. You can only get a log N scaling per processor. The key is the base of the log and not having N too large, ie using a KILLER MICRO and not a pipsqueak. Eight by eight switchnodes are practical at this point, with four by four being abslolutely easy. Pin count is the main problem, not silicon area. Assuming 8x8 nodes, a 512 node system takes three stages, a 4096 node system takes 4 stages. Are 4 switch chips cheaper, or equivalent in cost to a killer micro and 32 meg of memory? SUNS "The network is the computer" is meant for ethernet types of things but it really does apply to multiprocessors. If you don't have real good communcation capability between the computing nodes what you can do with the machine is limited. Could anyone handle a KILLER MICRO powered system with 4096 nodes? Just think, 4096 times the power of a YMP for scalar but MIMD parallel codes. ~400 times the power of a YMP cpu for vectorized and MIMD parallel codes. It boggles the mind. brooks@maddog.llnl.gov, brooks@maddog.uucp
brooks@vette.llnl.gov (Eugene Brooks) (10/20/89)
>Assuming 8x8 nodes, a 512 node system takes three stages, a 4096 node >system takes 4 stages. Are 4 switch chips cheaper, or equivalent in >cost to a killer micro and 32 meg of memory? Oops! It should be, are 4 switch chips cheaper than 8 killer micros and 256 Meg of memory. The switch is 4 stages deep, but there are 8 micros hung on each switch port. The bottom line is that the switch is probably not more than half the cost of the machine, even given the fact that it is not a commodity part. Of course, a good design for the switch chip and node interface might become a commodity part! Depending on the cache hit rates one might hang more than one micro on each node and further amortize the cost of the switch. brooks@maddog.llnl.gov, brooks@maddog.uucp
khb%chiba@Sun.COM (Keith Bierman - SPD Advanced Languages) (10/20/89)
In article <33870@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >I agree with you here. In fact, did anyone notice a recent newspaper article >(In Tuesday's Merc. News - from Knight Ridder:) > >"Control Data to use Mips design" > >"Control Data Corp. has cast its lot with Mips Computer Systems, inc. to design >the brains of its future computers, choosing a new computer architecture >developed by the Sunnyvale Company." CDC has been selling the MIPS based SGI workstation under its label for a while now ... so this is either total non-news ... or CDC has simply decided to cut SGI out of the picture. When I had a chance to play with the CDC labeled SGI box I couldn't find _any_ differences from the SGI equivalent (except that the SGI had a newer software release and different power up message). Keith H. Bierman |*My thoughts are my own. !! kbierman@sun.com It's Not My Fault | MTS --Only my work belongs to Sun* I Voted for Bill & | Advanced Languages/Floating Point Group Opus | "When the going gets Weird .. the Weird turn PRO" "There is NO defense against the attack of the KILLER MICROS!" Eugene Brooks
rodger@chorus.fr (Rodger Lea) (10/20/89)
From article <17045@cfctech.UUCP>, by joel@cfctech.UUCP (Joel Lessenberry): > It is IBM's most advanced system.. > > Single level storage ^^^^^ At last !! > Object Oriented Arch What exactly do you/they mean by object oriented. Are we talking something along the lines of the intel approach ? I would be interested in details - anybody in the know ? Rodge rodger@chorus.fr
munck@chance.uucp (Robert Munck) (10/20/89)
In article <1259@crdos1.crd.ge.COM> davidsen@crdos1.UUCP (bill davidsen) writes: > > The Intel 80386 has 32 bit segments, but its still a segmented system, >and the virtual address space is (I believe) 40 bits. You're both too high and too low. The 386 supports 16,384 segments of up to 4GB, 14 bits plus 32 bits => 46 bit addresses. HOWEVER, the segments map into either real memory (page translation disabled), maximum 4GB, or linear virtual memory (paging enabled), also maximum 4GB. Virtual addresses are 46 bits and the virtual address space is 4GB. I think it's cute. -- Bob <Munck@MITRE.ORG>, linus!munck.UUCP -- MS Z676, MITRE Corporation, McLean, VA 22120 -- 703/883-6688
hascall@atanasoff.cs.iastate.edu (John Hascall) (10/20/89)
In article <3394@chorus.fr> rodger@chorus.fr (Rodger Lea) writes: }From article <17045@cfctech.UUCP>, by joel@cfctech.UUCP (Joel Lessenberry): }> It is IBM's most advanced system.. }> Object Oriented Arch } What exactly do you/they mean by object oriented. Are we }talking something along the lines of the intel approach ? The AS/400 architecture makes the VAX architecture look like RISC--it is *so* CISC!! As I understand it, there are 2 levels of microcode. Your instruction (I was told one instruction was "create database") executes the top level of microcode which in turn executes the bottom level of microcode which in turn actually causes the hardware to do something. Most unusual. John
bga@odeon.ahse.cdc.com (Bruce Albrecht) (10/21/89)
In article <126561@sun.Eng.Sun.COM>, khb%chiba@Sun.COM (Keith Bierman - SPD Advanced Languages) writes: > CDC has been selling the MIPS based SGI workstation under its label > for a while now ... so this is either total non-news ... or CDC has > simply decided to cut SGI out of the picture. As far as I know, CDC will still be selling SGI workstations. CDC will be working with Mips directly to develop high-performance versions of the Mips architecture.
chris@dg.dg.com (Chris Moriondo) (10/21/89)
In article <1989Oct19.172050.20818@ux1.cso.uiuc.edu> hsu@uicsrd.csrd.uiuc.edu (William Tsun-Yuk Hsu) writes: >In article <220@dg.dg.com> chris@dg.dg.com (Chris Moriondo) writes: >> >>The only really scalable interconnect schemes of which I am aware are >>multistage interconnects which grow (N log N) as you linearly increase the >>numbers of processors and memories... >> >>While message passing multicomputers maximize programmer effort in the sense >>that they don't lend themselves to "dusty deck" programs, they have the >>advantage that the interconnect costs scale linearly with the size machine. > >Ummm, message passing does not necessarily mean a single-stage >interconnect. Also, most commercial message passing systems these >days are hypercubes... Too right. I confess I was thinking more along the lines of the current crop of fine-grained mesh-connected message-passing multicomputers that are being worked on at CALTECH (Mosaic) and MIT (the Jelly-bean machine and the Apiary.) At least with machines of this ilk you only pay message latency proportional to how far you are communicating, rather than paying on every (global) memory reference with the shared-memory approach. Some of the hot-spot contention results indicate that the cost of accessing memory as seen by a processor might bear little relationship to its own referencing behavior. >...and it's oversimplifying to claim that the >cost of the hypercube interconnect scales linearly with system size. >Remember that there are O(logN) ports per processor. With hypercubes, what concerns me more than the scaling of the number of ports is the scaling of the length of the longest wires, and the scaling of the number of wires across the midpoint of the machine. (Unless of course you can figure out a way to wire your hypercube in hyperspace... :-)
vorbrueg@bufo.usc.edu (Jan Vorbrueggen) (10/21/89)
In article <10200@venera.isi.edu> rod@venera.isi.edu.UUCP (Rodney Doyle Van Meter III) writes: >In article <490@ctycal.UUCP> ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes: >> ... I am saying >>that the notion of a general purpose massively parallel architecture that >>efficiently executes all kinds of algorithms is probably a naive and >>simplistic view of the world. >Depends on how you classify "all" algorithms. Nary a machine ever made >is good at every algorithm ever invented. I learned in school that it is hard to write a good numerical algorithm (e.g., to solve differential equations), but fairly easy to find an example which makes it stand in the rain. Maybe the same applies to building computers :-) Rolf
seanf@sco.COM (Sean Fagan) (10/21/89)
In article <9078@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu writes: >The Cray-XMP is considerably slower than the YMP. >The single-processor XMP is no-longer a supercomputer. >Take a program requiring more than 128MBytes of memory (or 64 MBytes >for that matter (but I personally prefer more than 256M to excerice the >VM system alittle!)) (i.e. a relatively BIG job, a *supercomputer* job) What?! Uhm, excercising the Cray's VM system is definitely going to be an interesting job -- Seymour doesn't *believe* in VM! (Well, anecdote has it that he doesn't *understand* it 8-).) I have mixed feelings about VM (as anybody who's seen more than three of my postings probably realizes 8-)): on one hand, yes, getting page faults will tend to slow things down. However, the system can be designed, from a software point of view, in such a way that page faults will be kept to a minimum. Also, having about 4 Gbytes of real memory tends to help. And, face it, swapping programs in and out of memory can be a time-consuming process, even on a Cray -- if you're dealing with 100+ Mword programs! Other supercomputers have VM, of course. However, I have never gotten the chance to play on, say, an ETA-10 to compare it to a Cray (I asked someone, once, at FSU for an account, and I was turned down 8-)). My personal opinion is that the machine is not as fast for quite a number of applications, but having the VM might help it beat a Cray in Real-World(tm) situations. Anybody got any data on that? And, remember: memory is like an orgasm: it's better when it's real (paraphrasing Seymour). 8-) -- Sean Eric Fagan | "Time has little to do with infinity and jelly donuts." seanf@sco.COM | -- Thomas Magnum (Tom Selleck), _Magnum, P.I._ (408) 458-1422 | Any opinions expressed are my own, not my employers'.
stein@dhw68k.cts.com (Rick Stein) (10/22/89)
Keywords: In article <220@dg.dg.com> chris@dg.dg.com (Chris Moriondo) writes: >In article <35977@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >While message passing multicomputers maximize programmer effort in the sense >that they don't lend themselves to "dusty deck" programs, they have the >advantage that the interconnect costs scale linearly with the size machine. Indeed, the "dusty deck" (aka toxic waste dump) is generally not organized to exploit the linear scalable potential of the multicomputer. To my knowledge, no university in the U.S. teaches how to create linear scalable software, the cornerstone of multicomputers. Until the shared-memory s/w engineering styles are abandonded, no real progress in multicomputing can begin (at least in this country). Europe and Japan are pressing on without (despite us).> >chrism -- Richard M. Stein (aka, Rick 'Transputer' Stein) Sole proprietor of Rick's Software Toxic Waste Dump and Kitty Litter Co. "You build 'em, we bury 'em." uucp: ...{spsd, zardoz, felix}!dhw68k!stein
pcg@emerald.cs.aber.ac.uk (Piercarlo Grandi) (10/23/89)
In article <17045@cfctech.UUCP> joel@cfctech.UUCP (Joel Lessenberry) writes:
is anyone else out their interested in starting an AS/400 thread?
It is IBM's most advanced system..
Single level storage
Object Oriented Arch
Context addressing
Hi level machine Instruction set
64 bit logical addressing
True complete I/D split, no chance for self modifying
code
Rumours exist that the AS/400 (nee S/38) is the result of putting
Peter's Bishop dissertation (a landmark work) "Very large address
spaces and garbage collection", MIT TR 107, in the hands of the
same team that had designed the System/3 (arrgghh!). IMNHO the
S/38 is a poor implementation of a great design. That it still is
good is more a tribute to the great design than to the
implementation skills of the System/3 "architects".
--
Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
brooks@vette.llnl.gov (Eugene Brooks) (10/23/89)
In article <27203@dhw68k.cts.com> stein@dhw68k.cts.com (Rick Stein) writes a followup to something attributed to me, but 180 degrees out of phase with my opinion on the great shared memory vs message passing debate: >Indeed, the "dusty deck" (aka toxic waste dump) is generally not organized >to exploit the linear scalable potential of the multicomputer. To my >knowledge, no university in the U.S. teaches how to create linear scalable >software, the cornerstone of multicomputers. Until the shared-memory >s/w engineering styles are abandonded, no real progress in multicomputing >can begin (at least in this country). Europe and Japan are pressing on >without (despite us).> The posting he quoted here was incorrectly attributed to me. It was in fact someone's retort to something I wrote. Scalable shared memory machines, which provide coherent caches (local memory where shared memory is used as such), are buildable, usable, and cost effective. Some students and professors at Caltech, which included someone by the name of Brooks before his rebirth into the "real" world of computational physics, were so desperate for computer cycles that they sidetracked the parallel computer industry by hooking up a bunch of Intel 8086-8087 powered boxes together in a system with miserable communication performance. Industry, in its infinite wisdom, followed their lead by providing machines with even poorer communication performance. When you quote, please be sure to get the right author when it is from a message with several levels of quoting. I had something to do with the message passing hypermania, but it is not my party line these days.... brooks@maddog.llnl.gov, brooks@maddog.uucp
carr@mfci.UUCP (George R Carr Jr) (10/23/89)
In article <MCCALPIN.89Oct16141656@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: > .... [software for] >parallel supercomputers is depressingly immature. I think traditional >moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle >existing scientific workloads better than 1000-processor parallel >machines for quite some time.... I know of several problem domains where I strongly disagree. More than one aerospace company is currently looking at 1000+ node parallel machines because no Cray, ETA, NEC, or other 'conventional' machine can give them the time to solution required. The major area of excitement with parallel machines is to find the problems for which algorithms exist which are now computable which are not otherwise computable. George R Carr Jr internet: carr@multiflow.com Multiflow Computer, Inc. uucp: uunet!mfci!mfci-la!carr 16360 Roscoe Blvd., Suite 215 fax: (818)891-0395 Van Nuys, CA 91406 voice: (818)892-7172
kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/23/89)
In article <36232@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >The YMP is 30% faster than a the XMP I was referring to. This is >for scalar dominiated compiled code and is a rather general result. If you have scalar dominated code that fits in a workstation's memory and you dont want to run more than one job at a time, then you are right. I am sure other users of the YMP will be happy to keep the machine busy and get good 64-bit megaflops. >>The single-processor XMP is no-longer a supercomputer. >Only if the difference between supercomputer and not is a 30% speed increase. I have little desire to defend or promote a YMP, but you cant run a scalar code on a vector machine and complain, too! On the NASA benchamrks, which I am sure some of this audience has seen, the YMP sustained over 1 GFlops. THAT, is significantly faster than a single processor XMP. REWRITE the code!! Or have someone do it for you (there was a company that would get your code to run at least twice faster or your money back, I forget the name and dont know them or anyone who does). If they dont perform, Throw away all the dusty decks. Refuse to use dusty-deck oriented code. But if that's all the algorithm can do for now, then yes, use whatever gives you the desired performance at the least life-time cost (not price!) >I have no interest in single cpu micros with less than 128MB. >I prefer 256 MB. I want enough main memory to hold my problems. A 256 MB micro can cost you some. And not so little. And all that for just one user. I am not sure the numbers come out. And how about IO bandwidth and file-size. maybe your application doesnt need any. Talk to a Chemist. By the time micros become killers, they wont be micros anymore! >I am talking list price for the system. A frigging XMP eating micro with Yes. My comment about list-price was not directed at Eugene. Sorry. I meant to emphasize the importace of using peak-price to go with peak-performance (I have seen cases where the reported performance is on a high-end machine, but the reported price is not!).
grunwald@Tokyo.ira.uka.de (Grunwald Betr. Tichy) (10/23/89)
I followed the articles for some time and want to mention some points. 1. Hardware costs are only a fraction of the cost. To do real big problems you need lots of support software and you relie on it. So if you use a PC you will have to write more code (or buy spezialised code at a high price) and trust your version. This is hard because numeric mathematic is not so easy as it seems and if your aircraft comes down or your bridge cracks, its to late to blame yourself. 2. Parallel computers will need a Pascal (C,Modula,Ada,..) like language, which can be compiled and run on a scalable architekture. Nobody wants to rewrite all programs, if he gets more processors. It would be even better to have it scaled at runtime, so the process runs faster, if the no other users want the processors too. I know only the Connection Machine doing that, and this machine is not as general purpose as a Workstation. (What OS has the CM ? What Languages ? Can you compile a CM program to work on other Computers ? (not simulated) 3. Some Problems are just to big for a PC. Even if you have a more sophisticated system then the normal Primitiv Computer, there are a lot of problems which have been scaled down to run on supercomputers. So further downscaling is not possible without a substantial loss of accuracy. (accuracy is not only the length of a floating point. Its how much points can your grids have? What differential equations are possible ? What about error control ? (Its useless getting wrong results faster. You have to know about the error range.)) My opinion is, supercomputers will exist a long time in future and MICROS still have a long way to go, to match the performance. Most people comparing the power don't think of the background of the numbercrunchers and that are lots of software packages and big disks to record the results, what is a big part of the machines cost. Don't get me wrong: I'm a Micro User (OS9-680x0) and I like it, but I know that things are not so easy in the supercomputing area as some people might think. Knut Grunwald, Raiffeisenstr. 8, 7555 Elchesheim-Illingen, West-Germany
brooks@vette.llnl.gov (Eugene Brooks) (10/23/89)
In article <9119@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes: >If you have scalar dominated code that fits in a workstation's memory One should not attempt to infer that a workstation's memory is small. An YMP 8/32 has 4 megawords (32 MB) available per processor. If all you want is 32 MB per processor you can buy this with a killer micro for about 40K, simply throw it away in a year when its performance has been eclipsed by the next killer micro, and still have your computer time work out to be about 5 dollars an hour. They have the gall to charge $250 an hour for Cray YMP time, for low priority time at that. >THAT, is significantly faster than a single processor XMP. > >REWRITE the code!! Or have someone do it for you (there was a company >that would get your code to run at least twice faster or your money back, >I forget the name and dont know them or anyone who does). We did! And we showed that you could asymtotically get the factor of 2 you suggest with infinite work. Why suggest doing such a thing when one can get a factor of 100 with little work on 100 killer micros? >If they dont perform, >Throw away all the dusty decks. Refuse to use dusty-deck oriented code. This was not a dusty deck. This code was written in the last couple of years with modern tooling, for both vectorized and MIMD parallel machines. It is not the code which is scalar, it is the algorithm. One could say toss out the algorithm, but it is one most robust ones available for the application in question. >A 256 MB micro can cost you some. And not so little. But it is much cheaper than a SUPERCOMPUTER for my application, and it is FASTER. To bring back the car analogy, the accelerator is still pressed to the metal for speed improvements in killer micros. brooks@maddog.llnl.gov, brooks@maddog.uucp
henry@utzoo.uucp (Henry Spencer) (10/23/89)
In article <74731@linus.UUCP> munck@chance.UUCP (Robert Munck) writes: >... The 386 supports 16,384 segments of up >to 4GB, 14 bits plus 32 bits => 46 bit addresses... Except that it's not a 46-bit address space, it's a bunch of 32-bit ones. There is a difference. As witness the horrors that are perpetrated on 8086/88/186/286 machines to try to cover up their lack of a unified address space. "Near" and "far" pointers, anyone? -- A bit of tolerance is worth a | Henry Spencer at U of Toronto Zoology megabyte of flaming. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
rang@cs.wisc.edu (Anton Rang) (10/23/89)
In article <36593@lll-winken.LLNL.GOV> brooks@vette.llnl.gov (Eugene Brooks) writes: > They have the gall to charge $250 an >hour for Cray YMP time, for low priority time at that. You haven't got much reason to complain...out here I have the privilege of spending $300/hour for VAX-11/785 time... :-) Schools can be SO much fun.... Anton +----------------------------------+------------------+ | Anton Rang (grad student) | rang@cs.wisc.edu | | University of Wisconsin--Madison | | +----------------------------------+------------------+
henry@utzoo.uucp (Henry Spencer) (10/23/89)
In article <27203@dhw68k.cts.com> stein@dhw68k.cts.com (Rick Stein) writes: >...no university in the U.S. teaches how to create linear scalable >software, the cornerstone of multicomputers. Until the shared-memory >s/w engineering styles are abandonded, no real progress in multicomputing >can begin (at least in this country). Europe and Japan are pressing on >without (despite us).> What remains to be seen is whether they are pressing on up a blind alley. Remember where this discussion thread started out: the mainstream of high-volume development has vast resources compared to the more obscure byways. Results from those byways have to be awfully damned good if they are going to be competitive except in ultra-specialized niches. As I've mentioned in another context, "gonna have to change our whole way of thinking to go parallel real soon, because serial's about to run out of steam" has been gospel for quite a while now... but the difficulty of that conversion has justified an awful lot of highly successful work on speeding up non-parallel computing. Work which is still going and still succeeding. I'm neutral on the nationalism -- you're all foreigners to me :-) -- but highly skeptical on the parallelism. -- A bit of tolerance is worth a | Henry Spencer at U of Toronto Zoology megabyte of flaming. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
wen-king@cit-vax.Caltech.Edu (King Su) (10/24/89)
In article <36549@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >...................................... Some students and professors at <Caltech, which included someone by the name of Brooks before his rebirth into >the "real" world of computational physics, were so desperate for computer <cycles that they sidetracked the parallel computer industry by hooking up >a bunch of Intel 8086-8087 powered boxes together in a system with miserable <communication performance. Industry, in its infinite wisdom, followed their >lead by providing machines with even poorer communication performance. Huh? As far as I know, every commercially available multicomputer that were built after our original multicomputer has a better communication performance. We did not lead anybody into anything, as Caltech CS has never been a strong influence to the industry. Nor have we advocated low communication performances. Today's multicomputers are as much as three orders of magnitude better in message latency and throughput, thanks to worm-hole routing hardwares. There will be further improvements when low-dimensional networks are in use. Perhaps we could have provided more positive influences to the industry, but we are operating under the guideline that university research groups should not be turned into joint-ventures. The tax- payers did not give us money for us to make more money for ourselves. -- /*------------------------------------------------------------------------*\ | Wen-King Su wen-king@vlsi.caltech.edu Caltech Corp of Cosmic Engineers | \*------------------------------------------------------------------------*/
gil@banyan.UUCP (Gil Pilz@Eng@Banyan) (10/27/89)
In article <12345@cit-vax.Caltech.Edu> wen-king@cit-vax.UUCP (Wen-King Su) writes: >Perhaps we could have provided more positive influences to the >industry, but we are operating under the guideline that university >research groups should not be turned into joint-ventures. The tax- >payers did not give us money for us to make more money for ourselves. Why not ? If you start off a (successfull) joint venture won't you end up employing people ? People who will be paying taxes on the money they make as well as buying goods and services etc. It would seem that the "tax payers" would be much better off if a state-funded research group _were_ turned into a joint venture rather than to let it's research be used later by someone else outside of the taxpaying area (i.e. it's better for California if the research funded in California schools went to build businesses in California rather than, say Texas . . at a national level it all evens out, but locally it does make a difference . . this is why the whole Massachusetts "Miracle" schtick is such a joke . . a "Miracle" wow ! . . lots of schools & research ==> lots of start-up companies ==> a moderate number of successfull companies ==> money coming in . . amazing ! education works !) -=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=- Gilbert W. Pilz Jr. gil@banyan.com Banyan Systems Inc. (617) 898-1196 -=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-
frazier@oahu.cs.ucla.edu (Greg Frazier) (10/27/89)
In article <562@banyan.UUCP> gil@banyan.com writes: >In article <12345@cit-vax.Caltech.Edu> wen-king@cit-vax.UUCP (Wen-King Su) writes: >>Perhaps we could have provided more positive influences to the >>industry, but we are operating under the guideline that university >>research groups should not be turned into joint-ventures. The tax- >>payers did not give us money for us to make more money for ourselves. > >Why not ? If you start off a (successfull) joint venture won't you >end up employing people ? People who will be paying taxes on the money >they make as well as buying goods and services etc. It would seem >that the "tax payers" would be much better off if a state-funded >research group _were_ turned into a joint venture rather than to let [ etc about benefits of start-up and Mass miracle ] No, the problem is that the taxpayer outlay for the start up is not compensated by the jobs "created" or the taxes received. I put "created" in quotes, because I do not believe that all of these jobs are created - the increased competition is going to cost somebody something, be it another startup which goes under, a major corp which loses some market share and lays people off, or the reduction of a substitutive market, such as typewriter manufacturers. But let's not get into social dynamics - even if the jobs are all "created", the taxpayer has footed a major bill, the rewards of which will, for the most part, end up in only a few people's pockets. To address the Mass miracle, it wasn't the concentration of major universities which brought it about, it was the concentration of defense contractors, which is why the carpet was pulled out from under Mass when the defense cutbacks went through (you will recall Mass has had deficits recently - no more miracle). Greg Frazier @@@@@@@@@@@@@@@@@@@@@@@@))))))))))))))))))##############3 "They thought to use and shame me but I win out by nature, because a true freak cannot be made. A true freak must be born." - Geek Love Greg Frazier frazier@CS.UCLA.EDU !{ucbvax,rutgers}!ucla-cs!frazier
kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/28/89)
In article <36593@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >>If you have scalar dominated code that fits in a workstation's memory >One should not attempt to infer that a workstation's memory is small. >An YMP 8/32 has 4 megawords (32 MB) available per processor. If all you >want is 32 MB per processor you can buy this with a killer micro for >about 40K, simply throw it away in a year when its performance has been >eclipsed by the next killer micro, and still have your computer time work Point well taken. One of the reasons I have little desire to defend or promote a YMP is precisely that. The YMP has 128 MWords of memory, by the way. This is for 8 processors. Cray-2 has 256 MWords. (but a terrible latency, even in the s model). (the Cray C90 is supposed to have 512 MW, and Cray-4 1000+ MWords, but these are paper machines for now). So, the point is that fp performance alone does not make a supercomputer anymore (surprize surprize!). My feeling these days is that one needs a sophisticated VM system, with a large hierarchical memory system, and first-rate I/O and networking, so that one could run a single job very fast (the traditional domain of supers has been just this. They've been "benchmark machines" in my opinion) but you could also sit in a network and handle many users and many jobs. (many = say, 37)! and on top of that, you need libraries, compilers, debuggers, editors, profilers, etc. The emergence of the powerful micro is welcome, indeed. And when they can be ganged-up and you know how to program them and have an application that uses their strengths and does not excercise their weaknesses,,, indeed, they are fast. BUT: 1) they dont have the software, libraries, compilers, etc. 2) they often have a low bandwidth connection to a not-so-strong front-end 3) they cant handle I/O so well yet. 4) there are no standards for anything 5) you need a pretty large job to get speed-up, anyway. Remember, the way the guys at Sandia got their great speed-ups was to make the jobs larger. Much larger. You need over 99.99% parallelism for a 1000 processoor machine! So the parallel part of your program should be allowed to grow (and fortunately, if the algorithm is parallelizable, the parallel parts tend to grow faster than the serial parts in many cases, if not most. Same thing with vectorizable parts if the algorithm is vectorizable). Except that it just so happens that when you have a large job, it also runs much faster on a super! (a modern super with lots of memory, that is, not the OLD definition of super.) My point is that there is no point in getting too excited about highly parallel machines. nor about fast microprocessors. A micro is not called a micro just because it has a microprocessor in it. Not anymore. It usually has a low bandwidth memory system, not much of a cache, no much a VM system, not much of an I/O, etc. That's what keeps the price down. (prices are going down for supers, too). Its great to have a fast micro on your desk, but it'll have plenty to do rendering the data that you got from the super! and delivering mail, etc. And if you want to gang them up, you'll end up paying exactly as much as you would if you got a super, maybe more! This is how it will end-up being. If you had all the software and all the I/O and all the disk and all the networking, etc... You'll have to pay for those! Hardware costs will not differ enough to burry it. (I am comparing a contemporary super with a contemporary high-end parallel system. They could very well be the same thing in the 5 years that was specified. All supers are multiprocessors now and are increasing the number of processors. So that's another reasonn why you'll be paying exactly the same price if not more!) Conclusion: Like a teacher said a long time ago,, there is the law of conservation of difficulty!! Highly parallel systems will NOT be a revolutionary deal where you suddenly can do something much more cheaply. It has been evolutionary. which is why I said: By the time micros become killers, they wont be micros anymore. Highly parallel systems are good. They have merit. They are here to stay, etc. Fast micros are also nice. But lets not sensationalize the issues. And by the way,, most of the japanese machines achieve their speed by multiple functional units: more than one adder and one multiplier. And a final note about "pagemaker". No insult was intended. pagemaker is a sophisticated application that requires all components of the machine from the cpu to the screen to the printer to font calculations, etc. It would have been quite unimaginable to try to do something like that on a computer 30 years ago. I think it was clear what I meant.
kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/28/89)
In article <1218@iraun1.ira.uka.de> grunwald@Tokyo.UUCP (Grunwald Betr. Tichy) writes: >2. Parallel computers will need a Pascal (C,Modula,Ada,..) like language, which ... >Knut Grunwald, Raiffeisenstr. 8, 7555 Elchesheim-Illingen, West-Germany I agree with you on the points that you made. But your choice of languages was unexpected. I don't want to start a language/religion debate, but do want to ask what language you use for supercomputing applications in germany. I am looking to see if there are trends in different countries.
mash@mips.COM (John Mashey) (10/31/89)
In article <428@propress.com> pan@propress.com (Philip A. Naecker) writes: >Case in point: The R2000 chipset implemented on the R/120 (mentioned by others >in this conversation) has, by all measures *excellent* scalar performance. One >would benchmark it at about 12-14 times a microVAX. However, in real-world, >doing-useful-work, not-just-simply-benchmarking situations, one finds that >actual performance (i.e., performance in very simple routines with very simple >algorithms doing simple floating point operations) is about 1/2 that expected. Please be a little more specific, as this is contrary to large numbers of people's experience with "doing-useful-work, not-just-simply-benchmarking" situations. Note: it is perfectly possible that one can encounter realistic programs for which the performance is half of what is expected, on some given class of benchmarks. Is the statement above: a) The M/120 is really a 6-7X microVAX machine OR b) We've run some programs in which it is found to be a 6-7X uVAX machine. Note that, as posted, this reads more like a) than b), so please say more. >Why? Because memory bandwidth is *not* as good on a R2000 as it is on other >machines, even machines with considerably "slower" processors. There are >several components to this, the most important being the cache implementation >on an R/120. Other implementations using the R2000/R3000/Rx000 chipsets might >well do much better, but only with considerable effort and cost, both of which >mean that those "better" implementations will begin to approach the price/ >performance of the "big" machines that you argue will be killed by the >price/performance of commodity microprocessors. The R2000 in an M/120 indeed has a very simple memory system. The rest of the comments seem overstated to me: we just announced a new machine (the RC3240), which is a CPU-board upgrade to an M/120, uses an R3000, gains another 40-50% performance from the same old memory boards, and costs the same as an M/120 did when it was announced. If it had been desgined from scratch, it would be faster, with little or no increase in cost. PLEASE look at the data on the various machines built of such parts. The one-word refill of the R2000 certainly slowed it down; the multi-word refill & instruction-streaming on the R3000 certainly help improve the kinds of programs that hurt an R2000, and the cost differences are really pretty minimal. In addition, if you look at R3000s in larger system designs, I think it is hard to claim that these implementations are anywhere near the price/performance (anywhere near as high price, that is) as those of "big" machines, at least for CPU performance. > >I think you are to a degree correct, but one must always tailor such >generalities with a dose of real-world applications. I didn't, and I got bit >to the tune of a fine bottle of wine. :-( Anyway, we all agree on that: "your mileage may vary". How about posting something on the particular applications to generate some insight about what these things are good for or not? -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.COM (John Mashey) (10/31/89)
In article <9119@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes: >If you have scalar dominated code that fits in a workstation's memory >and you dont want to run more than one job at a time, then you are right. Note: some of this discussion has seemed to assume that micro == workstation. To help unconfuse people, let us remember that the same CPU chip can be used in various different configurations, only some of which are workstations. Note that desktop workstations are unlikely to get enouhg memory to keep real supercomputer users happy, given the usual cost tradeoffs. This might not be true of big desksides, and is least >A 256 MB micro can cost you some. And not so little. And all that >for just one user. I am not sure the numbers come out. And how about IO >bandwidth and file-size. maybe your application doesnt need any. Again, note that the issue is not necessarily single-user workstations versus supercomputers, it's mixtures of desktops, desksides, and servers versus supercomputers.. I.e., a whole lot of this discussion has seemed like a classic "domain-of- discourse" argument: in which the argument: A is true gets heated replies of "No, it isn't" should be converted to: In domain 1 (not very vectorizable), A is true. (micros are tough) But domain 2 (elsewhere), A is not true. (micros are not so tough) which makes clear that the real argument is more like: How big are domains 1 & 2? Will that change? THUS: FOR SUPERCOMPUTER USERS: a) How much of your code is vectorizable? b) How much is parallelizable? c) How much mostly needs big memories? d) How much is dominated by turnaround time, cost-is-no-object processing? e) Do you have some more data points, i.e., SUPERCOMPUTER X versus microprocessor-based-system Y, including elapsed times & costs? In most of this dicussion, we've gotten a few data points. more would help. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jesup@cbmvax.UUCP (Randell Jesup) (11/18/89)
In article <AGLEW.89Nov7130958@chant.urbana.mcd.mot.com> aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) writes: >What has happened is that better value is being provided, but also the >amount of money people are willing to spend on computing has gone up. >The system that bears the same relationship to the state of the art as >the 4,000$ PC did a few years ago now costs at least 10,000$. > >The inexpensive "home computer" has been slightly lost in these developments. >The Amiga, perhaps... but even the Amiga is running up the prices. Well, the Amiga 500 (the "home" machine) can be gotten for about the same price as the original C-64 (tape drive - disk drive pushes it up). The C-64 was introduced at $600 for the CPU unit alone. The Amiga 500 includes a disk drive as well. Then again, selling at good performance/price levels is Commodore's business. Of course, there's very little competition in that part of the market nowadays. Maybe we're ripe for another upswing (following the recent resurgence of video games) in the home computer market. Disclaimer: I work for Commodore-Amiga, Inc. -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup Common phrase heard at Amiga Devcon '89: "It's in there!"
nelson@m.cs.uiuc.edu (11/20/89)
> parallelism to continue to deliver more performance. If you project the > slope of the clock rates of supercomputers, you will see sub-nanosecond > CYCLE times before 1995. I don't see any technologies in the wings which > promise to allow this to continue... Actually, I don't see this (dare I say it) EVER occuring. Ignoring delay due to capacitance, a nanosecond is only 12 inches of wire -- and I'm reasonably sure that the "critical path" length is at least on the order of a foot (does anyone know?). Once capacitance delay comes into the picture (even on-chip there is a significant amount), even with new technologies, that 12 inches is being reduced at least a tenfold (opinion/guess). That leaves you with an inch of wiring for the critical path for this super technology -- that does not seem nearly enough to build a nano-processor around. Anyone else have Opinions? Facts? -- Taed.
jonah@db.toronto.edu (Jeffrey Lee) (11/20/89)
nelson@m.cs.uiuc.edu writes: >> parallelism to continue to deliver more performance. If you project the >> slope of the clock rates of supercomputers, you will see sub-nanosecond >> CYCLE times before 1995. I don't see any technologies in the wings which >> promise to allow this to continue... >Actually, I don't see this (dare I say it) EVER occuring. NEVER say "never." :-) > Ignoring > delay due to capacitance, a nanosecond is only 12 inches of wire -- > and I'm reasonably sure that the "critical path" length is at least > on the order of a foot (does anyone know?). Once capacitance delay > comes into the picture (even on-chip there is a significant amount), > even with new technologies, that 12 inches is being reduced at least > a tenfold (opinion/guess). That leaves you with an inch of wiring > for the critical path for this super technology -- that does not > seem nearly enough to build a nano-processor around. Hierarchy and locality is wonderful for dodging these sorts of problems. Put a large register set, simple ALU, and tiny instruction cache onto a single GaAs or ECL (or whatever) chip. Assume an 4-level memory where the first three levels have a .8 hit rate and a 5-fold slowdown to the next level which is 64 times larger: level access hit Ehit(ns) size 1 1ns .8 1.0 256B 2 5ns .16 1.6 16KB 3 25ns .032 2.4 1MB 4 125ns .008 3.4 64MB+ [294 W/s ==> 150 MIPS] Now, 5ns gives you just enough time to get off the chip to a close neighbour cache chip, 25ns gives you enough time to get elsewhere on the board, and 125ns is enough time to go to the bus. Each critical path gets slightly longer and slightly slower. Each level can be made from a slower and cheaper technology. With a hit rate of .8, the effective access time is 3.4 ns/word or 294 word/s. Which should put you in the 150 MIP range with RISC technology. [The ratio of 2W ==> 1 MIPS assumes that each operation (on average) uses one instruction and one data word. The SPARC seems to have a MIPS rating of about 1/2 its MHz.] Ok, so the numbers are all out of a hat. Lets try some different hats: level access hit Ehit(ns) size 1 1ns .7 1.0 256B 2 5ns .21 1.75 16KB 3 25ns .063 3.33 1MB 4 125ns .027 6.7 64MB+ [149 W/s ==> 75 MIPS] level access hit Ehit(ns) size 1 1ns .9 1.0 256B 2 5ns .09 1.35 16KB 3 25ns .009 1.58 1MB 4 125ns .001 1.7 64MB+ [588 W/s ==> 300 MIPS] I'm more inclined to believe the values of .8 or .9 for locality given the 64x expansion at each level. I've no facts though. Is a 5ns single-chip 16KB cache possible, now or in 5 years? What about a 25ns multi-chip 1MB cache? What is the normal hit rate for a 16KB cache? Comments?
mcdonald@uxe.cso.uiuc.edu (11/21/89)
> parallelism to continue to deliver more performance. If you project the > slope of the clock rates of supercomputers, you will see sub-nanosecond > CYCLE times before 1995. I don't see any technologies in the wings which > promise to allow this to continue... >Anyone else have Opinions? Facts? No, of course not. How fast could the speed demon people do a PDP8 on a chip right now - just the CPU and 4k words of memory plus a couple of serial lines - lets say 200 megabaud? HAs anyone else out there looked at the schematic of a PDP-8? Its pretty RISC. Doug McDonald
paul@taniwha.UUCP (Paul Campbell) (11/22/89)
In article <46500087@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:
-No, of course not. How fast could the speed demon people do a
-PDP8 on a chip right now - just the CPU and 4k words of memory plus
-a couple of serial lines - lets say 200 megabaud? HAs anyone else
-out there looked at the schematic of a PDP-8? Its pretty RISC.
Sounds like a good student MOSIS project :-)
Paul
--
Paul Campbell UUCP: ..!mtxinu!taniwha!paul AppleLink: CAMPBELL.P
"### Error 352 Too many errors on one line (make fewer)" - Apple Computer
"We got a thousand points of light for the homeless man,
Got a kinder, gentler, machine gun hand ..." - Neil Young 'Freedom'
nelson@m.cs.uiuc.edu (11/22/89)
> Actually, I don't see this (dare I say it) EVER occuring. Ignoring ... > a tenfold (opinion/guess). That leaves you with an inch of wiring > for the critical path for this super technology -- that does not > seem nearly enough to build a nano-processor around. Well, I was thinking only in terms of reasonable today-type technology ideas. I came up with something of a lower bound. I assumed that the smallest transistor is an angstrom in length. Then I used some guesses as to what a processor has to contain, etc. As it all comes down, it seems that our lower bound is on the order of 10 picoseconds for a cycle time in a processor. Other parts would obviously have a lower lower bound. Now you may say that there is no way that a transistor or transistor- work-alike can be built that small... Maybe so, but it is (?) a lower bound. -- Taed.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/22/89)
In article <3300084@m.cs.uiuc.edu> nelson@m.cs.uiuc.edu writes: | As it all comes down, it seems that our lower bound is on the order | of 10 picoseconds for a cycle time in a processor. Other parts | would obviously have a lower lower bound. I think your lower bound is too high. I believe that _Electronics News_ had an article about a 200GHz counter. My subscription lapsed two years ago. There is a lower limit, because you have to make things smaller (as you said), and when the diameter of a conductor becomes small enough it becomes an excercise in probability to see if an electron put in one end comes out the other. An article a few years ago claimed that this occurs at about 17 orders of magnitude smaller and faster than a Cray2. Warning: The only thing I'm sure is true is that the article said so ;-) -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "The world is filled with fools. They blindly follow their so-called 'reason' in the face of the church and common sense. Any fool can see that the world is flat!" - anon