fouts@orville.nas.nasa.gov (Marty Fouts) (12/30/87)
Barry Shein write: > Marty Fouts continues a tradition in discussions of the relative > merits of parallel machines which can be summed up in the simple > statement of the "best as enemy of the good". This certainly wasn't my intent, although the discussion may have drifted that way. It was my intent to counter a certain amount of the enthusiasm which could lead to a belief that parallel architectures are the right answer for all problems, when there are some cases in which a single processor machine is a better choice. He has also raised another example of parallel processing enthusiasm which I try to counter when I can. He shows that an Encore Multimax gives a speed of of 50x over a 750 for a particular make. Of course, this looks good until you realize that he's talking about two different generations of processors in two different price ranges. I would be interested if Barry would tell us the speedup the Multimax gets over doing the same make on 1 of its cpus. I would also be interested in comparing the Multimax performance to that of a single processor system done in more recent technology than an 750, for instance a Sun 3/260. His ancedote is an example of a problem which we haven't discussed yet in this exchange over parallelism, which is price performance. We have on our floor a Convex C1 and an Alliant FX8 which I believe are about equivalently priced models. Although the Alliant can parallize some of my algorithms which the Convex can not vectorize, they all tend to run faster on the Convex than on the Alliant. I do *NOT* make the general claim that at a given price a parallel machine has poorer performance than a single processor, but for a fairly wide range of problems over a fairly wide price range of machines, this seems to be true. This exchange of ancedotal evidence leads to the hard question: For which problems will a parallel processor using the best available implementation technology give better cost/performance than the best single processor? Try comparing the best performing $250K single processor system to the best performing $250K parallel processor system on your problems. If your like me, you will find there is no simple predictor of which problems will work better on either. Next, try picking a different price tag. You will find that the set of programs which work best on one architecture will be different than before. What fun! He goes on to ask what's the point in this discussion? To me, the point is to learn about the trade offs between parallel and single processor systems, and to offer some data points to others who are about to enter this arena. He concludes with the statement: > Seriously, what's the point? Just the 80's version of all the people > that used to decry higher level languages because they could pull out > this algorithm that no compiler could do as good a job on as a > hand-coded solution. So what!? No. The point is not to say no one should use parallel systems under any circumstances. Like Mr. Shein, I use them regularly. The point is to try to help new users form realistic expectations of the class or problems for which parallelism is trivial, the class for which it is work (and the nature of the work) and the class for which it should not be utilized.
bzs@bu-cs.BU.EDU (Barry Shein) (12/30/87)
Posting-Front-End: GNU Emacs 18.41.4 of Mon Mar 23 1987 on bu-cs (berkeley-unix) From: fouts@orville.nas.nasa.gov (Marty Fouts) >He has also raised another example of parallel processing >enthusiasm which I try to counter when I can. He shows that an Encore >Multimax gives a speed of of 50x over a 750 for a particular make. Of >course, this looks good until you realize that he's talking about two >different generations of processors in two different price ranges. I >would be interested if Barry would tell us the speedup the Multimax >gets over doing the same make on 1 of its cpus. I would also be >interested in comparing the Multimax performance to that of a single >processor system done in more recent technology than an 750, for >instance a Sun 3/260. Sure, glad to (would you believe first there was a problem with tmp on the Sun, then the copy on the encore someone had, um, patched so I had to revert the code to its original state as it wouldn't compile at all, then I got everything going and kicked the plug out of my terminal when I got up to empty the washer...no kidding...arghhhh! How do I benchmark my LIFE!) The software in question was nethack which has 75 .c modules, approximate wall clock time to compile (that's the third number which csh reports when you say 'time make', after user and system, I prefer wall clock time as I don't pay for the cycles, I pay for employee time and with my own boredom, I can get into that argument if you want, other measures aren't useless, I just consider this the most important for something like compilation.): System h:mm:ss Notes Vax11/750: 1:39:00 (4.3bsd, 8MB, RA81s) Encore/Umax: 2:10 (6 CPUS, make told to use at most 8 processes*) Encore/Umax: 11:00 (same machine, use at most 1 process) Sun3/280: 13:32 (16MB, Super-eagles, 68881) Machines all lightly loaded or unloaded (the 750 was unloaded tho it's hard to gauge how much the little thing is bothered by net traffic, there were a few users on the Encore playing games etc, the Sun3/280 might have been doing a little file serving tho it didn't look like anyone was around, certainly some mail was processed during the make as it's the campus mail server but the University is closed for the week and it's 11PM so that should be nominal, I could check the syslog I guess.) In short, most machines were as good or better than when I usually use them, it might not be rarified, but it's certainly realistic (another argument about what I call Christmas eve benchmarks as in "sure the 3090 does 40MIPs, I can get that on Christmas eve, any other time it has 200 users and seems slow and no one's giving me my own 3090".) Beyond that (this is too long already) I do agree with Marty and also agree that I may have misunderstood/misrepresented what he was saying. It's hard to tell exactly which applications will get exactly what speed up. I do believe with general purpose parallel machines such as the Encore or Sequent (as opposed to special purpose machines which seem to put the entire burden on the coder to get any parallelism) one might be able to get enough advantage in price/performance to justify the machine itself. At that point they can explore how they can further exploit the new tool (as apparently both of us are.) Unless you're (this is the general audience "you") convinced that there's no possibility of there being anything in it for you I can't think of any other way to explore the potential without actually using one. This doesn't mean it's for everyone (tho I could argue that the general purpose models are very good at some very mundane problems, perhaps even more accessible than with exotic problems) but I am glad to have the opportunity to have gained relatively early access to this technology (again, general purpose parallel processors.) -Barry Shein, Boston University * I use slightly more processes than CPUs on makes because I find the performance continues to improve a little beyond NPROCS=NCPUS. I assume this is due to some processes always being in I/O wait. See all the neat things about parallelism we can begin to form intuitions about, if we explore it.
bzs@bu-cs.BU.EDU (Barry Shein) (12/30/87)
Posting-Front-End: GNU Emacs 18.41.4 of Mon Mar 23 1987 on bu-cs (berkeley-unix) >System h:mm:ss Notes > >Vax11/750: 1:39:00 (4.3bsd, 8MB, RA81s) >Encore/Umax: 2:10 (6 CPUS, make told to use at most 8 processes*) >Encore/Umax: 11:00 (same machine, use at most 1 process) >Sun3/280: 13:32 (16MB, Super-eagles, 68881) I found that Sun number a little hi so I re-ran it on a much more lightly loaded (eg. virtually no net traffic, which was the thing I was worried about and mentioned in the original note) Sun3/280 with otherwise similar hardware and got 9:55, I don't think it changes the conclusions much however (relative to the spirit of this particular discussion, trying to provide some anecdotal info), but honesty got the best of me. -Barry Shein, Boston University The problem with moderate affluence is that you end up with so much more laundery to do.
brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (12/31/87)
In article <18079@bu-cs.BU.EDU> bzs@bu-cs.BU.EDU (Barry Shein) writes: > >* I use slightly more processes than CPUs on makes because I find the >performance continues to improve a little beyond NPROCS=NCPUS. I >assume this is due to some processes always being in I/O wait. See all >the neat things about parallelism we can begin to form intuitions >about, if we explore it. I use twice as many processes as available cpus for parallel makes on the Sequent, if the load average is less than 2 the machine just isn't being worked hard enough! :-) P.S. Damn that parallel make anyway, it doesn't give me enough time to go get my coffee.
collins@encore.UUCP (Jeff Collins) (01/03/88)
In article <60@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes: >He has also raised another example of parallel processing >enthusiasm which I try to counter when I can. He shows that an Encore >Multimax gives a speed of of 50x over a 750 for a particular make. Of >course, this looks good until you realize that he's talking about two >different generations of processors in two different price ranges. I >would be interested if Barry would tell us the speedup the Multimax >gets over doing the same make on 1 of its cpus. I would also be >interested in comparing the Multimax performance to that of a single >processor system done in more recent technology than an 750, for >instance a Sun 3/260. > (Note: the he mentioned is Barry Shein at BU). Yes, Barry is talking about two different processors but they are the same generation. The original Multimax (now called the Multimax 120) contains NS32032 processors. These are approximately .75Mips, as is the VAX 750. As to price, yes the Multimax that was mentioned is more expensive than a 750 (but they are your basic scrap metal now). A more interesting comparison would be a new VAX. As we don't have one I can't give the numbers on this, but I can ask the network - how long does it take to compile a 4.3BSD kernel on a real VAX? I would bet it is longer than 4.5 minutes (the best we can do on our fastest machine). >His ancedote is an example of a problem which we haven't discussed >yet in this exchange over parallelism, which is price performance. We >have on our floor a Convex C1 and an Alliant FX8 which I believe are >about equivalently priced models. Although the Alliant can parallize >some of my algorithms which the Convex can not vectorize, they all >tend to run faster on the Convex than on the Alliant. I do *NOT* make >the general claim that at a given price a parallel machine has poorer >performance than a single processor, but for a fairly wide range of >problems over a fairly wide price range of machines, this seems to be >true. This exchange of ancedotal evidence leads to the hard question: >For which problems will a parallel processor using the best available >implementation technology give better cost/performance than the best >single processor? Vector processors are a special form of parallel processors (this should start lots of flames). Perhaps the Convex is simply implemented better than the Alliant? You are using one comparison to condemn an entire class of machines. > >No. The point is not to say no one should use parallel systems under >any circumstances. Like Mr. Shein, I use them regularly. The point is >to try to help new users form realistic expectations of the class or >problems for which parallelism is trivial, the class for which it is >work (and the nature of the work) and the class for which it should >not be utilized. I definately agree with this, it is very important to determine the match between an application and the type machine to use. As an attempt to further your point, let me give my perspective on the various applications and machine solutions (from an admittedly biased perspective). General timesharing: A truely parallel machine like the Encore Multimax or the (I have to say this for fairness) Sequent Balance. This does NOT include the Alliant or any of the current multi-CPU VAXen. The Alliant does have a parallel hardware architecture for the IPs, but the OS does not currently take maximum advantage of it. The multi-CPU VAXen, do not parallelize the most important aspect of a timesharing machine - I/O. The best testimonial that I have heard for this is from the University of Oklahoma. After they installed a multi for thier student time sharing, they were able to eliminate the student signup procedure (you remember that you sign up a week in advance for a 2 hour time slot). The machine that they installed did not cost any more than a current generation VAX. NOTE: this is entirely transparent. Fine-Grained parallelism: A Convex or an Alliant (unless of couse you can afford a Cray). There are some new machines that are very interesting in this solution area: Multiflow, ETA. Probably the decision here should be based on compiler technology, the machines are pretty evenly matched. Course-Grained parallelism: This is tougher to define. Probably the best course of action here is to benchmark the application on various machines.
physh@unicom.UUCP (Jon 'Quality in - Quantity out' Foreman) (01/04/88)
In article <18079@bu-cs.BU.EDU> bzs@bu-cs.BU.EDU (Barry Shein) writes: >[...] then I got everything going and kicked the plug out of my >terminal when I got up to empty the washer >The software in question was nethack which has 75 .c modules, >approximate wall clock time to compile [...] > >System h:mm:ss Notes >Vax11/750: 1:39:00 (4.3bsd, 8MB, RA81s) >Encore/Umax: 2:10 (6 CPUS, make told to use at most 8 processes*) >Encore/Umax: 11:00 (same machine, use at most 1 process) >Sun3/280: 13:32 (16MB, Super-eagles, 68881) > Why didn't you include the benchmark time from the washing machine. *I* at least am very curious about how the various speeds and temperatures settings effect overall performance. Does MOVing wet laundry from washer to dryer improve overall response times, or is it better to put the wet laundry on a line outside to dry thus freeing the capacity of the dryer? I am sure that many of us out here in netland are on the edge of our seats waiting to find out what a washing machine could do to the 75 .c modules that encompass nethack. If the benchmark is particularly fast, what will be the impact on future mechanical based computing engines? The mind boggles. Jon :-) -- ucbvax!pixar!\ | For small letters | ~~~~~~~\~~~ That's spelled hoptoad!well!unicom!physh | ( < 10K ) only: | Jon }() "physh" and ptsfa!/ | physh@ssbn.wlk.com | / pronounced "fish".
csg@pyramid.pyramid.com (Carl S. Gutekunst) (01/05/88)
In article <2428@encore.UUCP> collins@encore.UUCP (Jeff Collins) writes: > General timesharing: A truely parallel machine like the Encore > Multimax or the (I have to say this for fairness) Sequent Balance. As long as you are being fair, don't forget the Elxsi 6400, the Pyramid 9800, and Counterpoint's workstation line. Symmetric parallel processing in general timesharing is not that uncommon these days. > The multi-CPU VAXen, do not parallelize the most important > aspect of a timesharing machine - I/O. The difference is that Multi-CPU VAXen (under UNIX, not VMS) are Master/Slave. The blanket statement that "I/O is not parallelized" is not entirely true, since the system provides other intelligent I/O interfaces that operate in parallel with the main CPUs. But it *is* true that the operating system is not running in parallel, which can be a major bottleneck especially with more than two CPUs. A similar strategy is used by the CCI Power 6/32, the older Celerity systems (I dunno about the new ones), and the Arete 68010 and 68020 multi-CPU boxes. Note that any price/performance comparison to DEC is misleading. Any of the second- or third-tier computer companies can beat DEC on price/performance. Likewuse with IBM. But this has not stopped either company from selling oodles of machines, sometimes for good reason, sometimes not. <csg>
fouts@orville.nas.nasa.gov (Marty Fouts) (01/05/88)
In article <2428@encore.UUCP> collins@encore.UUCP (Jeff Collins) writes: > >In article <60@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes: >>... This exchange of ancedotal evidence leads to the hard question: >>For which problems will a parallel processor using the best available >>implementation technology give better cost/performance than the best >>single processor? > > Vector processors are a special form of parallel processors (this > should start lots of flames). Perhaps the Convex is simply implemented > better than the Alliant? You are using one comparison to condemn > an entire class of machines. > Please don't put inflamatory words in my mouth. I cited one example to show that sometimes one class of machines can outperfom the other and asked under what circumstances this is true. I never condemned either class of architectures. (At best a cynical reading of what I said would claim that I condemned the Alliant implementation, and I didn't even do that.) To clarify my original question: As an architect, I would like to know: For which problems will a parallel processor using the best available implementation technology give better cost/performance than the best single processor? This question is really asking if there are any inherent limitiations due to either implementation technology or architectural constraints which lead to one or the other class of architectures being stronger for a particular class of problem. As a user, I would like to know: For which problems will parallel processor X give better cost/performance than single processor Y? In this case, I don't care why the relationship holds, only that it holds, and how effectively I can predict it. (So I can best take advantage.)
roger@celtics.UUCP (Roger B.A. Klorese) (01/06/88)
In article <12600@pyramid.pyramid.com> csg@pyramid.UUCP (Carl S. Gutekunst) writes: >The difference is that Multi-CPU VAXen (under UNIX, not VMS) are Master/Slave. >A similar strategy is used by the older Celerity systems... The Celerity C1260 dyadic system, as well as the new Celerity 6000 system, runs in symmetrical multiprocessing mode, *not* master-slave mode. While portions of the Celerity UNIX kernel are single-threaded, they can be executed on either processor of the C1260 (and any of the up to 4 scalar processors on the Celerity 6000). -- ///==\\ (Your message here...) /// Roger B.A. Klorese, CELERITY (Northeast Area) \\\ 40 Speen St., Framingham, MA 01701 +1 617 872-1552 \\\==// celtics!roger@necntc.nec.com - necntc!celtics!roger