[comp.arch] Single/Multi Tasking position summary

fouts@orville.nas.nasa.gov (Marty Fouts) (12/30/87)

Barry Shein write:

> Marty Fouts continues a tradition in discussions of the relative
> merits of parallel machines which can be summed up in the simple
> statement of the "best as enemy of the good".

This certainly wasn't my intent, although the discussion may have
drifted that way.  It was my intent to counter a certain amount of the
enthusiasm which could lead to a belief that parallel architectures
are the right answer for all problems, when there are some cases in
which a single processor machine is a better choice.

He has also raised another example of parallel processing
enthusiasm which I try to counter when I can.  He shows that an Encore
Multimax gives a speed of of 50x over a 750 for a particular make.  Of
course, this looks good until you realize that he's talking about two
different generations of processors in two different price ranges.  I
would be interested if Barry would tell us the speedup the Multimax
gets over doing the same make on 1 of its cpus.  I would also be
interested in comparing the Multimax performance to that of a single
processor system done in more recent technology than an 750, for
instance a Sun 3/260.

His ancedote is an example of a problem which we haven't discussed
yet in this exchange over parallelism, which is price performance.  We
have on our floor a Convex C1 and an Alliant FX8 which I believe are
about equivalently priced models.  Although the Alliant can parallize
some of my algorithms which the Convex can not vectorize, they all
tend to run faster on the Convex than on the Alliant.  I do *NOT* make
the general claim that at a given price a parallel machine has poorer
performance than a single processor, but for a fairly wide range of
problems over a fairly wide price range of machines, this seems to be
true.  This exchange of ancedotal evidence leads to the hard question:
For which problems will a parallel processor using the best available
implementation technology give better cost/performance than the best
single processor?

Try comparing the best performing $250K single processor system to the
best performing $250K parallel processor system on your problems.  If
your like me, you will find there is no simple predictor of which
problems will work better on either.  Next, try picking a different
price tag.  You will find that the set of programs which work best on
one architecture will be different than before.  What fun!

He goes on to ask what's the point in this discussion?  To me, the
point is to learn about the trade offs between parallel and single
processor systems, and to offer some data points to others who are
about to enter this arena.

He concludes with the statement:

> Seriously, what's the point? Just the 80's version of all the people
> that used to decry higher level languages because they could pull out
> this algorithm that no compiler could do as good a job on as a
> hand-coded solution. So what!?

No.  The point is not to say no one should use parallel systems under
any circumstances.  Like Mr. Shein, I use them regularly.  The point is
to try to help new users form realistic expectations of the class or
problems for which parallelism is trivial, the class for which it is
work (and the nature of the work) and the class for which it should
not be utilized.

bzs@bu-cs.BU.EDU (Barry Shein) (12/30/87)

Posting-Front-End: GNU Emacs 18.41.4 of Mon Mar 23 1987 on bu-cs (berkeley-unix)



From: fouts@orville.nas.nasa.gov (Marty Fouts)
>He has also raised another example of parallel processing
>enthusiasm which I try to counter when I can.  He shows that an Encore
>Multimax gives a speed of of 50x over a 750 for a particular make.  Of
>course, this looks good until you realize that he's talking about two
>different generations of processors in two different price ranges.  I
>would be interested if Barry would tell us the speedup the Multimax
>gets over doing the same make on 1 of its cpus.  I would also be
>interested in comparing the Multimax performance to that of a single
>processor system done in more recent technology than an 750, for
>instance a Sun 3/260.

Sure, glad to (would you believe first there was a problem with tmp on
the Sun, then the copy on the encore someone had, um, patched so I had
to revert the code to its original state as it wouldn't compile at
all, then I got everything going and kicked the plug out of my
terminal when I got up to empty the washer...no kidding...arghhhh! How
do I benchmark my LIFE!)

The software in question was nethack which has 75 .c modules,
approximate wall clock time to compile (that's the third number which
csh reports when you say 'time make', after user and system, I prefer
wall clock time as I don't pay for the cycles, I pay for employee time
and with my own boredom, I can get into that argument if you want,
other measures aren't useless, I just consider this the most
important for something like compilation.):

System		h:mm:ss	Notes

Vax11/750:	1:39:00 (4.3bsd, 8MB, RA81s)
Encore/Umax:	   2:10	(6 CPUS, make told to use at most 8 processes*)
Encore/Umax:	  11:00	(same machine, use at most 1 process)
Sun3/280:	  13:32 (16MB, Super-eagles, 68881)

Machines all lightly loaded or unloaded (the 750 was unloaded tho it's
hard to gauge how much the little thing is bothered by net traffic,
there were a few users on the Encore playing games etc, the Sun3/280
might have been doing a little file serving tho it didn't look like
anyone was around, certainly some mail was processed during the make
as it's the campus mail server but the University is closed for the
week and it's 11PM so that should be nominal, I could check the syslog
I guess.)

In short, most machines were as good or better than when I usually use
them, it might not be rarified, but it's certainly realistic (another
argument about what I call Christmas eve benchmarks as in "sure the
3090 does 40MIPs, I can get that on Christmas eve, any other time it
has 200 users and seems slow and no one's giving me my own 3090".)

Beyond that (this is too long already) I do agree with Marty and also
agree that I may have misunderstood/misrepresented what he was saying.

It's hard to tell exactly which applications will get exactly what
speed up. I do believe with general purpose parallel machines such as
the Encore or Sequent (as opposed to special purpose machines which
seem to put the entire burden on the coder to get any parallelism) one
might be able to get enough advantage in price/performance to justify
the machine itself. At that point they can explore how they can
further exploit the new tool (as apparently both of us are.)

Unless you're (this is the general audience "you") convinced that
there's no possibility of there being anything in it for you I can't
think of any other way to explore the potential without actually using
one. This doesn't mean it's for everyone (tho I could argue that the
general purpose models are very good at some very mundane problems,
perhaps even more accessible than with exotic problems) but I am glad
to have the opportunity to have gained relatively early access to this
technology (again, general purpose parallel processors.)

	-Barry Shein, Boston University

* I use slightly more processes than CPUs on makes because I find the
performance continues to improve a little beyond NPROCS=NCPUS.  I
assume this is due to some processes always being in I/O wait. See all
the neat things about parallelism we can begin to form intuitions
about, if we explore it.

bzs@bu-cs.BU.EDU (Barry Shein) (12/30/87)

Posting-Front-End: GNU Emacs 18.41.4 of Mon Mar 23 1987 on bu-cs (berkeley-unix)



>System		h:mm:ss	Notes
>
>Vax11/750:	1:39:00 (4.3bsd, 8MB, RA81s)
>Encore/Umax:	   2:10	(6 CPUS, make told to use at most 8 processes*)
>Encore/Umax:	  11:00	(same machine, use at most 1 process)
>Sun3/280:	  13:32 (16MB, Super-eagles, 68881)

I found that Sun number a little hi so I re-ran it on a much more
lightly loaded (eg. virtually no net traffic, which was the thing I
was worried about and mentioned in the original note) Sun3/280 with
otherwise similar hardware and got 9:55, I don't think it changes the
conclusions much however (relative to the spirit of this particular
discussion, trying to provide some anecdotal info), but honesty got
the best of me.

	-Barry Shein, Boston University

The problem with moderate affluence is that you end up with so much
more laundery to do.

brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (12/31/87)

In article <18079@bu-cs.BU.EDU> bzs@bu-cs.BU.EDU (Barry Shein) writes:
>
>* I use slightly more processes than CPUs on makes because I find the
>performance continues to improve a little beyond NPROCS=NCPUS.  I
>assume this is due to some processes always being in I/O wait. See all
>the neat things about parallelism we can begin to form intuitions
>about, if we explore it.

I use twice as many processes as available cpus for parallel makes on
the Sequent, if the load average is less than 2 the machine just
isn't being worked hard enough! :-)

P.S.  Damn that parallel make anyway, it doesn't give me enough time
to go get my coffee.

collins@encore.UUCP (Jeff Collins) (01/03/88)

In article <60@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes:
>He has also raised another example of parallel processing
>enthusiasm which I try to counter when I can.  He shows that an Encore
>Multimax gives a speed of of 50x over a 750 for a particular make.  Of
>course, this looks good until you realize that he's talking about two
>different generations of processors in two different price ranges.  I
>would be interested if Barry would tell us the speedup the Multimax
>gets over doing the same make on 1 of its cpus.  I would also be
>interested in comparing the Multimax performance to that of a single
>processor system done in more recent technology than an 750, for
>instance a Sun 3/260.
>
(Note: the he mentioned  is Barry Shein at BU).

	Yes, Barry is talking about two different processors but they are the
	same generation.  The original Multimax (now called the Multimax 120)
	contains NS32032 processors.  These are approximately .75Mips, as is the
	VAX 750.  As to price, yes the Multimax that was mentioned is more 
	expensive than a 750 (but they are your basic scrap metal now).  A more
	interesting comparison would be a new VAX.  As we don't have one I can't
	give the numbers on this, but I can ask the network - how long does it
	take to compile a 4.3BSD kernel on a real VAX?  I would bet it is
	longer than 4.5 minutes (the best we can do on our fastest machine).

>His ancedote is an example of a problem which we haven't discussed
>yet in this exchange over parallelism, which is price performance.  We
>have on our floor a Convex C1 and an Alliant FX8 which I believe are
>about equivalently priced models.  Although the Alliant can parallize
>some of my algorithms which the Convex can not vectorize, they all
>tend to run faster on the Convex than on the Alliant.  I do *NOT* make
>the general claim that at a given price a parallel machine has poorer
>performance than a single processor, but for a fairly wide range of
>problems over a fairly wide price range of machines, this seems to be
>true.  This exchange of ancedotal evidence leads to the hard question:
>For which problems will a parallel processor using the best available
>implementation technology give better cost/performance than the best
>single processor?

	Vector processors are a special form of parallel processors (this 
	should start lots of flames).  Perhaps the Convex is simply implemented
	better than the Alliant?  You are using one comparison to condemn
	an entire class of machines.

>
>No.  The point is not to say no one should use parallel systems under
>any circumstances.  Like Mr. Shein, I use them regularly.  The point is
>to try to help new users form realistic expectations of the class or
>problems for which parallelism is trivial, the class for which it is
>work (and the nature of the work) and the class for which it should
>not be utilized.

	I definately agree with this, it is very important to determine the
	match between an application and the type machine to use.  As an 
	attempt to further your point, let me give my perspective on the 
	various applications and machine solutions (from an admittedly biased
	perspective).

	General timesharing: A truely parallel machine like the Encore 
	    Multimax or the (I have to say this for fairness) Sequent Balance.
	    This does NOT include the Alliant or any of the current multi-CPU
	    VAXen.  The Alliant does have a parallel hardware architecture for
	    the IPs, but the OS does not currently take maximum advantage of
	    it.  The multi-CPU VAXen, do not parallelize the most important
	    aspect of a timesharing machine - I/O.  The best testimonial that
	    I have heard for this is from the University of Oklahoma.  After
	    they installed a multi for thier student time sharing, they were
	    able to eliminate the student signup procedure (you remember that
	    you sign up a week in advance for a 2 hour time slot).  The
	    machine that they installed did not cost any more than a current
	    generation VAX.  NOTE: this is entirely transparent.

	Fine-Grained parallelism: A Convex or an Alliant (unless of couse
	    you can afford a Cray).  There are some new machines that are
	    very interesting in this solution area: Multiflow, ETA.  Probably
	    the decision here should be based on compiler technology, the
	    machines are pretty evenly matched.

	Course-Grained parallelism: This is tougher to define.  Probably the
	    best course of action here is to benchmark the application on 
	    various machines.

physh@unicom.UUCP (Jon 'Quality in - Quantity out' Foreman) (01/04/88)

In article <18079@bu-cs.BU.EDU> bzs@bu-cs.BU.EDU (Barry Shein) writes:
>[...] then I got everything going and kicked the plug out of my
>terminal when I got up to empty the washer
>The software in question was nethack which has 75 .c modules,
>approximate wall clock time to compile [...]
>
>System		h:mm:ss	Notes
>Vax11/750:	1:39:00 (4.3bsd, 8MB, RA81s)
>Encore/Umax:	   2:10	(6 CPUS, make told to use at most 8 processes*)
>Encore/Umax:	  11:00	(same machine, use at most 1 process)
>Sun3/280:	  13:32 (16MB, Super-eagles, 68881)
>

	Why didn't you include the benchmark time from the washing
machine. *I* at least am very curious about how the various speeds and
temperatures settings effect overall performance.  Does MOVing wet
laundry from washer to dryer improve overall response times, or is it
better to put the wet laundry on a line outside to dry thus freeing the
capacity of the dryer?  I am sure that many of us out here in netland
are on the edge of our seats waiting to find out what a washing machine
could do to the 75 .c modules that encompass nethack.

	If the benchmark is particularly fast, what will be the impact
on future mechanical based computing engines?  The mind boggles.


					Jon   :-)
-- 
ucbvax!pixar!\            | For small letters  | ~~~~~~~\~~~   That's spelled
hoptoad!well!unicom!physh | ( < 10K ) only:    |  Jon  }()      "physh" and 
       ptsfa!/            | physh@ssbn.wlk.com |        /     pronounced "fish".

csg@pyramid.pyramid.com (Carl S. Gutekunst) (01/05/88)

In article <2428@encore.UUCP> collins@encore.UUCP (Jeff Collins) writes:
>	General timesharing: A truely parallel machine like the Encore 
>	    Multimax or the (I have to say this for fairness) Sequent Balance.

As long as you are being fair, don't forget the Elxsi 6400, the Pyramid 9800,
and Counterpoint's workstation line. Symmetric parallel processing in general
timesharing is not that uncommon these days.

>	    The multi-CPU VAXen, do not parallelize the most important
>	    aspect of a timesharing machine - I/O.

The difference is that Multi-CPU VAXen (under UNIX, not VMS) are Master/Slave.
The blanket statement that "I/O is not parallelized" is not entirely true,
since the system provides other intelligent I/O interfaces that operate in
parallel with the main CPUs. But it *is* true that the operating system is not
running in parallel, which can be a major bottleneck especially with more than
two CPUs. 

A similar strategy is used by the CCI Power 6/32, the older Celerity systems
(I dunno about the new ones), and the Arete 68010 and 68020 multi-CPU boxes.

Note that any price/performance comparison to DEC is misleading. Any of the
second- or third-tier computer companies can beat DEC on price/performance.
Likewuse with IBM. But this has not stopped either company from selling oodles
of machines, sometimes for good reason, sometimes not.

<csg>

fouts@orville.nas.nasa.gov (Marty Fouts) (01/05/88)

In article <2428@encore.UUCP> collins@encore.UUCP (Jeff Collins) writes:
>
>In article <60@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes:
>>...  This exchange of ancedotal evidence leads to the hard question:
>>For which problems will a parallel processor using the best available
>>implementation technology give better cost/performance than the best
>>single processor?
>
>	Vector processors are a special form of parallel processors (this 
>	should start lots of flames).  Perhaps the Convex is simply implemented
>	better than the Alliant?  You are using one comparison to condemn
>	an entire class of machines.
>

Please don't put inflamatory words in my mouth.  I cited one example
to show that sometimes one class of machines can outperfom the other
and asked under what circumstances this is true.  I never condemned
either class of architectures.  (At best a cynical reading of what I
said would claim that I condemned the Alliant implementation, and I
didn't even do that.)

To clarify my original question:

As an architect, I would like to know:

For which problems will a parallel processor using the best available
implementation technology give better cost/performance than the best
single processor?

This question is really asking if there are any inherent limitiations
due to either implementation technology or architectural constraints
which lead to one or the other class of architectures being stronger
for a particular class of problem.

As a user, I would like to know:

For which problems will parallel processor X give better
cost/performance than single processor Y?

In this case, I don't care why the relationship holds, only that it
holds, and how effectively I can predict it.  (So I can best take
advantage.)

roger@celtics.UUCP (Roger B.A. Klorese) (01/06/88)

In article <12600@pyramid.pyramid.com> csg@pyramid.UUCP (Carl S. Gutekunst) writes:
>The difference is that Multi-CPU VAXen (under UNIX, not VMS) are Master/Slave.
>A similar strategy is used by the older Celerity systems...

The Celerity C1260 dyadic system, as well as the new Celerity 6000 system, 
runs in symmetrical multiprocessing mode, *not* master-slave mode.  While
portions of the Celerity UNIX kernel are single-threaded, they can be 
executed on either processor of the C1260 (and any of the up to 4 scalar
processors on the Celerity 6000).

-- 
 ///==\\   (Your message here...)
///        Roger B.A. Klorese, CELERITY (Northeast Area)
\\\        40 Speen St., Framingham, MA 01701  +1 617 872-1552
 \\\==//   celtics!roger@necntc.nec.com - necntc!celtics!roger