[comp.arch] Single tasking the wave of the future?

gnu@hoptoad.uucp (John Gilmore) (12/02/87)

> Another trend...                     ...is that towards individual
> (single-user) computers.  The future of multi-tasking on such machines is
> very much in question...
> -- 
> Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
> Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

Does anybody seriously believe this?  I don't.

All the little single chip micros in microwave ovens are running realtime
code; depending on internal structure, this could be single threaded polling,
one task with lots of interrupts, or many tasks.

99% of the rest of the computers will be "general purpose" and all of
them will be running multitasking.  Whether they call it "TSR and steal
interrupts" or they call it "fork()" or "multifinder", it will be
multitasking.  Mr.  Adams, who works at a major MSDOS software house,
may be confused by the fact that on MSDOS you have to write your own
scheduler and task switching to get multitasking.  That hasn't stopped
everyone from doing it, given the obvious benefits of multitasking; it
just makes it 30 times as hard.
-- 
{pyramid,ptsfa,amdahl,sun,ihnp4}!hoptoad!gnu			  gnu@toad.com
		"Watch me change my world..." -- Liquid Theatre

dave@sdeggo.UUCP (David L. Smith) (12/03/87)

In article <3445@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes:
 franka@mmintl writes:
 > > Another trend...                     ...is that towards individual
 > > (single-user) computers.  The future of multi-tasking on such machines is
 > > very much in question...
 > 99% of the rest of the computers will be "general purpose" and all of
 > them will be running multitasking.  

I agree with John,  however the ideal towards which we are striving is multi-
tasking operating systems and single-tasking processors.  Admit it, don't
you hate having your processor multi-task?
-- 
David L. Smith
{sdcsvax!man,ihnp4!jack!man, hp-sdd!crash, pyramid}!sdeggo!dave
man!sdeggo!dave@sdcsvax.ucsd.edu 
The net.goddesses made me do it!

eugene@pioneer.arpa (Eugene Miya N.) (12/03/87)

In article <147@sdeggo.UUCP> dave@sdeggo.UUCP (David L. Smith) writes:
>Admit it, don't you hate having your processor multi-task?

Should should have typed a ;-).

Sometimes, I just love forking 24 processes into the background
(doing serious work) on one of our Sequents.  Sometimes I complain
when the load average gets over 0.15 ;-).  I sometimes have to
do the same thing on our VAX, (only fewer processes).
I have come to the conclusion that these boxes are the way to go
instead of too-loosely coupled workstations (you should have good
graphical front-end, however...).

From the Rock of Ages Home for Retired Hackers:

--eugene miya, NASA Ames Research Center, eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene

daveh@cbmvax.UUCP (12/03/87)

in article <147@sdeggo.UUCP>, dave@sdeggo.UUCP (David L. Smith) says:
> In article <3445@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes:
>  franka@mmintl writes:
>  > > Another trend...                     ...is that towards individual
>  > > (single-user) computers.  The future of multi-tasking on such machines is
>  > > very much in question...
>  > 99% of the rest of the computers will be "general purpose" and all of
>  > them will be running multitasking.  
> 
> I agree with John,  however the ideal towards which we are striving is multi-
> tasking operating systems and single-tasking processors.  Admit it, don't
> you hate having your processor multi-task?

I'd rather have my processor multi-task, than sit in idle most of the time, as
it's pretty much doing right now, with just a terminal program, a static CLI,
some dormant background tasks, and a performance monitor running.  At least
given the choice of multitasking vs. not multitasking, given the reality of
a single processor system.

I would object to allowing others to steal time from my CPU, even if I'm not
doing anything with it at the moment.  And I certainly wouldn't mind at all
if I could drop in an additional CPU and have the tasks distributed in a 
reasonable way over the two.  But it'll be some time before I have a processor
available for each individual task, so basically, I'm really happy that I can
let my processor multitask.

> David L. Smith
-- 
Dave Haynie     Commodore-Amiga    Usenet: {ihnp4|uunet|rutgers}!cbmvax!daveh
   "The B2000 Guy"              PLINK : D-DAVE H             BIX   : hazy
		"I can't relax, 'cause I'm a Boinger!"

dan@rose3.Rosemount.COM (Dan Messinger) (12/04/87)

In article <3445@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>> Another trend...                     ...is that towards individual
>> (single-user) computers.  The future of multi-tasking on such machines is
>> very much in question...

>Does anybody seriously believe this?  I don't.

I don't either.  Single user workstations are definitely becoming more
popular.  But single tasking?  Never.  Who wants to tie up their computer
just to get a listing?  And I frequently will find other things to do while
my programs are compiling (Rogue, if nothing else :-).  And then their is
network activity or uucp.

Multitasking is a necessity!  So it should be part of the operating system
instead of kludged into the applications (e.g., the print spooler in MS-DOS)

Dan Messinger

peter@sugar.UUCP (Peter da Silva) (12/05/87)

In article <147@sdeggo.UUCP>, dave@sdeggo.UUCP (David L. Smith) writes:
> I agree with John,  however the ideal towards which we are striving is multi-
> tasking operating systems and single-tasking processors.  Admit it, don't
> you hate having your processor multi-task?

Hello, no. Until you have something incredible like 64000 processors in your
box, you're going to find yourself running out of the suckers. If you have
a connection machine on your desk, great! If you've got a transputer, you lose.
What happens when you want to start up that 5th or 9th program and you've
only got 4 or 8 CPUs?
-- 
-- Peter da Silva  `-_-'  ...!hoptoad!academ!uhnix1!sugar!peter
-- Disclaimer: These U aren't mere opinions... these are *values*.

dave@sdeggo.UUCP (David L. Smith) (12/07/87)

In article <1227@sugar.UUCP>, peter@sugar.UUCP (Peter da Silva) writes:
 >In article <147@sdeggo.UUCP>, dave@sdeggo.UUCP (David L. Smith) writes:
 >>I agree with John,  however the ideal towards which we are striving is multi-
 >>tasking operating systems and single-tasking processors.  Admit it, don't
 >>you hate having your processor multi-task?
 > 
 >Hello, no. Until you have something incredible like 64000 processors in your
 >box, you're going to find yourself running out of the suckers. If you have
 >a connection machine on your desk, great! If you've got a transputer, you 
 >lose.  What happens when you want to start up that 5th or 9th program and 
 >you've only got 4 or 8 CPUs?

Then one of your processors starts to multi-task.  Of course you never have
enough processors to handle everything, but it would be nice if you did.
It's an _ideal_.  

The question was "Don't you hate having your processor multi-task" and the
answers have been "No, because it has to."  If it didn't have to would you
want it to?  Timesharing is a necessary evil, and we use it because we have
finite resources.  The trend has been to increase these resources and at
the point where they become effectively infinite (if you have, say 256 
processors in a work station with a _very_ fast I/O system) the number of
times time-sharing would come into play would be very few.

The same arguments apply to virtual memory.  Do you like swapping, paging
and the rest of those fun things?  I don't, but I have to live with them
because I can't afford 4 gigabytes of RAM.  (Or 4 gig of disk for that 
matter).  Just because it's necessary doesn't mean you have to like it.

-- 
David L. Smith
{sdcsvax!man,ihnp4!jack!man, hp-sdd!crash, pyramid}!sdeggo!dave
man!sdeggo!dave@sdcsvax.ucsd.edu 
I don't need no steenkin' quote!

mmengel@cuuxb.UUCP (12/07/87)

In article <151@sdeggo.UUCP> dave@sdeggo.UUCP (David L. Smith) writes:
$In article <1227@sugar.UUCP>, peter@sugar.UUCP (Peter da Silva) writes:
$ >In article <147@sdeggo.UUCP>, dave@sdeggo.UUCP (David L. Smith) writes:
$ >>I agree with John,  however the ideal towards which we are striving is multi-
$ >>tasking operating systems and single-tasking processors.  Admit it, don't
$ >>you hate having your processor multi-task?
$ > 
$ >Hello, no. Until you have something incredible like 64000 processors in your
$ >box, you're going to find yourself running out of the suckers. If you have
$ >a connection machine on your desk, great! If you've got a transputer, you 
$ >lose.  What happens when you want to start up that 5th or 9th program and 
$ >you've only got 4 or 8 CPUs?
$
$Then one of your processors starts to multi-task.  Of course you never have
$enough processors to handle everything, but it would be nice if you did.
$It's an _ideal_.  
$
$The question was "Don't you hate having your processor multi-task" and the
$answers have been "No, because it has to."  If it didn't have to would you
$want it to?  Timesharing is a necessary evil, and we use it because we have
$finite resources.  


We also use time sharing because even on existing time sharing systems,
there is often a noticable percentage of *idle time*, when every single
process on the system is waiting for i/o to complete.  This is even
true on *IX systems with disk cache hit rates of 80 and 90%.  The truth
is, on systems like sun workstations and the lot, if you take some 
process accounting statistics while you are running your c compile
in background, and rogue in foreground, with emacs running the whole
show, you still have easily 20% idle time on the cpu, even with little 
or no paging activity.

Before you decide that your system isn't as fast as you'd  like it
to be because its multi-tasking, try looking at some system activity
statistics for your system.
-- 
 Marc Mengel	

 attmail!mmengel
 ...!{moss|lll-crg|mtune|ihnp4}!cuuxb!mmengel

throopw@xyzzy.UUCP (12/07/87)

> peter@sugar.UUCP (Peter da Silva)
> Until you have something incredible like 64000 processors in your
> box, you're going to find yourself running out of the suckers. 

Note that argument that "we don't need to multiplex a single processor
because we can simply throw more processors at the problem" is the same
argument as "we don't need to have demand paging because we can simply
buy more and more memory".  I think that both are incorrect arguments,
for exactly the reasons that Peter points out:

> What happens when you want to start up that 5th or 9th program and you've
> only got 4 or 8 CPUs?

Or, what happens when you want to keep N processes handy at a convenient
point in mid-execution, when they total to M pages of address space and
you've only got (M-1) pages of physical memory?

I think virtual processors are just as important as virtual memory.  And
I think virtual memory is *very* important.  The fact that many PCs
throw these things away is, I think, a Bad Thing.

--
"Solve THIS!"
"'Mighty Mouse is a wimp.'  But what does it mean?"
"It means you've made me MAD!"
                        --- Mighty Mouse to alien brain.
-- 
Wayne Throop      <the-known-world>!mcnc!rti!xyzzy!throopw

greg@xios.XIOS.UUCP (Greg Franks) (12/08/87)

#> Another trend...                     ...is that towards individual
#> (single-user) computers.  The future of multi-tasking on such machines is
#> very much in question...

#Does anybody seriously believe this?  I don't.

(nor I....)

Perhaps the real trend is to single tasks on single computers, but 
*lots* of computers in one box.  Now if Mr. Tramiel can get his 
Atari-as-Transputer going maybe we'll actually see this concept.

-- 
Greg Franks             XIOS Systems Corporation, 1600 Carling Avenue,
(613) 725-5411          Ottawa, Ontario, Canada, K1Z 8R8
utzoo!dciem!nrcaer!xios!greg    "There's so much to sea in Nova Scotia"

throopw@xyzzy.UUCP (Wayne A. Throop) (12/10/87)

> mmengel@cuuxb.ATT.COM (Marc W. Mengel)
>$ dave@sdeggo.UUCP (David L. Smith)
>$ [...] you never have
>$enough processors to handle everything, but it would be nice if you did.
>$It's an _ideal_.  

True enough.

>$The question was "Don't you hate having your processor multi-task" and the
>$answers have been "No, because it has to."  If it didn't have to would you
>$want it to?

That wasn't the question I heard, though the difference is subtle.  The
question I heard, implied strongly by the title and the discussion of
neglecting the capability for multitasking in PC OS software, is:

    "Don't you hate having your processor *capable* of multitasking?"

or, as I'd further paraphrase it

    "Don't you find the notion of 'virtual processors' useless,
     or at least wasteful?"

And my answer is 

    "No, because my resources are finite, and I'd rather have
     my capabilities degrade at the edges rather than chop off.
     It's no more useless or wasteful than virtual memory."

>$Timesharing is a necessary evil, and we use it because we have
>$finite resources.  

Agreed, but the evil is necessary even on "single user" workstations.

> Before you decide that your system isn't as fast as you'd  like it
> to be because its multi-tasking, try looking at some system activity
> statistics for your system.

Amen.

--
Sometimes I think the only universal in the computing field
is the fetch-execute cycle.
                                        --- Alan J. Perlis
-- 
Wayne Throop      <the-known-world>!mcnc!rti!xyzzy!throopw

throopw@xyzzy.UUCP (Wayne A. Throop) (12/10/87)

Oops, I just realized that I simply repeated the Q&A Dave complained of:

> throopw@xyzzy.UUCP (Wayne A. Throop)
>>$ dave@sdeggo.UUCP (David L. Smith)
>>$The question was "Don't you hate having your processor multi-task" and the
>>$answers have been "No, because it has to."  If it didn't have to would you
>>$want it to?

Sorry about that.  But to answer the question directly:

Clearly not.  "If your program didn't have to do IO, would you want it
to?"  "If your OS didn't have to consume N Mb of disk space, would you
want it to?"

Anyhow, I view the question phrased this way a little unrealistic.  Let
me rephrase it again:

    "Other things being equal (same #processors, same #bucks of
     hardware), would you rather do without virtual memory or processors, or
     would you rather pay the extra bucks of hardware/software to support it
     for your personal workstation?"

For a toy computer for non-essentials, I'd answer that I'd want it
cheaper.  For a computer for getting real work done, for real,
essential-to-daily-business computing, I'd answer that I'd want the
extra capability.  And the reason is that a toy/non-essential computer
can afford to have its capability chop off under load, but something
essential to getting work done must degrade gracefully.

--
Sometimes I think the only universal in the computing field
is the fetch-execute cycle.
                                        --- Alan J. Perlis
-- 
Wayne Throop      <the-known-world>!mcnc!rti!xyzzy!throopw

jcz@sas.UUCP (John Carl Zeigler) (12/10/87)

It might be informative to remember why multi-tasking was invented
in the first place.   Back in the 60's Mr Warbucks wandered down to the
Data Processing Department to see what kind of trouble that particular
budgetary sink-hole was getting into.    He wandered into the Computer
Room and spied a Computer Programmer with a couple cases of cards
turning dials and flipling switches.   "Hi, what are you doing," he asked.
   "Oh, Hello Mr. Warbucks.  I am testing our new Social Security Accounting
and Collating System.   Here, watch this."
   With that he flipped one of the switches and a machine began spewing
cards into a box.   After a few hundred cards had dropped, one popped out
into a slot.  "See, here is your card," explained the Programmer.
   "That's nice, but what does that light mean?" he said,
pointing to a warm red light on the front panel.
   "That's the WAIT light,  It goes on whenever the CPU is idle."
   "But, it NEVER went off!"
   "Right!  I have made this program so fast that the light doesn't have
time to stop glowing enough for you to see it."
   "You mean for most of the time the machine wasn't doing anything!"
   "Well, that is one way to look at it."
   "Why couldn't it have been doing something else????   I am not going
pay a hundred times your salary a year just to have the damn thing
WAITing!!!!   Get that fixed!"

Seriously.   I do not mind having my computer doing something else
while I am thinking, or while the disk drive is thinking, either.
Most applications are NOT CPU bound.  Most are I/O bound.  The overhead
of task switching will always be paid for by the increased utility of
the machine.  (not forgetting that L = yW, (y is upside down))
Until we get a better architectural paradigm than the current
'Von Neuman Bottle-Neck', that is.    Any discussion on that??


-- 
--jcz
John Carl Zeigler
SAS Institute Inc.
Cary, NC  27511           (919) 467-8000        ...!mcnc!rti!sas!jcz

ram%shukra@Sun.COM (Renu Raman, Sun Microsystems) (12/10/87)

    Digressing away from the tasking issue a bit - how long will 
    uni-processor machines keep parallel processors at bay?  Looks
    like saturation of HW techology is nowhere near.

---------------------
   Renukanthan Raman				ARPA:ram@sun.com
   Sun Microsystems			UUCP:{ucbvax,seismo,hplabs}!sun!ram
   M/S 5-40, 2500 Garcia Avenue,
   Mt. View,  CA 94043

fouts@orville.nas.nasa.gov (Marty Fouts) (12/11/87)

In article <36083@sun.uucp> ram@sun.UUCP (Renu Raman, Sun Microsystems) writes:
>
>    Digressing away from the tasking issue a bit - how long will 
>    uni-processor machines keep parallel processors at bay?  Looks
>    like saturation of HW techology is nowhere near.
>

Well, as Gene Amdahl would say, it's not that simple.  Parallel
processing, at least in the sense of throwing multiple processors at a
single problem is currently difficult to use.  Software is hard to
write, easy to get wrong and nearly impossible to debug.  As long as
this stays true, parallel processor machines will have a burden beyond
price/performance to overcome before they are accepted.

As far as parallel processors for traditional multiprocessing, the
extra burden that multiple compute engines add to i/o facilities like
disk drives is such that they don't make much sense except for in very
high end machines.  I still remember the PDP 11/782 (dual processor
11/780 used as a master/slave system) which ran i/o bound workloads
slower than an 11/780 because of the introduction of processor control
overhead without upscaling i/o performance.

Saturation of usable HW technology is close.

eugene@pioneer.arpa (Eugene Miya N.) (12/11/87)

In article <18@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes:
>In article <36083@sun.uucp> ram@sun.UUCP (Renu Raman, Sun Microsystems) writes:
>>how long will uni-processor machines keep parallel processors at bay?
>
>Well, as Gene Amdahl would say, it's not that simple.  Parallel
>processing, at least in the sense of throwing multiple processors at a
>single problem is currently difficult to use.
 . . .
>Saturation of usable HW technology is close.

Merry Christmas, Renu:
Oh, the "how long small can you make a Cray?" or Dan Ingall's "Cray on a
wrist idea?

Well, 1) we have some parallel processors now.  The most effective ones
are in loosely coupled, lightly stressed situations.  Just look at IBMs,
Univacs, etc. we've had them for a long time (but distinct process (jobs)
for processors). I think the Sequent/Flex/et al way is the way to
currently build machines, not just workstations (not enough power), but
you have to make it cheaper.  Marty, notes all the problems of other
types.

2) Since Marty didn't answer the question, I would hazard a generic
guess (I don't know anything) of about 10 more years (far enough into
the future?).  There is a Rand Corp report by Sutherland, Meade, and
Everhart on the limitations of silicon (there are other similar reports)
which chronicles about 9 generations of Silicon.  They were "3rd gen"
back in 1976 when they wrote it, and we are sort of "5-6 gen" by their
comments.  After all, how fast can you push the 68K ;-)?  I would just
hate to have some life-saving application which needs just that little bit
more power that silicon could not provide.  If you believe in the
infinite power of computing, then you might be able to tell me when we
have a sequential 100 Teraflops machine.

All comments grossly over-generalized to answer this general question.
From the Rock of Ages Home for Retired Hackers:

--eugene miya, NASA Ames Research Center, eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene

fay@encore.UUCP (Peter Fay) (12/12/87)

In article <18@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes:
>In article <36083@sun.uucp> ram@sun.UUCP (Renu Raman, Sun Microsystems) writes:
>>
>>    Digressing away from the tasking issue a bit - how long will 
>>    uni-processor machines keep parallel processors at bay?  Looks
>>    like saturation of HW techology is nowhere near.
>>
>
>Well, as Gene Amdahl would say, it's not that simple.  Parallel
>processing, at least in the sense of throwing multiple processors at a
>single problem is currently difficult to use.

In regard to general-purpose multiprocessors:

Every fork() (or thread_create() in Mach) in every program can get
scheduled on a different cpu (that includes every shell per user,
daemon, background task, ... Also, don't forget all those kernel
processes (or tasks, threads) running on different cpus (pagers,
signal handlers, ...). How difficult is it when the O.S.  does it
transparently? 

And then there is more sophisticated mechanisms ("micro-tasking", gang
scheduling, vectorizing fortran compilers) available to any user who
wants more capability. 

>...Software is hard to
>write, easy to get wrong and nearly impossible to debug...

Writing software which exploits the FULL parallelism of a machine MAY
be hard to do in CERAIN cases. Otherwise the (user) software is
pretty much identical. Debugging is a whole other soapbox, but my
experience is that debugging coding errors is not much more difficult
than uniprocessors.  What is hard (or "impossible" with current tools)
is detecting race conditions and bottlenecks - i.e. CONCEPTUAL errors. 
This is one of the many time lags in parallel software tool development,
not an inherent defect in architecture. Race conditions are not a
common occurence for users to debug.

> ...As long as
>this stays true, parallel processor machines will have a burden beyond
>price/performance to overcome before they are accepted.
>

Since this is not true, the only thing to overcome is common prejudice.
Price/performance advantage (for small multiprocessor minis and up) is 
huge.

>As far as parallel processors for traditional multiprocessing, the
>extra burden that multiple compute engines add to i/o facilities like
>disk drives is such that they don't make much sense except for in very
>high end machines...

I/O can be a bottleneck for PC's, minis and Crays. The solution is to 
parallelize I/O, which is what multiprocessors do. (General-purpose
multis are NOT "very high-end" -- they run from $70K - $750K).

> ...I still remember the PDP 11/782 (dual processor
>11/780 used as a master/slave system) which ran i/o bound workloads
>slower than an 11/780 because of the introduction of processor control
>overhead without upscaling i/o performance.

That was a long time ago in multiprocessor history. All I/O was done
by one CPU (master/slave) - sequential, not parallel. It was NOT a
symmetric multi like those today. 

>Saturation of usable HW technology is close.

What does this cliche mean?

By the way, I don't fault people for not understanding
about multis (e.g., from past exposure or school). It takes some
time for common misconceptions to catch up to current reality. 

-- 
			peter fay
			fay@multimax.arpa
{allegra|compass|decvax|ihnp4|linus|necis|pur-ee|talcott}!encore!fay

lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (12/15/87)

Multitasking gives flexibility, which is considered virtuous.

Interestingly, there are cases when you get a performance win by simulating
a multiprocessor on a uniprocessor. For example, suppose you are searching a
big tree of choices while solving a Travelling Salesman type problem.  This
has the nice property that there is variance in the individual compute
times. So, if you have a (simulated) multiprocessor, some of the
computations finish earlier than their peers. This gives the overall search
new data (about lower bounds) which may allow you to kill some of the
computations that haven't finished yet. This can save work, not only
directly, but also by preventing the killed searches from generating
offspring. 


-- 
	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

fouts@orville.nas.nasa.gov (Marty Fouts) (12/17/87)

In article <2341@encore.UUCP> fay@encore.UUCP (Peter Fay) writes:
>In regard to general-purpose multiprocessors:
>
>Every fork() (or thread_create() in Mach) in every program can get
>scheduled on a different cpu (that includes every shell per user,
>daemon, background task, ... Also, don't forget all those kernel
>processes (or tasks, threads) running on different cpus (pagers,
>signal handlers, ...). How difficult is it when the O.S.  does it
>transparently? 

Mr. Fay is using transparently in a way with which I am unfamilar.  It
is true that Mach provides primitives which allow the programmer to
introduce multitasking into the program; but these are in no sense
transparent.  Task creation and deletion, synchronization, and task
scheduling all require explict code in the program which is to take
advantage of the tasking.  Even the ancient Unix fork() has to be
explicitly coded for.

>
>And then there is more sophisticated mechanisms ("micro-tasking", gang
>scheduling, vectorizing fortran compilers) available to any user who
>wants more capability. 
>

A problem with these more sophisticated mechanisms is that they can
lead to parallel execution in which the wall clock time goes up as a
function of the number of processors, rather than down. . .  Another
problem is that no compiler can perfectly optimize.  The better the
programmer understands what the optimizer can optimize, the easier it
is to write optimizable code.  With vector code, this is fairly
straightforward to do, with parallel code, it is much more difficult.

>Writing software which exploits the FULL parallelism of a machine MAY
>be hard to do in CERAIN cases. 

It has been my experience in five years of writing code to exploit
distributed and concurrent parallelism that it is hard in a large
number of cases.  There are well behaved algorithms, such as those
involved in PDE solution for which parallelism is trivial. There are
also algorithms requiring high communication cost, much
synchronization and unpredictable work per task which are nearly
untractable.  There appear to be an entire class of algorithms for
which no parallel solution is more efficent than a sequential solution
on the same processor.

>Debugging is a whole other soapbox, but my
>experience is that debugging coding errors is not much more difficult
>than uniprocessors.  What is hard (or "impossible" with current tools)
>is detecting race conditions and bottlenecks - i.e. CONCEPTUAL errors. 
>This is one of the many time lags in parallel software tool development,
>not an inherent defect in architecture. Race conditions are not a
>common occurence for users to debug.
>

I will not quible over the defects inherent nature, only agree that it
is a most difficult problem, and one which decades of parallel
processing research has not solved adequately.

>Price/performance advantage (for small multiprocessor minis and up) is 
>huge.
>

Again, this is only true for well behaved algorithms.  Throwing a
parallel machine at an average workload is like throwing a vector
machine at it, only more so:  The difference between vendor 'not to be
exceeded' speed for the machine and delivered throughput for the
workload can be two order of magnitude, dramatically reducing the real
price / performance.

>I/O can be a bottleneck for PC's, minis and Crays. The solution is to 
>parallelize I/O, which is what multiprocessors do. (General-purpose
>multis are NOT "very high-end" -- they run from $70K - $750K).
>

Oddly enough, the Cray 2 doesn not solve the I/O bottleneck problem by
providing parallel I/O processors.  It is one of the best I/O balanced
multiprocessor machines, and accomplishes this with a single I/O
(foreground) processor, performing work for all four compute
(background) processors.  The factor isn't how the scaling is
achieved, but that it frequently isn't achieved.

>> ...I still remember the PDP 11/782 ...
>
>That was a long time ago in multiprocessor history. All I/O was done
>by one CPU (master/slave) - sequential, not parallel. It was NOT a
>symmetric multi like those today. 

Actually, the 11/782 wasn't a long time ago in multiprocessor history.
The earliest multiprocessor machines were the earliest machines.  It
was John Von Neumann's greatest gift to get us out of the completely
parallel processor business into the uniprocessor business.  But even
recognizable diadic multiprocessors started appearing a decade before
the 11/782.  The point of the analogy, which was poorly drawn, is that
many vendors are still falling into the trap of scaling up one
function of a machine, while not scaling up the rest of the system to
match, leading to very unbalanced systems.  If your favorite PC which
has a faster IP/FLOP rating than a 780 can't support 30 users because
it has poor I/O performance, adding N Transputers isn't going to allow
it support 30 users. . .

(By the way, not all current multiprocessors are symetrical.  They
 actually fall into a fairly wide range of classes with respect to how
 i/o load is distributed among the processors.)

>
>>Saturation of usable HW technology is close.
>
>What does this cliche mean?
>

It means I was in a hurry to finish, so didn't draw this point out.
Sorry about that.  It appears that the rate at which performance
improvement through new implementation technology occurs is slowing
dramatically.  Top end clock rates are down to a few nanoseconds, and
it doesn't appear likely that subnanosecond rates are going to arrive
before the end of the century.  This means the bottom end is within
two orders of magnitude of the top and rapidly getting closer.  To me
this means that radical performance improvements over the spectrum of
machines aren't as likely as a narrowing of the spectrum, thus
 "saturation of usable technology."

When you couple this with the fact that after twenty years of various
kinds of software/hardware research, parallel processing is still
limited to the technology described by Dykstra in 63, and still has
the same problems in software development, it is probably fair to say
that little progress is being made in software either.

>By the way, I don't fault people for not understanding
>about multis (e.g., from past exposure or school). It takes some
>time for common misconceptions to catch up to current reality. 
>

Since I don't see a smiley face on this, I will assume that it was
intended to be as obnoxious and condescending as it sounds.  I won't
argue over who misunderstands what, but I will point out to Mr. Fay,
that I daily program a range of multiprocessors in both distributed
and parallel applications, and that my opinions are the result of
direct experience with all (MIMD, SIMD, multiflow, massive or sparse)
of the currently available kinds of multiprocessors, as well as many
of the dinasours.

As an example of a class of algorithms which is difficult to vectorize
or parallelize, let me pull out the ancient prime finder algorithm:

      IPRIME(1) = 1
      IPRIME(2) = 2
      IPRIME(3) = 3
      NPRIME = 3
      DO 50 N = 5, MAXN, 2
         DO 10 I = 3, NPRIME
            IQ = N / IPRIME(I)
            IR = N - (IPRIME(I) * IQ)
            IF (IR .EQ. 0) GO TO 40
            IF (IQ .LT. IPRIME(I)) GO TO 20
 10      CONTINUE
 20      NPRIME = NPRIME + 1
         IPRIME(NPRIME) = N
         IF (NPRIME .GE. MAXP) GO TO 60
 40      CONTINUE
 50   CONTINUE
 60   CONTINUE


Although there are different algorithms for finding primes, I use this
one to illustrate a class of problems which comes up frequently in my
work.  There exists some set from which must be drawn a subset.  There
exists a rule for ordering the set and another rule for determining if
an element is a member of the subset.  The determination rule requires
the subset drawn from all elements ordered before X in the initial set
be know to determine if X is a member of the subset.  The
determination test requires a comparison with some of the elements of
the known subset to decide if X should be added to the subset.
Usually, although not always, there is an ordering of the subset (not
necessarily the same as the ordering of the set) such that by
comparing X with members of the subset in order, it is frequently 
unnecessary to compare X with all of the members.

None of the vectorizing compilers that I have access to will attempt
to vectorize this algorithm.  The Alliant automatic parallelizer will
not attempt to parallelize it.  Most of the mechanisms I have tried to
handcraft a vector or parallel variant which remains true to the
previous paragraph have added sufficent extra work to the algorithm
that it runs more slowly as the number of processors increase.

I would be indebted to Mr. Fay, or anyone else who could provide me
with a parallel or vectorizable version of the algorithm which
maintained it as an example of the kind of problem I have described
above.

fay@encore.UUCP (Peter Fay) (12/22/87)

To begin with, my statement apparently was found offensive:

In article <25@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes:

#>By the way, I don't fault people for not understanding
#>about multis (e.g., from past exposure or school). It takes some
#>time for common misconceptions to catch up to current reality. 
#
#Since I don't see a smiley face on this, I will assume that it was
#intended to be as obnoxious and condescending as it sounds. [...]

No insult was intended. My point is that, as shown by your own example
(about the multiprocessor VAX 11/782 being slower than a uniprocessor
780) older multiprocessor technology has left a bad impression of
multis in general, much of which has since been solved by symmetric
shared-memory multiprocessors.


Mr. Fouts original statement regarding parallel processors:

#[...] Parallel
#processing, at least in the sense of throwing multiple processors at a
#single problem is currently difficult to use.  Software is hard to
#write, easy to get wrong and nearly impossible to debug.  As long as
#this stays true, parallel processor machines will have a burden beyond
#price/performance to overcome before they are accepted.


My response was the follwing:

#In regard to general-purpose multiprocessors:
 NOTE! ->     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#Every fork() (or thread_create() in Mach) in every program can get
#scheduled on a different cpu (that includes every shell per user,
#daemon, background task, ... Also, don't forget all those kernel
#processes (or tasks, threads) running on different cpus (pagers,
#signal handlers, ...). How difficult is it when the O.S.  does it
#transparently?
#
#And then there is more sophisticated mechanisms ("micro-tasking", gang
#scheduling, vectorizing fortran compilers) available to any user who
#wants more capability. 

He responded to me:

#Mr. Fay is using transparently in a way with which I am unfamilar.  It
#is true that Mach provides primitives which allow the programmer to
#introduce multitasking into the program; but these are in no sense
#transparent.  Task creation and deletion, synchronization, and task
#scheduling all require explict code in the program which is to take
#advantage of the tasking.  Even the ancient Unix fork() has to be
#explicitly coded for.
#

Yes, fork() and thread_create() must be explicitly coded. One way of
looking at this though, is that, for example, Unix utilities already
have about 850 fork()'s ALREADY coded in them. Compile them and new
fork()'s will generally be run in parallel by the O.S. This usually is
not optimal code, but it IS parallel and ISN'T difficult. Our modified
"make" utility uses simple forks, shared memory and a dependency graph
and often produces near-linear speedup. This parallelism enabled our
machine to have the fastest Ada Validation Suit (just over three
hours). No, this isn't a marketing pitch, just a justification of my
remarks. (I still don't know why anyone would want Ada, anyway :-)

I just compiled our 4.2 Unix kernel from scratch (152,000 lines in 13
minutes):

3461.2u 642.1s 13.26 508% 0+2k 1842+9439io 59pf+0w

Using a 6-cpu machine, I got a speedup of 508% (obviously, I had to
share some cpus with the O.S. and daemons). 


#
#A problem with [micro-tasking, gang scheduling, vectorizing] is that they can
#lead to parallel execution in which the wall clock time goes up as a
#function of the number of processors, rather than down. . .  

True. It "can" also go down. As we all know, it depends on the
algorithm and how it is coded. If there is a high amount of data
dependency, many synchronization points, and little computation
between, adding more processors may slow it down (on a supercomputer
or a multi). And if that's the only algorithm you have, then a multi
is not for you. If, however, you do  lots of compiles, use lots of
utilities, write operating system code (like me), use databases, AI,
sorts, TP, editing, concurrent Ada & pascal, parallel OPS5, etc. (as I
said, _general_purpose_), then a multi is the only way to go.

#Another
#problem is that no compiler can perfectly optimize.  The better the
#programmer understands what the optimizer can optimize, the easier it
#is to write optimizable code.  With vector code, this is fairly
#straightforward to do, with parallel code, it is much more difficult.
#
I agree.

#>Writing software which exploits the FULL parallelism of a machine MAY
#>be hard to do in CERAIN cases. 
#
#It has been my experience in five years of writing code to exploit
#distributed and concurrent parallelism that it is hard in a large
#number of cases. [...] There are
#also algorithms requiring high communication cost, much
#synchronization and unpredictable work per task which are nearly
#untractable.  There appear to be an entire class of algorithms for
#which no parallel solution is more efficent than a sequential solution
#on the same processor.

I too can come up with many examples of algorithms that are hard or
impossible to parallelize. Some architectures can exploit these better
than others. 

#[...] it is probably fair to say
#that little progress is being made in software either.
#

As I said originally:

"This is one of the many time lags in parallel software tool development,
 not an inherent defect in architecture."


>As an example of a class of algorithms which is difficult to vectorize
>or parallelize, let me pull out the ancient prime finder algorithm:

See my next followup - this is already too long.
-- 
			peter fay
			fay@multimax.arpa
{allegra|compass|decvax|ihnp4|linus|necis|pur-ee|talcott}!encore!fay

daveh@cbmvax.UUCP (Dave Haynie) (12/22/87)

in article <25@amelia.nas.nasa.gov>, fouts@orville.nas.nasa.gov (Marty Fouts) says:
> In article <2341@encore.UUCP> fay@encore.UUCP (Peter Fay) writes:
>>In regard to general-purpose multiprocessors:
>>Every fork() (or thread_create() in Mach) in every program can get
>>scheduled on a different cpu (that includes every shell per user,
>>daemon, background task, ... Also, don't forget all those kernel
>>processes (or tasks, threads) running on different cpus (pagers,
>>signal handlers, ...). How difficult is it when the O.S.  does it
>>transparently? 

> Mr. Fay is using transparently in a way with which I am unfamilar.  It
> is true that Mach provides primitives which allow the programmer to
> introduce multitasking into the program; but these are in no sense
> transparent.  Task creation and deletion, synchronization, and task
> scheduling all require explict code in the program which is to take
> advantage of the tasking.  Even the ancient Unix fork() has to be
> explicitly coded for.

I believe he's referring to the processor the code is executing on, not the
specific request for threading, as being transparent.  Right now if you
fork() or thread_create(), you start an asynchronous task of some kind that's
time shared on the same physical CPU.  In the proposed senario, the OS will
break up the tasks as evenly as possible amoung the existing CPUs in the
system.  This distribution is completely transparent to the programmer; it
doesn't look any different than if it were running on a single machine, other
than the large increase in execution speed you may get.

This of course bypasses any of the arguments you get into when you introduce
more sophisticated ways of using the multiple processors, like vectorizing
or parallelizing compilers.  Tasks are already atomic units and trivial to
break up between CPUs, and the synchronization mechanisms are already put in
place by the programmer.  Of course, if you were writing code specifically
for a machine like this with 5 or 10 CPUs, you'd consider multiple tasks as
a speed enhancing device in many cases, where on the single CPU machine
they'd likely slow down your program.  A good argument for lightweight tasks
like what you get in Mach.

-- 
Dave Haynie     Commodore-Amiga    Usenet: {ihnp4|uunet|rutgers}!cbmvax!daveh
   "The B2000 Guy"              PLINK : D-DAVE H             BIX   : hazy
		"I can't relax, 'cause I'm a Boinger!"

fouts@orville.nas.nasa.gov (Marty Fouts) (12/23/87)

In article <2989@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes:

>I believe he's referring to the processor the code is executing on, not the
>specific request for threading, as being transparent.  Right now if you
>fork() or thread_create(), you start an asynchronous task of some kind that's
>time shared on the same physical CPU.  In the proposed senario, the OS will
>break up the tasks as evenly as possible amoung the existing CPUs in the
>system.  This distribution is completely transparent to the programmer; it
>doesn't look any different than if it were running on a single machine, other
>than the large increase in execution speed you may get.
>

The last part of this isn't precisely true.  I have frequently found
that timing problems in multitasked codes, especially those which are
synchronization primitive bugs will not show up in single processor
execution because of "accidental" synchronization as a result of
sharing the single processor.  Even pipelines can give different
behaviour on a multiprocessor than on a uniprocessor, although the
producer/consumer synchronization usually prevents this.

There just isn't any such thing as transparent when dealing with
cooperatively multitasked codes.

muller@alliant.Alliant.COM (Jim Muller) (12/23/87)

In <25@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes,
with some references to article <2341@encore.UUCP> fay@encore.UUCP (Peter Fay):

> >...And then there is more sophisticated mechanisms ("micro-tasking", gang
> >scheduling, vectorizing fortran compilers) available to any user who...

>A problem with these more sophisticated mechanisms is that they can
>lead to parallel execution in which the wall clock time goes up as a
>function of the number of processors, rather than down... 
   With vectorization this is typically due to excessive masking.  In heavily
   IF'ed code, the processor(s) can end up doing much more work than necessary.
   The vector overhead generally is insignificant by comparison.  True parallel
   processing, however, does not suffer this problem, since any one processor
   never has to execute the "not-to-be-executed" portions.  For this reason,
   true parallel processing can reduce runtimes even on code for which vector
   processing is a liability.

   [
   Example of code for which vectorization may be a liability:
       DO I = 1, 100
          IF (NS.GT.M) THEN
             .
             .
          ELSE IF (N..GT.MX) THEN
             IF (II.EQ.IX) THEN
                .
                .
             ENDIF
          ENDIF
   100  CONTINUE
   ]

>Another problem is that no compiler can perfectly optimize.  The better the
>programmer understands what the optimizer can optimize, the easier it
>is to write optimizable code.  With vector code, this is fairly
>straightforward to do, with parallel code, it is much more difficult.
>It has been my experience in five years of writing code to exploit
>distributed and concurrent parallelism that it is hard in a large
>number of cases.
   It is a truism for any machine, vector- or multi- or not, that a good
   program will run better than a poor one, and that a programmer who is
   ignorant of the compiler's capabilities can accidentally defeat its best
   efforts.  The same holds for hardware issues as well as software.  As for
   parallelism being harder to exploit than vectorization, the reverse is more
   likely to hold.  Parallel execution can be applied to practically any
   vectorizable structure, while the reverse is decidedly not true.  Many
   structures can be run as parallel processes, with runtime speedup, even if
   vectorization is either impossible or a liability.  Anyway, it should be
   the compiler's job to analyze code for parallel or vector opportunities,
   and compilers capable of doing that well do exist.

>There are well behaved algorithms, such as those involved in PDE solution
>for which parallelism is trivial. There are also algorithms requiring high
>communication cost, much synchronization and unpredictable work per task
>which are nearly untractable.  There appear to be an entire class of
>algorithms for which no parallel solution is more efficent than a sequential
>solution on the same processor.
   Indeed parallelism is trivial in some problems.  That does not make it
   useless.  Instead, it make it easy and efficient.  More to the point,
   however, for those problems which do not exhibit inherent parallelism,
   nothing requires the machine (or compiler) to use it where inappropriate.
   All that is needed is that the unused processors be doing something else.
   For problems where it *is* appropriate, it can produce significant runtime
   speedup.  In fact, most problems exhibit "some" parallelism, usually enough
   to be beneficial.  For large, computationally intensive problems, the gain
   in execution speed from parallelism will usually justify the amount of
   effort required to exploit it, especially if the compiler rather than the
   programmer does most of the work.
   
> >Debugging is a whole other soapbox, but my experience is that debugging
> >coding errors is not much more difficult than uniprocessors.  What is hard
> >(or "impossible" with current tools) is detecting race conditions and
> >bottlenecks - i.e. CONCEPTUAL errors. 
   Race conditions are typically a problem only when a compiler has been
   inappropriately "permitted" to optimize code that it normally would have
   refused.  The obvious approach, of course, should be to debug in single-
   processor mode.

>...Throwing a parallel machine at an average workload is like throwing a
>vector machine at it, only more so:  The difference between vendor "not to
>be exceeded" speed for the machine and delivered throughput for the
>workload can be two order of magnitude, dramatically reducing the real
>price / performance.
   This was discussed by this group just a month or so ago.  There are really
   two causes.  The first is the structure of the code itself.  Certainly,
   real life programs are often dominated by scalar code, and better-written
   code will be less likely to thwart the compiler.  More importantly, "peak"
   speeds quoted are usually the output rate during a vector instruction,
   totally ignoring the vector startup time.  Startup time can be significant
   and must be paid every time a vector instruction starts (or repeats if the
   loop length is greater than the vector length).  When this is taken into
   account, the *real* maximum rates can be achieved (well, almost) by codes
   that are well-structured for the machine.  For an "average workload",
   a multi-processor machine can be beneficial if the various processors are
   free to do other jobs when parallelism is inappropriate.

>Top end clock rates are down to a few nanoseconds, and it doesn't appear
>likely that subnanosecond rates are going to arrive before...
   Light (or electromagnetic pulses) can only travel so fast, so you have to
   make things smaller to make them faster (obviously).  This is *exactly* why
   radically different approaches such as parallelism, etc. are *appropriate*
   courses to follow, rather than useless.  It is the single processor that is
   becoming harder to speed up.

>...it is a most difficult problem, and one which decades of parallel
>processing research has not solved adequately...
>When you couple this with the fact that after twenty years of various
>kinds of software/hardware research, parallel processing is still
>limited to the technology described by Dykstra in 63, and still has
>the same problems in software development, it is probably fair to say
>that little progress is being made in software either.

   My background is science, not computer architecture, so I do not fully
   appreciate the "decades of parallel processing research" or the "past
   twenty years of various kinds of software/hardware".  But I am quite
   familiar with some approaches, and have seen that it does work. The
   Alliant FX/8 was based on Kuck (1976), and in fact, was designed to
   address many of the objections you raised about parallel processing's
   limitations.  It certainly cannot be described as equivalent to anything
   from 20 years ago.  It does have most of the features I have described
   above, including an intelligent compiler (you are right, the programmer
   should not have to explicitely deal with parallelism), parallel I/O
   capability, slowdown with vectorization for heavily-masked code (most
   vector machines behave this way), "other jobs" for free (non-busy)
   processors when parallelism is inappropriate, race-condition failure
   when the compiler is permitted to optimize inappropriately, good
   parallelism for some codes and none at all for others, significant
   vector startup time that reduces the "peak" speed to a more "real" speed,
   and, finally, a dependence, as with any other machine, on the programmer
   avoiding any of the pitfalls that will defeat the machine's best intentions.

REFERENCES
Kuck, D., "Parallel Processing of Ordinary Programs," Advances in Computing,
   Vol. 15, 119-179, M. Rubinoff and M.L. Yovits, Eds., Academic Press, 1976.
-----------------------------------------------------------------------------
This is not an official Alliant response.  I am simply an employee who feels
differently than you do.  I only mentioned the FX/8 because it is a good,
real-world, functioning example that attempts to address the valid objections
you raised, and because I understand it.  Alliant had no input into this
posting, nor did Alliant require me to write this disclaimer.

fouts@orville.nas.nasa.gov (Marty Fouts) (12/24/87)

In article <1030@alliant.Alliant.COM> muller@alliant.UUCP (Jim Muller) writes:
>In <25@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes,
>>A problem with these more sophisticated mechanisms is that they can
>>lead to parallel execution in which the wall clock time goes up as a
>>function of the number of processors, rather than down... 
>   With vectorization this is typically due to excessive masking.  In heavily
>   IF'ed code, the processor(s) can end up doing much more work than necessary.
>   The vector overhead generally is insignificant by comparison.  True parallel
>   processing, however, does not suffer this problem, since any one processor
>   never has to execute the "not-to-be-executed" portions.  For this reason,
>   true parallel processing can reduce runtimes even on code for which vector
>   processing is a liability.

Although it is true that parallelization doesn't suffer from vector
start up overhead, it does suffer from various degrees of
synchronization overhead and communication cost.  There are pathologic
algorithms which will actually perform worse on a multiprocessor in
concurrent mode as a result of that overhead than they perform in
single processor mode on the same machine.

>   Race conditions are typically a problem only when a compiler has been
>   inappropriately "permitted" to optimize code that it normally would have
>   refused.  The obvious approach, of course, should be to debug in single-
>   processor mode.
>

This of course assumes a compiler which can recognize the concurrency
and generate concurrent code.  While Alliant has demonstrated that
there are a wide range of constructs for which concurrency can be
detected, there is still a much wider range of constructs for which it
has to be explicitly generated (and a wide range of vendors whose
compilers do not deal with it.)  For this class of problem, it is very
difficult to generate correct code and even more difficult to debug
code.

>   Light (or electromagnetic pulses) can only travel so fast, so you have to
>   make things smaller to make them faster (obviously).

This is not obvious, only typical.  You can also make them fast by
making them more efficent. (;-)  Anyway, my problem with parallelism
as a way to speed things up is that it speeds up the machines ability
to handle a workload, which doesn't help me because management will
just put more users on it, but doesn't speed up its ability to handle
my intractable problems.

>
>   My background is science, not computer architecture, so I do not fully
>   appreciate the "decades of parallel processing research" or the "past
>   twenty years of various kinds of software/hardware".  But I am quite
>   familiar with some approaches, and have seen that it does work. The
>   Alliant FX/8 was based on Kuck (1976), and in fact, was designed to
>   address many of the objections you raised about parallel processing's
>   limitations.  It certainly cannot be described as equivalent to anything
>   from 20 years ago. . .

To quible: your reference is 12 years old, and is based on work done
several years earlier.  I will split the decade with you and say 15
years instead.  And, of course, Kuck's work is based on earlier
theory.  I agree that commercially available multiprocessors have only
become accessable in the last 5 years, and that the technology that
Kuck described isn't widely available yet.  I still maintain that if
you stray from the path 

>This is not an official Alliant response.  I am simply an employee who feels
>differently than you do.  I only mentioned the FX/8 because it is a good,
>real-world, functioning example that attempts to address the valid objections
>you raised, and because I understand it.  Alliant had no input into this
>posting, nor did Alliant require me to write this disclaimer.

I have written code for the Alliant, and it does a good job of
recognizing concurrency that it can find.  In the interest of a
balanced point of view, I am trying to point out that there is a long
way to go before this approach will be widely usable to speed up
delivered performance to general purpose work loads.  By all means,
people with vectorizable code should use vector computers, people with
concurent code should use multiple processors, and multiple processors
should be used for competitive multitasking.  But we should not look
to these technologies to continue the once rapid growth in overall
computer performance.

cik@l.cc.purdue.edu (Herman Rubin) (12/26/87)

In article <1030@alliant.Alliant.COM>, muller@alliant.Alliant.COM (Jim Muller) writes:
								As for
>    parallelism being harder to exploit than vectorization, the reverse is more
>    likely to hold.  Parallel execution can be applied to practically any
>    vectorizable structure, while the reverse is decidedly not true. 

If you are doing even moderately efficient methods of generating non-uniform
random numbers, this is likely to be false.  A flexible vector structure like
the CYBER 205 makes the acceptance-rejection step trivial.  A rigid vector
structure like the CRAY makes it somewhat harder.  On a MIMD machine, we
have to wait for the unluckiest processor to finish.  On a SIMD machine, we
have this cost, and also must have both an acceptance and a rejection step
for each cycle which has a rejection, as well as testing at each cycle
whether a rejection has occurred _on any processor_.

							More to the point,
>    however, for those problems which do not exhibit inherent parallelism,
>    nothing requires the machine (or compiler) to use it where inappropriate.
>    All that is needed is that the unused processors be doing something else.

This eliminates SIMD machines, which are currently the ones with the greatest
(by far) parallelism.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

bzs@bu-cs.BU.EDU (Barry Shein) (12/28/87)

Posting-Front-End: GNU Emacs 18.41.4 of Mon Mar 23 1987 on bu-cs (berkeley-unix)



Marty Fouts continues a tradition in discussions of the relative
merits of parallel machines which can be summed up in the simple
statement of the "best as enemy of the good".

I don't think any point is made when someone pulls out some algorithm
and challenges the community to parallelize it, what's the point? Has
anyone denied that such problematic algorithms exist.

He also questions the merits of Peter Fay's claim of development
transparency in some modern machines. Here's an example:

I compile a system with about 100 C modules on a Vax750
(4.3,8MB,RA81s), elapsed time is 1 hour 39 minutes.

I do the following on exactly the same code on an Encore Multimax
(6x332,28MB,CDC drives):

	% setenv PARALLEL 8
	% time make

and it compiles in 2 Minutes and 10 seconds or about a 50x speedup.

Do you like waiting an hour for a compile (assuming your machine is
two or three times faster than the 750)? Is it worth anything to you
to simply solve that problem? It's not like the parallel systems are
much more expensive than what you're probably using now (the Encore
I describe above is ~$200K.)

So the real question is: Is the value of parallel systems enhanced
by their acheivments or diminished by the hard problems which exist?

I also have a parallelized raytrace program which exhibits roughly N
performance increase as it's re-run allowing more CPUs to participate
(same machine.) Sure, it's an easy, nearly trivial problem to
parallelize (NOT transparent, but all I have to do is break the loop
across simple forks, about 20 lines of additional C code), so what?
Does that mean it's not worthwhile and only impossible problems are to
be considered? That would be a strange view.

I am continuously amazed at people who, upon learning that I own
several parallel systems and use them in production work immediately
remark that they find it interesting, but OF COURSE there are all these
problems (most of which they themselves don't need solved) which can't
be parallelized (or not easily.)

Seriously, what's the point? Just the 80's version of all the people
that used to decry higher level languages because they could pull out
this algorithm that no compiler could do as good a job on as a
hand-coded solution. So what!?

	-Barry Shein, Boston University

reggie@pdn.UUCP (George W. Leach) (12/28/87)

In article <18013@bu-cs.BU.EDU> bzs@bu-cs.BU.EDU (Barry Shein) writes:

>I am continuously amazed at people who, upon learning that I own
>several parallel systems and use them in production work immediately
>remark that they find it interesting, but OF COURSE there are all these
>problems (most of which they themselves don't need solved) which can't
>be parallelized (or not easily.)


      So nobody is forcing them to "parallelize" these problems!  However,
for those areas where we need the help you can utilized this ability.  And
certainly in the future we will get a better handle on how to harness all
of this power in other areas.  One thing that for years I have wanted is
the ability to assign all database processes to specific CPUs.  On a singe
CPU machine a few DB backends combined with other people compiling and editing
can bring a machine to a slow crawl very quickly.


-- 
George W. Leach					Paradyne Corporation
{gatech,rutgers,attmail}!codas!pdn!reggie	Mail stop LF-207
Phone: (813) 530-2376				P.O. Box 2826
						Largo, FL  34649-2826

eugene@pioneer.arpa (Eugene Miya N.) (12/29/87)

In article <1030@alliant.Alliant.COM> muller@alliant.UUCP (Jim Muller) writes:
>> >Debugging is a whole other soapbox, but my experience is that debugging
>   Race conditions are typically a problem only when a compiler has been
>   inappropriately "permitted" to optimize code that it normally would have
>   refused.  The obvious approach, of course, should be to debug in single-
>   processor mode.

Excuse me for chuckling.  This approach was mentioned and rejected
several years ago at Cray User Group meetings.  [I note your innocence
below.]  There are several good old HEP papers which show single machine
debugging isn't the solution.  I just hope that fewer other readers make
the same mistake.  Not every problem is a race condition, but we've
seen some good ones.

>   My background is science, not computer architecture, so I do not fully
>   appreciate the "decades of parallel processing research" or the "past

Don't worry we won't that against you.  ;-) [Nor the comment that you
worked for Alliant.]

From the Rock of Ages Home for Retired Hackers:

--eugene miya, NASA Ames Research Center, eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene

brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (12/30/87)

In article <3771@ames.arpa> eugene@pioneer.UUCP (Eugene Miya N.) writes:
>In article <1030@alliant.Alliant.COM> muller@alliant.UUCP (Jim Muller) writes:
>>> >Debugging is a whole other soapbox, but my experience is that debugging
>>   Race conditions are typically a problem only when a compiler has been
>>   inappropriately "permitted" to optimize code that it normally would have
>>   refused.  The obvious approach, of course, should be to debug in single-
>>   processor mode.
Although the above comments are a little off the wall, the debugging of a
parallel program is best done in two steps.  First you run it in
serial mode (parallel code but with one processor) and find and kill
all of the bugs that show up.  If your program dumps core here the
use of a good serial debugger will be of great help.  After doing this
you run the code with more than one processor.  Bad things that are
hard to pin down can still happen, (gee Eugene, this program ran fine
through 13 cycles and then dumped core out of the blue!) but at
least you won't have the serial program bugs biting you when you are
running the code in parallel.

Of course there are some implicit assumptions:
	1) The parallel programming model you use uses the same code
regardless of the number of processors (the number of processors
is a runtime parameter that may not even be known to the user of the
code)
	2) The code produces the same results (perhaps to machine roundoff
accuracy to satisfy those who have real numerical parallel code experience)
regardless of the number of processors you run it with.