[comp.arch] Speed

lfoard@wpi.wpi.edu (Lawrence C Foard) (05/04/89)

Why isn't parallel computing used more? I can't think of many problems that
couldn't be done in parallel if the hardware/software to do it was done well.
We have an Encore here at WPI and although most programs don't make use of the
multiple processors directly simple things like being able to do parallel make
greatly increase the speed. When you double a processors speed it will cost
much much more (depending how close you are to the current limit), but
doubling the number of processors only doubles the cost. For example 30 ARM
chips will cost around $1000 and have a total of around 200MIPS! With some
clever programming it should be possible to make use of this, for example you
could have the ultimate game machine. 3D hidden surfaces using Z buffering can
be done mostly in Parallel. Imagine flight simulator with real shaded 3D
surfaces. Any numeric problem where a cell only depends on its neighbors can
also easily be broken up, for example solving Laplaces equation. Or even
better drawing Mandelbrot sets. Since most applications that need speed can be
made parallel it would seem logical to spend more effort finding cheap ways to
implement parallel processing rather than trying to squeezy every last drop
out of a single processor.

P.S. What important (time consuming) problems can't make use of parallel
     processing.

-- 
Disclaimer: My school does not share my views about FORTRAN.
            FORTRAN does not share my views about my school.

chuck@melmac.harris-atd.com (Chuck Musciano) (05/04/89)

In article <2187@wpi.wpi.edu> lfoard@wpi.wpi.edu (Lawrence C Foard) writes:
>Why isn't parallel computing used more? I can't think of many problems that
>couldn't be done in parallel if the hardware/software to do it was done well.

     Well, having been a player in the quest for working parallel machines, 
here are some of the problems encountered in building large scale parallel
machines:

	How do you program it?  Parallelize existing code (the dusty deck
	     problem), retrofit parallel constructs onto existing langauges,
	     or design entirely new parallel programming languages?  Also
	     nontrivial: how do you change the mindset of thousands of
	     programmers well-versed in serial programming techniques?

	For shared memory machines: what interconnection scheme do you use?
	     Flat memory space?  Heirarchical?  What about caches to reduce
	     long memory latencies?  Cache coherency for thousands of
	     processors?

	For message passing architectures: what interconnection scheme do
	     you use?  How do you handle messages in general?  What sort
	     of communication synchronization should you use?

	What sort of task granularity are you trying to support?  Fine grained
	     tasks of just a few instructions each?  Coarse grained tasks of
	     many thousand instructions?  If you choose fine grained, how
	     will you reduce the task overhead to make the tasks effective?
	     If coarse grained is the way to go, how do you solve the load
	     balancing problem when just a few processors have all the work,
	     starving the remaining processors?

	Will you use static or dynamic scheduling?  Static makes the task
	     overhead smaller, encouraging granularity reduction, but it is
	     very difficult to predict at compile-time exactly how big a task
	     will be, so that you can correctly place tasks in the machine.
	     Dynamic scheduling solves the load balancing problem at runtime,
	     but increases task overhead to move tasks through the system.
	     In addition, runtime heuristics for data distribution throughout
	     large networks of machines are difficult to formulate.

	How do you debug your programs?  What sort of mental model do you give
	     your application programmers who must deal with hundreds of 
	     processors accessing both global and local data objects simul-
	     taneously?  How do you implement debugging primitives so that
	     you can easily single step any, all, or just one processor?  How
	     can the debugger access all the data in all the processors,
	     including the caches and local buffers?  How do you even stop
	     an entire machine all at once, given certain propogation delays
	     throughout the architecture?  How do you debug programs which
	     are completely nondeterminisitic, with execution ordering being
	     essentially random?

	How about your programming model?  Communicating sequential processes?
	     Control flow with barrier synchronization?  Data flow?  If so,
	     static or dynamic?  Graph reduction?  How much of the underlying
	     machine model do you promote into the programming languages?

     I could go on and on.  Your instincts are right: there aren't many things
which wouldn't run faster on a parallel machine.  Unfortunately, what seems 
like a straightforward solution is in reality a very hard problem.  If you
are really interested, you might want to browse through past proceedings of
the International Conference on Parallel Processing (Penn State Press) or
some of the architecture and supercomputing conference proceedings.

Chuck Musciano			ARPA  : chuck@trantor.harris-atd.com
Harris Corporation 		Usenet: ...!uunet!x102a!trantor!chuck
PO Box 37, MS 3A/1912		AT&T  : (407) 727-6131
Melbourne, FL 32902		FAX   : (407) 727-{5118,5227,4004}

johnr@ogccse.ogc.edu (John Roberts) (05/05/89)

In article <2187@wpi.wpi.edu>, lfoard@wpi.wpi.edu (Lawrence C Foard) writes:
> 
> Why isn't parallel computing used more? I can't think of many problems...

Most people quote Amdahl's Law which essentially says that some portion of a
program is sequential in nature and the amount of time it takes to execute
that sequential code is the lower limit on the amount of time it will take
the program to execute.  Other's have disagreed, saying that you would
rewrite the program (eliminating the sequential code) for parallel systems.
This lead to Mohler's law which said that you'd scale up the parallelizable
part of the program and solve larger problems.

There are real parallel systems out there.  The Cogent XTM system from 
Cogent Research is one example (of a distributed memory system).

For around $20K you get two Inmos T800 transputers each with 4 Mbytes of
RAM, a 90 Mbyte hard drive, an 800K floppy drive and a 1024 by 808 by 256
grey scale monitor.  However the most important thing about the system
is that it is scalable!  You can attach a "resource server" with more
processors (and memory).  The operating system has built into it a
communication system called "Linda" which makes parallel programming
much easier than other parallel machines.  The operating system is
highly compatible with Unix System V.

The future of computing lies in parallel programming.
 
John Roberts
johnr@ogccse.ogc.edu