[comp.arch] parallel systems vs. uni-proces

ds@hollin.prime.com (10/25/89)

/* Written  5:06 pm  Oct 24, 1989 by hsu@uicsrd.csrd.uiuc.edu in comp.arch */
In article <6655@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:
>
>Also, the job of programming a shared memory system is a lot harder than
>programming a system with messages as the communication medium. 

Umm, have you tried? 

When I took a parallel programming class a couple years ago, everybody
griped about the problems of coding Gaussian elimination for the
hypercube. Hardly anybody griped about implementing it for the
Sequent balance. (This single example of course doesn't prove
anything, but I would like to see some examples of problems that
are clearly easier to code for a message-passing system.)

Bill
/* End of text from hollin:comp.arch */

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

An example of a problem that is clearly easier to code in a message-passing
system than in a shared-memory MIMD system like the Sequent is ordinary array
(matrix) cross-product.  The Sequent offers no particular support, according
to its programming literature, other than doing unrelated cross-products
simultaneously on different 386s.  However, true data-parallel computers
(defined as computers that are powerful enough to simulate a systolic 2-d grid
efficiently) can implement the cross-product using an elegant linear-time
algorithm (measured using the word, not bit, model) that simply feeds each
array element into adjacent grid edges, one array into the rows and the other
into the columns, with a delay of one time unit for each succeeding row (and
the same for the columns).  Each processor does an add and a multiply.  Let me
know if you want to know the details; you can work it out for yourself easily
by examining the terms of a written-out symbolic cross-product.  (Furthermore,
this algorithm can also "pipeline" unrelated cross-products to reduce the
multiplicative overhead to close to 1, although this is a tiny effect compared
with the improvement from quadratic time to linear).

David Spector
Prime Computer, Inc.
ds@primerd.prime.com (until the layoff, soon)

(The above reflects my very scanty knowledge about parallel programming.
Please let me know if I've misunderstood something, and thanks...)

kale@m.cs.uiuc.edu (10/26/89)

According to NCUBE glossies in front of me: 

"512 billion bytes of memory capacity"
	So up to 64 Mbytes/processor..

about price: 
"... at less than $1000/MFLOP and $1000/Mbyte"

and elsewhere: ".. varying number of processors - from 16 to 8192 -
	that deliver from 8 to 27,000 scalar MFLOPS"

(So, price can be deduced.. But I believe it will be highly non-linear..)

On another point: Yes it is more difficult (but not too difficult)
to program distributed memory machines. But the rewards are great, 
and the challenge is stimulating. I believe they will be
useful for a large fraction of applications.
When people claim that "they are good for some applications, but not most",
they sometimes may mean a particular algorithm or a particular way of doing 
things may be difficult on message passing machines.
But not the end application - it can often be reformulated so it works fine
on them.