[comp.parallel] Information on Massively parallel machines

SAROFF@BNLDAG.AGS.BNL.GOV (11/15/89)

I am interested in getting information, technical specs, software
available, personal experiences, benchmarks, that sort of thing about
the massively parallel machines currently available.  

If some of you have opinions or ideas about

		NCUBE
		Thinking Machine
		Kendall Square
		Evans and Sutherland
		MassPar
		Key Computers
		(or Others . . .)

I would appreciate hearing.  

	Thanks Muchly

	Stephen Saroff

gary@cs.utexas.edu (Gary Smith) (11/16/89)

In article <7041@hubcap.clemson.edu>, SAROFF@BNLDAG.AGS.BNL.GOV writes:
> I am interested in getting information, technical specs, software
> available, personal experiences, benchmarks, that sort of thing about
> the massively parallel machines currently available.  
> 
> If some of you have opinions or ideas about
> 
> 		NCUBE
> 		Thinking Machine
> 		Kendall Square
> 		Evans and Sutherland
> 		MassPar
> 		Key Computers
> 		(or Others . . .)
> 
> I would appreciate hearing.  
> 
> 	Thanks Muchly
> 
> 	Stephen Saroff
> 		


I must admit I remain very skeptical that massive MIMD parallelism will
pay off in the near future, if ever.  Now modest MIMD parallelism, up to
perhaps 256 processors might, assuming we get a significant number of
applications that can achieve parallelism in excess of 99.9%, and if syn-
chronization overhead can be made arbitrarily small.  Of course, these
are pretty severe constraints!

The matrix below is the reason for my skepticism.  I included the short C
program I used to generate the F's.

Gary Smith
UT System CHPC
Balcones Research Center
10100 Burnet Rd
Austin, TX 78758-4497

Internet: g.smith@chpc.utexas.edu

==========================================================================

                  Maximum Speedup Assuming Zero Overhead
                  --------------------------------------


                    Speedup = F(f,p) = 1/[(1-f)+(f/p)]

                     f: fraction of code parallelized
                     p: number of processors employed
  

  f \ p  8   16   32   64  128  256  512 1024 2048 4096 8192  16K  32K  65K
---------------------------------------------------------------------------
10.00    1    1    1    1    1    1    1    1    1    1    1    1    1    1
15.00    1    1    1    1    1    1    1    1    1    1    1    1    1    1
20.00    1    1    1    1    1    1    1    1    1    1    1    1    1    1
25.00    1    1    1    1    1    1    1    1    1    1    1    1    1    1
30.00    1    1    1    1    1    1    1    1    1    1    1    1    1    1
35.00    1    1    2    2    2    2    2    2    2    2    2    2    2    2
40.00    2    2    2    2    2    2    2    2    2    2    2    2    2    2
45.00    2    2    2    2    2    2    2    2    2    2    2    2    2    2
50.00    2    2    2    2    2    2    2    2    2    2    2    2    2    2
55.00    2    2    2    2    2    2    2    2    2    2    2    2    2    2
60.00    2    2    2    2    2    2    2    2    2    2    2    2    2    2
65.00    2    3    3    3    3    3    3    3    3    3    3    3    3    3
70.00    3    3    3    3    3    3    3    3    3    3    3    3    3    3
75.00    3    3    4    4    4    4    4    4    4    4    4    4    4    4
80.00    3    4    4    5    5    5    5    5    5    5    5    5    5    5
85.00    4    5    6    6    6    7    7    7    7    7    7    7    7    7
90.00    5    6    8    9    9   10   10   10   10   10   10   10   10   10
90.50    5    7    8    9   10   10   10   10   10   11   11   11   11   11
91.00    5    7    8   10   10   11   11   11   11   11   11   11   11   11
91.50    5    7    9   10   11   11   12   12   12   12   12   12   12   12
92.00    5    7    9   11   11   12   12   12   12   12   12   12   12   12
92.50    5    8   10   11   12   13   13   13   13   13   13   13   13   13
93.00    5    8   10   12   13   14   14   14   14   14   14   14   14   14
93.50    5    8   11   13   14   15   15   15   15   15   15   15   15   15
94.00    6    8   11   13   15   16   16   16   17   17   17   17   17   17
94.50    6    9   12   14   16   17   18   18   18   18   18   18   18   18
95.00    6    9   13   15   17   19   19   20   20   20   20   20   20   20
95.50    6   10   13   17   19   21   21   22   22   22   22   22   22   22
96.00    6   10   14   18   21   23   24   24   25   25   25   25   25   25
96.50    6   10   15   20   24   26   27   28   28   28   28   29   29   29
97.00    7   11   17   22   27   30   31   32   33   33   33   33   33   33
97.50    7   12   18   25   31   35   37   39   39   40   40   40   40   40
98.00    7   12   20   28   36   42   46   48   49   49   50   50   50   50
98.50    7   13   22   33   44   53   59   63   65   66   66   66   67   67
99.00    7   14   24   39   56   72   84   91   95   98   99   99  100  100
99.10    8   14   25   41   60   78   91  100  105  108  110  110  111  111
99.20    8   14   26   43   63   84  101  111  118  121  123  124  125  125
99.30    8   14   26   44   68   92  112  125  134  138  140  142  142  143
99.40    8   15   27   46   73  101  126  143  154  160  163  165  166  166
99.50    8   15   28   49   78  113  144  167  182  191  195  198  199  199
99.60    8   15   28   51   85  127  168  201  223  236  243  246  248  249
99.70    8   15   29   54   93  145  202  252  287  308  320  327  330  332
99.80    8   16   30   57  102  170  253  336  402  446  471  485  493  496
99.90    8   16   31   60  114  204  339  506  672  804  891  943  970  985
99.91    8   16   31   61  115  208  351  533  721  874  979 1041 1075 1093
99.92    8   16   31   61  116  213  363  563  776  958 1085 1161 1204 1227
99.93    8   16   31   61  118  217  377  597  842 1059 1217 1314 1369 1398
99.94    8   16   31   62  119  222  392  635  919 1185 1385 1513 1586 1625
99.95    8   16   32   62  120  227  408  677 1012 1344 1608 1783 1885 1941
99.96    8   16   32   62  122  232  425  727 1126 1553 1916 2169 2323 2408
99.97    8   16   32   63  123  238  444  784 1269 1838 2369 2770 3026 3172
99.98    8   16   32   63  125  244  465  850 1453 2252 3105 3831 4338 4646
99.99    8   16   32   64  126  250  487  929 1700 2906 4503 6210 7662 8676

===========================================================================

main()
{
	double f;
	for(f=1000.0; f!=9000.0; f+=500.0)
		speedup(f);
	for(f=9000.0; f!=9900.0; f+=50.0)
		speedup(f);
	for(f=9900.0; f!=9990.0; f+=10.0)
		speedup(f);
	for(f=9990.0; f!=10000.0; f+=1.0)
		speedup(f);
}
speedup(f)
double f;
{
	double p;
	printf("%5.2f",f*0.01);
	for(p=8.0; p<131072.0; p+=p)
		printf("%5.0f",10000.0/((10000.0-f)+(f/p)));
	printf("\n");
	return(0);
}

workman@decwrl.dec.com (Will Workman) (11/16/89)

Would be pleased to provide information on the MasPar MP-1
series of massively parallel SIMD machines being introduced
within the next two months.  Please let me know your mailing
address and phone #  - our Federal Marketing office is in
Bethesda, Maryland  (301) 571-9480.  Will return from
Supercomputing '89 in Reno on Friday, and you can reach me
on Monday at the office.

Best Regards > > Will Workman, Dir of Fed Mkt, MasPar Computer

pardo@cs.washington.edu (David Keppel) (11/17/89)

> SAROFF@BNLDAG.AGS.BNL.GOV writes:
>> [I am interested in massively parallel machines currently availale.]

ut-emx!gary@cs.utexas.edu (Gary Smith) writes:
>[I'm skeptical that massive MIMD parallelism will pay off.]
>[Table showing speedup(%code parallelized, number of processors).]

For a discussion of parallelism, see ``Type Architectures, Shared
Memory, and the Corollary of Modest Potential'' by Lawrence Snyder; it
appeared in ``Anual Review of Computer Science, Vol 1, 1986, ISBN
0-8243-3201-6.

	;-D on  ( The parallel of serialization )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

fyodor@decwrl.dec.com (Chris Kuszmaul) (11/20/89)

In article <7070@hubcap.clemson.edu>, ut-emx!gary@cs.utexas.edu (Gary Smith) writes:
> In article <7041@hubcap.clemson.edu>, SAROFF@BNLDAG.AGS.BNL.GOV writes:
> > I am interested in getting information, technical specs, software
> > available, personal experiences, benchmarks, that sort of thing about
> > the massively parallel machines currently available.  
> 
> I must admit I remain very skeptical that massive MIMD parallelism will
> pay off in the near future, if ever.  Now modest MIMD parallelism, up to
> perhaps 256 processors might, assuming we get a significant number of
> applications that can achieve parallelism in excess of 99.9%, and if syn-
> chronization overhead can be made arbitrarily small.  Of course, these
> are pretty severe constraints!

  There is no question that some applications will never be parallelizable
at all, and that some applications will only be trivially parallelizable,
but there are a large class of problems for which parallel computers
(including, ding ding ding, SIMD computers) will provide speedup
proportional to the number of processors assigned to the task.

  These applications include the fast fourier transform, and linear 
systems solvers, as two very wide ranging examples. If one examines
the profile of a typical application which uses an FFT, one will
find that for an FFT of 32 data elements, the profile is:

  83% parallelizable core.
  17% serial controll code.

  Well, gee, it seems that even if we had an infinite number of
processors, we would not get more than a factor of 6 improvement in
performance. 

   The problem with this analysis, and with Mr. Smith's analysis,
is the assumption that the percentage of time taken to perform
the serial portion of the application grows at the same rate as the
portion taken to perform the parallel core.

  In the case of the FFT, it is more appropriate (if not
PERFECTLY accurate) to say that the serial portion of the 
code grows linearly with the problem size (O(N)), and
the parallel portion grows as O(NlogN). Thus, the following
table:

problem size:     % parallelizability:
64                85
128               87
256		  88
512               90
1024              91
2048              91.6
4096              92.3
8192              92.8
16k               93.3
32k               93.7
64k               94.1
128k              94.4
256k		  94.7
512k              95.0
1M                95.2

   Now, actually, the above numbers are quite conservative for the
FFT, since the parallelizable portion of the problem involves
much more floating point arithmetic than the serial portion
(proportionatly). Moreover, the FFT itself is a conservative
example, because it is such a slowly increasing in complexity
algorithm.

  Consider now, a linear system solver, for which the parallel
portion of the code grows at a rate of O(n^3), and the serial 
portion grows at a rate of O(n^2). Let us assume that the 
overhead (non increasing-with-size-of-problem-serial-portion)
is rather large, so that for n = 10, the percentages work out as:

parallel:  50%
serial:    50%

the ensuing table:

problem size       parallel%
20                   66       
40                   80
60                   85
100                  90.9
200                  95.2
400                  97.5
1000           	     99.0
2000 		     99.5                
5000                 99.8
10000                99.9
20000                99.95
40000                99.98

   Now, one might object that there are not
very many applications which involve matrices of
size 40000 x 40000, or which use million
point FFTs. I would counter by agreeing! Of
course there aren't many applications of that size,
they REQUIRE too much parallelism for most of
todays computers.

   As I said before, the above is not intended as a proof
of the general applicability of massivley parallel
computers, it is merely a demonstration that massively
parallel computers will, (and are) revolutionizing
computing, and how we think about computing. It also
does happen to be the case that a tremendous number of
real applications parallelize and scale in the same way as the FFT
and linear systems solvers. Thus, I submit that massively
parallel computing is a good business to be in, but who
am I to talk?

    CLK

wsmith@dopey.MDBS.UUCP (Bill Smith) (11/21/89)

In article <7070@hubcap.clemson.edu>, ut-emx!gary@cs.utexas.edu (Gary Smith) writes:
> 
> I must admit I remain very skeptical that massive MIMD parallelism will
> pay off in the near future, if ever.  Now modest MIMD parallelism, up to
> perhaps 256 processors might, assuming we get a significant number of
> applications that can achieve parallelism in excess of 99.9%, and if syn-
> chronization overhead can be made arbitrarily small.  Of course, these
> are pretty severe constraints!
> 
> The matrix below is the reason for my skepticism.  I included the short C
> program I used to generate the F's.
	[ matrix eliminated ]


There are several reasons that this conclusion is not valid.

1.   It is not necessary for speedup to be linear for a parallel processor
	to be worthwhile.   All that is wanted is faster performance.
	How much one is willing to pay for that performance dictates when
	one should start spending more money on bigger processors vs.
	when one should get more smaller processors.

2.   Many problems parallelize naturally.  With such problems, it is
	easy to get a highly efficient parallel implementation.
	Examples include computer graphics or complicated numeric processing

3.   Just because one does not get 100% utilization of a processor on
	one problem, it does not imply that computational power will be
	wasted.   Timesharing systems are a perfect example.   No one
	user uses 100% of the computer, but the sum total of the users
	do get nearly that utilization from the computer.   MIMD
	computers, in principle, could be timeshared as well.   Shared
	memory computers definitely are timeshared.

Bill Smith
pur-ee!mdbs!wsmith
(Not necessarily opinions of MDBS Inc.)

craig@Think.COM (Craig Stanfill) (11/22/89)

I wish I had time to respond to this in more depth, but let me take a
quick stab at this from an informal point-of-view, with a single
application: 3D aerodynamics.  Let us assume that one wishes to
simulate 3D airflow over an airframe, using a 1024 x 1024 x 1024 grid,
for a total of 2**30 cells.  It is probably reasonable to put, for
example, a 16x32x32 subgrid (2**14 points) on each processing element;
this achieves a good ratio of local to non-local computation.  One
then needs 2**16 processors.  Our experience is that, under these
circumstances, the serial front end is not the bottleneck. 

This is, by the way, a problem that people very much want to solve. 

				Craig Stanfill
				Advanced Information Systems
				Thinking Machines Corporation
				(617) 876-1111