[comp.arch] chewing up mips with graphics parallel computing

johnw@astroatc.UUCP (John F. Wardale) (06/26/87)

In article <5793@think.UUCP> bradley@godot.think.com.UUCP (Bradley Kuszmaul) writes:
>In article <337@astroatc.UUCP> I (John F. Wardale) wrote:
>>...
>Actually, I interpret the 90%-10% rule as indicating that there might be
>a lot of parallelism in problems.  Since 90% of the time is spent
>executing a very small amount of code, it seems likely that there is a
>lot of data involved.  It is also likely that the dependencies between
>data are something smaller than "completely connected", and so it is
>likely that different parts of the data can be processed in parallel.
Comments:
1) The 90-10 was not well explain
2) the lots of data I agree with
3) the lack of dependencies does *NOT* follow!

re: 1:  If you machine parallelize, you get your gains from a
small section of code.  If you hand parallelize, you have to KNOW
which 10% to do.
Look at the theoretic (i.e. guarenteed not to exceed) performances
of vector machines (like a Cray) and the actual performance on
real code.  (Try the Linpak benchmark if you believe in
benchmarks.)  While is *IS* true that there *ARE* problems that
get 50+% of top, most tend to get *MUCH* less.  Why?  Because
scalar performance frequently dominates.

re: 3:  When scheduling code for pipelined computers, given real
programs several studies (sorry, no ref's) have shown that the
number of registers needed for complete register allocation was in
the twenties.  (i.e. 16 is pretty good,  32 is over-kill).
--> pipelining is parallelism, (tho admittedly limited) so why
should we believe that multi-cpus can extract more parallelism
than this?

Codes blocks can be classes as:
1: parallelizable and vectorizable [example: zero an array]
2: parallelizable    [example: a=5; b=6]
3: vectorizable      
4: neither
An example of #4 is an orbital dif-eq-shooting problem
you have initial conditions, and a set of differential equations.
In a loop you;
    increment time a little bit, calculate [sequentially dependent]
    values for acceleration, velocity, and position
end-loop

If anyone can vectorize or parallelize this, call me, I'll split
the profits with you.  :-)

>  The lesson is "Rules of thumb, such as Amdahl's Law, don't 
> have much to say about parallel computing".
Personally, I think such rules WILL apply to parallel
computations, and also be used to crudly estimate how much
parallelism be extracted from average codes.  Personally
I think it will be a one or low-two digit number.

As far as >1000 processor machines, if each could talk to >100
other processors, maybe the AI people could build a >= real-time
human brain simulator!

There are also enuf "embarrassingly parallel" problems that such
machines will have a good market, tho not as wide a market as
the real and small "Cray-like" machines.

BTW:  It would be wonderful if I'm wrong, and someone finds a way
to effectively split *ANY* problem to run on >100 processors in
parallel, but I won't believe it until I see it.


			John W

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Name:	John F. Wardale
UUCP:	... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw
arpa:   astroatc!johnw@rsch.wisc.edu
snail:	5800 Cottage Gr. Rd. ;;; Madison WI 53716
audio:	608-221-9001 eXt 110

To err is human, to really foul up world news requires the net!

herndon@umn-cs.UUCP (Robert Herndon) (06/27/87)

In article <338@astroatc.UUCP>, johnw@astroatc.UUCP (John F. Wardale) writes:
> In article <5793@think.UUCP> bradley@godot.think.com.UUCP (Bradley Kuszmaul) writes:
> Codes blocks can be classes as:
> 1: parallelizable and vectorizable [example: zero an array]
> 2: parallelizable    [example: a=5; b=6]
> 3: vectorizable      
> 4: neither
> An example of #4 is an orbital dif-eq-shooting problem
> you have initial conditions, and a set of differential equations.
> In a loop you;
>     increment time a little bit, calculate [sequentially dependent]
>     values for acceleration, velocity, and position
> end-loop
> 
> If anyone can vectorize or parallelize this, call me, I'll split
> the profits with you.  :-)
  Parallelize that, no.  Parallelize the same problem, which presumably
is "tell me how projectile X under conditions Y will fall", is
entirely a different question, and may be parallelizable,
depending on the differential equation involved.  Just as
In a loop you:
     read the next character
	(c = read)
     if the output table entry indexed by the character is non-zero, output
       the token numbered there.
	( if output[current_state][c] != 0 
		write(output[current_state][c])  )
     index the state-transition-table by the current-state and the input
     character, and make the resulting state the new current state.
	( current_state = state_trans[current_state][c] )
end-loop
  describes a serial lexical scanner of the garden variety,
it is not trivially parallelizable.  However, a different,
parallel algorithm IS capable of solving the same problem
("Given an input string and a Mealy machine, tell me what
tokens are transduced when the Mealy machine is given the
input string.") in time logarithmic to the length of the
input string.  (See Hillis and Steele in CACM, December 1986
for a solution.)
  This is why the Connection Machine people think that parallelism
is likely to be a big win for many problems -- and why they are
probably right.
> 
> There are also enuf "embarrassingly parallel" problems that such
> machines will have a good market, tho not as wide a market as
> the real and small "Cray-like" machines.
> 
> BTW:  It would be wonderful if I'm wrong, and someone finds a way
> to effectively split *ANY* problem to run on >100 processors in
> parallel, but I won't believe it until I see it.
>
  I wouldn't call the above problem "embarassingly parallel",
but depending on the complexity of the scanner, several thousand
processors on a connection machine should easily break the X100
performance barrier of any machine you care to name.
  If you mean "ALL" problems, then I'd have to agree with you.
There certainly are problems that seem to require a serial
solution strategy.  BUT:  you must differentiate between
"a problem" and "code to solve a problem".  They are NOT identical,
and an algorithm change makes it a whole new ball game.

-- 
Robert Herndon				Dept. of Computer Science,
...!ihnp4!umn-cs!herndon		Univ. of Minnesota,
herndon@umn-cs.ARPA			136 Lind Hall, 207 Church St. SE
herndon.umn-cs@csnet-relay.ARPA		Minneapolis, MN  55455

pase@ogcvax.UUCP (Douglas M. Pase) (07/01/87)

In article <umn-cs.1683> herndon@umn-cs.UUCP (Robert Herndon) writes:
> [...]
>  If you mean "ALL" problems, then I'd have to agree with you.
>There certainly are problems that seem to require a serial
>solution strategy.
> [...]
>
>-- 
>Robert Herndon				Dept. of Computer Science,
>...!ihnp4!umn-cs!herndon		Univ. of Minnesota,
>herndon@umn-cs.ARPA			136 Lind Hall, 207 Church St. SE
>herndon.umn-cs@csnet-relay.ARPA		Minneapolis, MN  55455

One such problem (one which is serial) is the calculation of positive integer
powers.  The fastest method is as follows:

	pow(b,i)	/* raise b to the ith power */
	{
		int j;

		for (j=1; i; i>>=1) {
			if (i&1)	/* least significant bit is set */
				j *= b;
			b *= b;		/* square b */
		}
		return(j);
	}

If someone wants to quibble, the last multiplication (squaring 'b') could be
removed by restructuring the code.  However, I have heard that this method
has been proven to be faster than any parallel algorithm which computes the
same function.
--
Doug Pase   --   ...ucbvax!tektronix!ogcvax!pase  or  pase@Oregon-Grad (CSNet)

rentsch@unc.cs.unc.edu (Tim Rentsch) (07/03/87)

In article <1331@ogcvax.UUCP> pase@ogcvax.UUCP (Douglas M. Pase) writes:
> One such problem (one which is serial) is the calculation of positive 
> integer powers.  The fastest method is as follows:
> 
> 	pow(b,i)	/* raise b to the ith power */
> 	{
> 		int j;
> 
> 		for (j=1; i; i>>=1) {
> 			if (i&1)	/* least significant bit is set */
> 				j *= b;
> 			b *= b;		/* square b */
> 		}
> 		return(j);
> 	}
> 
> If someone wants to quibble, the last multiplication (squaring 'b') could be
> removed by restructuring the code.  However, I have heard that this method
> has been proven to be faster than any parallel algorithm which computes the
> same function.

I don't know about faster than any parallel algorithm, but the given
algorithm is not the fastest possible, at least in terms of number
of multiplies.  Example:

	using the given algorithm takes 6 multiplies to compute
	x ** 15 (unquibbling the original multiply, and the last
	multiply), as in

			b		j
			-------		-------
			x		x
	(1)		x ** 2		
	(2)				x ** 3
	(3)		x ** 4
	(4)				x ** 7
	(5)		x ** 8
	(6)				x ** 15


	However, x ** 15 need take only 5 multiplies, as shown by:

			y		z
			-------		-------
			x		x
	(1)		x ** 2
	(2)		x ** 4
	(3)		x ** 5
	(4)				y ** 2  ( = x ** 10 )
	(5)				y ** 3  ( = x ** 15 )


Larger powers can be done in arbitrarily fewer multiplies than the
number used by the binary decomposition (given) method.  x ** 191 can
be computed with only 11 multiplies, but no fewer;  x ** 382 can also
be computed with 11 multiplies.  

I would love to say I thought up all this, but I didn't -- it's
from Knuth volume 2.

(And, I know this discussion should not continue on comp.arch --
followups please continue elsewhere.)

cheers,

Tim

johnw@astroatc.UUCP (07/07/87)

-------
Comment on using up MIPS:
a 100 mip or 1gip or 10gip ... workstation *IS* useful, even if
its utilization is 1% .1% or .01% as long as it gives faster
*RESPONCE*  ... Any time you do something, the faster its done,
the better!  

PS:  I said useful, not necessarily cost-effective.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

In article <1683@umn-cs.UUCP> herndon@umn-cs.UUCP (Robert Herndon) writes:
>In article <338@astroatc.UUCP>, I wrote what prompted Robert.

>  Parallelize that, no.  Parallelize the same problem, which presumably
>is "tell me how projectile X under conditions Y will fall", is
>entirely a different question, and may be parallelizable,
>depending on the differential equation involved.  Just as
		...example of... [deleted for brevity]
<paraphrasing...> Your garden variety scanner is serial, but
a fancy-new algorithm called "a Mealy Machine" is much better.

>  If you mean "ALL" problems, then I'd have to agree with you.
>There certainly are problems that seem to require a serial
>solution strategy.  BUT:  you must differentiate between
>"a problem" and "code to solve a problem".  They are NOT identical,
>and an algorithm change makes it a whole new ball game.

New methods are great for new codes, but given the amount of code
in the world, and the number of new machines comming out, how much 
effort do you REALLY think goes into each conversion ???  I'd have
a LOT more faith in your ideas if you had a parallel "guru"
program that could identfy all your "poor" algorithms and fix them
for you.  (Maybe it could just say stuff like:  xxx looks like a
scanner.  If you convert this, you could get 10:1 improvements,
see file /usr/lib/guru/algorithm-examples/mealy for details.)
(slightly :-)

Programs run on Cray-cost machines can/will/are converted and/or
fixed to run well, but most "users" are more likely to do simple
"ports".  Resatating this idea, If the machine is harder
[arbitrarly comparison] to program, then it will not be accepted
as well (it will have lower sales)

----------

I think you intentionally chose the "scanner" example so you could
claim that its in the code distributed with the system, and will
therefore be in all the vendor-supplied (system) programs, but
scanning constitues a very *SMALL* fraction of the total CPU time,
so these sort of algorithmic improvements must be applied in a
wholesale manner to all the sub-parts of each program.  That will
(99.9% likely) be VERY expensive!!


			John W

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Name:	John F. Wardale
UUCP:	... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw
arpa:   astroatc!johnw@rsch.wisc.edu
snail:	5800 Cottage Gr. Rd. ;;; Madison WI 53716
audio:	608-221-9001 eXt 110

To err is human, to really foul up world news requires the net!

herndon@umn-cs.UUCP (07/08/87)

In article <350@astroatc.UUCP>, johnw@astroatc.UUCP (John F. Wardale) writes:
> In article <1683@umn-cs.UUCP> herndon@umn-cs.UUCP (Robert Herndon) writes:
> >  Parallelize that, no.  Parallelize the same problem, which presumably
> >is "tell me how projectile X under conditions Y will fall", is
> >entirely a different question, and may be parallelizable,
> >depending on the differential equation involved.  Just as
> 		...example of... [deleted for brevity]
> <paraphrasing...> Your garden variety scanner is serial, but
> a fancy-new algorithm called "a Mealy Machine" is much better.
Point #1.

> >  If you mean "ALL" problems, then I'd have to agree with you.
> >There certainly are problems that seem to require a serial
> >solution strategy.  BUT:  you must differentiate between
> >"a problem" and "code to solve a problem".  They are NOT identical,
> >and an algorithm change makes it a whole new ball game.
> 
> New methods are great for new codes, but given the amount of code
> in the world, and the number of new machines comming out, how much 
> effort do you REALLY think goes into each conversion ???
Point #2.

> ...
> I think you intentionally chose the "scanner" example so you could
> claim that its in the code distributed with the system, and will
> therefore be in all the vendor-supplied (system) programs, but
> scanning constitues a very *SMALL* fraction of the total CPU time,
> so these sort of algorithmic improvements must be applied in a
> wholesale manner to all the sub-parts of each program.  That will
> (99.9% likely) be VERY expensive!!
Point #3.
  Point #1.  "The Mealy Machine" is a standard creature from
automata theory.  It has nothing to do with parallelism. I simply
described an algorithm outlined in Hillis's and Steele's article
in the Dec. '86 CACM that can be used to simulate a Mealy machine
in parallel.
  Point #2.  Conversion efforts vary with the amount of work
required to convert something, and the benefits to be accrued
to the conversion.  Many heavy duty numerical simulations (e.g.,
airflow simulations, weather modeling, seismic analysis, ...)
require large amounts (e.g., weeks) of time on Cray class computers,
and are still economies.  If the time could be cut down even
one or two orders of magnitude by recoding these for parallel
hardware, using entirely different algorithms, many firms would
find it profitable to do so.  Most programs do not require these
kinds of efforts.  So be it.  Future programs, written by future
programmers familiar with parallel algorithms and hardware are
likely to write programs that use parallelism.  Just because there
is a cost to parallelism, and the cost is high, doesn't mean
that there are no algorithms that will benefit from it now.
  There will be a learning curve, like the learning curve for
high level languages.  Twenty years ago, the cost of compilers
was very high, and many shops refused to use them, coding
everything in assembler.  The transition to HLLs has been less
than painless, and less than economical for the average shop
in the beginning, but I don't see too many people arguing that
HLLs are a mistake.
  Point #3.  I chose the example because I knew it, it is of
concern to me (being a languages & OS person), and it is something
that is always presented alongside serial algorithms.  Scanners and
such can be automatically generated, and many programs use them.
So they do drain only a small amount of CPU time in most systems.
Any one algorithm you care to choose does.  The power of the
algorithm scheme (no pun intended) outlined is that it can be applied
to many tasks to yield performance gains.
-- 
Robert Herndon				Dept. of Computer Science,
...!ihnp4!umn-cs!herndon		Univ. of Minnesota,
herndon@umn-cs.ARPA			136 Lind Hall, 207 Church St. SE
herndon.umn-cs@csnet-relay.ARPA		Minneapolis, MN  55455

johnw@astroatc.UUCP (John F. Wardale) (07/09/87)

In article <1716@umn-cs.UUCP> herndon@umn-cs.UUCP (Robert Herndon) writes:
>In article <350@astroatc.UUCP>, I wrote stuff I deleted
>> In article <1683@umn-cs.UUCP> herndon@umn-cs.UUCP wrote more deleted stuff
>  There will be a learning curve, like the learning curve for
>high level languages.  
>..., but I don't see too many people arguing that
>HLLs are a mistake.

Well put!  (Sorry if i sounded like I was I high-simmer.)
Parallelism is the growing outlook for the current future.  By the
time it fully arrive, something newer may be replacing it.

I just read/hear too much from those who think it will be an
instant cure for all our current problems.

I fully argree that many of the results will justify there cost
and time!  (I just don't expect to see it all by tomorrow.)

In terms of the Karp challenge, can the Mealy machine do
a 100 fold increase given normal lenght tokens, or are we talking
more like 10 to 20 ??


			John W

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Name:	John F. Wardale
UUCP:	... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw
arpa:   astroatc!johnw@rsch.wisc.edu
snail:	5800 Cottage Gr. Rd. ;;; Madison WI 53716
audio:	608-221-9001 eXt 110

To err is human, to really foul up world news requires the net!