johnw@astroatc.UUCP (John F. Wardale) (06/26/87)
In article <5793@think.UUCP> bradley@godot.think.com.UUCP (Bradley Kuszmaul) writes: >In article <337@astroatc.UUCP> I (John F. Wardale) wrote: >>... >Actually, I interpret the 90%-10% rule as indicating that there might be >a lot of parallelism in problems. Since 90% of the time is spent >executing a very small amount of code, it seems likely that there is a >lot of data involved. It is also likely that the dependencies between >data are something smaller than "completely connected", and so it is >likely that different parts of the data can be processed in parallel. Comments: 1) The 90-10 was not well explain 2) the lots of data I agree with 3) the lack of dependencies does *NOT* follow! re: 1: If you machine parallelize, you get your gains from a small section of code. If you hand parallelize, you have to KNOW which 10% to do. Look at the theoretic (i.e. guarenteed not to exceed) performances of vector machines (like a Cray) and the actual performance on real code. (Try the Linpak benchmark if you believe in benchmarks.) While is *IS* true that there *ARE* problems that get 50+% of top, most tend to get *MUCH* less. Why? Because scalar performance frequently dominates. re: 3: When scheduling code for pipelined computers, given real programs several studies (sorry, no ref's) have shown that the number of registers needed for complete register allocation was in the twenties. (i.e. 16 is pretty good, 32 is over-kill). --> pipelining is parallelism, (tho admittedly limited) so why should we believe that multi-cpus can extract more parallelism than this? Codes blocks can be classes as: 1: parallelizable and vectorizable [example: zero an array] 2: parallelizable [example: a=5; b=6] 3: vectorizable 4: neither An example of #4 is an orbital dif-eq-shooting problem you have initial conditions, and a set of differential equations. In a loop you; increment time a little bit, calculate [sequentially dependent] values for acceleration, velocity, and position end-loop If anyone can vectorize or parallelize this, call me, I'll split the profits with you. :-) > The lesson is "Rules of thumb, such as Amdahl's Law, don't > have much to say about parallel computing". Personally, I think such rules WILL apply to parallel computations, and also be used to crudly estimate how much parallelism be extracted from average codes. Personally I think it will be a one or low-two digit number. As far as >1000 processor machines, if each could talk to >100 other processors, maybe the AI people could build a >= real-time human brain simulator! There are also enuf "embarrassingly parallel" problems that such machines will have a good market, tho not as wide a market as the real and small "Cray-like" machines. BTW: It would be wonderful if I'm wrong, and someone finds a way to effectively split *ANY* problem to run on >100 processors in parallel, but I won't believe it until I see it. John W - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Name: John F. Wardale UUCP: ... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw arpa: astroatc!johnw@rsch.wisc.edu snail: 5800 Cottage Gr. Rd. ;;; Madison WI 53716 audio: 608-221-9001 eXt 110 To err is human, to really foul up world news requires the net!
herndon@umn-cs.UUCP (Robert Herndon) (06/27/87)
In article <338@astroatc.UUCP>, johnw@astroatc.UUCP (John F. Wardale) writes: > In article <5793@think.UUCP> bradley@godot.think.com.UUCP (Bradley Kuszmaul) writes: > Codes blocks can be classes as: > 1: parallelizable and vectorizable [example: zero an array] > 2: parallelizable [example: a=5; b=6] > 3: vectorizable > 4: neither > An example of #4 is an orbital dif-eq-shooting problem > you have initial conditions, and a set of differential equations. > In a loop you; > increment time a little bit, calculate [sequentially dependent] > values for acceleration, velocity, and position > end-loop > > If anyone can vectorize or parallelize this, call me, I'll split > the profits with you. :-) Parallelize that, no. Parallelize the same problem, which presumably is "tell me how projectile X under conditions Y will fall", is entirely a different question, and may be parallelizable, depending on the differential equation involved. Just as In a loop you: read the next character (c = read) if the output table entry indexed by the character is non-zero, output the token numbered there. ( if output[current_state][c] != 0 write(output[current_state][c]) ) index the state-transition-table by the current-state and the input character, and make the resulting state the new current state. ( current_state = state_trans[current_state][c] ) end-loop describes a serial lexical scanner of the garden variety, it is not trivially parallelizable. However, a different, parallel algorithm IS capable of solving the same problem ("Given an input string and a Mealy machine, tell me what tokens are transduced when the Mealy machine is given the input string.") in time logarithmic to the length of the input string. (See Hillis and Steele in CACM, December 1986 for a solution.) This is why the Connection Machine people think that parallelism is likely to be a big win for many problems -- and why they are probably right. > > There are also enuf "embarrassingly parallel" problems that such > machines will have a good market, tho not as wide a market as > the real and small "Cray-like" machines. > > BTW: It would be wonderful if I'm wrong, and someone finds a way > to effectively split *ANY* problem to run on >100 processors in > parallel, but I won't believe it until I see it. > I wouldn't call the above problem "embarassingly parallel", but depending on the complexity of the scanner, several thousand processors on a connection machine should easily break the X100 performance barrier of any machine you care to name. If you mean "ALL" problems, then I'd have to agree with you. There certainly are problems that seem to require a serial solution strategy. BUT: you must differentiate between "a problem" and "code to solve a problem". They are NOT identical, and an algorithm change makes it a whole new ball game. -- Robert Herndon Dept. of Computer Science, ...!ihnp4!umn-cs!herndon Univ. of Minnesota, herndon@umn-cs.ARPA 136 Lind Hall, 207 Church St. SE herndon.umn-cs@csnet-relay.ARPA Minneapolis, MN 55455
pase@ogcvax.UUCP (Douglas M. Pase) (07/01/87)
In article <umn-cs.1683> herndon@umn-cs.UUCP (Robert Herndon) writes: > [...] > If you mean "ALL" problems, then I'd have to agree with you. >There certainly are problems that seem to require a serial >solution strategy. > [...] > >-- >Robert Herndon Dept. of Computer Science, >...!ihnp4!umn-cs!herndon Univ. of Minnesota, >herndon@umn-cs.ARPA 136 Lind Hall, 207 Church St. SE >herndon.umn-cs@csnet-relay.ARPA Minneapolis, MN 55455 One such problem (one which is serial) is the calculation of positive integer powers. The fastest method is as follows: pow(b,i) /* raise b to the ith power */ { int j; for (j=1; i; i>>=1) { if (i&1) /* least significant bit is set */ j *= b; b *= b; /* square b */ } return(j); } If someone wants to quibble, the last multiplication (squaring 'b') could be removed by restructuring the code. However, I have heard that this method has been proven to be faster than any parallel algorithm which computes the same function. -- Doug Pase -- ...ucbvax!tektronix!ogcvax!pase or pase@Oregon-Grad (CSNet)
rentsch@unc.cs.unc.edu (Tim Rentsch) (07/03/87)
In article <1331@ogcvax.UUCP> pase@ogcvax.UUCP (Douglas M. Pase) writes: > One such problem (one which is serial) is the calculation of positive > integer powers. The fastest method is as follows: > > pow(b,i) /* raise b to the ith power */ > { > int j; > > for (j=1; i; i>>=1) { > if (i&1) /* least significant bit is set */ > j *= b; > b *= b; /* square b */ > } > return(j); > } > > If someone wants to quibble, the last multiplication (squaring 'b') could be > removed by restructuring the code. However, I have heard that this method > has been proven to be faster than any parallel algorithm which computes the > same function. I don't know about faster than any parallel algorithm, but the given algorithm is not the fastest possible, at least in terms of number of multiplies. Example: using the given algorithm takes 6 multiplies to compute x ** 15 (unquibbling the original multiply, and the last multiply), as in b j ------- ------- x x (1) x ** 2 (2) x ** 3 (3) x ** 4 (4) x ** 7 (5) x ** 8 (6) x ** 15 However, x ** 15 need take only 5 multiplies, as shown by: y z ------- ------- x x (1) x ** 2 (2) x ** 4 (3) x ** 5 (4) y ** 2 ( = x ** 10 ) (5) y ** 3 ( = x ** 15 ) Larger powers can be done in arbitrarily fewer multiplies than the number used by the binary decomposition (given) method. x ** 191 can be computed with only 11 multiplies, but no fewer; x ** 382 can also be computed with 11 multiplies. I would love to say I thought up all this, but I didn't -- it's from Knuth volume 2. (And, I know this discussion should not continue on comp.arch -- followups please continue elsewhere.) cheers, Tim
johnw@astroatc.UUCP (07/07/87)
------- Comment on using up MIPS: a 100 mip or 1gip or 10gip ... workstation *IS* useful, even if its utilization is 1% .1% or .01% as long as it gives faster *RESPONCE* ... Any time you do something, the faster its done, the better! PS: I said useful, not necessarily cost-effective. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= In article <1683@umn-cs.UUCP> herndon@umn-cs.UUCP (Robert Herndon) writes: >In article <338@astroatc.UUCP>, I wrote what prompted Robert. > Parallelize that, no. Parallelize the same problem, which presumably >is "tell me how projectile X under conditions Y will fall", is >entirely a different question, and may be parallelizable, >depending on the differential equation involved. Just as ...example of... [deleted for brevity] <paraphrasing...> Your garden variety scanner is serial, but a fancy-new algorithm called "a Mealy Machine" is much better. > If you mean "ALL" problems, then I'd have to agree with you. >There certainly are problems that seem to require a serial >solution strategy. BUT: you must differentiate between >"a problem" and "code to solve a problem". They are NOT identical, >and an algorithm change makes it a whole new ball game. New methods are great for new codes, but given the amount of code in the world, and the number of new machines comming out, how much effort do you REALLY think goes into each conversion ??? I'd have a LOT more faith in your ideas if you had a parallel "guru" program that could identfy all your "poor" algorithms and fix them for you. (Maybe it could just say stuff like: xxx looks like a scanner. If you convert this, you could get 10:1 improvements, see file /usr/lib/guru/algorithm-examples/mealy for details.) (slightly :-) Programs run on Cray-cost machines can/will/are converted and/or fixed to run well, but most "users" are more likely to do simple "ports". Resatating this idea, If the machine is harder [arbitrarly comparison] to program, then it will not be accepted as well (it will have lower sales) ---------- I think you intentionally chose the "scanner" example so you could claim that its in the code distributed with the system, and will therefore be in all the vendor-supplied (system) programs, but scanning constitues a very *SMALL* fraction of the total CPU time, so these sort of algorithmic improvements must be applied in a wholesale manner to all the sub-parts of each program. That will (99.9% likely) be VERY expensive!! John W - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Name: John F. Wardale UUCP: ... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw arpa: astroatc!johnw@rsch.wisc.edu snail: 5800 Cottage Gr. Rd. ;;; Madison WI 53716 audio: 608-221-9001 eXt 110 To err is human, to really foul up world news requires the net!
herndon@umn-cs.UUCP (07/08/87)
In article <350@astroatc.UUCP>, johnw@astroatc.UUCP (John F. Wardale) writes: > In article <1683@umn-cs.UUCP> herndon@umn-cs.UUCP (Robert Herndon) writes: > > Parallelize that, no. Parallelize the same problem, which presumably > >is "tell me how projectile X under conditions Y will fall", is > >entirely a different question, and may be parallelizable, > >depending on the differential equation involved. Just as > ...example of... [deleted for brevity] > <paraphrasing...> Your garden variety scanner is serial, but > a fancy-new algorithm called "a Mealy Machine" is much better. Point #1. > > If you mean "ALL" problems, then I'd have to agree with you. > >There certainly are problems that seem to require a serial > >solution strategy. BUT: you must differentiate between > >"a problem" and "code to solve a problem". They are NOT identical, > >and an algorithm change makes it a whole new ball game. > > New methods are great for new codes, but given the amount of code > in the world, and the number of new machines comming out, how much > effort do you REALLY think goes into each conversion ??? Point #2. > ... > I think you intentionally chose the "scanner" example so you could > claim that its in the code distributed with the system, and will > therefore be in all the vendor-supplied (system) programs, but > scanning constitues a very *SMALL* fraction of the total CPU time, > so these sort of algorithmic improvements must be applied in a > wholesale manner to all the sub-parts of each program. That will > (99.9% likely) be VERY expensive!! Point #3. Point #1. "The Mealy Machine" is a standard creature from automata theory. It has nothing to do with parallelism. I simply described an algorithm outlined in Hillis's and Steele's article in the Dec. '86 CACM that can be used to simulate a Mealy machine in parallel. Point #2. Conversion efforts vary with the amount of work required to convert something, and the benefits to be accrued to the conversion. Many heavy duty numerical simulations (e.g., airflow simulations, weather modeling, seismic analysis, ...) require large amounts (e.g., weeks) of time on Cray class computers, and are still economies. If the time could be cut down even one or two orders of magnitude by recoding these for parallel hardware, using entirely different algorithms, many firms would find it profitable to do so. Most programs do not require these kinds of efforts. So be it. Future programs, written by future programmers familiar with parallel algorithms and hardware are likely to write programs that use parallelism. Just because there is a cost to parallelism, and the cost is high, doesn't mean that there are no algorithms that will benefit from it now. There will be a learning curve, like the learning curve for high level languages. Twenty years ago, the cost of compilers was very high, and many shops refused to use them, coding everything in assembler. The transition to HLLs has been less than painless, and less than economical for the average shop in the beginning, but I don't see too many people arguing that HLLs are a mistake. Point #3. I chose the example because I knew it, it is of concern to me (being a languages & OS person), and it is something that is always presented alongside serial algorithms. Scanners and such can be automatically generated, and many programs use them. So they do drain only a small amount of CPU time in most systems. Any one algorithm you care to choose does. The power of the algorithm scheme (no pun intended) outlined is that it can be applied to many tasks to yield performance gains. -- Robert Herndon Dept. of Computer Science, ...!ihnp4!umn-cs!herndon Univ. of Minnesota, herndon@umn-cs.ARPA 136 Lind Hall, 207 Church St. SE herndon.umn-cs@csnet-relay.ARPA Minneapolis, MN 55455
johnw@astroatc.UUCP (John F. Wardale) (07/09/87)
In article <1716@umn-cs.UUCP> herndon@umn-cs.UUCP (Robert Herndon) writes: >In article <350@astroatc.UUCP>, I wrote stuff I deleted >> In article <1683@umn-cs.UUCP> herndon@umn-cs.UUCP wrote more deleted stuff > There will be a learning curve, like the learning curve for >high level languages. >..., but I don't see too many people arguing that >HLLs are a mistake. Well put! (Sorry if i sounded like I was I high-simmer.) Parallelism is the growing outlook for the current future. By the time it fully arrive, something newer may be replacing it. I just read/hear too much from those who think it will be an instant cure for all our current problems. I fully argree that many of the results will justify there cost and time! (I just don't expect to see it all by tomorrow.) In terms of the Karp challenge, can the Mealy machine do a 100 fold increase given normal lenght tokens, or are we talking more like 10 to 20 ?? John W - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Name: John F. Wardale UUCP: ... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw arpa: astroatc!johnw@rsch.wisc.edu snail: 5800 Cottage Gr. Rd. ;;; Madison WI 53716 audio: 608-221-9001 eXt 110 To err is human, to really foul up world news requires the net!