chris@mimsy.UUCP (Chris Torek) (07/01/88)
In article <91odrecXKL1010YEOek@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes: >You guys aren't thinking big enough. How about multiple parallel >pipelines to compute all the various instruction threads in parallel >and just keep the results of the one that is actually taken? Actually, this sort of idea is contained in some research and thesis work that is (was?) going on here at Maryland. How do you move old code (`dusty decks') onto a parallel processor? One way is to slice up the program into independent pieces that can be combined again later. In some case, rather than running the loop for i in [0..n-1) A := compute if test(A) then B[i] := fn1(A) else B[i] := fn2(A) fi rof you run two or three loops: 1 2 3 for i in [0..n-1) for i in [0..n-1) for i in [0..n-1) A := compute A := compute A := compute B[i] := fn1(A) B[i] := fn2(a) combine[i] := test(A) rof rof rof and at the end, you pick the appropriate B[i]s based on the (secret) combine[i]s (which may be assigned within either of processes 1 and 2, or might be done on a third path, depending on how hard `test' is). The separation (slicing) and recombining (splicing) is relatively straightforward, although there are oodles of messy details (computing dominators, watching side effects, what about overflow?, ...). If you have a massively parallel machine, computing both answers, then throwing it away the wrong one, may be cheapest. You have to account for communications costs, and decide where the communication goes (can process 3 stop processes 1 and 2 short? can 1 & 2 share A? etc.). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
aglew@urbsdc.Urbana.Gould.COM (07/01/88)
>Actually, this sort of idea is contained in some research and thesis >work that is (was?) going on here at Maryland. How do you move old >code (`dusty decks') onto a parallel processor? One way is to slice up >the program into independent pieces that can be combined again later. This comes perilosly close to something I'm trying. Davidson at the UIll, now UMich, proposed a partitioned access/execute architecture a while back, where you basically have two processors, one for address and memory computations, one for (mainly floating point?) calculation. They run independently, with FIFOs between them for passage of variables back and forth, condition codes, etc. They basically run two versions of exactly the same program, one with all the FP taken out, and the other with all the memory references replaced by "get the next value from the memory unit". Sounds a bit like Wulf's WM, with the extension that, instead of executing the same program and treating instructions for the other unit as NOPs, you partition the program. I'm trying to take this a bit farther - instead of partitioning by _type_ of data, I want to see there is enough parallelism within the same data type, when there are many registers available. Ie. have different functional units for all operations on registers R0-R15 and R16-R31, with FIFOs for operations like R0 = R16 + R32. Anyone else doing similar?
smryan@garth.UUCP (Steven Ryan) (07/02/88)
> How do you move old >code (`dusty decks') onto a parallel processor? One way is to slice up >the program into independent pieces that can be combined again later. > . . . >you run two or three loops: > > 1 2 3 > > for i in [0..n-1) for i in [0..n-1) for i in [0..n-1) > A := compute A := compute A := compute > B[i] := fn1(A) B[i] := fn2(a) combine[i] := > test(A) > rof rof rof On a 205, if compute, test, and fn1 and fn2 are vectorisable, this entire construct can be hand-vectorised by something like compute' -> promoted A' test(A') -> bit vector fn1'(A') -> B[i] if bit[i] set (else discard and ignore faults) fn2'(A') -> B[i] if bit[i] clear (else discard and ignore faults) CFT is supposed to provide a vectorisable conditional expression. FTN200 code to vectorise ifs may/may not be completed and released.
cik@l.cc.purdue.edu (Herman Rubin) (07/02/88)
In article <853@garth.UUCP>, smryan@garth.UUCP (Steven Ryan) writes: | How do you move old |code (`dusty decks') onto a parallel processor? One way is to slice up |the program into independent pieces that can be combined again later. | . . . |you run two or three loops: | | 1 2 3 | | for i in [0..n-1) for i in [0..n-1) for i in [0..n-1) | A := compute A := compute A := compute | B[i] := fn1(A) B[i] := fn2(a) combine[i] := | test(A) | rof rof rof > > On a 205, if compute, test, and fn1 and fn2 are vectorisable, this entire > construct can be hand-vectorised by something like > compute' -> promoted A' > test(A') -> bit vector > fn1'(A') -> B[i] if bit[i] set (else discard and ignore faults) > fn2'(A') -> B[i] if bit[i] clear (else discard and ignore faults) > CFT is supposed to provide a vectorisable conditional expression. FTN200 code > to vectorise ifs may/may not be completed and released. Even better is to first construct vectors A1, which contains only the elements of A for which bit[i] set, and A0, which has the elements for which bit[i] is clear, compute fn1 on A1, fn2 only on A0, and use bit to merge the two result vectors. What was suggested could be vectorized on the CRAY or even on SIMD machines; the improvement can be so vectorized on those machines only with difficulty. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet
nusip@maccs.McMaster.CA (Mike Borza) (07/03/88)
In article <12258@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >In article <91odrecXKL1010YEOek@amdahl.uts.amdahl.com> >chuck@amdahl.uts.amdahl.com (Charles Simmons) writes: >>You guys aren't thinking big enough. How about multiple parallel >>pipelines to compute all the various instruction threads in parallel >>and just keep the results of the one that is actually taken? > >Actually, this sort of idea is contained in some research and thesis >work that is (was?) going on here at Maryland. How do you move old I was at a talk given by A.K. Dewdney some time ago, about non-deterministic programming languages. The jist of these languages was exactly what you're talking about. At the outset, all possible outcomes of a particular computation are accounted for, then a selection criterion is applied to accept or reject particular outcomes. Lots of interesting possibilities and problems here. One of the most interesting features is that whole classes of possible solutions can be selected, rather than a single solution. I think this leads to a very natural way to program some problems which are very difficult in more traditional approaches. Global optimization of the route-and-place problem in VLSI and PCB design is a good example. We're starting to see the emergence of architectures which address these sorts of ideas in things like the Connection Machine. Of course, exponential explosion of the state space does appear to be something of a problem :-) mike borza <nusip@maccs.UUCP or EE784@SSCvax.McMaster.CA>
pf@csc.ti.com (Paul Fuqua) (07/04/88)
Date: Friday, July 1, 1988 8:59am (CDT) From: aglew at urbsdc.Urbana.Gould.COM This comes perilosly close to something I'm trying. Davidson at the UIll, now UMich, proposed a partitioned access/execute architecture a while back, where you basically have two processors, one for address and memory computations, one for (mainly floating point?) calculation. They run independently, with FIFOs between them for passage of variables back and forth, condition codes, etc. They basically run two versions of exactly the same program, one with all the FP taken out, and the other with all the memory references replaced by "get the next value from the memory unit". Anyone else doing similar? I have a copy of a paper, "PIPE: A VLSI Decoupled Architecture," by James R Goodman and five others, that appeared in a 1985 IEEE something (sorry, it's not noted on my copy). PIPE uses two processors connected by queues: "Each of PIPE's processors is capable of executing an entire program by itself (in SP -- single processor -- mode), or the two processors may work together on a program (in AE -- access/execute -- mode)." In detail, it's quite similar to the description above, with the addition of more queues in the memory system (that don't connect the processors), and a strange sort of variable-delay branch. The work was done at the University of Wisconsin-Madison. Is anyone outside the Great Lakes states working on this idea? pf Paul Fuqua Texas Instruments Computer Science Center, Dallas, Texas CSNet: pf@csc.ti.com (ARPA too, sometimes) UUCP: {smu, texsun, cs.utexas.edu, im4u, rice}!ti-csl!pf
ho@svax.cs.cornell.edu (Alex Ho) (07/05/88)
i have a copy of a paper by allen, kennedy, porterfield, and warren from rice university (published by the acm, i don't have the exact source) describing a scheme for eliminating fortran goto statements though the use of if conversion. in essence, they assign a boolean guard expression to statements which are executed conditionally. for example (from the paper)... original: do 100 i = 1, 100 statement1 if (x(i)) goto 200 statement2 100 continue statement3 200 statement4 transformed: ex1 = .true. do 100 i = 1, 100 if (ex1) statement1 if (ex1) ex1 = .not. x(i) if (ex1) statement2 100 continue if (.not. ex1) goto 200 statement3 200 statement4 the do loop can now be vectorized. has anyone else had experience with this procedure? --- alex ho university of california, berkeley ho@svax.cs.cornell.edu a confused eecs major (spending the summer at) cornell university
lfm@fpssun.fps.com (Larry Meadows) (07/06/88)
In article <12258@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: > In article <91odrecXKL1010YEOek@amdahl.uts.amdahl.com> > chuck@amdahl.uts.amdahl.com (Charles Simmons) writes: > >You guys aren't thinking big enough. How about multiple parallel > >pipelines to compute all the various instruction threads in parallel > >and just keep the results of the one that is actually taken? > > Actually, this sort of idea is contained in some research and thesis > work that is (was?) going on here at Maryland. Sounds a lot like multiflow to me. But what do you do about exceptions???? -- Larry Meadows @ FPS ...!tektronix!fpssun!lfm ...!omepd!ihf1!fpssun!lfm
eugene@pioneer.arc.nasa.gov.arpa (Eugene N. Miya) (07/06/88)
In article <822@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >In article <853@garth.UUCP>, smryan@garth.UUCP (Steven Ryan) writes: >> CFT is supposed to provide a vectorisable conditional expression. FTN200 code >> to vectorise ifs may/may not be completed and released. > >Even better is to first construct vectors A1, which contains only the elements >of A for which bit[i] set, and A0, which has the elements for which bit[i] is >clear, compute fn1 on A1, fn2 only on A0, and use bit to merge the two result >vectors. I think it was Steve who asked what this type of conditional code looks like. And I also see that Herman responded. I have collected some of this type of information for a little net article on supercomputers and what should be taught in super-classes which one might not get but should.... (they are just some ramblings sort of like these). Anyway, I spoke to one of my officemate for his sage opinions and he gave me a couple of references by Paul Frederickson. Well! Paul just took a job at RIACS, so I just walked down the hall, and he has this paper in Parallel Computing in 1984 and a more recent version in 1987. George only told me PC84 and the author, so you can look up the topic. Anyway, these codes are basically Monte Carlo bomb (nuke) codes from places like LLNL, LANL, Sandia which model photon, neutron, and energy transport, those kinds of things. Herman's description is closer (dope vectors). Anyways, if you need a reference, Parallel Computing 1984, a quarterly, article by Frederickson, so you shouldn't have too much trouble finding this, or consult your nearest National Laboratory. "Mr. Fusion" 8-) "Plutonium are your corner drug store..." Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize." There seems to be a little bit of interest in the subject of conditionals.
bradlee@cg-atla.UUCP (Rob Bradlee X5153) (07/06/88)
To: ulowell!bbn!mit-eddie!bloom-beacon!tut.cis.ohio-state.edu!mailrus!iuvax!pur-ee!a.cs.uiuc.edu!uxc.cso.uiuc.edu!urbsdc!aglew Subject: Re: getting rid of branches Newsgroups: comp.arch In-Reply-To: <28200173@urbsdc> References: <12258@mimsy.UUCP> Organization: Compugraphic Corp. Wilmington, MA Cc: Bcc: In article <28200173@urbsdc> you write: > > >This comes perilosly close to something I'm trying. >Davidson at the UIll, now UMich, proposed a partitioned >access/execute architecture a while back, where you basically >have two processors, one for address and memory computations, >one for (mainly floating point?) calculation. >They run independently, with FIFOs between them for passage of >variables back and forth, condition codes, etc. They basically >run two versions of exactly the same program, one with all the >FP taken out, and the other with all the memory references >replaced by "get the next value from the memory unit". > >Anyone else doing similar? This sounds quite similar to the architecture of array processors built by CSP, Inc of Billerica, MA. They use an IPU (integer processor) and an APU (arithmetic processor) connected by FIFOs. The IPU would handle computations to determine the next piece of data to get from memory and send it over to the APU for processing. The APU would put the result in another FIFO for return to the IPU which could then determine where to put it into memory. The best part was that the two processors operated asynchronously waiting on their respective input queues if one got ahead of the other. Unfortunately, to do anything really useful a program had to be done by hand in assembler. However, they did have a FORTRAN compiler which generated sometimes adequate code. -- Rob Bradlee w:(617)-658-5600 X5153 h:(617)-944-5595 Compugraphic Corp. ...!{ima,ulowell,ism780c}!cg-atla!bradlee 200 Ballardvale St. Wilmington, Mass. 01887 The Nordic Way: Ski till it hurts! -- Rob Bradlee w:(617)-658-5600 X5153 h:(617)-944-5595 Compugraphic Corp. ...!{ima,ulowell,ism780c}!cg-atla!bradlee 200 Ballardvale St. Wilmington, Mass. 01887 The Nordic Way: Ski till it hurts!
smryan@garth.UUCP (Steven Ryan) (07/06/88)
> ex1 = .true. > do 100 i = 1, 100 > if (ex1) statement1 <-----(1) > if (ex1) ex1 = .not. x(i) <-----(2) > if (ex1) statement2 > 100 continue > if (.not. ex1) goto 200 > statement3 > 200 statement4 > >the do loop can now be vectorized. Actually, I don't think so: the reference to ex1 in (1) is before a definition in (2), ex1 is recursive. I worked on something like, provided the control statements were forward jumps only and did not exit the loop. The forward jump restriction was to ensure the flow was acyclic and to ensure guards were computed before their use without having to rearrange the loop. Each transfer of control had an associated guard bit vector. The transfer predicate was vectorised and anded with the guard on the transfer of control. Labels orred together all guard vectors leading to them. Guards were associated with assignment statement sequences. (Rather a guard was establised by control statements and labels and was retained until the next.) The control structures can form an loop within the do loop, but this requires a bit of cleverness and is of limited utility. Multiexit loops either have recursive guards are must buffer assignments and discard the results of the excess iterations.
david@titan.rice.edu (David Callahan) (07/07/88)
In article <880@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes: >> ex1 = .true. >> do 100 i = 1, 100 >> if (ex1) statement1 <-----(1) >> if (ex1) ex1 = .not. x(i) <-----(2) >> if (ex1) statement2 >> 100 continue >> if (.not. ex1) goto 200 >> statement3 >> 200 statement4 >> >>the do loop can now be vectorized. > >Actually, I don't think so: the reference to ex1 in (1) is before a definition >in (2), ex1 is recursive. > I don't know who made the original statement. The loop can be ``partially vectorized'' if not all of the statements are part of the computation of the exit branch (part of the the strongly connected component of the data dependence graph that contains the assignment to "ex1"). The loop can be distributed so that the vectorizable statements are isolated from the exit computation. The exit computation is then used to compute a new trip count for the other loops. The exit flags do not appear in the guards on the statements during execution but are used to remember which exit was taken. This approach can also be applied uniformly to "while" loops. David Callahan Rice University
hoefling@uicsrd.csrd.uiuc.edu (07/08/88)
>/* Written 1:21 pm Jul 4, 1988 by ho@svax.cs.cornell.edu */ > >original: do 100 i = 1, 100 > statement1 > if (x(i)) goto 200 > statement2 > 100 continue > statement3 > 200 statement4 > >transformed: > ex1 = .true. > do 100 i = 1, 100 > if (ex1) statement1 > if (ex1) ex1 = .not. x(i) > if (ex1) statement2 > 100 continue > if (.not. ex1) goto 200 > statement3 > 200 statement4 > >the do loop can now be vectorized. If "x" in the original is invariant inside the loop (i.e. no dependences involving it within the loop), then it is trivial to determine on which iteration the exit will occur [ it's iteration i where x(i) is first .TRUE. ]. Knowing that, it is also trivial to determine how many "statement1"s, "statement2"s, "statement3"s and "statement4"s will be executed. It is therefore also trivial to set up vector statements which do the statements that many times. The crucial question is whether there are any dependences from statement2 to statement1. Such a dependence would make the loop not vectorizable (at some point, an instance of statement1 would have to wait for an instance of statement2 to finish). If we assume that there are no dependences on "x" and no dependences from statement2 to statement1, then the problem comes down to simply finding the index of the first occurence of .TRUE. in "x". C---Let's say that "first_TRUE_index" returns 0 if no x(i) is .TRUE. exit_index = first_TRUE_index(x) if (exit_index .EQ. 0) then limit1 = 100 limit2 = 100 else limit1 = exit_index limit2 = exit_index-1 end if dovector i=1,limit1 statement1 end dovector dovector i=1,limit2 statement2 end dovector if (exit_index .EQ. 0) statement3 statement4 Jay Hoeflinger Center for Supercomputing Research and Development U of Illinois
colwell@mfci.UUCP (Robert Colwell) (07/08/88)
In article <236@lfm.fpssun.fps.com> lfm@fpssun.fps.com (Larry Meadows) writes: >In article <12258@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: >> In article <91odrecXKL1010YEOek@amdahl.uts.amdahl.com> >> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes: >> >You guys aren't thinking big enough. How about multiple parallel >> >pipelines to compute all the various instruction threads in parallel >> >and just keep the results of the one that is actually taken? >> >> Actually, this sort of idea is contained in some research and thesis >> work that is (was?) going on here at Maryland. > >Sounds a lot like multiflow to me. > >But what do you do about exceptions???? Please, that's Multiflow with a capital M (they made me say that). About the multiple pipelines: I view our VLIW as being something close to the converse of what you suggest above. We don't try to do anything with multiple instruction threads in parallel. The essence of trace scheduling is that you find the right path through the code and remove all the non-essential branch dependencies. Whatever is left can be compacted only subject to data dependencies. Given that this approach works (and we certainly think it does), it wouldn't make much sense to tie up N hardware pipelines with calculations, only to dump N-1 of them sometime in the future, when you could have gotten all N pipes doing real work. The only thing we do that's close to what you suggest is in performing code motions that sometimes result in operations being kicked off whose results are never used. There's a whole raft of interesting problems that come knocking when you do this, including taking page faults on memory references that aren't going to be used, avoiding access violations on references that are way out of bounds due to array subscripts that have walked off the end, etc. And then there's exceptions. Suppose there's a loop that's doing floating point divides along with a lot of other stuff. You'll probably want to kick off the divide early due to its long latency. But on the last trip through you start the divide, then find out that this trace is done, and you branch out. Meanwhile, that divide was unnecessary and its operands are highly likely to be bogus, so it is liable to try 0/0 or some other offensive calculation. You don't want to trap on an exception here, since that calculation wasn't "real" anyway. It's a hard problem. The TRACE has a "store-check" operation that watches for NaNs being written to memory from the Floating register file, on the theory that if you're actually writing the result of a flop to memory, you must really care about it. At that point you know the computation went awry. Problem is, of course, you may no longer be in the vicinity of the code where the "event" occurred. You are still in the routine where the NaN showed up, and if that's enough info you're done. If it isn't, you will probably need to re-compile in a non-optimizing mode and re-rerun; in that mode any computation you do, you want, so the functional units can then report their exceptions directly and immediately, and you can pinpoint your problem that way. I suspect that nobody's got all the answers to this problem yet. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
smryan@garth.UUCP (Steven Ryan) (07/09/88)
>Sounds a lot like multiflow to me. > >But what do you do about exceptions???? Cry.