[comp.arch] getting rid of branches

chris@mimsy.UUCP (Chris Torek) (07/01/88)

In article <91odrecXKL1010YEOek@amdahl.uts.amdahl.com>
chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
>You guys aren't thinking big enough.  How about multiple parallel
>pipelines to compute all the various instruction threads in parallel
>and just keep the results of the one that is actually taken?

Actually, this sort of idea is contained in some research and thesis
work that is (was?) going on here at Maryland.  How do you move old
code (`dusty decks') onto a parallel processor?  One way is to slice up
the program into independent pieces that can be combined again later.
In some case, rather than running the loop

	for i in [0..n-1)
		A := compute
		if test(A)
		then B[i] := fn1(A)
		else B[i] := fn2(A)
		fi
	rof

you run two or three loops:

		1			2			3

	for i in [0..n-1)	for i in [0..n-1)	for i in [0..n-1)
		A := compute		A := compute		A := compute
		B[i] := fn1(A)		B[i] := fn2(a)		combine[i] :=
									test(A)
	rof			rof			rof

and at the end, you pick the appropriate B[i]s based on the (secret)
combine[i]s (which may be assigned within either of processes 1 and 2,
or might be done on a third path, depending on how hard `test' is).

The separation (slicing) and recombining (splicing) is relatively
straightforward, although there are oodles of messy details (computing
dominators, watching side effects, what about overflow?, ...).  If you
have a massively parallel machine, computing both answers, then
throwing it away the wrong one, may be cheapest.  You have to account
for communications costs, and decide where the communication goes (can
process 3 stop processes 1 and 2 short? can 1 & 2 share A? etc.).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

aglew@urbsdc.Urbana.Gould.COM (07/01/88)

>Actually, this sort of idea is contained in some research and thesis
>work that is (was?) going on here at Maryland.  How do you move old
>code (`dusty decks') onto a parallel processor?  One way is to slice up
>the program into independent pieces that can be combined again later.

This comes perilosly close to something I'm trying.
Davidson at the UIll, now UMich, proposed a partitioned
access/execute architecture a while back, where you basically
have two processors, one for address and memory computations,
one for (mainly floating point?) calculation.
They run independently, with FIFOs between them for passage of
variables back and forth, condition codes, etc. They basically
run two versions of exactly the same program, one with all the
FP taken out, and the other with all the memory references
replaced by "get the next value from the memory unit".

Sounds a bit like Wulf's WM, with the extension that,
instead of executing the same program and treating instructions
for the other unit as NOPs, you partition the program.

I'm trying to take this a bit farther - instead of partitioning
by _type_ of data, I want to see there is enough parallelism
within the same data type, when there are many registers available.
Ie. have different functional units for all operations on 
registers R0-R15 and R16-R31, with FIFOs for operations like
R0 = R16 + R32.

Anyone else doing similar?

smryan@garth.UUCP (Steven Ryan) (07/02/88)

>                                                How do you move old
>code (`dusty decks') onto a parallel processor?  One way is to slice up
>the program into independent pieces that can be combined again later.
>          .   .   .
>you run two or three loops:
>
>		1			2			3
>
>	for i in [0..n-1)	for i in [0..n-1)	for i in [0..n-1)
>		A := compute		A := compute		A := compute
>		B[i] := fn1(A)		B[i] := fn2(a)		combine[i] :=
>									test(A)
>	rof			rof			rof

On a 205, if compute, test, and fn1 and fn2 are vectorisable, this entire
construct can be hand-vectorised by something like
               compute' -> promoted A'
               test(A') -> bit vector
               fn1'(A') -> B[i] if bit[i] set (else discard and ignore faults)
               fn2'(A') -> B[i] if bit[i] clear (else discard and ignore faults)
CFT is supposed to provide a vectorisable conditional expression. FTN200 code
to vectorise ifs may/may not be completed and released.

cik@l.cc.purdue.edu (Herman Rubin) (07/02/88)

In article <853@garth.UUCP>, smryan@garth.UUCP (Steven Ryan) writes:
  |                                                How do you move old
  |code (`dusty decks') onto a parallel processor?  One way is to slice up
  |the program into independent pieces that can be combined again later.
  |          .   .   .
  |you run two or three loops:
  |
  |		1			2			3
  |
  |	for i in [0..n-1)	for i in [0..n-1)	for i in [0..n-1)
  |		A := compute		A := compute		A := compute
  |		B[i] := fn1(A)		B[i] := fn2(a)		combine[i] :=
  |									test(A)
  |	rof			rof			rof
> 
> On a 205, if compute, test, and fn1 and fn2 are vectorisable, this entire
> construct can be hand-vectorised by something like
>                compute' -> promoted A'
>                test(A') -> bit vector
>                fn1'(A') -> B[i] if bit[i] set (else discard and ignore faults)
>                fn2'(A') -> B[i] if bit[i] clear (else discard and ignore faults)
> CFT is supposed to provide a vectorisable conditional expression. FTN200 code
> to vectorise ifs may/may not be completed and released.

Even better is to first construct vectors A1, which contains only the elements
of A for which bit[i] set, and A0, which has the elements for which bit[i] is
clear, compute fn1 on A1, fn2 only on A0, and use bit to merge the two result
vectors.  What was suggested could be vectorized on the CRAY or even on SIMD
machines; the improvement can be so vectorized on those machines only with 
difficulty.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

nusip@maccs.McMaster.CA (Mike Borza) (07/03/88)

In article <12258@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <91odrecXKL1010YEOek@amdahl.uts.amdahl.com>
>chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
>>You guys aren't thinking big enough.  How about multiple parallel
>>pipelines to compute all the various instruction threads in parallel
>>and just keep the results of the one that is actually taken?
>
>Actually, this sort of idea is contained in some research and thesis
>work that is (was?) going on here at Maryland.  How do you move old

I was at a talk given by A.K. Dewdney some time ago, about
non-deterministic programming languages.  The jist of these
languages was exactly what you're talking about.  At the outset,
all possible outcomes of a particular computation are accounted
for, then a selection criterion is applied to accept or reject
particular outcomes.  Lots of interesting possibilities and
problems here.  One of the most interesting features is that
whole classes of possible solutions can be selected, rather than
a single solution.  I think this leads to a very natural way to
program some problems which are very difficult in more traditional
approaches.  Global optimization of the route-and-place problem
in VLSI and PCB design is a good example.  We're starting to see
the emergence of architectures which address these sorts of ideas
in things like the Connection Machine.

Of course, exponential explosion of the state space does appear to
be something of a problem :-)

mike borza     <nusip@maccs.UUCP or EE784@SSCvax.McMaster.CA>

pf@csc.ti.com (Paul Fuqua) (07/04/88)

    Date: Friday, July 1, 1988  8:59am (CDT)
    From: aglew at urbsdc.Urbana.Gould.COM

    This comes perilosly close to something I'm trying.
    Davidson at the UIll, now UMich, proposed a partitioned
    access/execute architecture a while back, where you basically
    have two processors, one for address and memory computations,
    one for (mainly floating point?) calculation.
    They run independently, with FIFOs between them for passage of
    variables back and forth, condition codes, etc. They basically
    run two versions of exactly the same program, one with all the
    FP taken out, and the other with all the memory references
    replaced by "get the next value from the memory unit".

    Anyone else doing similar?

I have a copy of a paper, "PIPE:  A VLSI Decoupled Architecture," by James R
Goodman and five others, that appeared in a 1985 IEEE something (sorry, it's
not noted on my copy).  PIPE uses two processors connected by queues:  "Each
of PIPE's processors is capable of executing an entire program by itself (in
SP -- single processor -- mode), or the two processors may work together on a
program (in AE -- access/execute -- mode)."

In detail, it's quite similar to the description above, with the addition of
more queues in the memory system (that don't connect the processors), and a
strange sort of variable-delay branch.  The work was done at the University
of Wisconsin-Madison.  Is anyone outside the Great Lakes states working on
this idea?

                              pf

Paul Fuqua
Texas Instruments Computer Science Center, Dallas, Texas
CSNet:  pf@csc.ti.com (ARPA too, sometimes)
UUCP:   {smu, texsun, cs.utexas.edu, im4u, rice}!ti-csl!pf

ho@svax.cs.cornell.edu (Alex Ho) (07/05/88)

i have a copy of a paper by allen, kennedy, porterfield, and warren
from rice university (published by the acm, i don't have the exact
source) describing a scheme for eliminating fortran goto statements 
though the use of if conversion.
in essence, they assign a boolean guard expression to statements
which are executed conditionally. for example (from the paper)...

original:   do 100 i = 1, 100
               statement1
               if (x(i)) goto 200
               statement2
        100 continue
            statement3
        200 statement4

transformed:
            ex1 = .true.
            do 100 i = 1, 100
               if (ex1) statement1
               if (ex1) ex1 = .not. x(i)
               if (ex1) statement2
        100 continue
            if (.not. ex1) goto 200
            statement3
        200 statement4

the do loop can now be vectorized.
has anyone else had experience with this procedure?

---

alex ho
university of california, berkeley             ho@svax.cs.cornell.edu     
a confused eecs major (spending the summer at) cornell university

lfm@fpssun.fps.com (Larry Meadows) (07/06/88)

In article <12258@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> In article <91odrecXKL1010YEOek@amdahl.uts.amdahl.com>
> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
> >You guys aren't thinking big enough.  How about multiple parallel
> >pipelines to compute all the various instruction threads in parallel
> >and just keep the results of the one that is actually taken?
> 
> Actually, this sort of idea is contained in some research and thesis
> work that is (was?) going on here at Maryland.

Sounds a lot like multiflow to me.

But what do you do about exceptions????











-- 
Larry Meadows @ FPS			...!tektronix!fpssun!lfm
					...!omepd!ihf1!fpssun!lfm

eugene@pioneer.arc.nasa.gov.arpa (Eugene N. Miya) (07/06/88)

In article <822@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>In article <853@garth.UUCP>, smryan@garth.UUCP (Steven Ryan) writes:

>> CFT is supposed to provide a vectorisable conditional expression. FTN200 code
>> to vectorise ifs may/may not be completed and released.
>
>Even better is to first construct vectors A1, which contains only the elements
>of A for which bit[i] set, and A0, which has the elements for which bit[i] is
>clear, compute fn1 on A1, fn2 only on A0, and use bit to merge the two result
>vectors.

I think it was Steve who asked what this type of conditional code looks
like.  And I also see that Herman responded.  I have collected some
of this type of information for a little net article on supercomputers
and what should be taught in super-classes which one might not get
but should.... (they are just some ramblings sort of like these).  Anyway,
I spoke to one of my officemate for his sage opinions and he gave me
a couple of references by Paul Frederickson.  Well!  Paul just took a
job at RIACS, so I just walked down the hall, and he has this paper in
Parallel Computing in 1984 and a more recent version in 1987.  George
only told me PC84 and the author, so you can look up the topic.

Anyway, these codes are basically Monte Carlo bomb (nuke) codes
from places like LLNL, LANL, Sandia which model photon, neutron,
and energy transport, those kinds of things.  Herman's description
is closer (dope vectors).  Anyways, if you need a reference,
Parallel Computing 1984, a quarterly, article by Frederickson,
so you shouldn't have too much trouble finding this, or consult
your nearest National Laboratory.
"Mr. Fusion" 8-)  "Plutonium are your corner drug store..."

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
There seems to be a little bit of interest in the subject of
conditionals.

bradlee@cg-atla.UUCP (Rob Bradlee X5153) (07/06/88)

To: ulowell!bbn!mit-eddie!bloom-beacon!tut.cis.ohio-state.edu!mailrus!iuvax!pur-ee!a.cs.uiuc.edu!uxc.cso.uiuc.edu!urbsdc!aglew
Subject: Re: getting rid of branches
Newsgroups: comp.arch
In-Reply-To: <28200173@urbsdc>
References: <12258@mimsy.UUCP>
Organization: Compugraphic Corp. Wilmington, MA
Cc: 
Bcc: 

In article <28200173@urbsdc> you write:
>
>
>This comes perilosly close to something I'm trying.
>Davidson at the UIll, now UMich, proposed a partitioned
>access/execute architecture a while back, where you basically
>have two processors, one for address and memory computations,
>one for (mainly floating point?) calculation.
>They run independently, with FIFOs between them for passage of
>variables back and forth, condition codes, etc. They basically
>run two versions of exactly the same program, one with all the
>FP taken out, and the other with all the memory references
>replaced by "get the next value from the memory unit".
>
>Anyone else doing similar?

This sounds quite similar to the architecture of array processors built
by CSP, Inc of Billerica, MA.  They use an IPU (integer processor) and
an APU (arithmetic processor) connected by FIFOs.  The IPU would 
handle computations to determine the next piece of data to get from
memory and send it over to the APU for processing. The APU would put
the result in another FIFO for return to the IPU which could then
determine where to put it into memory.  The best part was that the
two processors operated asynchronously waiting on their respective
input queues if one got ahead of the other.  Unfortunately, to do
anything really useful a program had to be done by hand in assembler.
However, they did have a FORTRAN compiler which generated sometimes adequate
code.

-- 
Rob Bradlee  w:(617)-658-5600 X5153  h:(617)-944-5595
Compugraphic Corp.	    ...!{ima,ulowell,ism780c}!cg-atla!bradlee
200 Ballardvale St.
Wilmington, Mass. 01887           The Nordic Way: Ski till it hurts!
-- 
Rob Bradlee  w:(617)-658-5600 X5153  h:(617)-944-5595
Compugraphic Corp.	    ...!{ima,ulowell,ism780c}!cg-atla!bradlee
200 Ballardvale St.
Wilmington, Mass. 01887           The Nordic Way: Ski till it hurts!

smryan@garth.UUCP (Steven Ryan) (07/06/88)

>            ex1 = .true.
>            do 100 i = 1, 100
>               if (ex1) statement1          <-----(1)
>               if (ex1) ex1 = .not. x(i)    <-----(2)
>               if (ex1) statement2
>        100 continue
>            if (.not. ex1) goto 200
>            statement3
>        200 statement4
>
>the do loop can now be vectorized.

Actually, I don't think so: the reference to ex1 in (1) is before a definition
in (2), ex1 is recursive.

I worked on something like, provided the control statements were forward jumps
only and did not exit the loop. The forward jump restriction was to ensure the
flow was acyclic and to ensure guards were computed before their use without
having to rearrange the loop.

Each transfer of control had an associated guard bit vector. The transfer
predicate was vectorised and anded with the guard on the transfer of control.
Labels orred together all guard vectors leading to them. Guards were associated
with assignment statement sequences. (Rather a guard was establised by control
statements and labels and was retained until the next.)

The control structures can form an loop within the do loop, but this requires
a bit of cleverness and is of limited utility. Multiexit loops either have
recursive guards are must buffer assignments and discard the results of the
excess iterations.

david@titan.rice.edu (David Callahan) (07/07/88)

In article <880@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
>>            ex1 = .true.
>>            do 100 i = 1, 100
>>               if (ex1) statement1          <-----(1)
>>               if (ex1) ex1 = .not. x(i)    <-----(2)
>>               if (ex1) statement2
>>        100 continue
>>            if (.not. ex1) goto 200
>>            statement3
>>        200 statement4
>>
>>the do loop can now be vectorized.
>
>Actually, I don't think so: the reference to ex1 in (1) is before a definition
>in (2), ex1 is recursive.
>

I don't know who made the original statement. 
The loop can be ``partially vectorized'' if not all of
the statements are part of the computation of the
exit branch (part of the the strongly connected 
component of the data dependence graph that contains
the assignment to "ex1"). The loop can be distributed
so that the vectorizable statements are isolated from
the exit computation. The exit computation is then
used to compute a new trip count for the other loops.

The exit flags do not appear in the guards on the
statements during execution but are used to remember
which exit was taken. This approach can also be applied
uniformly to "while" loops.

David Callahan
Rice University

hoefling@uicsrd.csrd.uiuc.edu (07/08/88)

>/* Written  1:21 pm  Jul  4, 1988 by ho@svax.cs.cornell.edu */
>
>original:   do 100 i = 1, 100
>               statement1
>               if (x(i)) goto 200
>               statement2
>        100 continue
>            statement3
>        200 statement4
>
>transformed:
>            ex1 = .true.
>            do 100 i = 1, 100
>               if (ex1) statement1
>               if (ex1) ex1 = .not. x(i)
>               if (ex1) statement2
>        100 continue
>            if (.not. ex1) goto 200
>            statement3
>        200 statement4
>
>the do loop can now be vectorized.

If "x" in the original is invariant inside the loop (i.e. no dependences 
involving it within the loop), then it is trivial to determine on which
iteration the exit will occur [ it's iteration i where x(i) is first .TRUE. ].
Knowing that, it is also trivial to determine how many "statement1"s, 
"statement2"s, "statement3"s and "statement4"s will be executed. It is 
therefore also trivial to set up vector statements which do the statements
that many times.  The crucial question is whether there are any dependences 
from statement2 to statement1.  Such a dependence would make the loop not 
vectorizable (at some point, an instance of statement1 would have to wait 
for an instance of statement2 to finish).

If we assume that there are no dependences on "x" and no dependences from 
statement2 to statement1, then the problem comes down to simply finding the 
index of the first occurence of .TRUE. in "x".

C---Let's say that "first_TRUE_index" returns 0 if no x(i) is .TRUE.

	exit_index = first_TRUE_index(x)

	if (exit_index .EQ. 0) then
		limit1 = 100
		limit2 = 100
	else 
		limit1 = exit_index
		limit2 = exit_index-1
	end if

	dovector i=1,limit1
		statement1
	end dovector

	dovector i=1,limit2
		statement2
	end dovector

	if (exit_index .EQ. 0) statement3

	statement4

Jay Hoeflinger
Center for Supercomputing Research and Development
U of Illinois

colwell@mfci.UUCP (Robert Colwell) (07/08/88)

In article <236@lfm.fpssun.fps.com> lfm@fpssun.fps.com (Larry Meadows) writes:
>In article <12258@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
>> In article <91odrecXKL1010YEOek@amdahl.uts.amdahl.com>
>> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
>> >You guys aren't thinking big enough.  How about multiple parallel
>> >pipelines to compute all the various instruction threads in parallel
>> >and just keep the results of the one that is actually taken?
>> 
>> Actually, this sort of idea is contained in some research and thesis
>> work that is (was?) going on here at Maryland.
>
>Sounds a lot like multiflow to me.
>
>But what do you do about exceptions????

Please, that's Multiflow with a capital M (they made me say that).

About the multiple pipelines:  I view our VLIW as being something
close to the converse of what you suggest above.  We don't try to do
anything with multiple instruction threads in parallel.  The essence
of trace scheduling is that you find the right path through the code
and remove all the non-essential branch dependencies.  Whatever is
left can be compacted only subject to data dependencies.  Given that
this approach works (and we certainly think it does), it wouldn't
make much sense to tie up N hardware pipelines with calculations,
only to dump N-1 of them sometime in the future, when you could have
gotten all N pipes doing real work.

The only thing we do that's close to what you suggest is in
performing code motions that sometimes result in operations being
kicked off whose results are never used.  There's a whole raft of
interesting problems that come knocking when you do this, including
taking page faults on memory references that aren't going to be used,
avoiding access violations on references that are way out of bounds
due to array subscripts that have walked off the end, etc.

And then there's exceptions.  Suppose there's a loop that's doing
floating point divides along with a lot of other stuff.  You'll
probably want to kick off the divide early due to its long latency.
But on the last trip through you start the divide, then find out that
this trace is done, and you branch out.  Meanwhile, that divide was
unnecessary and its operands are highly likely to be bogus, so it
is liable to try 0/0 or some other offensive calculation.  You don't
want to trap on an exception here, since that calculation wasn't
"real" anyway.

It's a hard problem.  The TRACE has a "store-check" operation that
watches for NaNs being written to memory from the Floating register
file, on the theory that if you're actually writing the result of a
flop to memory, you must really care about it.  At that point you
know the computation went awry.  Problem is, of course, you may no
longer be in the vicinity of the code where the "event" occurred.
You are still in the routine where the NaN showed up, and if
that's enough info you're done.  If it isn't, you will probably need
to re-compile in a non-optimizing mode and re-rerun; in that mode any
computation you do, you want, so the functional units can then report
their exceptions directly and immediately, and you can pinpoint your
problem that way.

I suspect that nobody's got all the answers to this problem yet.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

smryan@garth.UUCP (Steven Ryan) (07/09/88)

>Sounds a lot like multiflow to me.
>
>But what do you do about exceptions????

Cry.