[comp.arch] taken -vs- untaken branches, Fortran FREQUENCY declaration

johnl@ima.ISC.COM (John R. Levine) (01/07/88)

In article <18251@bu-cs.BU.EDU> hen@bucsd.bu.edu (Wm. H. Henneman) writes:
>As I recall, the original FORTRAN (for the IBM 704) had a FREQUENCY
>statement which allowed the programmer to give hints about the
>probability of branches being taken.  It was used by the compiler for
>index register optimization.  
>
>[Why did it disappear in later versions?]

The FREQUENCY statement disappeared for two reasons, as far as I can tell.
The first is that it didn't improve the code much; changing the order of
the "branch if greater" vs. the "branch if less" instructions after a test
made little difference on the non-overlapped, non-pipelined 7094.  The
second, consequent reason, is that practically nobody ever used it.  The
first Fortran I compiler put enormous emphasis on optimization because there
was a well-founded fear that if the object code were too much worse than what
an assembler programmer would have written, nobody would take the compiler
seriously.  The FREQUENCY statement helped the optimization.  In reality,
people took to Fortran quite fast.  Then they asked if the compiler, which
was pretty slow, couldn't run faster.  IBM added a switch to bypass most of
the opmization so as to produce worse code mroe quickly.  Guess what fraction
of compiles left optimization on after that.

But enough of this, it's a far cry from hardware architecture.  The lesson
here seems to be that the compiler should make its branch predictions
automatically based on the code or perhaps iteratively based on information
gathered from profiling a previously compiled version.

-- 
John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869
{ ihnp4 | decvax | cbosgd | harvard | yale }!ima!johnl, Levine@YALE.something
Gary Hart for President -- Let's win one for the zipper.

cik@l.cc.purdue.edu (Herman Rubin) (01/07/88)

In article <836@ima.ISC.COM>, johnl@ima.ISC.COM (John R. Levine) writes:
> In article <18251@bu-cs.BU.EDU> hen@bucsd.bu.edu (Wm. H. Henneman) writes:
> >As I recall, the original FORTRAN (for the IBM 704) had a FREQUENCY
> >statement which allowed the programmer to give hints about the
> >probability of branches being taken.  It was used by the compiler for
> >index register optimization.  

> >[Why did it disappear in later versions?]
  
> The FREQUENCY statement disappeared for two reasons, as far as I can tell.
> The first is that it didn't improve the code much; changing the order of
> the "branch if greater" vs. the "branch if less" instructions after a test
> made little difference on the non-overlapped, non-pipelined 7094.  .....

> But enough of this, it's a far cry from hardware architecture.  The lesson
> here seems to be that the compiler should make its branch predictions
> automatically based on the code or perhaps iteratively based on information
> gathered from profiling a previously compiled version.

It is a property of the 70x(x) series, and most of the computers of that time,
that a branch was almost costless, and the time required to save and restore
all registers was approximately that of a single multiplication.  In fact, on
some computers transfer (even conditional) was the fastest operation, and
memory access next.  Since then, arithmetic operations have, relative to 
access, speeded up, branches have slowed, and the number of registers has
greatly increased.  On many machines, a transfer can cost 15-100 instruction
times.  On many which attempt to look ahead, even an untaken transfer can
cost 8 instructions.  Thus it is important that the programmer (algorithm
designer) be able to tell the compiler which way things are going, and for
the hardware to be able to charge ahead when there is a possibility of a
rare branch.


-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

baum@apple.UUCP (Allen J. Baum) (01/09/88)

--------
[]
Someone asked:
>As I recall, the original FORTRAN (for the IBM 704) had a FREQUENCY
>statement which allowed the programmer to give hints about the
>probability of branches being taken.  It was used by the compiler for
>index register optimization.  Why did it disappear in later versions?

I was once told that the reason it was discontinued was that the programmers
intuition about which was the frequent case was frequently (arg) wrong, and
that the statements hurt as much as they helped.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

johnl@ima.ISC.COM (John R. Levine) (01/09/88)

In article <645@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>In article <836@ima.ISC.COM>, johnl@ima.ISC.COM (John R. Levine) writes:
>> [The Fortran FREQUENCY statement made little difference on the IBM 70[49]X
>> machines, was little used, and was dropped from the language.]  The lesson
>> here seems to be that the compiler should make its branch predictions
>> automatically based on the code or perhaps iteratively based on information
>> gathered from profiling a previously compiled version.
>
>...  On many machines, a transfer can cost 15-100 instruction
>times.  On many which attempt to look ahead, even an untaken transfer can
>cost 8 instructions.  Thus it is important that the programmer (algorithm
>designer) be able to tell the compiler which way things are going, and for
>the hardware to be able to charge ahead when there is a possibility of a
>rare branch.

Let me try to clarify my point; then I promise to shut up. I realize that
modern computers have different performance tradeoffs than a 704 did. There's
certainly no doubt that branch prediction can be a big win on pipelined
architectures. What I doubt is that it's a good idea to expect the programmer
to put into the source code explicit hints about which way a branch is
expected to go.

It seems to me that in most cases static analysis by the compiler can figure
out which is the preferred direction, e.g. branches that close loops are
likely to be taken. In the cases where that doesn't work, I'd expect profiling
and recompiling to produce far better hints that a typical programmer would
put in. Perhaps if you're that concerned about code tweaking, you should write
that part of your program in assembler (0.25 :-))
-- 
John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869
{ ihnp4 | decvax | cbosgd | harvard | yale }!ima!johnl, Levine@YALE.something
Gary Hart for President -- Let's win one for the zipper.

nobody@ism780c.UUCP (Unprivileged user) (01/09/88)

>> The FREQUENCY statement disappeared for two reasons, as far as I can tell.
>> The first is that it didn't improve the code much; changing the order of
>> the "branch if greater" vs. the "branch if less" instructions after a test
>> made little difference on the non-overlapped, non-pipelined 7094.  .....

First a  note on the history of FORTRAN and how it interacted with the IBM
704 architecture.  FORTRAN was originally developed as programming language
for the IBM 704 computer.  The FREQENCY statement was associated with the
arithmetic IF statement. An IF statement like:

       IF (<EXPRESSION>) 1,2,3

Was compiled into the following 704 machine code.

	<evaluate-expression>
	TZE  LABEL2      /* transfer to label 2 if result is zero */
	TMI  LABEL1      /* transfer to lable 1 if result is minus */
	TRA  LABEL3      /* unconditional transfer to label 3 */

Now, the purpose of the FREQUENCY statement was was to allow the programmer
to assert that the most frequent value of the <expression> was negative.
The hope was that the generated branching code would be:

	TMI  LABEL1      /* transfer to lable 1 if result is minus */
	TZE  LABEL2      /* transfer to label 2 if result is zero */
	TRA  LABEL2      /* unconditional transfer to label 3 */

thus causing the processor to execute fewer branch instructions on average.
But allas, the semantics of FORTRAN forbad the above optimization!  The
reason is that 'minus' did mean the same as 'less than zero' on the 704
hardware.  The 704 used signed magnitude representation for numbers.  So
there were *two* forms of zero, +0 and -0.  and the TMI instruction
would branch if the result was -0.  Therefore, the compiler was forced to
generate the zero test first independent of any FREQUENCY assertion.  It was
for this reason that FREQUENCY was removed from FORTRAN.

The FORTRAN compiler did generate high quality code.  For example, the
inner loop of a matrix multiply could be written:
	 C(I,J) = C(I,J) + A(I,K)*B(K,J)
the generated code look like:

  loop     LDQ  a,ik         /* ij is an index register  2-cycles       */
	   FMP  b,kj,        /* kj is an index register  4-cycles (ave) */
	   FAD  c,ij         /* ij is an index register  6-cycles (ave) */
	   STO  c,ij         /*                          2-cycles       */
	   TXI  *+1,kj,n     /* increment kj by row size 1-cycles       */
	   TXI  *+1,ik,1     /* increment ik by one      1-cycle        */
	   TXL  loop,ik,n+1  /* to loop if not done      1-cycle        */

>It is a property of the 70x(x) series, and most of the computers of that time,
>that a branch was almost costless, and the time required to save and restore
>all registers was approximately that of a single multiplication.
>-- 
>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
>Phone: (317)494-6054
>hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

Second, some history on computer architecture.  On the 709 and
7094 the instructions times were:
       branching       -- 1 cycle
       load, add, etc  -- 2 cycles
       multiply        -- 2 to 5 cycles
       save & restore  -- 52 cycles (all registers)
       subroutine call -- 7  cycles average (FORTRAN calling convention)
       709 cycle       -- 12 micro seconds
       7094 cycle      -- 1.2 microseconds (average)

Marvin Rubinstein (Historian)  Interactive Systems.

lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (01/10/88)

A FREQUENCY statement (or equivalent) is needed so that the tooling
has a way of feeding its discoveries back to the compiler.

RISC machines aren't the only ones who care about branch prediction.
IBM 370's have dedicated considerable hardware to the matter - conditional
execution of instructions down both possible paths, and the like. 
The new VLIW machines, such as the Multiflow Trace, eliminate this
hardware, so the compiler's analysis becomes crucial.
-- 
	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

franka@mmintl.UUCP (Frank Adams) (01/14/88)

I will note in passing that the assumption that continuing a loop is more
likely than terminating it is often wrong.  Loops of the form:

while (error is present) {
	try to fix it;
}

are not all that uncommon.
-- 

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

earl@mips.UUCP (Earl Killian) (01/20/88)

In article <2652@mmintl.UUCP>, franka@mmintl.UUCP (Frank Adams) writes:
> I will note in passing that the assumption that continuing a loop is more
> likely than terminating it is often wrong.  Loops of the form:
> 
> while (error is present) {
> 	try to fix it;
> }
> 
> are not all that uncommon.

True, they are common, but precisely because such things execute 0 or
1 times means they don't weight branch statistics as heavily as the
"real" loop that executes 10 times.

Also,
	while (t) {
	  body
	}
is usually compiled as
	if (t) {
	  do {
	    body
	  } while (t);
	}
by a good compiler.  Thus the backward branch in this loop is never
executed in your case, and so doesn't affect the statistics of
backward branches.

When I last measured backward conditional branches on 13 programs, I
found an average of 85% are taken.  The data on forward branches was
52% (i.e. 50-50).  That's the average behavior; individual programs
tended to vary a whole lot on the frequency of forward branches.

One final statistic: backward branches were only 28% of the total on
average.  Another way of thinking about this is that the average loop
contains 2.6 forward branches (e.g. executed if statements).

john@geac.UUCP (John Henshaw) (01/22/88)

In article <839@ima.ISC.COM> johnl@ima.UUCP (John R. Levine) writes:
>
>It seems to me that in most cases static analysis by the compiler can figure
>out which is the preferred direction, e.g. branches that close loops are
>likely to be taken. In the cases where that doesn't work, I'd expect profiling
>and recompiling to produce far better hints that a typical programmer would
>put in.
>-- 
>John R. Levine

I endorse this point of view, and wonder if there are compilers
available that insert profiling logic that maintain a comprehensive
history of branch activity, followed by recompilation that takes into
account the previous branch history. The activity of producing a piece
of production code would be:

	1. Design, code and test your program.
	2. Compile it for "branch profiling".
	3. Execute the program on a dataset large enough that there is
	   sufficient confidence that the program's execution is truly
	   representative of "standard production activity".
	4. Recompile the program. This recompilation should use the
	   information gained from step 3.

There are a few assumptions here :-). I think they're obvious.

Most of the intent here is to not have programmers guess about program
activity, but rather have the compiler/profiler do this analysis for
him/her. There is the known concern that the data will change, causing 
worse performance than anticipated - steps 2 to 4 should probably be rerun
from time to time.

And of course there are times when we wish to optimize the exception
conditions because they're the reason for automation in the first place.

Comments anyone?

-john-
-- 
John Henshaw,			(mnetor, yetti, utgpu !geac!john)
Geac Computers Ltd.		"Relax. The collapse of the universe and all
Markham, Ontario, Canada, eh?	 life as we know it, could only kill you once."

tak@bnr-rsc.UUCP (Mike Takefman) (01/25/88)

In article <2158@geac.UUCP> john@geac.UUCP (John Henshaw) writes:
.	1. Design, code and test your program.
.	2. Compile it for "branch profiling".
.	3. Execute the program on a dataset large enough that there is
.	   sufficient confidence that the program's execution is truly
.	   representative of "standard production activity".
.	4. Recompile the program. This recompilation should use the
.	   information gained from step 3.
.There are a few assumptions here :-). I think they're obvious.

One assumption that may not be obvious but is important is that this 
method is highly intrusive. This method will not work well for a 
real time application. 

For vanilla programming I fully endorse this view, but I believe 
that the challenge is in doing this non-intrusively.


-- 
Michael Takefman	       There are 2 basic truths,          
Processor Exploratory	         Never fight a land war in Asia and
tak@bnr-rsc@bnr-vpa@utzoo....      never go against a Sicilian if death is 
Bell Northern Research               on the line.   (The Princess Bride)

jmd@granite.dec.com (John Danskin) (01/27/88)

In article <604@bnr-rsc.UUCP> tak@bnr-rsc.UUCP (Mike Takefman) writes:
>In article <2158@geac.UUCP> john@geac.UUCP (John Henshaw) writes:
>.	1. Design, code and test your program.
>.	2. Compile it for "branch profiling".
>.	3. Execute the program on a dataset large enough that there is
>.	   sufficient confidence that the program's execution is truly
>.	   representative of "standard production activity".
>.	4. Recompile the program. This recompilation should use the
>.	   information gained from step 3.
>.There are a few assumptions here :-). I think they're obvious.
>
>One assumption that may not be obvious but is important is that this 
>method is highly intrusive. This method will not work well for a 
>real time application. 
>
>For vanilla programming I fully endorse this view, but I believe 
>that the challenge is in doing this non-intrusively.


What do you mean by intrusive? Do you mean that the optimizing process
can be unpredictable?

For a given set of inputs, a real time program can exibit two speed related
characteristics:
	1) Fast enough.
	2) Not fast enough.

If the program is fast enough for some usual case, but not fast enough for
some exception, then there are three options:
	1) Profile with more instances of the exceptional behaviour
	   to try to pump up the performance of those cases, at the expense
	   of the other cases. This may require several iterations, and
	   may expose a design problem, namely that given the state of
	   compiler technology embodied in your compiler, there is no
	   way to jiggle branch frequencies (and global register
	   allocation characteristics etc.) there is no way to make all of the
	   code sequences fast enough.
	
	2) Redo the parts that aren't fast enough in assembler.

	3) Change your algorithms so there is more time.

	4) Get an optimizing compiler that understands real time constraints:
		you say:
			this subroutine, given these sets of inputs, must
			complete in xxx microseconds.

	    you would probably give these directives to the compiler in a
	    meta language that you used while profiling.

	    Of course, this last option would be a lot of work 8-{).


-- 
John Danskin				| decwrl!jmd
DEC Workstation Systems Engineering	| (415)853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94306			| I do not speak for DEC.