johnl@ima.ISC.COM (John R. Levine) (01/07/88)
In article <18251@bu-cs.BU.EDU> hen@bucsd.bu.edu (Wm. H. Henneman) writes: >As I recall, the original FORTRAN (for the IBM 704) had a FREQUENCY >statement which allowed the programmer to give hints about the >probability of branches being taken. It was used by the compiler for >index register optimization. > >[Why did it disappear in later versions?] The FREQUENCY statement disappeared for two reasons, as far as I can tell. The first is that it didn't improve the code much; changing the order of the "branch if greater" vs. the "branch if less" instructions after a test made little difference on the non-overlapped, non-pipelined 7094. The second, consequent reason, is that practically nobody ever used it. The first Fortran I compiler put enormous emphasis on optimization because there was a well-founded fear that if the object code were too much worse than what an assembler programmer would have written, nobody would take the compiler seriously. The FREQUENCY statement helped the optimization. In reality, people took to Fortran quite fast. Then they asked if the compiler, which was pretty slow, couldn't run faster. IBM added a switch to bypass most of the opmization so as to produce worse code mroe quickly. Guess what fraction of compiles left optimization on after that. But enough of this, it's a far cry from hardware architecture. The lesson here seems to be that the compiler should make its branch predictions automatically based on the code or perhaps iteratively based on information gathered from profiling a previously compiled version. -- John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869 { ihnp4 | decvax | cbosgd | harvard | yale }!ima!johnl, Levine@YALE.something Gary Hart for President -- Let's win one for the zipper.
cik@l.cc.purdue.edu (Herman Rubin) (01/07/88)
In article <836@ima.ISC.COM>, johnl@ima.ISC.COM (John R. Levine) writes: > In article <18251@bu-cs.BU.EDU> hen@bucsd.bu.edu (Wm. H. Henneman) writes: > >As I recall, the original FORTRAN (for the IBM 704) had a FREQUENCY > >statement which allowed the programmer to give hints about the > >probability of branches being taken. It was used by the compiler for > >index register optimization. > >[Why did it disappear in later versions?] > The FREQUENCY statement disappeared for two reasons, as far as I can tell. > The first is that it didn't improve the code much; changing the order of > the "branch if greater" vs. the "branch if less" instructions after a test > made little difference on the non-overlapped, non-pipelined 7094. ..... > But enough of this, it's a far cry from hardware architecture. The lesson > here seems to be that the compiler should make its branch predictions > automatically based on the code or perhaps iteratively based on information > gathered from profiling a previously compiled version. It is a property of the 70x(x) series, and most of the computers of that time, that a branch was almost costless, and the time required to save and restore all registers was approximately that of a single multiplication. In fact, on some computers transfer (even conditional) was the fastest operation, and memory access next. Since then, arithmetic operations have, relative to access, speeded up, branches have slowed, and the number of registers has greatly increased. On many machines, a transfer can cost 15-100 instruction times. On many which attempt to look ahead, even an untaken transfer can cost 8 instructions. Thus it is important that the programmer (algorithm designer) be able to tell the compiler which way things are going, and for the hardware to be able to charge ahead when there is a possibility of a rare branch. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet
baum@apple.UUCP (Allen J. Baum) (01/09/88)
-------- [] Someone asked: >As I recall, the original FORTRAN (for the IBM 704) had a FREQUENCY >statement which allowed the programmer to give hints about the >probability of branches being taken. It was used by the compiler for >index register optimization. Why did it disappear in later versions? I was once told that the reason it was discontinued was that the programmers intuition about which was the frequent case was frequently (arg) wrong, and that the statements hurt as much as they helped. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
johnl@ima.ISC.COM (John R. Levine) (01/09/88)
In article <645@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >In article <836@ima.ISC.COM>, johnl@ima.ISC.COM (John R. Levine) writes: >> [The Fortran FREQUENCY statement made little difference on the IBM 70[49]X >> machines, was little used, and was dropped from the language.] The lesson >> here seems to be that the compiler should make its branch predictions >> automatically based on the code or perhaps iteratively based on information >> gathered from profiling a previously compiled version. > >... On many machines, a transfer can cost 15-100 instruction >times. On many which attempt to look ahead, even an untaken transfer can >cost 8 instructions. Thus it is important that the programmer (algorithm >designer) be able to tell the compiler which way things are going, and for >the hardware to be able to charge ahead when there is a possibility of a >rare branch. Let me try to clarify my point; then I promise to shut up. I realize that modern computers have different performance tradeoffs than a 704 did. There's certainly no doubt that branch prediction can be a big win on pipelined architectures. What I doubt is that it's a good idea to expect the programmer to put into the source code explicit hints about which way a branch is expected to go. It seems to me that in most cases static analysis by the compiler can figure out which is the preferred direction, e.g. branches that close loops are likely to be taken. In the cases where that doesn't work, I'd expect profiling and recompiling to produce far better hints that a typical programmer would put in. Perhaps if you're that concerned about code tweaking, you should write that part of your program in assembler (0.25 :-)) -- John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869 { ihnp4 | decvax | cbosgd | harvard | yale }!ima!johnl, Levine@YALE.something Gary Hart for President -- Let's win one for the zipper.
nobody@ism780c.UUCP (Unprivileged user) (01/09/88)
>> The FREQUENCY statement disappeared for two reasons, as far as I can tell. >> The first is that it didn't improve the code much; changing the order of >> the "branch if greater" vs. the "branch if less" instructions after a test >> made little difference on the non-overlapped, non-pipelined 7094. ..... First a note on the history of FORTRAN and how it interacted with the IBM 704 architecture. FORTRAN was originally developed as programming language for the IBM 704 computer. The FREQENCY statement was associated with the arithmetic IF statement. An IF statement like: IF (<EXPRESSION>) 1,2,3 Was compiled into the following 704 machine code. <evaluate-expression> TZE LABEL2 /* transfer to label 2 if result is zero */ TMI LABEL1 /* transfer to lable 1 if result is minus */ TRA LABEL3 /* unconditional transfer to label 3 */ Now, the purpose of the FREQUENCY statement was was to allow the programmer to assert that the most frequent value of the <expression> was negative. The hope was that the generated branching code would be: TMI LABEL1 /* transfer to lable 1 if result is minus */ TZE LABEL2 /* transfer to label 2 if result is zero */ TRA LABEL2 /* unconditional transfer to label 3 */ thus causing the processor to execute fewer branch instructions on average. But allas, the semantics of FORTRAN forbad the above optimization! The reason is that 'minus' did mean the same as 'less than zero' on the 704 hardware. The 704 used signed magnitude representation for numbers. So there were *two* forms of zero, +0 and -0. and the TMI instruction would branch if the result was -0. Therefore, the compiler was forced to generate the zero test first independent of any FREQUENCY assertion. It was for this reason that FREQUENCY was removed from FORTRAN. The FORTRAN compiler did generate high quality code. For example, the inner loop of a matrix multiply could be written: C(I,J) = C(I,J) + A(I,K)*B(K,J) the generated code look like: loop LDQ a,ik /* ij is an index register 2-cycles */ FMP b,kj, /* kj is an index register 4-cycles (ave) */ FAD c,ij /* ij is an index register 6-cycles (ave) */ STO c,ij /* 2-cycles */ TXI *+1,kj,n /* increment kj by row size 1-cycles */ TXI *+1,ik,1 /* increment ik by one 1-cycle */ TXL loop,ik,n+1 /* to loop if not done 1-cycle */ >It is a property of the 70x(x) series, and most of the computers of that time, >that a branch was almost costless, and the time required to save and restore >all registers was approximately that of a single multiplication. >-- >Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 >Phone: (317)494-6054 >hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet Second, some history on computer architecture. On the 709 and 7094 the instructions times were: branching -- 1 cycle load, add, etc -- 2 cycles multiply -- 2 to 5 cycles save & restore -- 52 cycles (all registers) subroutine call -- 7 cycles average (FORTRAN calling convention) 709 cycle -- 12 micro seconds 7094 cycle -- 1.2 microseconds (average) Marvin Rubinstein (Historian) Interactive Systems.
lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (01/10/88)
A FREQUENCY statement (or equivalent) is needed so that the tooling has a way of feeding its discoveries back to the compiler. RISC machines aren't the only ones who care about branch prediction. IBM 370's have dedicated considerable hardware to the matter - conditional execution of instructions down both possible paths, and the like. The new VLIW machines, such as the Multiflow Trace, eliminate this hardware, so the compiler's analysis becomes crucial. -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science
franka@mmintl.UUCP (Frank Adams) (01/14/88)
I will note in passing that the assumption that continuing a loop is more
likely than terminating it is often wrong. Loops of the form:
while (error is present) {
try to fix it;
}
are not all that uncommon.
--
Frank Adams ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate 52 Oakland Ave North E. Hartford, CT 06108
earl@mips.UUCP (Earl Killian) (01/20/88)
In article <2652@mmintl.UUCP>, franka@mmintl.UUCP (Frank Adams) writes: > I will note in passing that the assumption that continuing a loop is more > likely than terminating it is often wrong. Loops of the form: > > while (error is present) { > try to fix it; > } > > are not all that uncommon. True, they are common, but precisely because such things execute 0 or 1 times means they don't weight branch statistics as heavily as the "real" loop that executes 10 times. Also, while (t) { body } is usually compiled as if (t) { do { body } while (t); } by a good compiler. Thus the backward branch in this loop is never executed in your case, and so doesn't affect the statistics of backward branches. When I last measured backward conditional branches on 13 programs, I found an average of 85% are taken. The data on forward branches was 52% (i.e. 50-50). That's the average behavior; individual programs tended to vary a whole lot on the frequency of forward branches. One final statistic: backward branches were only 28% of the total on average. Another way of thinking about this is that the average loop contains 2.6 forward branches (e.g. executed if statements).
john@geac.UUCP (John Henshaw) (01/22/88)
In article <839@ima.ISC.COM> johnl@ima.UUCP (John R. Levine) writes: > >It seems to me that in most cases static analysis by the compiler can figure >out which is the preferred direction, e.g. branches that close loops are >likely to be taken. In the cases where that doesn't work, I'd expect profiling >and recompiling to produce far better hints that a typical programmer would >put in. >-- >John R. Levine I endorse this point of view, and wonder if there are compilers available that insert profiling logic that maintain a comprehensive history of branch activity, followed by recompilation that takes into account the previous branch history. The activity of producing a piece of production code would be: 1. Design, code and test your program. 2. Compile it for "branch profiling". 3. Execute the program on a dataset large enough that there is sufficient confidence that the program's execution is truly representative of "standard production activity". 4. Recompile the program. This recompilation should use the information gained from step 3. There are a few assumptions here :-). I think they're obvious. Most of the intent here is to not have programmers guess about program activity, but rather have the compiler/profiler do this analysis for him/her. There is the known concern that the data will change, causing worse performance than anticipated - steps 2 to 4 should probably be rerun from time to time. And of course there are times when we wish to optimize the exception conditions because they're the reason for automation in the first place. Comments anyone? -john- -- John Henshaw, (mnetor, yetti, utgpu !geac!john) Geac Computers Ltd. "Relax. The collapse of the universe and all Markham, Ontario, Canada, eh? life as we know it, could only kill you once."
tak@bnr-rsc.UUCP (Mike Takefman) (01/25/88)
In article <2158@geac.UUCP> john@geac.UUCP (John Henshaw) writes:
. 1. Design, code and test your program.
. 2. Compile it for "branch profiling".
. 3. Execute the program on a dataset large enough that there is
. sufficient confidence that the program's execution is truly
. representative of "standard production activity".
. 4. Recompile the program. This recompilation should use the
. information gained from step 3.
.There are a few assumptions here :-). I think they're obvious.
One assumption that may not be obvious but is important is that this
method is highly intrusive. This method will not work well for a
real time application.
For vanilla programming I fully endorse this view, but I believe
that the challenge is in doing this non-intrusively.
--
Michael Takefman There are 2 basic truths,
Processor Exploratory Never fight a land war in Asia and
tak@bnr-rsc@bnr-vpa@utzoo.... never go against a Sicilian if death is
Bell Northern Research on the line. (The Princess Bride)
jmd@granite.dec.com (John Danskin) (01/27/88)
In article <604@bnr-rsc.UUCP> tak@bnr-rsc.UUCP (Mike Takefman) writes: >In article <2158@geac.UUCP> john@geac.UUCP (John Henshaw) writes: >. 1. Design, code and test your program. >. 2. Compile it for "branch profiling". >. 3. Execute the program on a dataset large enough that there is >. sufficient confidence that the program's execution is truly >. representative of "standard production activity". >. 4. Recompile the program. This recompilation should use the >. information gained from step 3. >.There are a few assumptions here :-). I think they're obvious. > >One assumption that may not be obvious but is important is that this >method is highly intrusive. This method will not work well for a >real time application. > >For vanilla programming I fully endorse this view, but I believe >that the challenge is in doing this non-intrusively. What do you mean by intrusive? Do you mean that the optimizing process can be unpredictable? For a given set of inputs, a real time program can exibit two speed related characteristics: 1) Fast enough. 2) Not fast enough. If the program is fast enough for some usual case, but not fast enough for some exception, then there are three options: 1) Profile with more instances of the exceptional behaviour to try to pump up the performance of those cases, at the expense of the other cases. This may require several iterations, and may expose a design problem, namely that given the state of compiler technology embodied in your compiler, there is no way to jiggle branch frequencies (and global register allocation characteristics etc.) there is no way to make all of the code sequences fast enough. 2) Redo the parts that aren't fast enough in assembler. 3) Change your algorithms so there is more time. 4) Get an optimizing compiler that understands real time constraints: you say: this subroutine, given these sets of inputs, must complete in xxx microseconds. you would probably give these directives to the compiler in a meta language that you used while profiling. Of course, this last option would be a lot of work 8-{). -- John Danskin | decwrl!jmd DEC Workstation Systems Engineering | (415)853-6724 100 Hamilton Avenue | My comments are my own. Palo Alto, CA 94306 | I do not speak for DEC.