peter@mit-amt.MEDIA.MIT.EDU (Peter Schroeder) (09/20/89)
/* Consider the program below. When compiling it on an HP9000/835 with cc -O test.c -o test and then doing a time test I get 44 seconds runtime if I remove either the mult() call or the add() call from the loop (i.e. executing only one of the 2 function calls) the program runs so fast you can barely blink with your eyes. in other words: the time the program needs to execute the loop with both calls is larger by a factor of 50 (!) then the sum of the loop with only the add() and the loop with only the mult() I am not a compiler writer but I find this rather surprising and needless to say extremely dissapointing... I discovered this using C++ and cfront 1.2. The program below captures the essence of the C++ program and indeed timings are practically identical. Any explanation, workaround, etc. would be much appreciated! Peter peter@media-lab.media.mit.edu */ void mult( r, m, v ) double* r; double* m; double* v; { r[0] = m[0] * v[0] + m[1] * v[1] + m[2] * v[2]; r[1] = m[3] * v[0] + m[4] * v[1] + m[5] * v[2]; r[2] = m[6] * v[0] + m[7] * v[1] + m[8] * v[2]; } void add( r, a, b ) double* r; double* a; double* b; { r[0] = a[0] + b[0]; r[1] = a[1] + b[1]; r[2] = a[2] + b[2]; } main() { double c[3]; double m[9]; double hold[3]; int i; c[0] = 10.0; c[1] = c[2] = 0.0; m[0] = 1.0; m[1] = m[2] = m[3] = m[6] = 0.0; m[4] = m[7] = m[8] = 0.5; m[5] = -0.5; for( i = 0; i < 100000; i++ ){ mult( hold, m, c ); add( c, c, hold ); } }
daryl@hpcllla.HP.COM (Daryl Odnert) (09/21/89)
Note that only when both calls are in the loop are the values of of the array called hold[] actually used in any computations. (If you remove the call to mult() the array hold[] is never assigned to.) When both calls remain in the loop, the value of hold[0] increases very rapidly while hold[1] and hold[2] seem to remain with a 0 value. What is happening is that on the 1022nd iteration of the loop, the value of hold[0] is eventually growing so large that it becomes an IEEE infinity value. When this happens, the floating-point instructions that use the infinity value as an operand (I think) are trapping out to software (which takes a lot longer than executing the FLOP in hardware.) Note: I didn't do anything fancy to figure this out. I just placed the following statement between the call to mult() and the call to add(): printf("%d %g %g %g\n",i,hold[0],hold[1],hold[2]); Thus, the performance degradation you are experiencing appears to be the combined result of your program's behavior (infinity arithmetic) and what the 835 can actually compute in hardware. Hope this helps. Daryl Odnert daryl%hpcllla@hplabs.hp.com Hewlett Packard California Languages Lab
bobm@hpfcmgw.HP.COM (Bob Montgomery) (09/21/89)
Re: strange benchmark results on an 835 Please forgive a quick but incomplete response... I verified your timings on an 835 (HP-UX 3.1). I think I know the reason for the time anomaly but I don't know why it's the reason. How's that for a disclaimer? When both the mult and add call are in the loop, some of your numbers are getting very large very quickly. In fact, c[0] overflows on iteration 1020 (its value on iteration 1019 is 1.12356e+308) and hold[0] overflows on iteration 1021 (its value on iteration 1020 is 1.12356e+308). The largest number that can be represented in a double is 1.7976(...)e+308. On the Series 300 (HP-UX 6.5), the program terminates at this point with a "Floating exception (core dumped)". On the Series 800, the program runs to completion, but I'll bet that an exception handler is being secretly invoked on all subsequent iterations to "acknowledge and ignore" the overflow. When either the mult call or add call are removed, the overflow does not occur. I don't know the answers to the following questions: a. Why does the 800 not terminate with an exception? (Or why does the 300 do so?) b. Why does the suspected exception handler not result in system time clicks? (/bin/time reports sys time as 0.0.) c. Why does your program make such big numbers? :-) Bob "Not an 800 guru, but I play one on TV" Montgomery HP Support or Something Ft. Collins, CO
renglish@hpisod2.HP.COM (Robert English) (09/21/89)
> / peter@mit-amt.MEDIA.MIT.EDU (Peter Schroeder) / 10:38 am Sep 19, 1989 / > in other words: the time the program needs to execute the loop with > both calls is larger by a factor of 50 (!) then the sum of the > loop with only the add() and the loop with only the mult()... > Any explanation, workaround, etc. would be much appreciated! The explanation is that your variables are overflowing! To quote a friend of mine who traced the loop: > after the 1021'st iteration of the loop, the inexact and overflow flags > are set in the FPU status register. on every successive iteration, the > invalid operation flag is set. The FPU causes passes the buck to emulation > code on invalid operations... You can see this yourself by starting up this program inside of adb, stopping it after a few seconds, and then using the $R command to examine the contents of the floating point status register. The only workaround that I see is to check for overflow in add and/or multiply. --bob-- renglish@hpda.hp.com
jima@hplsla.HP.COM (Jim Adcock) (09/22/89)
I'm no expert on the 835, but I tried this program on my 370, and found it caused floating pt overflow at about loop 1000. Printing out the values showed that they had grown to be huge. Next I took the program to a 835 I have access to, and found [surprisingly] that the code ran without aborting, and took a long time as quoted (about 46 seconds.) Suspecting that the problem still has something to do with floating pt overflow, I multiplied all the floating pt constants by 1.0e-30, and then found the program ran in less than a second. So I suspect the problem lies with floating pt overflow, but I don't understand why the 370 reports it, and the 835 doesn't. What was this code *suppose* to do? Certainly, you weren't intending to cause a zillion floating pt overflows, were you? [If so, they will be slow.]
dhandly@hpcllz2.HP.COM (Dennis Handly) (09/23/89)
A similar question came up with a Fortran program. It looked like some relaxation techniques. The user took some of the SIN and COS function calls out of the look to see if it went faster? The end result was that with less function calls, it took longer. When I ran the same program on MPE XL, series 900, it aborted with an overflow. Using the +T option on HP-UX also caused the overflow. It turned out that the SIN and COS caused the result to be -1..+1
peter@mit-amt.MEDIA.MIT.EDU (Peter Schroeder) (09/25/89)
In article <3770023@hpcllz2.HP.COM> dhandly@hpcllz2.HP.COM (Dennis Handly) >relaxation techniques. The user took some of the SIN and COS function calls >out of the look to see if it went faster? The end result was that with >less function calls, it took longer. When I ran the same program on MPE XL, >series 900, it aborted with an overflow. Using the +T option on HP-UX >also caused the overflow. > >It turned out that the SIN and COS caused the result to be -1..+1 As many readers of this group have discovered the problem was indeed an overflow and subsequent dispatch of exception handling to software. I did that loop for timing purposes on some c++ programs that I had written and never bothered to look at the actual result. Now, the interesting point is that with the new ANSI/IEEE floating point stuff bad floating point operations don't cause core dumps anymore. It is indeed possible to enable this property once again with a little assembly program that Daryl (daryl@hpcllla.hp.com) sent to me [Daryl: can we post this here, so everybody can take advantage of this fix?]. Thanks everyone for your help! Peter peter@media-lab.media.mit.edu
daryl@hpcllla.HP.COM (Daryl Odnert) (09/27/89)
> [Daryl: can we post this here, so everybody can take advantage of this fix?].
Yes, you may post the routine here, but it comes with the following
disclaimer: the subroutine is not officially blessed in any way
Hewlett-Packard Company. It is only a contributed routine submitted
by an individual user of the HP 9000 Series 800.
I'll be glad to fix any problems with it that are discovered
by anyone who is using it (on my own time, of course.)
Daryl Odnert
daryl%hpcllla@hplabs.hp.com
daryl@hpcllla.HP.COM (Daryl Odnert) (09/27/89)
/* ** void enable_fp_traps( mask ) ** unsigned mask; ** ** This routine sets the trap-enable bits in the HPPA ** floating-point coprocessor. This routine expects a single ** unsigned value parameter that decribes which traps to enable. ** Only the low order 5 bits are of interest. All other bits are ignored. ** The bits are assigned as follows: ** ** 0x1: inexact result ** 0x2: underflow ** 0x4: overflow ** 0x8: division by zero ** 0x10: invalid operation ** ** For example, to enable all floating-point traps, execute the following ** call from your C program: ** enable_fp_traps ( 0x1F ); ** ** To compile this routine, just use save it in a file with a .s suffix ** and envoke cc on it. ** ** For more information on the HPPA floating-point coprocessor, see ** chapter 6 of the HP Precision Architecture and Instruction Set ** Reference Manual, HP Part No. 09740-90014. ** ** Author: Daryl Odnert ** Unlimited permission to copy is granted. ** Mail problems/questions to: ** daryl%hpcllla@hplabs.hp.com or hplabs!hpcllla!daryl ** */ .SPACE $TEXT$ .SUBSPA $CODE$ enable_fp_traps .PROC .EXPORT enable_fp_traps .CALLINFO CALLER,FRAME=8 .ENTER LDO -56(r30),r31 FSTDS fr0,0(0,r31) /* store coprocessor status register */ LDWS 0(0,r31),r25 /* load upper status word to r25 */ DEP r26,31,5,r25 /* deposit low 5 bits of mask into r25 */ STWS r25,0(0,r31) FLDDS 0(0,r31),fr0 .LEAVE .PROCEND .END
daryl@hpcllla.HP.COM (Daryl Odnert) (09/28/89)
An important note for C programmers who enable floating-point traps: The Series 800 C compiler (in HP-UX release 3.0 I think) has added a pragma that tells the optimizer that floating-point traps have been enabled. The primary effect of this pragma is to prevent invariant floating-point operations from being moved out of a loop by the optimizer. The syntax is as follows: #pragma FLOAT_TRAPS_ON <list of function names separated by commas> The list of function names is not optional. Unless a function is specified in this list, the optimizer will assume it is safe to move floating-point operations around. (I have filed an enhancement request that will allow an empty function name list to be interpreted as "all functions in this compilation unit".) Note that this pragma does not enable floating-point traps, it just tells the optimizer that you have enabled them yourself (using an assembly routine like the one I've posted in an earlier response.) Daryl Odnert HP California Languages Lab