[comp.sys.hp] extreme performance degradation in c compiler on HP9000/835

peter@mit-amt.MEDIA.MIT.EDU (Peter Schroeder) (09/20/89)

/*
Consider the program below. When compiling it on an HP9000/835
with
	cc -O test.c -o test
and then doing a
	time test
I get 44 seconds runtime

if I remove either the mult() call or the add() call from the
loop (i.e. executing only one of the 2 function calls) the program
runs so fast you can barely blink with your eyes.

in other words: the time the program needs to execute the loop with
both calls is larger by a factor of 50 (!) then the sum of the
loop with only the add() and the loop with only the mult()

I am not a compiler writer but I find this rather surprising and
needless to say extremely dissapointing...

I discovered this using C++ and cfront 1.2. The program below
captures the essence of the C++ program and indeed timings
are practically identical.

Any explanation, workaround, etc. would be much appreciated!

Peter
peter@media-lab.media.mit.edu
*/

void
mult( r, m, v )
	double* r; double* m; double* v;
{
	r[0] = m[0] * v[0] + m[1] * v[1] + m[2] * v[2];
	r[1] = m[3] * v[0] + m[4] * v[1] + m[5] * v[2];
	r[2] = m[6] * v[0] + m[7] * v[1] + m[8] * v[2];
}

void
add( r, a, b )
	double* r; double* a; double* b;
{
	r[0] = a[0] + b[0];
	r[1] = a[1] + b[1];
	r[2] = a[2] + b[2];
}

main()
{
	double c[3];
	double m[9];
	double hold[3];
	int i;

	c[0] = 10.0; c[1] = c[2] = 0.0;
	m[0] = 1.0; m[1] = m[2] = m[3] = m[6] = 0.0;
	m[4] = m[7] = m[8] = 0.5; m[5] = -0.5;

	for( i = 0; i < 100000; i++ ){
		mult( hold, m, c );
		add( c, c, hold );
	}
}

daryl@hpcllla.HP.COM (Daryl Odnert) (09/21/89)

Note that only when both calls are in the loop are the values of
of the array called hold[] actually used in any computations.
(If you remove the call to mult() the array hold[] is never assigned to.)

When both calls remain in the loop, the value of hold[0] increases
very rapidly while hold[1] and hold[2] seem to remain with a 0 value.

What is happening is that on the 1022nd iteration of the loop,
the value of hold[0] is eventually growing so large that it becomes
an IEEE infinity value.  When this happens, the floating-point
instructions that use the infinity value as an operand (I think)
are trapping out to software (which takes a lot longer than executing
the FLOP in hardware.)  

Note: I didn't do anything fancy to figure this out.  I just placed
the following statement between the call to mult() and the call to add():

    printf("%d %g %g %g\n",i,hold[0],hold[1],hold[2]);

Thus, the performance degradation you are experiencing appears to be
the combined result of your program's behavior (infinity arithmetic)
and what the 835 can actually compute in hardware.

Hope this helps.

Daryl Odnert
daryl%hpcllla@hplabs.hp.com
Hewlett Packard California Languages Lab

bobm@hpfcmgw.HP.COM (Bob Montgomery) (09/21/89)

Re: strange benchmark results on an 835

Please forgive a quick but incomplete response...

I verified your timings on an 835 (HP-UX 3.1).  I think I know the
reason for the time anomaly but I don't know why it's the reason.  How's
that for a disclaimer?

When both the mult and add call are in the loop, some of your numbers
are getting very large very quickly.  In fact, c[0] overflows on
iteration 1020 (its value on iteration 1019 is 1.12356e+308) and hold[0]
overflows on iteration 1021 (its value on iteration 1020 is
1.12356e+308).  The largest number that can be represented in a double
is 1.7976(...)e+308.

On the Series 300 (HP-UX 6.5), the program terminates at this point with
a "Floating exception (core dumped)".  On the Series 800, the program
runs to completion, but I'll bet that an exception handler is being
secretly invoked on all subsequent iterations to "acknowledge and
ignore" the overflow.

When either the mult call or add call are removed, the overflow does not
occur.

I don't know the answers to the following questions:

a.  Why does the 800 not terminate with an exception?  (Or why does the
   300 do so?)

b.  Why does the suspected exception handler not result in system time
   clicks?  (/bin/time reports sys time as 0.0.)

c.  Why does your program make such big numbers? :-)

Bob "Not an 800 guru, but I play one on TV" Montgomery
HP Support or Something
Ft. Collins, CO

renglish@hpisod2.HP.COM (Robert English) (09/21/89)

> / peter@mit-amt.MEDIA.MIT.EDU (Peter Schroeder) / 10:38 am  Sep 19, 1989 /

> in other words: the time the program needs to execute the loop with
> both calls is larger by a factor of 50 (!) then the sum of the
> loop with only the add() and the loop with only the mult()...

> Any explanation, workaround, etc. would be much appreciated!

The explanation is that your variables are overflowing!  To quote a
friend of mine who traced the loop:

> after the 1021'st iteration of the loop, the inexact and overflow flags
> are set in the FPU status register. on every successive iteration, the
> invalid operation flag is set.  The FPU causes passes the buck to emulation
> code on invalid operations...

You can see this yourself by starting up this program inside of adb,
stopping it after a few seconds, and then using the $R command to
examine the contents of the floating point status register.

The only workaround that I see is to check for overflow in add and/or
multiply.

--bob--
renglish@hpda.hp.com

jima@hplsla.HP.COM (Jim Adcock) (09/22/89)

I'm no expert on the 835, but I tried this program on my 370, and
found it caused floating pt overflow at about loop 1000.  Printing out the
values showed that they had grown to be huge.

Next I took the program to a 835 I have access to, and found [surprisingly]
that the code ran without aborting, and took a long time as quoted (about
46 seconds.)  Suspecting that the problem still has something to do with
floating pt overflow, I multiplied all the floating pt constants by 
1.0e-30, and then found the program ran in less than a second.

So I suspect the problem lies with floating pt overflow, but I don't understand
why the 370 reports it, and the 835 doesn't.

What was this code *suppose* to do?

Certainly, you weren't intending to cause a zillion floating pt overflows,
were you?  [If so, they will be slow.]

dhandly@hpcllz2.HP.COM (Dennis Handly) (09/23/89)

A similar question came up with a Fortran program.  It looked like some
relaxation techniques.  The user took some of the SIN and COS function calls
out of the look to see if it went faster?  The end result was that with
less function calls, it took longer.  When I ran the same program on MPE XL,
series 900, it aborted with an overflow.  Using the +T option on HP-UX
also caused the overflow.  

It turned out that the SIN and COS caused the result to be -1..+1

peter@mit-amt.MEDIA.MIT.EDU (Peter Schroeder) (09/25/89)

In article <3770023@hpcllz2.HP.COM> dhandly@hpcllz2.HP.COM (Dennis Handly)

>relaxation techniques.  The user took some of the SIN and COS function calls
>out of the look to see if it went faster?  The end result was that with
>less function calls, it took longer.  When I ran the same program on MPE XL,
>series 900, it aborted with an overflow.  Using the +T option on HP-UX
>also caused the overflow.  
>
>It turned out that the SIN and COS caused the result to be -1..+1

As many readers of this group have discovered the problem was indeed an
overflow and subsequent dispatch of exception handling to software.

I did that loop for timing purposes on some c++ programs that I had written
and never bothered to look at the actual result.

Now, the interesting point is that with the new ANSI/IEEE floating point
stuff bad floating point operations don't cause core dumps anymore. It is
indeed possible to enable this property once again with a little assembly
program that Daryl (daryl@hpcllla.hp.com) sent to me [Daryl: can we post
this here, so everybody can take advantage of this fix?].

Thanks everyone for your help!

Peter
peter@media-lab.media.mit.edu

daryl@hpcllla.HP.COM (Daryl Odnert) (09/27/89)

> [Daryl: can we post this here, so everybody can take advantage of this fix?].

Yes, you may post the routine here, but it comes with the following
disclaimer: the subroutine is not officially blessed in any way
Hewlett-Packard Company.  It is only a contributed routine submitted
by an individual user of the HP 9000 Series 800.

I'll be glad to fix any problems with it that are discovered
by anyone who is using it (on my own time, of course.)

Daryl Odnert
daryl%hpcllla@hplabs.hp.com

daryl@hpcllla.HP.COM (Daryl Odnert) (09/27/89)

/*
** void enable_fp_traps( mask )
**    unsigned mask;
**
** This routine sets the trap-enable bits in the HPPA
** floating-point coprocessor.  This routine expects a single
** unsigned value parameter that decribes which traps to enable.
** Only the low order 5 bits are of interest.  All other bits are ignored.
** The bits are assigned as follows:
**     
**       0x1:   inexact result
**       0x2:	underflow
**       0x4:	overflow
**       0x8:   division by zero
**       0x10:  invalid operation
**
** For example, to enable all floating-point traps, execute the following
** call from your C program:
**         enable_fp_traps ( 0x1F );
**
** To compile this routine, just use save it in a file with a .s suffix
** and envoke cc on it.
** 
** For more information on the HPPA floating-point coprocessor, see
** chapter 6 of the HP Precision Architecture and Instruction Set
** Reference Manual, HP Part No. 09740-90014.
**
** Author: Daryl Odnert
** Unlimited permission to copy is granted.
** Mail problems/questions to:
**      daryl%hpcllla@hplabs.hp.com  or  hplabs!hpcllla!daryl
** 
*/
        .SPACE  $TEXT$
        .SUBSPA $CODE$
enable_fp_traps
        .PROC
        .EXPORT enable_fp_traps
        .CALLINFO CALLER,FRAME=8
	.ENTER
	LDO     -56(r30),r31
        FSTDS   fr0,0(0,r31)     /* store coprocessor status register    */
	LDWS    0(0,r31),r25     /* load upper status word to r25        */
	DEP     r26,31,5,r25     /* deposit low 5 bits of mask into r25  */
	STWS    r25,0(0,r31)
	FLDDS   0(0,r31),fr0
	.LEAVE
        .PROCEND
        .END

daryl@hpcllla.HP.COM (Daryl Odnert) (09/28/89)

An important note for C programmers who enable floating-point traps:

The Series 800 C compiler (in HP-UX release 3.0 I think) has added
a pragma that tells the optimizer that floating-point traps have been
enabled.  The primary effect of this pragma is to prevent invariant
floating-point operations from being moved out of a loop by the optimizer.

The syntax is as follows:

   #pragma FLOAT_TRAPS_ON  <list of function names separated by commas>

The list of function names is not optional.  Unless a function is
specified in this list, the optimizer will assume it is safe to move
floating-point operations around.  (I have filed an enhancement
request that will allow an empty function name list to be interpreted
as "all functions in this compilation unit".)

Note that this pragma does not enable floating-point traps, it just
tells the optimizer that you have enabled them yourself (using an
assembly routine like the one I've posted in an earlier response.)

Daryl Odnert
HP California Languages Lab