[comp.arch] Alignment on RS/6000

aglew@crhc.uiuc.edu (Andy Glew) (11/20/90)

Newsgroups: info.rs6000
From: U644401@hnykun11.bitnet (Wilfred Janssen)
Subject: performance breakdown due to misalignment
Original-To: Multiple recipients of list POWER-L <POWER-L@NDSUVM1>
Reply-To: POWER-L IBM RS/6000 POWER Family <POWER-L@VM1.NoDak.EDU>
Organization: University of Illinois at Urbana
Distribution: info
Date: Mon, 19 Nov 90 14:20:57 MET

We experienced a problem with our RS6000 model 320. Some
FORTRAN applications ran at .2 Mflop/s, instead of at the usual
7.5 Mflop/s.

We found that this was caused by a misalignment of
double precision arrays. As an illustration we include a little
test program, which calls the routine DROT (in library BLAS).
When the arrays X and Y start at a double word boundaries (I0 = 0),
the program requires 7.5 sec.
Shift the begin address by one byte (I0 = 1) and the routine
requires 398 sec!

                                          Wilfred Janssen
                                          Paul Wormer
-----------------------------------------------------------------

C TEST TO ILLUSTRATE THE PENALTY OF MISALIGNMENT ON THE RS6000.
      PARAMETER (KILO=1024)
      CHARACTER*1 SPACE(0:16*KILO)
C------------------------------------------------------------------
C THE FOLLOWING STATEMENT IS THE CULPRIT, CHANGE I0 TO 0 AND THE
C PROGRAM RUNS 53 TIMES FASTER!
C------------------------------------------------------------------
      I0 = 1
      I1 = I0 + 8*KILO
      CALL EXEC(SPACE(I0), SPACE(I1), KILO)
      END

      SUBROUTINE EXEC(X, Y, N)
      IMPLICIT DOUBLE PRECISION (A-H,O-Z)
      DIMENSION X(N), Y(N)

      DO 10 I=1,N
         X(I) =  1.D0
         Y(I) = -1.D0
   10 CONTINUE
      C =  SQRT(0.5D0)
      S = -C
      DO 20 I=1,N*20
         CALL DROT(N, X,1, Y,1, C,S)
   20 CONTINUE
      END


--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

aglew@crhc.uiuc.edu (Andy Glew) (11/20/90)

I'll bet the misalignment slowdown noted in the last post was due to
trapping and handling the misalignment in the trap handler.


But it makes you think: is it really worth being backward compatible
if there is such a huge performance penalty?  Can you imagine the
novice user who would not know how to look for such a problem?  Can
you imagine UNIX Review publishing a benchmark with that sort of
problem?


Put another way, maybe such backward compatibility should not be
present by default.  Maybe the program should die with the message
    
    Error: misaligned accesses. "fort -misaligned" may be used.

so that the user would have to go and turn on the misaligned compatbility
mode, and would be aware of the possible performance costs.

--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (11/20/90)

>>>>> On 20 Nov 90 05:45:34 GMT, aglew@crhc.uiuc.edu (Andy Glew) said:

[concerning the huge performance penalty on the RS/6000 for misaligned
accesses] 
Andy> I'll bet the misalignment slowdown noted in the last post was due to
Andy> trapping and handling the misalignment in the trap handler.

Most likely.

Andy> But it makes you think: is it really worth being backward compatible
Andy> if there is such a huge performance penalty?  Can you imagine the
Andy> novice user who would not know how to look for such a problem?  Can
Andy> you imagine UNIX Review publishing a benchmark with that sort of
Andy> problem?

Anyone who writes code like the test case presented (which calls a
double-precision arithmetic routine and passes it a character*1
variable array to work on!) deserves whatever miserable performance
that they get.

Andy> Put another way, maybe such backward compatibility should not be
Andy> present by default.  Maybe the program should die with the message
Andy>     
Andy>     Error: misaligned accesses. "fort -misaligned" may be used.
Andy> so that the user would have to go and turn on the misaligned compatbility
Andy> mode, and would be aware of the possible performance costs.

The example given was not terribly representative of real codes.  A
*much* more likely cause of poor alignment is placing a
double-precision entity after an odd number of integer items in a
common block.  In these cases the user is given an informative error
message by the compiler (or loader?).

I don't understand why IBM chose not to include an 'auto-double' flag
in the RS/6000 FORTRAN compiler.  If 'auto-double' upgraded INTEGERs
to 64-bits as well as REALs, then the alignment problem would be
solved on the overwhelming majority of real codes.

P.S. How many people out there caught my quotes in UNIX TODAY (November
12, page 8), complete with a misspelled name.  At least my comments
were transmitted accurately!
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@vax1.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

ian@sibyl.eleceng.ua.OZ (Ian Dall) (11/21/90)

In article <AGLEW.90Nov19234534@lasso.crhc.uiuc.edu> aglew@crhc.uiuc.edu (Andy Glew) writes:
>I'll bet the misalignment slowdown noted in the last post was due to
>trapping and handling the misalignment in the trap handler.
>
>But it makes you think: is it really worth being backward compatible
>if there is such a huge performance penalty?

In a similar vein, how many people have been caught by a floating point
program taking "forever" on a sparc (no doubt othe machines as well)
because it was spending all it's time doing NaN and Inf exception
handling?

I know the handling of these faults can be changed, but the point is
that the result takes *so* much longer to calculate as to be useless.
Much better to make the default to core dump on a floating point
exception.

-- 
Ian Dall     life (n). A sexually transmitted disease which afflicts
                       some people more severely than others.       

gillies@m.cs.uiuc.edu (11/22/90)

I think Andy@Theory.Stanford.Edu has a very warped perspective on
history.  IBM fought 360/370 compatibles every step of the way,
including refusing to release I/O specifications and refusing to
service machines with 3rd party devices and refusing to renew
maintenance contracts on machines owned by customers who purchased
3rd-party devices.

This proprietary dictatorship led to a 10-year government battle to
disband IBM as an illegal monopolistic trust.

IBM is STILL fighting to keep others from making exploiting clones of
their architecture, all the way up until a few years back when they
were forced (by a lawsuit involving hitachi) to offer their operating
system for public sale.

IBM is not now nor ever has been an open architecture company.

khb@chiba.Eng.Sun.COM (Keith Bierman fpgroup) (11/22/90)

In article <893@sibyl.eleceng.ua.OZ> ian@sibyl.eleceng.ua.OZ (Ian Dall) writes:


   In a similar vein, how many people have been caught by a floating point
   program taking "forever" on a sparc (no doubt othe machines as well)
   because it was spending all it's time doing NaN and Inf exception
   handling?

Only underflow costs a lot.

   I know the handling of these faults can be changed, but the point is
   that the result takes *so* much longer to calculate as to be useless.
   Much better to make the default to core dump on a floating point
   exception.

Well, there is compliance with the word and spirit of IEEE 754 to
consider. Since there are enough folks who disagree with the
committee, current compilers (C1.0, f77v1.3, etc.)  allow one to
compile -fnonstd to get that behavior.

--
----------------------------------------------------------------
Keith H. Bierman    kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33			 | (415 336 2648)   
    Mountain View, CA 94043

baum@Apple.COM (Allen J. Baum) (11/22/90)

[]
In article <893@sibyl.eleceng.ua.OZ> ian@sibyl.eleceng.ua.OZ (Ian Dall) writes:
>
>   In a similar vein, how many people have been caught by a floating point
>   program taking "forever" on a sparc (no doubt othe machines as well)
>   because it was spending all it's time doing NaN and Inf exception
>   handling?

Its always been my impression that NaN, Inf, and Denorms were reasonably
rare- they don't usually occur. Is this not the case? Are there statistics
somewhere that show how often these cases occur?

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

cet1@cl.cam.ac.uk (C.E. Thompson) (11/23/90)

In article <MCCALPIN.90Nov20092451@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
>
>The example given was not terribly representative of real codes.  A
>*much* more likely cause of poor alignment is placing a
>double-precision entity after an odd number of integer items in a
>common block.  In these cases the user is given an informative error
>message by the compiler (or loader?).
>
But this specific case (double floating point load/store access on a
multiple of 4 which is not a multiple of 8) *is* dealt with in hardware on
the RS/6000, isn't it? Reference not immediately to hand, but surely I didn't
dream it?

Chris Thompson
JANET:    cet1@uk.ac.cam.phx
Internet: cet1%phx.cam.ac.uk@nsfnet-relay.ac.uk

ian@sibyl.eleceng.ua.OZ (Ian Dall) (11/25/90)

In article <46760@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>[]
>In article <893@sibyl.eleceng.ua.OZ> ian@sibyl.eleceng.ua.OZ (Ian Dall) writes:
>>
>>   In a similar vein, how many people have been caught by a floating point
>>   program taking "forever" on a sparc (no doubt othe machines as well)
>>   because it was spending all it's time doing NaN and Inf exception
>>   handling?
>
>Its always been my impression that NaN, Inf, and Denorms were reasonably
>rare- they don't usually occur. Is this not the case? Are there statistics
>somewhere that show how often these cases occur?

In a well behaved program they are extremely rare. Therefore, if they occur
you almost certainly have a bug and you might as well core dump. Once one
occurs, they tend to be contageous, so that soon your program is trapping
on every floating point operation. *That* is what makes it take "forever".

The output of such a program is useless. The results are meaningless and
worse, they give you no clue as to where the problem might be. A core dump
is preferable except maybe in commercial software, and even then it only
gives an illusion of robustness.


-- 
Ian Dall     life (n). A sexually transmitted disease which afflicts
                       some people more severely than others.       
  ACSnet: ian@sibyl.eleceng.ua.oz
internet: ian@sibyl.eleceng.ua.oz.au

jkenton@pinocchio.encore.com (Jeff Kenton) (11/26/90)

From article <897@sibyl.eleceng.ua.OZ>, by ian@sibyl.eleceng.ua.OZ (Ian Dall):
> In article <46760@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>>
>>Its always been my impression that NaN, Inf, and Denorms were reasonably
>>rare- they don't usually occur. Is this not the case? Are there statistics
>>somewhere that show how often these cases occur?
> 
> In a well behaved program they are extremely rare. Therefore, if they occur
> you almost certainly have a bug and you might as well core dump.

This is probably true for NaN's and Infinities, but Denorms occur all the
time.  printf() often generates them when dealing with floats, and the
math library routines do also.

With denorms, the real issue is whether a particular program cares about
numbers in the denormalized range.  Most don't and would work just as well
(and faster) if there were an *option* to allow underflows to truncate to
zero.  "You might as well dump core" is not the thing to do.

Note that I am *not* suggesting we eliminate denorms.  They make good sense
mathematically -- they're just slow on many RISC machines.


----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----
-----  jeff kenton:    	consulting at jkenton@pinocchio.encore.com  -----
-----		        until 11/30/90 -- always at (617) 894-4508  -----
----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----

peter@ficc.ferranti.com (Peter da Silva) (11/26/90)

Actually, I'd think there was one environment where all the fancy trapping
and fixup for IEEE floating point makes sense: an interpreter.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
peter@ferranti.com 

khb@chiba.Eng.Sun.COM (Keith Bierman fpgroup) (11/27/90)

In article <AGLEW.90Nov26130207@cobra.crhc.uiuc.edu> aglew@crhc.uiuc.edu (Andy Glew) writes:


   Hmmm.... what should a spreadsheet do when the user has said to
   divide by 0?

If memory serves, the solution is already published,

                  +Inf      x is positive
	x/0 ==     NaN      x is zero
                  -Inf      x is negative


--
----------------------------------------------------------------
Keith H. Bierman    kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33			 | (415 336 2648)   
    Mountain View, CA 94043

aglew@crhc.uiuc.edu (Andy Glew) (11/27/90)

>In a well behaved program they are extremely rare. Therefore, if they occur
>you almost certainly have a bug and you might as well core dump. Once one
>occurs, they tend to be contageous, so that soon your program is trapping
>on every floating point operation. *That* is what makes it take "forever".
>
>The output of such a program is useless. The results are meaningless and
>worse, they give you no clue as to where the problem might be. A core dump
>is preferable except maybe in commercial software, and even then it only
>gives an illusion of robustness.

Hmmm.... what should a spreadsheet do when the user has said to divide by 0?
--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

baum@Apple.COM (Allen J. Baum) (11/28/90)

[]
>In article <13342@encore.Encore.COM> jkenton@pinocchio.encore.com (Jeff Kenton) writes:
>
>This is probably true for NaN's and Infinities, but Denorms occur all the
>time.  printf() often generates them when dealing with floats, and the
>math library routines do also.

I still don't understand the assertion that Denorms occur all the time.
Just because printf() knows how to handle them doesn't mean that it
happens very often. Likewise the math routines. What is your evidence
than it happens "often" (whatever that means- 1% is incredibly often by
my definition in this case)

If you are saying that printf() and the math libraries can produce intermediate
results internally which are denorm, even if the inputs and outputs aren't,
would it be possible to re-write them to avoid it (given the impetus that
perhaps denorms are really slow)?


--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

khb@chiba.Eng.Sun.COM (chiba) (11/28/90)

In article <46866@apple.Apple.COM> baum@Apple.COM (Allen J. Baum) writes:


   I still don't understand the assertion that Denorms occur all the time.

In some applications they simply do. I have often seen this in
situations where one is solving for deviations from a nominal and
cases where someone is using a very small stopping criterion. 

Those who have codes like this, usually think everything is OK, worked
fine on their ----- (fill in CDC, Honeywell, Univac, VAX, etc.) and
are perfectly happy with flush to zero semantics.
--
----------------------------------------------------------------
Keith H. Bierman    kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33			 | (415 336 2648)   
    Mountain View, CA 94043

ian@sibyl.eleceng.ua.OZ (Ian Dall) (11/29/90)

In article <AGLEW.90Nov26130207@cobra.crhc.uiuc.edu> aglew@crhc.uiuc.edu (Andy Glew) writes:
>I wrote:
>> Programs should core dump instead of emulating IEEE NaN and Inf handling.
>
>Hmmm.... what should a spreadsheet do when the user has said to divide by 0?

I conceed that there are times when the IEEE defined behaviour is
desirable.  Basically, any interactive program which does trivial
amounts of floating point are better off generating NaNs than core
dumping. One such case which I ran into was in Gdb. I have a machine
which *does* core dump rather than generate NaNs and Infs. It is very
annoying to have gdb core dump because you asked it to print a union
which doesn't happen to currently have a floating point quantity in a
floating element.

Don't misinterprete my position though. I never claimed that one
shouldn't be able to alter the behaviour. Certainly the programmer
ought to be able to specify alternate trap handlers. In the case of
your spreadsheet, you probably don't want all the dependent cells
filled with NaN's and Inf's any more than you want a core dump. What
you really want is to enter some more "user friendly" handler of
errors.

As some people have said, it is possible in Suns unbundled compilers
to specify either behaviour as a compiler option, which is, I think, a
step in the right direction. Especially since the best option when
developing a program may not be the best for the final "production"
version.

-- 
Ian Dall     life (n). A sexually transmitted disease which afflicts
                       some people more severely than others.       
  ACSnet: ian@sibyl.eleceng.ua.oz
internet: ian@sibyl.eleceng.ua.oz.au

jkenton@pinocchio.encore.com (Jeff Kenton) (11/29/90)

From article <46866@apple.Apple.COM>, by baum@Apple.COM (Allen J. Baum):
> []
>>In article <13342@encore.Encore.COM> jkenton@pinocchio.encore.com (Jeff Kenton) writes:
>>
>>This is probably true for NaN's and Infinities, but Denorms occur all the
>>time.  printf() often generates them when dealing with floats, and the
>>math library routines do also.
> 
> If you are saying that printf() and the math libraries can produce intermediate
> results internally which are denorm, even if the inputs and outputs aren't,
> would it be possible to re-write them to avoid it (given the impetus that
> perhaps denorms are really slow)?
> 


That's what I meant.  Most of them could be re-written for RISC machines
which do denorms in software, or could request flushing underflow to zero
if there was such an option.













----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----
-----  jeff kenton:    	consulting at jkenton@pinocchio.encore.com  -----
-----		        until 11/30/90 -- always at (617) 894-4508  -----
----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----

jgk@osc.COM (Joe Keane) (11/30/90)

In article <46760@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>Its always been my impression that NaN, Inf, and Denorms were reasonably
>rare- they don't usually occur. Is this not the case? Are there statistics
>somewhere that show how often these cases occur?

In article <897@sibyl.eleceng.ua.OZ> ian@sibyl.OZ (Ian Dall) writes:
>In a well behaved program they are extremely rare. Therefore, if they occur
>you almost certainly have a bug and you might as well core dump. Once one
>occurs, they tend to be contageous, so that soon your program is trapping
>on every floating point operation. *That* is what makes it take "forever".
>
>The output of such a program is useless. The results are meaningless and
>worse, they give you no clue as to where the problem might be. A core dump
>is preferable except maybe in commercial software, and even then it only
>gives an illusion of robustness.

Ugh, i don't agree with any of this.  Real functions (exp, log, etc.) have
points at which the result is infinite or undefined.  There's no way to avoid
this; the question is, how do you want to deal with it?  The results are
certainly not meaningless, the whole point of IEEE exceptional values is that
such exceptional values are handled in a reasonable way.  These conditions can
be tested for by the program and an appropriate action taken.

Now maybe many simple programs don't want to deal with such conditions in an
intelligent way, and dumping core is an acceptable solution.  If i was using a
spreadsheet or calculator program and it dumped core on me, then i'd be mad at the author for such a dumb program.

donc@microsoft.UUCP (Don CORBITT) (12/07/90)

In article <46866@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>[]
>>In article <13342@encore.Encore.COM> jkenton@pinocchio.encore.com (Jeff Kenton) writes:
>>
>>This is probably true for NaN's and Infinities, but Denorms occur all the
>>time.  printf() often generates them when dealing with floats, and the
>>math library routines do also.
>
>I still don't understand the assertion that Denorms occur all the time.
>Just because printf() knows how to handle them doesn't mean that it
>happens very often. Likewise the math routines. What is your evidence
>than it happens "often" (whatever that means- 1% is incredibly often by
>my definition in this case)
>
>
>--
>		  baum@apple.com		(408)974-3385
>{decwrl,hplabs}!amdahl!apple!baum

"In a previous life" I developed 3D graphics applications on PC's.  We did
lots of 4x4 matrix multiplication and inversions, concatenating matrices
of our hierarchical models, etc.  Although we didn't have 1% denormals, we
certainly had enough of them that a bug in the Intel '287 math chip caused
problems until we worked around it.  (As I recall, it would store a denormal
properly, but when reading one it converted it to NaN.  The fix was to trap
the denormal exception, and flush to zero.)  These denormals were data 
dependent, based on user manipulation of their models.  A simple way to
produce it is to rotate about X 90.0 degrees, then rotate about X -90.0
degrees.  This often converts 0.0 to a denormal tiny value.

It would be a Bad Thing if denormals caused fatal traps.

--
Don Corbitt - donc@microsoft.com - Not a spokesmodel for MS.
mail flames, post retractions.  Support short signatures, 3 lines max.