[comp.lang.fortran] Pipelined FP add

pmontgom@sonia.math.ucla.edu (Peter Montgomery) (12/17/89)

In article <28413@amdcad.AMD.COM> davec@cayman.amd.com (Dave Christie) writes:
>
>Quite true, but (IMHO) for most applications, int<->fp traffic is not very
>high, so having to explicitly move some data around is no big deal ... as
>long as it is done reasonably efficiently so that you don't blow one or two
>applications out of the water.
>
	I have written a multiple precision integer arithmetic package.
Its time-critical routines are conditionally compiled so
the algorithms can be individually optimized for different machines 
(yes, I resemble Herman Rubin in wanting full access to machine
instructions from high-level languages).  On some machines,
almost all arithmetic is integer.  Recently, while transporting 
my program to an Alliant, I vectorized some of these codes.  
A frequent primitive operation requires finding integers X, Y such that 
X*2**30 + Y = A*B + C where A, B, C are given (vectors of) integers
up to 2^30 (2^30 is the radix for multiple-precision arithmetic).
The Alliant has an vectorized integer multiply instruction, but
only for the lower 32 bits of a product.  Hence I can get 
Y = IAND(A*B + C, 2**30 - 1) with vectorized integer operations.
To get X (upper half of product) using vector operations, 
I use floating point, such as

	X = NINT((DBLE(A)*DBLE(B) + DBLE(C - Y))*0.5**30)

(DBLE = convert integer to double, NINT = convert double to nearest integer).
This statement uses 4 conversions between integer and floating point
while doing only 3 floating point operations.
The vectorized DBLE is compiled inline, but the vectorized NINT
is done in a subroutine; the NINT subroutine uses 10% of my program's time
(but total program time is down from before vectorization).

	BTW, I have asked the X3J3 committee to add this operation 
(i.e., given A, B, C, D, all nonnegative, and either A < D or B < D,
find X, Y such that X*D + Y = A*B + C) to Fortran 8x.


--------
        Peter Montgomery
        pmontgom@MATH.UCLA.EDU 
	Department of Mathematics, UCLA, Los Angeles, CA 90024

davec@proton.amd.com (Dave Christie) (12/18/89)

  
In article <2100@sunset.MATH.UCLA.EDU# pmontgom@math.ucla.edu (Peter Montgomery) writes:
#In article <28413@amdcad.AMD.COM> davec@cayman.amd.com (Dave Christie) writes:
#>
#>Quite true, but (IMHO) for most applications, int<->fp traffic is not very
#>high, so having to explicitly move some data around is no big deal ... as
#>long as it is done reasonably efficiently so that you don't blow one or two
#>applications out of the water.
#>
#	I have written a multiple precision integer arithmetic package.
#Its time-critical routines are conditionally compiled so
#the algorithms can be individually optimized for different machines 
#(yes, I resemble Herman Rubin in wanting full access to machine
#instructions from high-level languages).  On some machines,

Yes, multiple precision seems to be one of the applications you may
want to be careful about (Herman Rubin replied as well!).  My rash 
generalization comes from a former life at a TLA mainframe company 
where I did detailed instruction frequency measurements of a wide range
of real-world Fortran programs - such things as structural analysis, 
flight simulation, ECAD, many other flavors (but not likely multiple 
precision - notice I'm not using the words "representative" or 
"comprehensive" ;-)  This indicated (among many other things) that the 
frequency of converts would not be a big factor in deciding whether 
to go with separate or shared registers, were one to design a new
general-purpose ISA - other factors carry more weight.  Of course,
your milage may vary.  What one sees fit to do generally depends
on what one thinks will bring in the most bucks, right?  

#A frequent primitive operation requires finding integers X, Y such that 
#X*2**30 + Y = A*B + C where A, B, C are given (vectors of) integers
#up to 2^30 (2^30 is the radix for multiple-precision arithmetic).
#The Alliant has an vectorized integer multiply instruction, but
#only for the lower 32 bits of a product.  Hence I can get 
#Y = IAND(A*B + C, 2**30 - 1) with vectorized integer operations.
#To get X (upper half of product) using vector operations, 
#I use floating point, such as
#
#	X = NINT((DBLE(A)*DBLE(B) + DBLE(C - Y))*0.5**30)
#
#(DBLE = convert integer to double, NINT = convert double to nearest integer).
#This statement uses 4 conversions between integer and floating point
#while doing only 3 floating point operations.

This sounds like a nice extreme case.  Now I'm curious - does anyone know,
or care to figure out (I'm not curious enough to take the time myself :-)
whether the simultaneous integer/fp issue capability of the i860 would 
more than compensate for the penalty on converts caused by the separate 
int/fp registers, even on this extreme piece of code?

#        Peter Montgomery
#        pmontgom@MATH.UCLA.EDU 
#	Department of Mathematics, UCLA, Los Angeles, CA 90024

------------
Dave Christie
The opinions are mine, the facts belong to everyone .... oops - sorry,
the facts are proprietary.....