[net.arch] Integer multiplies

pmontgom@sdcrdcf.UUCP (Peter Montgomery) (07/01/85)

References:

Admittedly, some programs do very few integer multiplications except while
calculating subscripts and converting outputs to decimal notation.  My
"number crunchers" do very little floating point arithmetic but integer
multiplication is an essential instruction for them.  These programs
implement integer factorization algorithms, and do multiprecision
arithmetic modulo the number being factored (N).  Typically, N has 100
decimal digits (several machine words).  The most important operations are
A+B mod N, A-B mod N, and A*B mod N (the remainders of A+B, A-B, and A*B
when divided by N).  Multiplication remains the most expensive of these
operations, despite attempts (see my article "Modular Multiplication
Without Trial Division" in the April, 1985, "Mathematics of Computation")
to speed it up.  On the VAX 11/780, the inner loop

1:
	emul	r2,(r6)+,r1,r0		# (r0,r1) = mul2*ar2(i) + carry
	emul	r3,(r7)+,(r8)+,r4	# (r4,r5) = mul3*ar3(i) + tmparr(i)
	addl2	r4,r0
	adwc	r5,r1			# quad = (r0,r1) + (r4,r5)
	bicl3   r11,r0,-8(r8)           # tmparr(i-1) = lower 30 bits of quad
	ashq    $2,r0,r0                # r1 = carry = quad/2^30
	sobgtr  r9,1b                   # End of loop

uses EMUL (multiply two signed longwords, add a third longword) more than
any other instruction.

        For those of you implementing instruction sets, please don't
require us to use floating point operations for integer multiplication.
Our algorithms require EXACT answers.  Double length results (e.g., VAX
EMUL instruction) are desirable, but if they are unavailable then we want
the LOWER bits of the product.  Floating point operations return the UPPER
bits.  As an example, we may know that the determinant of the single
precision integer matrix

		    A   B
		    C   D

will be a single precision integer, even though the intermediate results
(A*D and B*C) overflow.  If we use single precision floating point
arithmetic (e.g, as required on the CRAY II), then we lose essentially all
precision when subtracting intermediate results, and the computation is a
waste.  If instead the computations are modulo the machine word size (2^64
on the CRAY), then the final result will be what we want despite the
overflow.


-- 
			Peter Montgomery

	{aero,allegra,bmcg,burdvax,hplabs,
	 ihnp4,psivax,randvax,sdcsvax,trwrb}!sdcrdcf!pmontgom

Don't blame me for the crowded freeways - I don't drive.