[comp.hypercube] Fast Parallel Integer Multiply

fpst@hubcap.clemson.edu (Steve Stevenson [as moderator]) (09/21/87)

[ I have put several messages together concerning the fast integer
	multiply question.  Many were originally posted to sci.math.
		Steve ]

------------------------------------------

From: coraki!pratt@Sun.COM (Vaughan Pratt)

The basic result is by a former teacher of mine at Sydney University,
Chris Wallace, who while working on the Illiac II in the US in the
early 60's showed how to multiply n-bit integers with log(n) delay.
The method and its time bound carries over to most if not all parallel
architectures.  For example a 1974 paper by Larry Stockmeyer and myself
shows (among other things) how to implement Wallace's algorithm on a
bit vector machine whose only instructions are shift, boolean
operations (and-or-not), clear, and test for zero, using time log(n) and
space (length of bit vectors) n^2.

Wallace's method is as follows.  To multiply AxB, form A times each bit
of B (so either A appropriately shifted, or 0) to yield n numbers.  All
this work is done in unit delay when the architecture allows it, and
otherwise with log(n) delay (the more usual situation).

Now add up the resulting n numbers in two phases, each phase taking
log(n) time.  In phase 1 repeatedly divide the numbers to be added into
groups of 4 and for each group replace the 4 numbers w,x,y,z by two
numbers u,v such that w+x+y+z = u+v.  It turns out that this 4->2 kind
of addition can be done with constant delay (i.e. independent of the
size of the numbers being added) (various methods possible), first
pointed out by UCLA's Avizienis.  In this way the n numbers can be
reduced to two numbers with log(n) delay.  Now add the two remaining
numbers together using say conventional carry-save propagation (again
with various methods possible), taking an additional log(n) more units
of delay.

I don't have the reference to hand, but I think it was an IEEE
journal from 1963, give or take a year, maybe Transactions on
Computers.

------------------------------------------

From: "Paul F. Dietz" <DIETZ%sdr.slb.com@RELAY.CS.NET>

Bob Silverman, responding to a query on parallel algorithms for
integer multiplication, wrote:

>... I'm afraid that the
>old-fashioned algorithms in Knuth which run in O(N^2) time (N = # bits)
>are inherently serial. Ditto for the recursive partitioning algorithms that
>say run in O(N^lg(3)) time.

The old-fashioned algorithms, and the n^1.59 divide-and-conquer algorithm,
are most certainly not inherently serial.  For example, the O(n^2) algorithm
can be implemented as a boolean circuit with O(n^2) gates and depth O(log
n), if one is careful in implementing the tree of adders.  In the divide and
conquer algorithm, the three subproblems can be solved in parallel, and the
splitting and combining steps can be done in logarithmic time, for an
O(log^2 n) time algorithm.

Paul F. Dietz
dietz@sdr.slb.com

------------------------------------------

From: cik@l.cc.purdue.edu (Herman Rubin)
Summary: It is easy to get a speedup

In article <13410@linus.UUCP>, bs@linus.UUCP (Robert D. Silverman) writes:
> In article <67@rpicsb8] abbottj@csv.rpi.edu (John Abbott) writes:
  >>[...original question...]
> 
> The basic question is: How Large? If the numbers are say several THOUSAND
> computer words long (e.g. say 20000+ decimal digits) then FFT techniques
> can be used and these parallelize well. Under that I'm afraid that the
> old-fashioned algorithms in Knuth which run in O(N^2) time (N = # bits)
> are inherently serial. Ditto for the recursive partitioning algorithms that
> say run in O(N^lg(3)) time.
> 
> Bob Silverman

If we have M^2 processors and M^2 memory available (M = # words, where
multiplying 2 words to produce a double word is available) then the
usual algorithm can be implemented in log M time.  Even if we do not
have massive parallelism, a significant speedup can be done using
parallelism or even vectorization.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

------------------------------------------

From: jamesa%betelgeuse@Sun.COM (James D. Allen)
Summary: Winograd's Modulus Multiply

In article <67@rpicsb8>, abbottj@csv.rpi.edu (John Abbott) writes:
> 
> Are there any articles about how to multiply large integers quickly on
> parallel machines? ...

I am suprised modulus representation wasn't mentioned when the topic
`unusual arithmetic representations' came up in this(?) newsgroup a few months
ago.  The multiply speed is much better than O(N^log 3).  I guess it's still
just a curiosity rather than of any practical use.  Perhaps another netlander
will give more info.  All I remember is:
	- the idea is to represent N with a k-tuple N' = (n1,n2,...,nk)
		such that N*M is represented by N'+M' = (n1+m1,n2+m2,...)
	- the domain of N,M is   {1,2,3,...,B-1}  and only the residue mod B
		of N*M is preserved.  Similarly the 'n1,...' are bounded by
			0 <= ni < bi
	- Given a large B, choose (how?) a large k such that the various bi
		are small.  This allows good parallel usage of k machines.
	- Winograd's example was B=64, k=3, (b1,b2,b3) = (16,6,2)
		N = 5^a * 2^b        is represented by 	(a,b,0)
		N = 64 - 5^a * 2^b   is represented by 	(a,b,1)
	- it is related to a similar but simpler technique for addition:
			- Choose B = p*q*...*r with p,q,... relatively prime
			- Represent N with N' = (ap,aq,...ar) where
				N = aq (mod q) etc.
	- unfortunately the additive and multiplicative representations are
		incompatible.  You can multiply quickly but adds are *slow*.

Sorry.  No reference except author's name.

James Allen

------------------------------------------

From: abbottj@csv.rpi.edu (John Abbott)

There had been a suggestion (by Erich Kaltofen) that there would be an
O(log(n)) *time* algorithm for parallel multiplication of big integers
- I imagine FFT achieves this. Unfortunately, I left my Knuth vol 2
back in England.

Also, a couple of months ago I posted an article saying that a final
year student at Bath University (England) had done a project on FFT
type multiplication and concluded that using finite field FFT (aka
NTT) was a winner with numbers as small as 30 decimal digits.  The
project will be/has been made into a technical report. If/When I ever
get a copy I plan to summarise its contents in an article. Thanks and
apologies to those who showed interest to the earlier article.

John.

------------------------------------------

From: abbottj@csv.rpi.edu (John Abbott)

In article <28499@sun.uucp> jamesa%betelgeuse@Sun.COM (James D. Allen) writes:
...
>I am suprised modulus representation wasn't mentioned when the topic
>`unusual arithmetic representations' came up in this(?) newsgroup a few months
>ago.
...
>	- Winograd's example was B=64, k=3, (b1,b2,b3) = (16,6,2)
...
>James Allen

There is a similar paper in computer algebra which allows one to multiply
polynomials just by adding (again addition & subtraction become hard).
I cannot find it in my files, but it is fairly recent. It might be in 
Proc EUROCAL 87 (when it appears). I am sure the author was French,
probably D Lugiez (maybe D Lazard or G Viry).

This also reminded me of P L Montgomery's paper "Modular Multiplication
without Trial Divison" in Math Comp 44 #170 April 1985 pp 519-521.

John.