fpst@hubcap.clemson.edu (Steve Stevenson [as moderator]) (09/21/87)
[ I have put several messages together concerning the fast integer multiply question. Many were originally posted to sci.math. Steve ] ------------------------------------------ From: coraki!pratt@Sun.COM (Vaughan Pratt) The basic result is by a former teacher of mine at Sydney University, Chris Wallace, who while working on the Illiac II in the US in the early 60's showed how to multiply n-bit integers with log(n) delay. The method and its time bound carries over to most if not all parallel architectures. For example a 1974 paper by Larry Stockmeyer and myself shows (among other things) how to implement Wallace's algorithm on a bit vector machine whose only instructions are shift, boolean operations (and-or-not), clear, and test for zero, using time log(n) and space (length of bit vectors) n^2. Wallace's method is as follows. To multiply AxB, form A times each bit of B (so either A appropriately shifted, or 0) to yield n numbers. All this work is done in unit delay when the architecture allows it, and otherwise with log(n) delay (the more usual situation). Now add up the resulting n numbers in two phases, each phase taking log(n) time. In phase 1 repeatedly divide the numbers to be added into groups of 4 and for each group replace the 4 numbers w,x,y,z by two numbers u,v such that w+x+y+z = u+v. It turns out that this 4->2 kind of addition can be done with constant delay (i.e. independent of the size of the numbers being added) (various methods possible), first pointed out by UCLA's Avizienis. In this way the n numbers can be reduced to two numbers with log(n) delay. Now add the two remaining numbers together using say conventional carry-save propagation (again with various methods possible), taking an additional log(n) more units of delay. I don't have the reference to hand, but I think it was an IEEE journal from 1963, give or take a year, maybe Transactions on Computers. ------------------------------------------ From: "Paul F. Dietz" <DIETZ%sdr.slb.com@RELAY.CS.NET> Bob Silverman, responding to a query on parallel algorithms for integer multiplication, wrote: >... I'm afraid that the >old-fashioned algorithms in Knuth which run in O(N^2) time (N = # bits) >are inherently serial. Ditto for the recursive partitioning algorithms that >say run in O(N^lg(3)) time. The old-fashioned algorithms, and the n^1.59 divide-and-conquer algorithm, are most certainly not inherently serial. For example, the O(n^2) algorithm can be implemented as a boolean circuit with O(n^2) gates and depth O(log n), if one is careful in implementing the tree of adders. In the divide and conquer algorithm, the three subproblems can be solved in parallel, and the splitting and combining steps can be done in logarithmic time, for an O(log^2 n) time algorithm. Paul F. Dietz dietz@sdr.slb.com ------------------------------------------ From: cik@l.cc.purdue.edu (Herman Rubin) Summary: It is easy to get a speedup In article <13410@linus.UUCP>, bs@linus.UUCP (Robert D. Silverman) writes: > In article <67@rpicsb8] abbottj@csv.rpi.edu (John Abbott) writes: >>[...original question...] > > The basic question is: How Large? If the numbers are say several THOUSAND > computer words long (e.g. say 20000+ decimal digits) then FFT techniques > can be used and these parallelize well. Under that I'm afraid that the > old-fashioned algorithms in Knuth which run in O(N^2) time (N = # bits) > are inherently serial. Ditto for the recursive partitioning algorithms that > say run in O(N^lg(3)) time. > > Bob Silverman If we have M^2 processors and M^2 memory available (M = # words, where multiplying 2 words to produce a double word is available) then the usual algorithm can be implemented in log M time. Even if we do not have massive parallelism, a significant speedup can be done using parallelism or even vectorization. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet ------------------------------------------ From: jamesa%betelgeuse@Sun.COM (James D. Allen) Summary: Winograd's Modulus Multiply In article <67@rpicsb8>, abbottj@csv.rpi.edu (John Abbott) writes: > > Are there any articles about how to multiply large integers quickly on > parallel machines? ... I am suprised modulus representation wasn't mentioned when the topic `unusual arithmetic representations' came up in this(?) newsgroup a few months ago. The multiply speed is much better than O(N^log 3). I guess it's still just a curiosity rather than of any practical use. Perhaps another netlander will give more info. All I remember is: - the idea is to represent N with a k-tuple N' = (n1,n2,...,nk) such that N*M is represented by N'+M' = (n1+m1,n2+m2,...) - the domain of N,M is {1,2,3,...,B-1} and only the residue mod B of N*M is preserved. Similarly the 'n1,...' are bounded by 0 <= ni < bi - Given a large B, choose (how?) a large k such that the various bi are small. This allows good parallel usage of k machines. - Winograd's example was B=64, k=3, (b1,b2,b3) = (16,6,2) N = 5^a * 2^b is represented by (a,b,0) N = 64 - 5^a * 2^b is represented by (a,b,1) - it is related to a similar but simpler technique for addition: - Choose B = p*q*...*r with p,q,... relatively prime - Represent N with N' = (ap,aq,...ar) where N = aq (mod q) etc. - unfortunately the additive and multiplicative representations are incompatible. You can multiply quickly but adds are *slow*. Sorry. No reference except author's name. James Allen ------------------------------------------ From: abbottj@csv.rpi.edu (John Abbott) There had been a suggestion (by Erich Kaltofen) that there would be an O(log(n)) *time* algorithm for parallel multiplication of big integers - I imagine FFT achieves this. Unfortunately, I left my Knuth vol 2 back in England. Also, a couple of months ago I posted an article saying that a final year student at Bath University (England) had done a project on FFT type multiplication and concluded that using finite field FFT (aka NTT) was a winner with numbers as small as 30 decimal digits. The project will be/has been made into a technical report. If/When I ever get a copy I plan to summarise its contents in an article. Thanks and apologies to those who showed interest to the earlier article. John. ------------------------------------------ From: abbottj@csv.rpi.edu (John Abbott) In article <28499@sun.uucp> jamesa%betelgeuse@Sun.COM (James D. Allen) writes: ... >I am suprised modulus representation wasn't mentioned when the topic >`unusual arithmetic representations' came up in this(?) newsgroup a few months >ago. ... > - Winograd's example was B=64, k=3, (b1,b2,b3) = (16,6,2) ... >James Allen There is a similar paper in computer algebra which allows one to multiply polynomials just by adding (again addition & subtraction become hard). I cannot find it in my files, but it is fairly recent. It might be in Proc EUROCAL 87 (when it appears). I am sure the author was French, probably D Lugiez (maybe D Lazard or G Viry). This also reminded me of P L Montgomery's paper "Modular Multiplication without Trial Divison" in Math Comp 44 #170 April 1985 pp 519-521. John.