koala@unc.UUCP (08/10/83)
Since the subject has been brought up, I felt I should clear up some of the statements about the FFP machine. The machine consists of a linear vector of small processors which communicate by being connected as the leaves of a binary tree. Roughly speaking, the FFP machine performs general matrix multiplication in O(nxn) space and time. Systolic arrays can multiply matrices in O(n) time, but do not provide a flexibility in the size of matrices that can be handled. Order notation only presents half the picture - in real life, constant factors and other terms are also important. The machine's matrix multiply operation examines each element of the two matrices once. Multiplying two matrices, mxn and nxp, requires accessing (mxn + nxp) values, and this is the measure of the time for the computation. Each cell performs n multiplications, dominated by the access. Further, when you multiply two matrices, mxn and nxp, the result is of size mxp. (Consider multiplying a column by a row). Thus, when n < (mxp)/(m+p), extra space must be allocated for the result. This is also a quadratic time operation. David Middleton UNC Chapel Hill decvax!duke!unc!koala
sts@ssc-vax.UUCP (Stanley T Shebs) (08/11/83)
I must admit to being a little sloppy when giving the maximum speed of a matrix multiplication on an FFP machine (haven't worked on this stuff for a year, and my memory is slipping). I still stand by the original statement, however. The *maximum* possible speed for the multiplication of two nxn matrices is O(log n). What I should have done is state that the machine architecture is completely unspecified. I am not convinced that the Mago tree machine is the ultimate in FFP designs, although it is very interesting. The achievement of O(log n) requires several things. Let me enumerate. First, assume that the matrix elements are already distributed to their processors. Second, assume that a single processor can quickly distribute a value to arbitrarily many processors (easy: put it on the bus (buss? :-} ) and let the processors all go through a read cycle simultaneously). Third, assume that the processors can communicate in such a way that addition of n numbers can be performed in log n time (by adding pairs, then pairs of pairs, etc). Then the distribution of values takes constant time, the multiplications are all done simultaneously and so take constant time, leaving only the summation to slow things down. I know this is fast and loose; its main failing is that it assumes the availability of an extraordinarily high number of communication paths (the exact number is left as an exercise for the reader). stan the leprechaun hacker ssc-vax!sts (soon utah-cs) ps For those not familiar with FP, read J. Backus' Turing Lecture in CACM (Aug 78, I believe) - it is very readable, also he gives a one-liner for matrix multiplication in FP, which I used as a basis for the timing hackery above