[net.ai] Matrix Multiplication on the FFP Machine

koala@unc.UUCP (08/10/83)

	Since  the  subject has been brought up, I felt I should
clear  up  some  of  the  statements about the FFP machine.  The
machine  consists  of  a linear vector of small processors which
communicate by being connected as the leaves of a binary tree.

	Roughly  speaking,  the  FFP  machine  performs  general
matrix multiplication in O(nxn) space and time.  Systolic arrays
can  multiply  matrices  in  O(n)  time,  but  do  not provide a
flexibility in the size of matrices that can be handled.

	Order  notation only presents half the picture - in real
life,  constant factors and other terms are also important.  The
machine's matrix multiply operation examines each element of the
two  matrices  once.   Multiplying  two  matrices,  mxn and nxp,
requires  accessing  (mxn + nxp) values, and this is the measure
of   the  time  for  the  computation.   Each  cell  performs  n
multiplications,  dominated  by  the  access.  Further, when you
multiply  two matrices, mxn and nxp, the result is of size mxp.
(Consider   multiplying   a   column  by  a  row).   Thus,  when
n  < (mxp)/(m+p), extra space must be allocated for the result.
This is also a quadratic time operation.

				David Middleton
				UNC Chapel Hill
				decvax!duke!unc!koala

sts@ssc-vax.UUCP (Stanley T Shebs) (08/11/83)

I must admit to being a little sloppy when giving the maximum speed
of a matrix multiplication on an FFP machine (haven't worked on this
stuff for a year, and my memory is slipping).  I still stand by the
original statement, however.  The *maximum* possible speed for the 
multiplication of two nxn matrices is O(log n).  What I should have
done is state that the machine architecture is completely unspecified.
I am not convinced that the Mago tree machine is the ultimate in
FFP designs, although it is very interesting.  The achievement of
O(log n) requires several things.  Let me enumerate.  First, assume
that the matrix elements are already distributed to their processors.
Second, assume that a single processor can quickly distribute a value to
arbitrarily many processors (easy: put it on the bus (buss? :-} )
and let the processors all go through a read cycle simultaneously).
Third, assume that the processors can communicate in such a way
that addition of n numbers can be performed in log n time (by adding
pairs, then pairs of pairs, etc).  Then the distribution of values
takes constant time, the multiplications are all done simultaneously
and so take constant time, leaving only the summation to slow things
down.  I know this is fast and loose; its main failing is that it
assumes the availability of an extraordinarily high number of
communication paths (the exact number is left as an exercise for the
reader).

					stan the leprechaun hacker
					ssc-vax!sts (soon utah-cs)

ps For those not familiar with FP, read J. Backus' Turing Lecture
in CACM (Aug 78, I believe) - it is very readable, also he gives
a one-liner for matrix multiplication in FP, which I used as a basis
for the timing hackery above