[comp.lang.prolog] vector arithmetic in Prolog

ok@quintus.UUCP (Richard A. O'Keefe) (04/16/88)
Two people were kind enough to give me a full reference to the paper by
David Wise I mentioned earlier.  It appears in 

	Functional Programming Languages and Computer Architecture
	(Portland, Oregon, USA, September 1987; Proceedings)
	Editor: Gilles Kahn,
	Springer-Verlag Lecture Notes in Computer Science No 274
	ISBN 0-387-18317-5
	Price, US$34.60

I enquired at three bookshops before I could locate it (you can find
pretty nearly _anything_ at the Stanford University Bookshop).  The
last place I tried first claimed they didn't have it, then when I
asked them to look again using the ISBN they found they'd listed it
under Goos, who is the _series_ editor, not the _volume_ editor.

The abstract of the paper reads:

	The broad problem of matrix algebra is taken up from the
	perspective of functional programming.  A key question is how
	arrays should be represented in order to admit good
	implementations of well-known efficient algorithms, and whether
	functional architecture sheds any new light on these or other
	solutions.  It relates directly to disarming the "aggregate
	update" problem.

	The major thesis is that 2^d-ary trees should be used to
	represent d-dimensional arrays; examples are matrix operations
	(d=2) and a particularly interesting vector (d=1) algorithm.
	Sparse and dense matrices are represented homogeneously, but at
	some overhead that appears tolerable; encouraging results are
	reviewed and extended.  A Pivot Step algorithm is described
	which offers optimal stability at no extra cost for searching.
	The new results include proposed sparseness measures for
	matrices, improved performance of stable matrix inversion
	through repeating pivoting while deep within a matrix-tree
	(extendible to solving linear systems), and a clean matrix
	derivation of the vector algorithm for the fast Fourier
	transform.  Running code [in a functional language called
	DAISY] is offered in the appendices.

	This work is particularly important because of the importance
	of this family of problems. Progress would be of simultaneous
	use in decomposing algorithms over traditional vector
	multiprocessors, as well as motivate practical interest in
	highly parallel functional architectures.

There is a series of papers on this topic, of which this is only one.
For example, there is a paper  
	Representing matrices as quadtrees for parallel processors
	D.S.Wise,
	Information Processing Letters 20 (May 1985) pp 195-199.
[I haven't seen this yet.]

I'm going on about this at some length because the ideas carry over to
Prolog.  Basically, the idea is that you divide a matrix into four blocks
		[       |       ]
		[   A   |   B   ]
		[_______|_______]
		[	|	]
		[   C   |   D   ]
		[	|	]
and represent it by a(A,B,C,D) where A,B,C,D are recursively represented.
An empty block (all elements 0) is represented by, say, e, and a matrix
with all diagonal elements equal to X and zeros elsewhere would be
represented by d(X).  For a dense matrix, the storage cost is about the
same as the list of lists representation.

What makes this approach so effective is that such a recursive
decomposition is quite a natural one for many matrix algebra operations.

I thought it would be interesting to see how this 2^d-ary approach worked
out for (dense) vectors.  I wrote vector addition and vector dot product
routines in Prolog for both vector-as-list and vector-as-tree methods.

add_list([], [], []).
add_list([X|Xs], [Y|Ys], [Z|Zs]) :-
	Z is X+Y,
	add_list(Xs, Ys, Zs).

dot_list(Xs, Ys, Dot) :-
	dot_list(Xs, Ys, 0.0, Dot).

dot_list([], [], Dot, Dot).
dot_list([X|Xs], [Y|Ys], D0, Dot) :-
	D1 is X*Y+D0,
	dot_list(Xs, Ys, D1, Dot).


add_tree(u(X), u(Y), u(Z)) :-
	Z is X+Y.
add_tree(p(X1,X2), p(Y1,Y2), p(Z1,Z2)) :-
	add_tree(X1, Y1, Z1),
	add_tree(X2, Y2, Z2).


dot_tree(u(X), u(Y), Dot) :-
	Dot is X*Y.
dot_tree(p(X1,X2), p(Y1,Y2), Dot) :-
	dot_tree(X1, Y1, D1),
	dot_tree(X2, Y2, D2),
	Dot is D1+D2.

Before we go any further: note that the list-based versions are tail
recursive, which means that they shouldn't push any stack frames, but
the tree-based versions are not tail-recursive.  Half of the calls in
add_tree/3 are tail recursive, but none of the ones in dot_tree/3, so
we can predict that

    time(dot_tree) > time(add_tree) > time(dot_list) ~=~ time(add_list).

We could make dot_tree/3 more like add_tree/3 by giving it an accumulator
parameter:

dot_seqt(Xs, Ys, Dot) :-
	dot_seqt(Xs, Ys, 0.0, Dot).

dot_seqt(u(X), u(Y), D0, Dot) :-
	Dot is X*Y+D0.
dot_seqt(p(X1,X2), p(Y1,Y2), D0, Dot) :-
	dot_seqt(X1, Y1, D0, D1),
	dot_seqt(X2, Y2, D1, Dot).

There are two reasons why I didn't do that.  The first is that if we are
interested in parallel systems, that accumulator parameter is going to
force sequential execution, whereas the original version has sequential
depth lg(N).  The second is that from a numerical analysis point of view,
we would _much_ rather do a tree of additions.  Since each floating-point
operation can cost you half a ULP on a reasonable machine, if we do a
sequence of 2N arithmetic operations we can lose N UsLP, but if we have
a binary tree of arithmetic operations we lose only lgN UsLP.  For example,
if we take the dot product of two vectors of length 128, we can lose up to
7 bits of the result in dot_list/3 or dot_seqt, but only 3 bits go in
dot_tree/3.  (I'm talking about round-off here, not cancellation.  That's
another story.)

I also wrote C versions of vector addition and dot product, using the
obvious array representation.  C programs on a Sun-3 workstation can
be compiled with
	-f68881		-- use a 68881 chip for floating-point arithmetic
	-fsoft		-- do floating-point arithmetic in software
	-fswitch	-- decide at run-time which to use
So that people on a network can access a single copy of Quintus Prolog
whether they have a floating-point chip or not, Quintus Prolog uses the
-fswitch version.  On a workstation with a floating-point chip, I
obtained the following results (time units arbitrary):

		Prolog	Prolog	C	C
		*_tree	*_list	fswitch	f68881
vector addition 53	36	 8	 6
dot product	70	47	25	16

The differences between {add,dot}_{list,tree} are compatible with the
assumption that the difference between tail recursion and non-tail
recursion are responsible.  It is interesting that the tree version
of dot product is only about 3 times slower than the C-fswitch version
(which is the appropriate one to compare it with).

The differences are large, but differences in compiler technology have
a lot to do with it.  It should be noted that the C-f68881 versions are
very simple loops with all key variables in registers, using *X++ rather
than indexing.  Bearing in mind that the tree representation has a very
simple generalisation to sparse vectors, I regard this first result as
encouraging.  When I have some figures for sparse data, I'll let you know.