[comp.ai.neural-nets] Pseudo-inverse formula problem. HELP!

rcrane@yoda.eecs.wsu.edu (Robert Crane - EE582) (04/28/91)

In the neural network literature, it is stated that one way to
calculate a weight vector, w, that minimizes the error for simple
linear units (such as perceptrons) is to use the pseudo-inverse
formula:

     w = Q*a'*b

     a is a nxp input vector space
     b is a 1xp training output vector space
     Q = inverse(A'*A)

     and the pseudo inverse is, Q*a'.

It is stated that the method for computing the weights applies only if
Q EXISTS, and that this condition requires the input patterns to be
linearly independent.

My problem is that in practical examples, I have been using
this formula with great success even though I have over
a p=100 input vectors (divided into two classes which are linerly 
separable) spanning only two dimensions.  Clearly, these vectors
canNOT be linerly independent, yet I can compute this equation.
The solution appears to be minimizing the mean square error in
every case in the problems I have been working on.


To be a little more explicit, the mean square error is defined
to be:

   E = .5*(a*w-b)'*(a*w-b)
     = .5*[w'*a'*a*w - 2*w'*a'*b + b'*b] 

Taking the partial derivative of E with respect to w 
and setting it equally to 0 to find the minimum we get:

   0 = A'*a*w - A'*b
   A'*a*w = A'*b

 and thus

   w = inv(a'*a)*a'*b.

Again, it is stated that the Q=inverse(a'*a) exists only if
the vectors that make up A are linearly independent.
But as I have stated before, I have been computing this
when the vectors that make up A are not linearly independent.

What is going on here????
Can someone please help me in explaining the mathematics involved here?
Does not the pseudo-inverse exist for any nxp matrix???
-- 
-bob crane (rcrane@yoda.eecs.wsu.edu)

balden@wimsey.bc.ca (Bruce Balden) (04/28/91)

In article <1991Apr27.213346.9351@serval.net.wsu.edu> rcrane@yoda.eecs.wsu.edu (Robert Crane - EE582) writes:
>In the neural network literature, it is stated that one way to
>calculate a weight vector, w, that minimizes the error for simple
>linear units (such as perceptrons) is to use the pseudo-inverse
>formula:
>
>     w = Q*a'*b
>
>     a is a nxp input vector space
>     b is a 1xp training output vector space
>     Q = inverse(A'*A)
>
>     and the pseudo inverse is, Q*a'.
>
>It is stated that the method for computing the weights applies only if
>Q EXISTS, and that this condition requires the input patterns to be
>linearly independent.
>
>My problem is that in practical examples, I have been using
>this formula with great success even though I have over
>a p=100 input vectors (divided into two classes which are linerly 
>separable) spanning only two dimensions.  Clearly, these vectors
>canNOT be linerly independent, yet I can compute this equation.
>The solution appears to be minimizing the mean square error in
>every case in the problems I have been working on.
>
>What is going on here????
>Can someone please help me in explaining the mathematics involved here?
>Does not the pseudo-inverse exist for any nxp matrix???

For any matrix A of maximal rank, its "Moore-Penrose pseudo-inverse" A+
is given by either (A'*A)^(-1) * A' or A' * (A * A')^(-1) whichever exists.
These are easily seen to both be equal simply to the ordinary inverse
in case it exists.   Needless to say, these formulas aren't going to
work unless one of the matrices A'*A and A*A' has an inverse.  This is
guaranteed by A having maximal rank.

This is a lot less mysterious if you take a geometric approach to things.
Consider the matrix of an overdetermined set of equations.  If we think
of the matrix as multipliying the unknown vector on the left : Ax,
then this means that A has more rows than columns.  For a particular
RHS y, there is not much chance that any vector exists such that
y=Ax.  However, we don't give up.  Instead, we look at the entire
set of images of A by all possible inputs.  This set (subspace) is
not, in general, going to include y, but we can PROJECT y onto the
subspace of solutions.  A little reasoning will show that the "orthogonal
projection" of y onto the range of A will produce the so-called
least-squares image y1.  For this y1, there is a solution y1=Ax and
this x has the property that no x produces a result closer to y.

Next question: how do we carry out the calculation. If A has a single
column, then its range consists multiples of its single column. The
familiar formula for the projection of of a vector y onto the range
of A takes the matrix form (in this case) of Py=A*(A'*A)^(-1)*A'
This formula turns out to be perfectly general, as long as the indicated
inverse exists.  

Therefore, the equation Py=Ax has a unique solution x.  But we already
know that A*(A'*A)^(-1)*A'y=Ax, i.e. Az=Ax where z=(A'*A)^(-1)*A'y
and we know that Az is in the image of and therefore z is the solution
sought.  In this case, therefore, we get the formula

A+ = (A'*A)^(-1)*A'

The interpretation of A+ depends on your point of view.  If you consider
A to act on the right: x -> xA, then for A having more rows than
columns, then A represents an UNDERDETERMINED system of equations and
A+ finds the SHORTEST of the possible solutions to xA = y.  In greater
generality, A will not have maximal rank (it reduces to a form having
both a column and a row of zeroes.  In such a case, A+ determines
the solution of x of Ax=y having the following properties:
(1) The vector w=Ax (which will not equal y usually) minimizes
the distance between w and y.
(2) Among all solutions of Ax=w, x is the shortest.

A+ is defined for all non-zero matrices.

Abstractly, A+ can be uniquely characterized by a list of properties.
I won't go into this here.

As for your success in violating the maximal rank condition, perhaps
you are using a matrix package that calculates pseudo-inverses
in the general case, or maybe we're just seeing round-off error.
-- 
*******************************************************************************
*	Bruce E. Balden	    		Computer Signal Corporation Canada    *
*	Thaumaturgist			225B Evergreen Drive		      *
*	balden@xenophon.wimsey.bc.ca	Port Moody, B.C. V3H 1S1     CANADA   *

janowsky@math.rutgers.edu (Steven Janowsky) (04/29/91)

The need for linear independence of patterns in neural network learning
algorithms is vastly overestimated.  Typically a linearly dependent set will
produce the same results as any spanning linearly independent set.

See Berryman, Inchiosa, Jaffe and Janowsky:
"Convergence of an iterative neural network learning algorithm for linearly
dependent patterns"  J. Phys. A 23 L223-L228 (1990).

"Extending the Pseudo-Inverse Rule" in _Neural_Networks_and_Spin_Glasses_
(Proceedings, Porto Alegre 1989), W.K. Theumann and R. K\"oberle, eds.,
Teaneck: World Scientific, 1990.