[comp.ai.neural-nets] Really smart systems

kingsley@hpwrce.HP.COM (Kingsley Morse) (08/15/90)

If we aspire to making truly intelligent machines, our algorithms must scale
up well to large training sets. In other words, the computational complexity
of our algorithms must accomodate many training patterns and many dimensions.
You may have heard of the "curse of dimensionality". If the computational
complexity of our algorithms grows slowly, then we can train with more patterns
and dimensions to get a smarter system. On the other hand, if the computation
required grows explosively as the number of training patterns or dimensions are
increased, then even astoundingly fast hardware won't help.

So, our challange is to find algorithms which scale up well to large problems,
so we can make really smart systems. Listen........... I'll post some
computational complexity figures for some common algorithms. If you add to
(correct?) these, please do. The terminology that I propose is that 
linear computational complexity is better than polynomial which is
better than exponential. Furthermore, the algorithms' scalability must be
measured with respect to learning, recall, patterns, and dimensions.


Algorithm          Learning                   Recall
             patterns  dimensions      patterns     dimensions
------------------------------------------------------------
Backprop    polynomial     ?           independent  linear

Nearest
   neighbor linear      linear         linear       linear

Cart        nlogn       exponential    independent      ?


This leads me to believe that nearest neighbor algorithms work better
for learning a lot because they can be trained faster.

Any contributions to this list are welcome.

russell@minster.york.ac.uk (08/17/90)

In article <3430007@hpwrce.HP.COM> kingsley@hpwrce.HP.COM (Kingsley Morse) writes:
>I'll post some
>computational complexity figures for some common algorithms. If you add to
>(correct?) these, please do. The terminology that I propose is that 
>linear computational complexity is better than polynomial which is
>better than exponential. Furthermore, the algorithms' scalability must be
>measured with respect to learning, recall, patterns, and dimensions.
>
>
>Algorithm          Learning                   Recall
>             patterns  dimensions      patterns     dimensions
>------------------------------------------------------------
>Backprop    polynomial     ?           independent  linear
>
>Nearest
>   neighbor linear      linear         linear       linear
>
>Cart        nlogn       exponential    independent      ?
>
>

Well, if you want tuppeny-worth's thrown in, here's mine.

Backprop has NO guarenteed convergence, therefore to quote computational complexity
is misleading.  It may be possible to give an "average case" complexity figure, but
this is so dependent on the initial weight settings as to be not too much use.
Tesauro and Janssens report empirical results between learning time and predicate
order (q) of patterns.  The net has q inputs, 2q nodes in the first layer, fully
connected, and 1 output node.  The task set is the parity function on the q bits.
Leaning times in b-p scale as approx. 4^q.  The task set is of size 2^q so the
training time is about 2^q.  Conclusion is that empirically learning time is of
exponential order.

The perceptron scales as exponential in input size (Hampson and Volper 1986).

(See Neural Network Design and the Complexity of Learning by Judd for more.....I
aint read it all yet, but it's interesting so far.....)

I confess to not understanding quite what "Cart" means in the above table - I can
make an educated guess, however.....

To add another algorithm to the list, the ADAM system, based on the Willshaw net
(i.e. a distributed associative memory) (Austin 1987) has complexity as follows:

Algorithm          Learning                   Recall
	     patterns  dimensions      patterns     dimensions
------------------------------------------------------------
ADAM	    linear    linear(quadratic)	indep.	    linear(nlogn)

brackets refer to abnormal, but permissible, parameterisation. (Beale 1990).

Note that recall may take longer than teaching - but also note that the realm of
use of ADAM means that the multiplicative constants in front of the order terms are
extremely small.

Russell.

usenet@nlm.nih.gov (usenet news poster) (08/18/90)

r> russell@uk.ac.york.minster  (russell) writes:
km> in respone to kingsley@hpwrce.HP.COM (Kingsley Morse):

km> I'll post some
km> computational complexity figures for some common algorithms. If you add to
km> (correct?) these, please do. The terminology that I propose is that 
km> linear computational complexity is better than polynomial which is
km> better than exponential. Furthermore, the algorithms' scalability must be
km> measured with respect to learning, recall, patterns, and dimensions...

r> Well, if you want tuppeny-worth's thrown in, here's mine.  Backprop has
r> NO guarenteed convergence, therefore to quote computational complexity
r> is misleading.  It may be possible to give an "average case" complexity
r> figure, but this is so dependent on the initial weight settings as to
r> be not too much use.  Tesauro and Janssens report empirical results
r> between learning time and predicate order (q) of patterns.  The net has
r> q inputs, 2q nodes in the first layer, fully connected, and 1 output
r> node.  The task set is the parity function on the q bits.  Leaning
r> times in b-p scale as approx. 4^q.  The task set is of size 2^q so the
r> training time is about 2^q.  Conclusion is that empirically learning
r> time is of exponential order.
r> 
r> The perceptron scales as exponential in input size (Hampson and Volper 1986).
r> 
r> (See Neural Network Design and the Complexity of Learning by Judd for
r> more.....I aint read it all yet, but it's interesting so far.....)
r> 
r> I confess to not understanding quite what "Cart" means in the above
r> table - I can make an educated guess, however.....

From a separate communiction with km:

CART means "Classification and Regression Tree". It is similar to ID3, and
here's how it works. The training vectors are used to "grow" a decision tree.
The decision tree can be used catagorize new inputs.

OK, let me add a couple more cents to the pot.  Classifications need to
consider both the computational complexity and the storage requirements.
Algorithms like nearest neighbor, CART? and ADAM? require that the complete
training set be stored for recall while a perceptron needs storage of the 
order #inputs+#outputs, and a single hidden layer fully connected neural
net needs (#inputs+#output)*#hidden_nodes.

A second consideration is will the algorithm generalize?  Ie., will it
make use of information from more than one input pattern to formulate
an output.

So at the risk of grossly misquoting and being flamed horribly let me
reorder the classification in terms of Np=#patterns, Ni=#inputs, 
No=#outputs, and Nh=#hidden nodes:

Algorithm           Learning time       Recall time   Storage      Generalizes
------------------------------------------------------------------------------
Perceptron          No*(x^Ni)           No*Ni         No*Ni        limited

Neural Net (1 hidden layer)             Nh*(Ni+No)    Nh*(Ni+No)    yes

Backprop            x^(Nh*(Ni+No)) ?
Conjugate gradient  (Nh*(Ni+No))^3 - locally
Monte Carlo         x^(Nh*(Ni+No)) ?

Nearest neighbor    (Ni+No)*Np          Ni*Np         Np*(Ni+No)    no

Class & Reg. Tree   Ni*(Np log Np)      Ni*log Np     Np*(Ni+No)    no
(CART)              + (Ni+No)*Np ?                    + Ni*log Np
 
Adaptive Memory	    Np?                 ?             Np*(Ni+No)?   no?
(ADAM)

David States