[comp.ai.neural-nets] Data Compression

andrew@nsc.nsc.com (andrew) (03/18/89)

Here is the promised collation of Image Compression papers which some
of you requested. The received material is listed here, with a brief review.
A rather light repast (quantity, not quality) consisted of:

1) Peter Foldiak: Adaptive network for optimal linear feature extraction.

Abstract: A network of highly interconnected linear neuron-like processing
units and a simple, local, unsupervised rule for the modification of con-
nection strengths between these units are proposed. After training the net-
work on a high (m) dimensional distribution of input vectors, the lower (n)
dimensional output becomes a projection into the subspace of the n largest
principal components (the subspace spanned by the n eigenvectors of largest
eigenvalues of the input covariance matrix) and maximize the mutual informa-
tion between the input and the output in the same way as principal component
analysis does. The purely local nature of the synaptic modification rule
(simple Hebbian and anti-Hebbian) makes the implementation of the network
easier, faster and biologically more plausible than rules depending on error
propagation.

From sun!NSS.CS.UCL.AC.UK!PF103%phoenix.cambridge.ac.uk 
Peter Foldiak
Physiological Laboratory
Downing Street
Cambridge CB2 3EG
England

2a) G.W. Cottrell, P. Munro, D. Zipser, "Image Compression By Back Propagation:
   An Example of Extensional Programming", Feb. 1987, ICS Report 8702, 
   (to be published in N.E. Sharkey (Ed.), Models of Cognition, Norwood, N.J.) 
   Reprint requests to : Institute for Cognitive Science, C-105; UC San Diego, 
   La Jolla, CA 92093.

2b) G.W. Cottrell, "Analysis of Image Compression by Back Propagation" (cf. 2a)

3a) Terence D. Sanger, "Optimal Unsupervised Learning in a Single-Layer
   Linear Feedforward Neural Network", MIT AI Lab., NE43-743, Cambridge, 
   MA 02139. TDS@wheaties.ai.mit.edu

3b) Terence D. Sanger, "An Optimality Principle for Unsupervised Learning" 

(Since I only got #1 electronically, the abstract is included only for #1).
---------------------------------------------------------------------------

All three papers are strikingly similar. They all deal with the "encoding
problem". If there is an "odd man out", it is 2a,b), in that 3) - and I
surmise 1) - build upon it and I think subsume its results.

The Sanger 3a) paper is highly germane; he seems to have defined a method
whereby maximal information preservation occurs across one layer, using
only linear elements, and a purely local update structure. The learned
matrix of weights becomes (row-wise) the eigenvectors of the input
autocorrelation. He demonstrates compression on 8-bit grey-scale photos
of about 23:1, good performance on textual segmentation, and interesting
correlations with the receptive fields found in primates, when he trains
with random noise. This paper builds upon (e.g.) the Linsker article
in Computer, 1988. It offers a new way to look at representations in hidden
units (which I've not yet seen so clearly expressed), and allows easy
selection of any desired compression level by selectively ignoring eigenvectors
of lower eigenvalue (ordering is a property of the learning algorithm), and
by the selective quantisation of the individual eigenvector outputs.
The techniques used include Oja learning, Gram-Schmidt orthogonalisation,
and the Karhunen-Loeve transformation.
Highly relevant is his comparitive data with respect to (cf. #2) self-supervised
backprop, where numerous criteria show GHA ("Generalised Hebbian Algorithm")
to be superior. These criteria include:
- performance in noise
- ability to selectively quantise
- training time
- dependence upon training order
- dependence upon initial weight values
- uniqueness of solution
- interpretation of solution as representing significant features

This should be provocative stuff for ardent backprop afficionados! Bear in
mind, however, that this (minimally) specifically addresses the encoding 
problem. Conversely, extraction of a significance-ordered, self-uncorrelated 
feature set is a quite general and useful analysis process for ANY dataset.

3b) from Terence Sanger on compression using GHA (Generalised Hebbian
Algorithm) explores performance with nonlinearities and multiple
layers. Results are:

1. He achieves equivalent (or better) convergence than backprop in a
   multilayer net, using an output nonlinearity of rectification at zero
   threshold. Since training time will scale "at most linearly" with the
   number of layers, this is "an encouragement" to apply GHA to solve
   other problems (heretofore addressed chiefly by backprop systems).

2. The receptive field results for nonlinear GHA are similar to the linear
   case, but include (mostly) double the number of states, since he now
   obtains the complementary patterns (off <-> on interchanged).
   (I find it mysterious that random noise training produces orientation-
   selective receptive fields spontaneously; what's the connection between
   eigenvectors of an input autocorrelation and straight lines?
   Not only are these fields similar to those found in retinal cells of
   cat and monkey, but, as one goes down the list in order of decreasing
   eigenvalue, resemble somewhat eigenstates of wave-functions of atoms
   from quantum mechanics - perhaps a coincidental isomorphism!).

3. With a 2-layer nonlinear GHA net (hidden nonlinear, output linear), one
   is able to detect stereo disparity edges. He notes that "no linear
   function has this property". This therefore vindicates the nonlinear
   approach.

Future directions may include investigation of classes of nonlinearities
other than rectification or sigmoids, and further analysis of nonlinearity
from a mathematical standpoint.

============================================================================
	DOMAIN: andrew@logic.sc.nsc.com  
	ARPA:   nsc!logic!andrew@sun.com
	USENET: ...{amdahl,decwrl,hplabs,pyramid,sun}!nsc!logic!andrew

	Andrew Palfreyman				408-721-4788 work
	National Semiconductor	MS D3969		408-247-0145 home
	2900 Semiconductor Dr.			
	P.O. Box 58090					there's many a slip
	Santa Clara, CA  95052-8090			'twixt cup and lip
============================================================================

fozzard@boulder.Colorado.EDU (Richard Fozzard) (03/18/89)

In article <10199@nsc.nsc.com> andrew@nsc.nsc.com (andrew) writes:

with regard to:
3b) Terence D. Sanger, "An Optimality Principle for Unsupervised Learning"

>   (I find it mysterious that random noise training produces orientation-
>   selective receptive fields spontaneously; what's the connection between
>   eigenvectors of an input autocorrelation and straight lines?
>   Not only are these fields similar to those found in retinal cells of
>   cat and monkey, but, as one goes down the list in order of decreasing
>   eigenvalue, resemble somewhat eigenstates of wave-functions of atoms
>   from quantum mechanics - perhaps a coincidental isomorphism!).
>
Well, I dont have the solution to the mystery, but Lehky and Sejnowski
report a similar learning of line segment receptive fields under equally
unexpected circumstances - learning surface curvature from shading.  This
work used standard back-prop instead of the Hebb rule, though.

Here's the reference:

"Neural Network Model for the Cortical Representation
of Surface Curvature from Images of Shaded Surfaces"
Sidney R. Lehky and Terrence Sejnowski
Department of Biophysics
Johns Hopkins University
In: Lund, J.S. (Ed.) Sensory Processing, Oxford (1988)

PS: If you talked to a certain Dr. Mandelbrot, he would insist that your
"coincidental isomorphism" was hardly that - remember fractals?

Richard Fozzard
University of Colorado
fozzard@boulder.colorado.edu

aboulang@bbn.com (Albert Boulanger) (03/18/89)

Here is another one from my collection since I am interested in this subject:

"Dimensionality-Reduction Using Connectionist Networks"
Eric Saud, MIT AI Memo 941 (January 1987)

Also the so-called "encoder" networks using backprop where the desired
output is set to be the input (dimensionality is reduced at the hidden
layer and the hidden layer activity can serve as the desired "real"
output) and Hinton's & McClelland's recirculating networks
generalization of encoder nets (see "Learning Representations by
Recirculation" Heoffrey Hinton & James McClelland, NIPS Proceedings
AIP Press 1988) can reduce dimensionality. In general the class of
learning algorithms called "unsupervised" learning can potentially
reduce dimensionality. There is however a spectrum of characteristics
among the different unsupervised learning procedures:

Do the reduced dimensions span the space?
Are the reduced dimensions orthogonal?

Terry Sanger's algorithm does both. It would be interesting to work
out what his learning rule does with sigmoid transfer functions for
the neurons.


Albert Boulanger
BBN Systems & Technologies Corporation
aboulanger@bbn.com

kortge@Portia.Stanford.EDU (Chris Kortge) (03/19/89)

In article <10199@nsc.nsc.com> andrew@nsc.nsc.com (andrew) writes:
> [...]
>3a) Terence D. Sanger, "Optimal Unsupervised Learning in a Single-Layer
>   Linear Feedforward Neural Network", MIT AI Lab., NE43-743, Cambridge, 
>   MA 02139. TDS@wheaties.ai.mit.edu
>
>The Sanger 3a) paper is highly germane; he seems to have defined a method
>whereby maximal information preservation occurs across one layer, using
>only linear elements, and a purely local update structure. The learned
>matrix of weights becomes (row-wise) the eigenvectors of the input
>autocorrelation. [...]
>Highly relevant is his comparitive data with respect to (cf. #2) self-supervised
>backprop, where numerous criteria show GHA ("Generalised Hebbian Algorithm")
>to be superior. These criteria include:
>- training time
>  [and several other things]

It wasn't clear to me from reading Sanger's thesis that the GHA is
obviously faster than self-supervised backprop.  He says that, for
backprop, "training time still seems to be an exponential function of
the number of units in the network."  It seems like this would be
problem-dependent, though, and principal components is not that tough
as typical backprop problems go.  Does anyone know of an actual scaling
study on this?  (E.g., where n-dimensional random Gaussian vectors with
m known principal components are used as inputs, and n & m are varied,
keeping percent variance explainable constant, say.)

Another problem with the claim is that the fuzzy term "training time"
hides something important.  Namely, Sanger's algorithm trains output
units one-by-one during the presentation of each pattern, and to my
knowledge this sequentiality is inherent.  Thus it could be that the GHA
is superior to backprop when measured in "pattern time", but not when
measured in real time (i.e. operations of an ideal parallel device).
Here again, I don't know the answer; I would be interested in whatever
info people have on this.

Chris Kortge

efrethei@afit-ab.arpa (Erik J. Fretheim) (03/19/89)

In article <10199@nsc.nsc.com> andrew@nsc.nsc.com (andrew) writes:
>Here is the promised collation of Image Compression papers which some
>of you requested. The received material is listed here, with a brief review.

One more to checl out is a paper by Daugman (I think) in the July 88 (or
was that August) issue of ASSP.  In it the aouthor discussed using a nueral 
net to find the Gabor coefficients for an image.  Using these they were able
to represent an image in less than 1 bit per pixel.  Gabor transforms are 
nice in that they have a close relationship to biological image processors.
Although the information given here is somewhat sketchy, the article shouldn't 
be hard to find for anyone interested.
ejf