andrew@nsc.nsc.com (andrew) (03/18/89)
Here is the promised collation of Image Compression papers which some of you requested. The received material is listed here, with a brief review. A rather light repast (quantity, not quality) consisted of: 1) Peter Foldiak: Adaptive network for optimal linear feature extraction. Abstract: A network of highly interconnected linear neuron-like processing units and a simple, local, unsupervised rule for the modification of con- nection strengths between these units are proposed. After training the net- work on a high (m) dimensional distribution of input vectors, the lower (n) dimensional output becomes a projection into the subspace of the n largest principal components (the subspace spanned by the n eigenvectors of largest eigenvalues of the input covariance matrix) and maximize the mutual informa- tion between the input and the output in the same way as principal component analysis does. The purely local nature of the synaptic modification rule (simple Hebbian and anti-Hebbian) makes the implementation of the network easier, faster and biologically more plausible than rules depending on error propagation. From sun!NSS.CS.UCL.AC.UK!PF103%phoenix.cambridge.ac.uk Peter Foldiak Physiological Laboratory Downing Street Cambridge CB2 3EG England 2a) G.W. Cottrell, P. Munro, D. Zipser, "Image Compression By Back Propagation: An Example of Extensional Programming", Feb. 1987, ICS Report 8702, (to be published in N.E. Sharkey (Ed.), Models of Cognition, Norwood, N.J.) Reprint requests to : Institute for Cognitive Science, C-105; UC San Diego, La Jolla, CA 92093. 2b) G.W. Cottrell, "Analysis of Image Compression by Back Propagation" (cf. 2a) 3a) Terence D. Sanger, "Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural Network", MIT AI Lab., NE43-743, Cambridge, MA 02139. TDS@wheaties.ai.mit.edu 3b) Terence D. Sanger, "An Optimality Principle for Unsupervised Learning" (Since I only got #1 electronically, the abstract is included only for #1). --------------------------------------------------------------------------- All three papers are strikingly similar. They all deal with the "encoding problem". If there is an "odd man out", it is 2a,b), in that 3) - and I surmise 1) - build upon it and I think subsume its results. The Sanger 3a) paper is highly germane; he seems to have defined a method whereby maximal information preservation occurs across one layer, using only linear elements, and a purely local update structure. The learned matrix of weights becomes (row-wise) the eigenvectors of the input autocorrelation. He demonstrates compression on 8-bit grey-scale photos of about 23:1, good performance on textual segmentation, and interesting correlations with the receptive fields found in primates, when he trains with random noise. This paper builds upon (e.g.) the Linsker article in Computer, 1988. It offers a new way to look at representations in hidden units (which I've not yet seen so clearly expressed), and allows easy selection of any desired compression level by selectively ignoring eigenvectors of lower eigenvalue (ordering is a property of the learning algorithm), and by the selective quantisation of the individual eigenvector outputs. The techniques used include Oja learning, Gram-Schmidt orthogonalisation, and the Karhunen-Loeve transformation. Highly relevant is his comparitive data with respect to (cf. #2) self-supervised backprop, where numerous criteria show GHA ("Generalised Hebbian Algorithm") to be superior. These criteria include: - performance in noise - ability to selectively quantise - training time - dependence upon training order - dependence upon initial weight values - uniqueness of solution - interpretation of solution as representing significant features This should be provocative stuff for ardent backprop afficionados! Bear in mind, however, that this (minimally) specifically addresses the encoding problem. Conversely, extraction of a significance-ordered, self-uncorrelated feature set is a quite general and useful analysis process for ANY dataset. 3b) from Terence Sanger on compression using GHA (Generalised Hebbian Algorithm) explores performance with nonlinearities and multiple layers. Results are: 1. He achieves equivalent (or better) convergence than backprop in a multilayer net, using an output nonlinearity of rectification at zero threshold. Since training time will scale "at most linearly" with the number of layers, this is "an encouragement" to apply GHA to solve other problems (heretofore addressed chiefly by backprop systems). 2. The receptive field results for nonlinear GHA are similar to the linear case, but include (mostly) double the number of states, since he now obtains the complementary patterns (off <-> on interchanged). (I find it mysterious that random noise training produces orientation- selective receptive fields spontaneously; what's the connection between eigenvectors of an input autocorrelation and straight lines? Not only are these fields similar to those found in retinal cells of cat and monkey, but, as one goes down the list in order of decreasing eigenvalue, resemble somewhat eigenstates of wave-functions of atoms from quantum mechanics - perhaps a coincidental isomorphism!). 3. With a 2-layer nonlinear GHA net (hidden nonlinear, output linear), one is able to detect stereo disparity edges. He notes that "no linear function has this property". This therefore vindicates the nonlinear approach. Future directions may include investigation of classes of nonlinearities other than rectification or sigmoids, and further analysis of nonlinearity from a mathematical standpoint. ============================================================================ DOMAIN: andrew@logic.sc.nsc.com ARPA: nsc!logic!andrew@sun.com USENET: ...{amdahl,decwrl,hplabs,pyramid,sun}!nsc!logic!andrew Andrew Palfreyman 408-721-4788 work National Semiconductor MS D3969 408-247-0145 home 2900 Semiconductor Dr. P.O. Box 58090 there's many a slip Santa Clara, CA 95052-8090 'twixt cup and lip ============================================================================
fozzard@boulder.Colorado.EDU (Richard Fozzard) (03/18/89)
In article <10199@nsc.nsc.com> andrew@nsc.nsc.com (andrew) writes: with regard to: 3b) Terence D. Sanger, "An Optimality Principle for Unsupervised Learning" > (I find it mysterious that random noise training produces orientation- > selective receptive fields spontaneously; what's the connection between > eigenvectors of an input autocorrelation and straight lines? > Not only are these fields similar to those found in retinal cells of > cat and monkey, but, as one goes down the list in order of decreasing > eigenvalue, resemble somewhat eigenstates of wave-functions of atoms > from quantum mechanics - perhaps a coincidental isomorphism!). > Well, I dont have the solution to the mystery, but Lehky and Sejnowski report a similar learning of line segment receptive fields under equally unexpected circumstances - learning surface curvature from shading. This work used standard back-prop instead of the Hebb rule, though. Here's the reference: "Neural Network Model for the Cortical Representation of Surface Curvature from Images of Shaded Surfaces" Sidney R. Lehky and Terrence Sejnowski Department of Biophysics Johns Hopkins University In: Lund, J.S. (Ed.) Sensory Processing, Oxford (1988) PS: If you talked to a certain Dr. Mandelbrot, he would insist that your "coincidental isomorphism" was hardly that - remember fractals? Richard Fozzard University of Colorado fozzard@boulder.colorado.edu
aboulang@bbn.com (Albert Boulanger) (03/18/89)
Here is another one from my collection since I am interested in this subject: "Dimensionality-Reduction Using Connectionist Networks" Eric Saud, MIT AI Memo 941 (January 1987) Also the so-called "encoder" networks using backprop where the desired output is set to be the input (dimensionality is reduced at the hidden layer and the hidden layer activity can serve as the desired "real" output) and Hinton's & McClelland's recirculating networks generalization of encoder nets (see "Learning Representations by Recirculation" Heoffrey Hinton & James McClelland, NIPS Proceedings AIP Press 1988) can reduce dimensionality. In general the class of learning algorithms called "unsupervised" learning can potentially reduce dimensionality. There is however a spectrum of characteristics among the different unsupervised learning procedures: Do the reduced dimensions span the space? Are the reduced dimensions orthogonal? Terry Sanger's algorithm does both. It would be interesting to work out what his learning rule does with sigmoid transfer functions for the neurons. Albert Boulanger BBN Systems & Technologies Corporation aboulanger@bbn.com
kortge@Portia.Stanford.EDU (Chris Kortge) (03/19/89)
In article <10199@nsc.nsc.com> andrew@nsc.nsc.com (andrew) writes: > [...] >3a) Terence D. Sanger, "Optimal Unsupervised Learning in a Single-Layer > Linear Feedforward Neural Network", MIT AI Lab., NE43-743, Cambridge, > MA 02139. TDS@wheaties.ai.mit.edu > >The Sanger 3a) paper is highly germane; he seems to have defined a method >whereby maximal information preservation occurs across one layer, using >only linear elements, and a purely local update structure. The learned >matrix of weights becomes (row-wise) the eigenvectors of the input >autocorrelation. [...] >Highly relevant is his comparitive data with respect to (cf. #2) self-supervised >backprop, where numerous criteria show GHA ("Generalised Hebbian Algorithm") >to be superior. These criteria include: >- training time > [and several other things] It wasn't clear to me from reading Sanger's thesis that the GHA is obviously faster than self-supervised backprop. He says that, for backprop, "training time still seems to be an exponential function of the number of units in the network." It seems like this would be problem-dependent, though, and principal components is not that tough as typical backprop problems go. Does anyone know of an actual scaling study on this? (E.g., where n-dimensional random Gaussian vectors with m known principal components are used as inputs, and n & m are varied, keeping percent variance explainable constant, say.) Another problem with the claim is that the fuzzy term "training time" hides something important. Namely, Sanger's algorithm trains output units one-by-one during the presentation of each pattern, and to my knowledge this sequentiality is inherent. Thus it could be that the GHA is superior to backprop when measured in "pattern time", but not when measured in real time (i.e. operations of an ideal parallel device). Here again, I don't know the answer; I would be interested in whatever info people have on this. Chris Kortge
efrethei@afit-ab.arpa (Erik J. Fretheim) (03/19/89)
In article <10199@nsc.nsc.com> andrew@nsc.nsc.com (andrew) writes: >Here is the promised collation of Image Compression papers which some >of you requested. The received material is listed here, with a brief review. One more to checl out is a paper by Daugman (I think) in the July 88 (or was that August) issue of ASSP. In it the aouthor discussed using a nueral net to find the Gabor coefficients for an image. Using these they were able to represent an image in less than 1 bit per pixel. Gabor transforms are nice in that they have a close relationship to biological image processors. Although the information given here is somewhat sketchy, the article shouldn't be hard to find for anyone interested. ejf