neuron-request@HPLABS.HP.COM ("Neuron-Digest Moderator Peter Marvit") (12/05/89)
Neuron Digest Monday, 4 Dec 1989 Volume 5 : Issue 52 Today's Topics: Re: Data Complexity Re: Data Complexity Re: Data Complexity Re: Data Complexity Re: Kolmogorov Complexity references (was Data Complexity) Re: Kolmogorov Complexity references (was Data Complexity) Neural Nets in Gene Recognition Re: Neural Nets in Gene Recognition genetic algorithm simulators Re: genetic algorithm simulators Send submissions, questions, address maintenance and requests for old issues to "neuron-request@hplabs.hp.com" or "{any backbone,uunet}!hplabs!neuron-request" Use "ftp" to get old issues from hplpm.hpl.hp.com (15.255.176.205). ------------------------------------------------------------ Subject: Re: Data Complexity From: ted@nmsu.edu (Ted Dunning) Organization: NMSU Computer Science Date: 16 Oct 89 17:38:56 +0000 ivan bach is very enthusiastic, but relatively uninformed about recent developments in determining the complexity of finite sequences. he also has an annoying tendency to post news that is more than 80 columns. [[Editor's Note: Ivan's article was reformatted for Digest inclusion. PM]] here is a partial commentary on his posting, In article <4542@imagen.UUCP> ib@apolling (Ivan N. Bach) writes: Path: opus!lanl!cmcl2!nrl-cmf!think!ames!amdcad!sun!imagen!daemon From: ib@apolling (Ivan N. Bach) Claude Shannon, founder of the theory of information, ... He decided to call this measure "entropy," because he realized that his formula: H = -K x sum of (pi x log pi) for all i=1 to i=a shannon actually qualified this measure extensively. it is only valid if successive symbols are independent, and only valid for stationary ergodic sequences. none of these qualifications apply in most biological systems. Lila Gatlin showed that Claude Shannon's formula could be used to calculate the entropy of a randomly chosen base in the DNA chain of bases in the nucleus of a living cell. If all four types of bases in a DNA chain are equiprobable, the entropy of a randomly chosen base in the DNA chain is equal to: ... = 2 bits actuallly lila gatlin showed that you can misuse shannon's formula to get an upper bound on the amount of information in a dna chain. Computer programmers know that they have to use at least 2 bits to Cencode 4 different values. real computer programmers know that this is not true. If all bases in a DNA chain are not equiprobable, the probability of a certain type of base occurring at a random position in a DNA chain is proportional to the number of bases of its type in the ... Therefore, the entropy of a randomly chosen base in the DNA chain of "Micrococcus lysodeikticus" is equal to: = 1.87 bits this is entirely wrong. see below. Lila Gatlin came to the conclusion that the entropy of a complex system is equal to the sum of entropies of its components. lila is again mistaken, unless what actually was said was that the entropy of a system composed of _independent_ parts is the sum of the entropies of its components. of course _complex_ systems generally do not allow such naive decomposition. [ arithmetic maunderings about bits and blueprints deleted ] [ fuzzy comparison of a turing machine to protein synthesis ] 4. Instead of constructing a full-blown neural network, we could specify a "blueprint" for contructing that network. I think that the amount of information in the blueprint can always be smaller than the amount of information in the network if there is any redundancy or repetition in the structure of the network. actually the blueprint must be less than or equal to the network. otherwise, we would just use the network as the blueprint. see below for a reasonable definition of information content. 5. If you use a certain "alphaneural net, you should use each type of component with the same frequency if you want achieve the maximum information capacity. of course maximum information _content_ (or capacity) would also be obtained by making all components completely independent. this is the same as recommending that people use RAM for their neural net. you _don't_ want a system with maximum content. you want a system that encodes and carries meaning. these are very different things. shannon's formulae are simple enough to be misused extensively by all kinds of people. such misuse does not mean that they are any more applicable than they were in 1949. in particular, if you compute the entropy of english text, you over-estimate the information content by a factor of 4 or more. for instance, this posting has an entropy of 4.6 bits per letter, while english is generally accepted to have an information content of 1 bit per letter or less. the difference is entirely due to the assumption of independence between letters. an even more dramatic example is a sequence formed by the regular expression {abcd}*. if we take any 100 characters from a random location in such a sequence, and compute the entropy for the sequence, we will find that we apparently have 200 bits of information. in fact, though, we have only 2 bits of information because once we know the first letter in the sequence, we know all the letters. a much more sound approach to the complexity of sequences was proposed in the 60's by various soviet mathematicians, and was popularized for use with chaotic systems by v.i. arnold in the 70's. this model defines the complexity of a finite string to be the size of the smallest computer program which can be used to recreate the sequence. in arnold's work, the sequence was the itinerary of a finite map, but this does not really matter if we are starting with a sequence from a finite alphabet. obviously, the upper bound on the complexity of a finite sequence is the sequence size itself, since we can easily write a program for, say, a ram machine which builds a list containing the sequence in the same number of steps as the original sequence is long. the complexity of one of our 100 character substrings of {abcd}* is much less, since we can write a very simple program which produces one of these substrings. in practice, information content of a complex sequence can be estimated by building relatively simple models for inter-symbol dependencies and then measuring the entropy of the errors the model makes in predicting the next symbol. this is nearly equivalent to the approach used by arnold and co, since if you can't build a predictor at all, then the error stream becomes equivalent to the original sequence. ted@nmsu.edu Dem Dichter war so wohl daheime In Schildas teurem Eichenhain! Dort wob ich meine zarten Reime Aus Veilchenduft und Mondenschein ------------------------------ Subject: Re: Data Complexity From: park@usceast.UUCP (Kihong Park) Organization: University of South Carolina, Columbia Date: 18 Oct 89 15:14:04 +0000 In article <TED.89Oct16113856@aigyptos.nmsu.edu> you write: >ivan bach is very enthusiastic, but relatively uninformed about recent >developments in determining the complexity of finite sequences. I agree with your comments regarding the outdatedness and errors of Ivan's statements. As to describing the information content of a sequence over a finite alphabet set, the algorithmic information complexity measure as developed independently by Kolmogorov, Chaitin, and Solomonov can be fruitfully applied here. Basically, they define the complexity of a sequence as the size of the minimum program which faithfully generates the sequence, if run on some universal turing machine. Thus, for a DNA sequence to encode maximum information, i.e., be maximally patternless, its algorithmic complexity must nearly equal the length of the sequence. As to Ivan's comments about the applicability of entropy to represent the efficiency of information coding in a network, many workers in the field have realized that for a given task, a more or less 'adequate' network size should be strived for. One of the principal motivations for doing this is the empirical observation that if a network size is 'too large' compared to the complexity of the task at hand, then it may generate solutions which do not 'generalize' well, given the opportunity to do so. What 'generalization' means is a different problem altogether, but what this observation shows us is that one has to constrain the degree of freedom of a network to entice it to capture the most compact representation. This constitutes so to say, the 'minimal program'. Thus, the problem at hand is knowing the optimal size of a network for a given task. I do not see an easy way how entropies can help us in solving this correspondence problem. > Dem Dichter war so wohl daheime > In Schildas teurem Eichenhain! > Dort wob ich meine zarten Reime > Aus Veilchenduft und Mondenschein It's interesting to notice that some people equate our profession to that of poets. Is this a revival of romanticism ? ------------------------------ Subject: Re: Data Complexity From: demers@beowulf.ucsd.edu (David E Demers) Organization: EE/CS Dept. U.C. San Diego Date: 18 Oct 89 20:37:44 +0000 In previous article ted@nmsu.edu (Ted Dunning) writes: >ivan bach is very enthusiastic, but relatively uninformed about recent >developments in determining the complexity of finite sequences. Agreed. [discussion of Bach's posting + critical commentary deleted] >shannon's formulae are simple enough to be misused extensively by all >kinds of people. such misuse does not mean that they are any more >applicable than they were in 1949. Despite the misuse, the basic concept is very valuable. >a much more sound approach to the complexity of sequences was proposed >in the 60's by various soviet mathematicians, and was popularized for >use with chaotic systems by v.i. arnold in the 70's. This work goes by the name of "Kolmogorov Complexity", usually. Solomonoff, Kolmogorov, Chaitin did much of the pioneering work in the early 60's, from independent reasons - examining the notion of randomness [Kolmogorov] and inference [Solomonoff]. >obviously, the upper bound on the complexity of a finite sequence is >the sequence size itself, [[... See Ted's article for description. -PM]] The above is a very good quick description of the concept. It essentially quantifies Occam's Razor in modeling. Computational learning theory takes much of these ideas; Valiant, et al make use of the same principle. I apologize for once again posting with the 'F' key before gathering citations. Walt Savitch just gave a talk last week on Kolmogorov Complexity for the Cognitive Science seminar here at UCSD; I will scare up a summary and some references and post soon for those interested in pursuing the idea in more depth. Dave DeMers demers@cs.ucsd.edu Dept. of Computer Science & Engineering demers@inls1.ucsd.edu UCSD {other non-Internet mailpaths La Jolla, CA 92093 according to normal syntax...} ------------------------------ Subject: Re: Data Complexity From: andrew@dtg.nsc.com (Lord Snooty @ The Giant Poisoned Electric Head ) Organization: National Semiconductor, Santa Clara Date: 23 Oct 89 01:58:24 +0000 I guess the entropy calcs use the p.ln(p) expression, as per Shannon.. but as per Hofstadter, what about the complexity of the context or "frame"? What tells you which jukebox will play your record? If you could give da Vinci a binary dump of a set of Postscript graphics files describing his drawings, could he work out that they were his? - or would you have to provide him with the appropriate "jukebox" (PC + OS + etc)? Of course, the correct answer is the latter one. So my question is: have you included the complexity of the decoding apparatus in your entropy estimates? ........................................................................... Andrew Palfreyman and the 'q' is silent, andrew@dtg.nsc.com as in banana time sucks ------------------------------ Subject: Re: Kolmogorov Complexity references (was Data Complexity) From: demers@beowulf.ucsd.edu (David E Demers) Organization: EE/CS Dept. U.C. San Diego Date: 28 Oct 89 17:20:47 +0000 I recently promised to provide some follow up references on Kolmogorov Complexity, following a thread discussing how one might measure "complexity" of a model, such as a neural net. For Kolmogorov Complexity in general, the best place to start would be with an introduction to the notions of Kolmogorov Complexity and its application to a number of different problems which can be found in: Li, M. and Paul Vitanyi, Kolmogorov Complexity and Its Applications, in Handbook of Theoretical Computer Science, (J. van Leeuwen, Ed.) North-Holland, 1989. A preliminary version of the same work is in the Proceedings of the 3d IEEE Structure in Complexity Theory Conference, 1988. The same authors have a book out by Addison-Wesley, An Introduction to Kolmogorov Complexity and its Applications (1989). For original work, see the references in the bibliography of the above book. Let me point out the following which are probably most important: Kolmogorov, AN, "Three approaches to the quantitative definition of information" Problems in Information Transmission 1, 1-7, 1965; Shannon & Weaver, The Mathematical Theory of Communication, U. of Illinois Press, 1949 (for basics of information theory) Kolmogorov, A.N. "Logical Basis for information theory and probability theory", IEEE Trans. on Info. Theory ?? 1968 (sorry) Solomonoff, R.J., "A formal theory of inductive inference", Information & Control 30, 1-22 and 224-254 (1964). Chaitin, G.J., "Randomness and Mathematical Proof" Scientific American, 232 (May 1975), pp47-52 Chaitin, G.J., "Algorithmic Information Theory", IBM J. Res. Dev. 21, 350-359 (1977) Chaitin, G.J., Information, Randomness and Incompleteness. (World Scientific Pub, 1987) other surveys are: Zvonkin, A.K. and L.A. Levin, "The complexity of finite objects and the development of the concepts of information and randomness by the means of the Theory of Algorithms", Russ. Math. Survey 25, 83-124 (1970) Schnorr, C.P., "A survey of the theory of random sequences", in Basic Problems in Methodology and Linguistics (Butts, R.E. Ed.) 1977, pp. 193-210. Application of Kolmogorov Complexity, or similar ideas, can be seen in work of Valiant, Gold, Rissanen & others (maximum likelihood, maximum entropy...). I might mention Rissanen, J., "Modeling by the shortest data description", Automatica 14, 465-471 (1978) for a practical suggestion for statistical modeling - minimum description length - which has two components, one essentially computing the size of the model and the other computing the size of the data when encoded using the theory. APPLICATIONS: Can state and prove Goedel's incompleteness theorem using Kolmogorov Complexity, Can derive a prime-number theorem, # primes less than n is on the order of n/ (log n (log log n)^2) , can state and prove the pumping lemma for regular languages in a couple of lines, and so on... In the neural net world, Tenorio & Lee used a variant of Rissanen's minimum description length along with a stochastic growth method for determining when to add new nodes to a network (TR-EE 89-30, June 1989, Purdue University School of Electrical Engineering) Hope this is of interest! Dave DeMers Dept. of Computer Science & Engineering C-014 and Institute of Non-Linear Science UCSD La Jolla, CA 92093 demers@cs.ucsd.edu demers@inls1.ucsd.edu ------------------------------ Subject: Re: Kolmogorov Complexity references (was Data Complexity) From: munnari.oz.au!bruce!lloyd@uunet.uu.net (lloyd allison) Organization: Monash Uni. Computer Science, Australia Date: 30 Oct 89 01:28:15 +0000 try also J. Rissanen Stochastic Complexity and C. S. Wallace and P. R. Freeman Estimation and inference by compact coding Journal of the Royal Statistical Society (series B) V 39 No3 1987 pages 223-239 and 252-265 (Rissanen) and 240-265 Wallace and Freeman also C. S. Wallace and D. M. Boulton An information Measure for Classification Computer Journal 11(2) 185-194 1968 ------------------------------ Subject: Neural Nets in Gene Recognition From: eesnyder@boulder.Colorado.EDU (Eric E. Snyder) Organization: University of Colorado, Boulder Date: 06 Nov 89 03:11:15 +0000 I am looking for some references on the application of neural nets to recognition of genes in DNA sequences. I have heard second-hand of some research at LANL but Medline does not seem to be well informed on the matter.... Thanx, - ------------------------------------------------------------------------- TTGATTGCTAAACACTGGGCGGCGAATCAGGGTTGGGATCTGAACAAAGACGGTCAGATTCAGTTCGTACTGCTG Eric E. Snyder Department of Biochemistry Proctoscopy recapitulates University of Colorado, Boulder hagiography. Boulder, Colorado 80309 LeuIleAlaLysHisTrpAlaAlaAsnGlnGlyTrpAspLeuAsnLysAspGlyGlnIleGlnPheValLeuLeu - ------------------------------------------------------------------------- ------------------------------ Subject: Re: Neural Nets in Gene Recognition From: eesnyder@boulder.Colorado.EDU (Eric E. Snyder) Organization: University of Colorado, Boulder Date: 09 Nov 89 15:32:07 +0000 Thanks for the several replies providing references on the subject. Several more people requested that I forward the information I recieved on the subject. Our mailer bounced several of these replies so I'll just post it here (the comp.ai people have probably seen it before): ************************************************************************* From: ohsu-hcx!spackmank@cse.ogc.edu (Dr. Kent Spackman) Subject: connectionist protein structure The two articles I mentioned are: Holley, L.H.; Karplus, M. Protein structure prediction with a neural network. Proceeding of National Academy of Science, USA; 1989; 86: 152-156. Qian, Ning; Sejnowski, Terrence J. Predicting the secondary structure of globular proteins using neural network models. J Mol Biol; 1988; 202: 865-884. I have an article that will be published in the proceedings of the Symposium on Computer Applications in Medical Care, in Washington, D.C., in November, entitled: "Evaluation of Neural Network Performance by ROC analysis: Examples from the Biotechnology Domain". Authors are M.L. Meistrell and myself. Kent A. Spackman, MD PhD Biomedical Information Communication Center (BICC) Oregon Health Sciences University 3181 SW Sam Jackson Park Road Portland, OR 97201-3098 From: Lambert.Wixson@MAPS.CS.CMU.EDU Subject: DNA,RNA, etc. Holley and Karplus, Proceedings of the National Academy of Science 86, 152-156 (89). - ---- From: mv10801@uc.msc.umn.edu Subject: Re: applications to DNA, RNA and proteins George Wilcox (mf12801@sc.msc.umn.edu) does work on predicting protein tertiary structure using large backprop nets. - --Jonathan Marshall Center for Research in Learning, Perception, and Cognition 205 Elliott Hall, Univ. of Minnesota, Minneapolis, MN 55455 - ---- >From munnari!cluster.cs.su.OZ.AU!ray@uunet.UU.NET Fri Sep 29 23:40:55 1989 Subject: applications to DNA, RNA and proteins Borman, Stu "Neural Network Applications In Chemistry Begin to Appear", C&E News, April 24 1989, pp 24-28. Thornton, Janet "The shape of things to come?" Nature, Vol. 335 (1st September 1988), pp 10-11. You probably know about the Qian and Sejnowski paper already. The Thornton "paper" is a fast overview with a sentence or two comparing Q&S's work with other work. Borman's C&E piece is fairly superficial, but it mentions some other people who have played with this stuff, including Bryngelson and Hopfield, Holley and Karplus (who apparantly have published in Proc. Nat. Acad. Sci., 86(1), 152 (1989)) and Liebman. The 1990 Spring Symposium at Stanford (March 27-29, 1990) will have a session on "Artificial Intelligence and Molecular Biology". The CFP lists Neural Networks (very broad-minded of them!), so it might be worth a look when it comes around. From: "Evan W. Steeg" <steeg@ai.toronto.edu> Subject: NNets and macromolecules There is a fair amount of work on applying neural networks to questions involving DNA, RNA, and proteins. The two major types of application are: 1) Using neural networks to predict conformation (secondary structure and/or tertiary structure) of molecules from their sequence (primary structure). 2) Using nets to find regularities, patterns, etc. in the sequence itself, e.g. find coding regions, search for homologies between sequences, etc. The two areas are not disjoint -- one might look for alpha-helix "signals" in a protein sequence as part of a structure prediction method, for example. I did my M.Sc. on "Neural Network Algorithms for RNA Secondary Structure Prediction", basically using a modified Hopfield-Tank (Mean Field Theory) network to perform an energy minimization search for optimal structures. A technical report and journal paper will be out soon. I'm currently working on applications of nets to protein structure prediction. (Reference below). Qian and Sejnowski used a feed-forward net to predict local secondary structure of proteins. (Reference above). At least two other groups repeated and extended the Qian & Sejnowski experiments. One was Karplus et al (ref. above) and the other was Cotterill et al in Denmark. (Discussed in a poster at the Fourth International Symposium on Artificial Intelligence Systems, Trento, Italy Sept. 1988). Finally, a group in Minnesota used a supercomputer and back-prop to try to find regularities in the 2-d distance matrices (distances between alpha-carbon atoms in a protein structure). An interim report on this work was discussed at the IJCNN-88 (Wash. DC) conference. (Sorry, I don't recall the names, but the two researchers were at the Minnesota Supercomputer Center, I believe.) As for the numerous research efforts in finding signals and patterns in sequences, I don't have these references handy. But the work of Lapedes of Los Alamos comes to mind as an interesting bit of work. Refs: E.W. Steeg. Neural Network Algorithms for the Prediction of RNA Secondary Structure. M.Sc. Thesis, Computer Science Dept., University of Toronto, Toronto, Ontario, Canada, 1988. Evan W. Steeg (416) 978-7321 steeg@ai.toronto.edu (CSnet,UUCP,Bitnet) Dept of Computer Science steeg@ai.utoronto (other Bitnet) University of Toronto, steeg@ai.toronto.cdn (EAN X.400) Toronto, Canada M5S 1A4 {seismo,watmath}!ai.toronto.edu!steeg - ----- From: pastor@PRC.Unisys.COM (Jon Pastor) Subject: Re: applications to DNA, RNA and proteins @article(nakata85a, Author="K. Nakata and M. Kanehisa and D. DeLisi", Title="Prediction of splice junctions in mRNA sequences", Journal="Nucleic Acids Research", Year="1985", Volume="13", Number="", Month="", Pages="5327--5340", Note="", Annote="") @article(stormo82a, Author="G.D. Stormo and T.D. Schneider and L.M. Gold ", Title="Characterization of translational initiation sites in E. coli", Journal="Nucleic Acids Research", Year="1982", Volume="10", Number="", Month="", Pages="2971--2996", Note="", Annote="") @article(stormo82b, Author="G.D. Stormo and T.D. Schneider and L.M. Gold and A. Ehrenfeucht", Title="Use of the `perceptron' algorithm to distinguish translational initiation sites in E. coli", Journal="Nucleic Acids Research", Year="1982", Volume="10", Number="", Month="", Pages="2997--3010", Note="", Annote="") In addition, there is going to be (I think) a paper by Alan Lapedes, from Los Alamos, in a forthcoming book published by the Santa Fe Institute; my group also has a paper in this book, which is how I know about Lapedes' submission. I am going to try to contact the editor to see if I can get a preprint; if so, I'll let you know. I didn't attend the meeting at which Lapedes presented his paper, but I'm told that he was looking for splice junctions. - ---- From: ff%FRLRI61.BITNET@CUNYVM.CUNY.EDU (Francoise Fogelman) Subject: proteins We have done some work on the prediction of secondary structures of proteins. This was presented at a NATO meeting (Les Arcs, march 1989) and will be published in the proceedings. F. Fogelman LRI Bat 490 Universite de Paris Sud 91405 ORSAY cedex FRANCE Tel 33 1 69 41 63 69 e-mail: ff@lri.lri.fr - ---- The book "Evolution, Learning and Cognition", the article "Learning to Predict the Secondary Structure of Globular Proteins" by N. Qian & T. J. Sejnowski. ------------------------------ Subject: genetic algorithm simulators From: schmid@spica.cbmvax (Bob Schmid) Organization: /usr2/schmid/.organization Date: 02 Oct 89 17:20:28 +0000 I see in David Goldberg's book, _Genetic Algorithms_, mention of public domain GA programs (page 71) with some "bells & whistles". Code by Booker and DeJong (1985), and Grefenstette (1984) is specifically mentioned. Since neural-nets seem to be prime targets of such algorithms, I was wondering, does anyone out there know of sources (by ftp, uucp or email) for these or any other GA public domain source code? Any leads will be most appreciated! - -Bob R. Schmid <schmid@cbmvax.cbm.commodore.com> <uunet!cbmvax!schmid> ------------------------------ Subject: Re: genetic algorithm simulators From: thearlin@vdsvax.crd.ge.com (Thearling Kurt H) Organization: General Electric CRD, Schenectady, NY Date: 02 Oct 89 19:15:43 +0000 >[ ... ] does anyone out there know of sources (by ftp, uucp or >email) for these or any other GA public domain source code? Here are a couple sources. The Grefenstette code (GENESIS) is availabe from him via email (gref@aic.nrl.navy.mil). Rick Riolo also has a subroutine package available (CFS-C) and his email address is Rick_Riolo@um.cc.umich.edu. He prefers to have you send him a tape and he will return it to you with the code. Contact him for the details. There is also a system that I just came accross called "whales and plankton," which, according to its author, "is a program which models a small ecology and is a study of genetic algorithms." It was posted to comp.sources.misc and is available via anonymous ftp from xanth.cs.odu.edu [128.82.8.1]. It is located in Volume7 and is named whpl. I don't know what the Booker and DeJong code is but I would be interested in finding more about it. If anybody knows where to get it, please let me know. Also, is the code included in Goldberg's book available from somewhere? I don't really want to type it in if I can get it over the net. If someone has already typed it it, could you send it to me? kurt Kurt Thearling General Electric CRD thearlin@vdsvax.crd.ge.com Bldg. KW, Room C301 uunet!vdsvax.crd.ge.com!thearlin P.O. Box 8 kurt@bach.csg.uiuc.edu Schenectady, NY 12301 ------------------------------ End of Neurons Digest *********************