turpin@ut-sally.UUCP (04/01/87)
Whatever the information content of human DNA is, it should NOT be interpretted as the amount of information required to describe what a human is. A better interpretation is to view this information as describing differences between individuals, so that the "amount of information" determines the potential genetic variability in the human population. (Even this is a rough cut, since not all possible values for human DNA are practical biological values.) Outside of DNA content, what is the other information that determines what a human is? The entire environment under which DNA is an encoding: biochemical "laws" that determine RNA and protein synthesis, the form of human DNA as opposed to other DNA (23 chromosome pairs), etc. Consider a computer that on receiving a one-bit message will either print the Declaration of Independence or Hobbes' Leviathan. That one bit determines which is chosen, but does not fully describe either result. Russell
johnc@haddock.UUCP (04/04/87)
> Consider a computer that on receiving >a one-bit message will either print the Declaration of >Independence or Hobbes' Leviathan. That one bit determines which >is chosen, but does not fully describe either result. Red herring time. The message contains exactly one bit. The complexity of the response has nothing whatsoever to do with the size of the message. The original question was about the information content of human DNA, not about the DNA's environment, which obviously contains much more information. One observation I haven't seen yet is the peculiarity of DNA called "reading frames" This effectively triples the number of amino-acid sequences a given chunk of DNA encodes. Multiply this by two for the complementary strand. Granted, it is very rare that all six readings actually code for something in real life. But this doesn't have much to do with the information content. I remember back in the early days of computing, a prof in a course held up a punchcard with lots of holes in it, and another with no holes punched at all, and asked the class to compare their information contents. The correct answer, of course, was that all the punchcards in the house contained exactly the same amount of information: 12*80 bits. Some of them contained all 0 bits, but that isn't less information than one with some holes punched out. I wonder if anyone has ever built a computer with the possibility of multiple "reading frames". Consider an 8-bit memory, but a 16-bit instruction size. If you start executing at address A and at A+1, you get two possibly very different programs. Can any real-life processors do this? It happens with DNA quite often. -- John Chambers (617)247-1155 ...!ima!johnc [No, I don't work at cdx39 any more.]
emigh@ecsvax.UUCP (04/04/87)
In article <425@haddock.UUCP> johnc@haddock.ISC.COM.UUCP (John Chambers) writes: > >I wonder if anyone has ever built a computer with the possibility of multiple >"reading frames". Consider an 8-bit memory, but a 16-bit instruction size. >If you start executing at address A and at A+1, you get two possibly very >different programs. Can any real-life processors do this? It happens >with DNA quite often. When I was still using Microsoft CP/M FORTRAN, I ran across some of their assembler programming in the FORTRAN libraries that did this. The 16 bit load was 1 byte for the instruction and 2 bytes of data. The 8 bit load was 1 byte for the instruction and 1 byte of data. A segment of code might look like: Location 1 2 3 If we started executing at 1, it read as Instructions LD HL Low High (Low and High are the 2 bytes of data) If we started executing at 2, it was Instructions xx LD A Data The HL value was always discarded (a slow NOP), and this presumably saved then a jump at some point. Needless to say, it was a mess to read and very poor programming practice. -- Ted H. Emigh Genetics and Statistics, North Carolina State U, Raleigh NC USENET: emigh@ecsvax.uucp DOMAIN: emigh%ecsvax.ncecs.edu ARPA: ecsvax!emigh@mcnc.org BITNET: NEMIGH@TUCC Distribution to monotremes and flightless waterfowl **RESTRICTED**
6065833@pucc.UUCP (04/04/87)
John Chambers writes: >>a discussion of various aspects of DNA coding, quantity and form. >... >One observation I haven't seen yet is the peculiarity of DNA called "reading >frames" This effectively triples the number of amino-acid sequences a given >chunk of DNA encodes. Multiply this by two for the complementary strand. >... >I wonder if anyone has ever built a computer with the possibility of multiple >"reading frames". Consider an 8-bit memory, but a 16-bit instruction size. >If you start executing at address A and at A+1, you get two possibly very >different programs. My mother once asked me to explain the basics about computers: what bits and bytes are, ect. She was confused about something; eventually I figured out that what she really wanted to know was HOW the computer knows where an 8-bit word begins and ends, given that there are only 2 "letters" in alphabet, and no punctuation or spaces allowed. This is the "reading frame" question from the other side. I'm actually quite surprised that she, who has never even touched a computer to date, came up with such a discerning question. Most people who use computers (as opposed to programming them) seem never to think about this at all (and I've dealt with thousands of such people). Una Smith 6065833@PUCC I thought signature files were silly until I realized I usually forget to identify myself. No longer. Please forgive my apparent rudeness.
diaz@aecom.UUCP (04/05/87)
In article <425@haddock.UUCP>, johnc@haddock.UUCP (John Chambers) writes: > One observation I haven't seen yet is the peculiarity of DNA called "reading > frames" This effectively triples the number of amino-acid sequences a given > chunk of DNA encodes. Multiply this by two for the complementary strand. > > Granted, it is very rare that all six readings actually code for something > in real life. But this doesn't have much to do with the information content. > Forget rare, there ain't no such animal. Although multiple reading frames have been observed in phage, and transcription from complementary strands observed in a variety of organisms (most recently, mice) there is no documented example of anywhere near the six possible readings coding for functional polypeptides. The implications for molecular evolution for such a scheme would be disastrous. Organisms apparently think little about the advantages of compact genomes, prokaryotes included. Rather there seems to be something about having a lot of "junk" DNA that's beneficial, if I may be allowed this ounce of teleology. Granted, we may one day realize that much satellite and intron DNA may have functions we don't even dream about today, but I truly doubt that coding for proteins will be one of them. -- dn/dx = Dan Diaz (philabs!aecom!diaz) Department of Molecular Biology & Pizza Chemistry AECOM "Hold the E.coli"
srp@ethz.UUCP (04/05/87)
In article <425@haddock.UUCP> johnc@haddock.ISC.COM.UUCP (John Chambers) writes: > >One observation I haven't seen yet is the peculiarity of DNA called "reading >frames" This effectively triples the number of amino-acid sequences a given >chunk of DNA encodes. Multiply this by two for the complementary strand. > >Granted, it is very rare that all six readings actually code for something >in real life. But this doesn't have much to do with the information content. > >It happens with DNA quite often. Whoa... I can think of only a few cases where two reading frames are used at one time: PhiX174, Sv40 (both viruses) come to mind. I donot know of *any* cases where more than one reading frame is used at once. Can you elaborate? I view the multiple reading frame translation as evolutions' way of dealing with size problems, thus it happens mainly in viruses (which would like to be small). Reading frames do overlap, but not for a very long segment. It doesn't take long before one reading frames' Alanine is another reading frames' Stop codon! The only reason this works at any length is because the genetic code is degenerate. As for uses in computer land, I think we are again limited by how difficult it is to have two strands of information running in the same data space. On top of that there isn't any data degenaracy to work with in computers. -- ----------- Scott Presnell Swiss Federal Institute of Technology (ETH-Zentrum) Department of Organic Chemistry Universitaetsstrasse 16 CH-8092 Zurich Switzerland. uucp: ...seismo!mcvax!cernvax!ethz!srp (srp@ethz.uucp) earn/bitnet: Benner@CZHETH5A
chiaraviglio@husc2.UUCP (04/06/87)
In article <425@haddock.UUCP>, johnc@haddock.UUCP (John Chambers) writes: > I wonder if anyone has ever built a computer with the possibility of multiple > "reading frames". Consider an 8-bit memory, but a 16-bit instruction size. > If you start executing at address A and at A+1, you get two possibly very > different programs. Can any real-life processors do this? It happens > with DNA quite often. Actually, you don't need 16-bit instruction size to do this, just an 8-bit instruction which expects some bytes of data immediately following it. This is the case on most processors (although in some cases the relevant numbers of bits are 16 and 32 or more); you can observe such an effect even on a 6502. The problem on processors is compounded by the fact that they store states much more than enzymes, so that even if an instruction is at A and the next one at A + 1, the program may behave very differently depending on the address at which execution begins. -- -- Lucius Chiaraviglio lucius@tardis.harvard.edu seismo!tardis.harvard.edu!lucius Please do not mail replies to me on husc2 (disk quota problems, and mail out of this system is unreliable). Please send only to the address given above.
gerryg@laidbak.UUCP (04/07/87)
In article <425@haddock.UUCP> johnc@haddock.ISC.COM.UUCP (John Chambers) writes: >I wonder if anyone has ever built a computer with the possibility of multiple >"reading frames". Consider an 8-bit memory, but a 16-bit instruction size. >If you start executing at address A and at A+1, you get two possibly very >different programs. Can any real-life processors do this? It happens >with DNA quite often. When I first understood that a coded DNA sequence could mean different things when decoded in different "phases", I thought about this connection to computer programs and data. It's not hard to see that this is a characteristic of the software, not the hardware. I think of it as a compression technique, once you have a program or data structure, you can look for ways of compressing it by folding it back on itself. That is look for places where a piece of code or data is duplicated in another, unintended place. The original copy can be deleted, and references to it refered to the new location. Of course you can get more fancy by rearanging or modifying things in a way that doesn't effect function to create a redundancy that can be deleted. Its probably not a very productive way to save your computer resources, but it is interesting to think about. In the case of DNA, there are several things that I wonder about. First, how does the cell keep non-sensical things from getting expressed. I know there is a lot of DNA devoted to "control" functions, but its hard to beleive that every possible reading of a DNA sequence either doesn't get expressed, or is necessary to (or at least not dangerous for) the organism. Another thing, when DNA sequences get rearanged in reproduction, new sequences are produced and other destroyed by this process. How can a cell survive this with its genetic information intact? Or is it that we never see the mistakes? And then, the transcription process can't be 100% reliable, but there's a lot of information in the DNA of most organisms; there must be mistakes. How does the organism cope with this? Is it just redundancy? Or is there some kind of repair mechanism that puts an almost right sequence back together? Well, I'm not a biologist, but it seems to me that these are interesting questions, and I suspect that we haven't gotten very close to answering them. gerry gleason
johnc@haddock.UUCP (04/07/87)
In article <1010@aecom.UUCP> diaz@aecom.UUCP (Dizzy Dan) writes: >In article <425@haddock.UUCP>, johnc@haddock.UUCP (John Chambers) writes: >> One observation I haven't seen yet is the peculiarity of DNA called "reading >> frames" This effectively triples the number of amino-acid sequences a given >> chunk of DNA encodes. Multiply this by two for the complementary strand. >> Granted, it is very rare that all six readings actually code for something >> in real life. But this doesn't have much to do with the information content. > >Forget rare, there ain't no such animal. Although multiple reading >frames have been observed in phage, and transcription from complementary >strands observed in a variety of organisms (most recently, mice) there >is no documented example of anywhere near the six possible readings >coding for functional polypeptides. Come now, isn't it a tad early to make such a declaration? The literature has on the order of 100 genomes published, all but a handful being viruses. A couple of real cases of overlapping genes have been discovered; both are in viruses. From this you are going to presume to predict that there are no cases at all in higher organisms? You have more chutzpa than I. > The implications for molecular evolution for such a scheme would > be disastrous. No, they'd only be disastrous if such overlaps were common. It seems clear after just a little thought that selection would tend to eliminate overlapping genes, perhaps replacing them with replicates that can then mutate independently. Simple info-theoretic calculations would predict that such multiply-read stretches of DNA would be rare; they wouldn't be impossible. I'll go out on a limb myself, and predict that when we finally get entire genomes of vertebrate species, we will in fact find a few overlapping genes. The frequency will be much lower than you would expect in a random list of nucleotides, but they will be found occasionally. BTW, there is one mitigating factor that allows them slightly more often than you'd expect: There are many codings for most of the amino acids. Of the 64 possible frames, there are only 21 amino acids and a stop code. This means that it is possible (although difficult) to make slight changes to one of a pair of overlapping genes without changing the other. But it'll still be quite rare. >Organisms apparently think little about the advantages of compact >genomes, prokaryotes included. Rather there seems to be something about >having a lot of "junk" DNA that's beneficial, if I may be allowed this >ounce of teleology. Granted, we may one day realize that much satellite >and intron DNA may have functions we don't even dream about today, but I >truly doubt that coding for proteins will be one of them. The same thing goes for programs. Most of my programs contain some junk that I hope is never used. I call it "debugging code", and it is turned on by something like a -D5 option on the command line. If you don't know about such an enabling option, you might well look at the code and decide that it is worthless and can never be executed. Consider also the use of #ifdef in C to supply parallel chunks of code that can never be activated together: #ifdef SYS5 ... #endif #ifdef BSD ... #endif #ifdef XENIX ... #endif How do you know that some of the "junk" DNA isn't like this? I sorta suspect that my DNA comes loaded down with similar "junk" sequences that can become enabled by some circumstance that probably won't happen in my lifetime, but happened often enough to my ancestors that the stuff passed the Darwinian tests and got passed on. The fact that current researchers can't explain it is interesting to me, but not to my genome. Note that genetic "diseases" like sickle-cell anemia and diabetes are already known to be adaptive in certain environments. If such downright damaging genes are "adaptive" in some populations, you'd expect a lot of the apparently-innocuous DNA to also be adaptive somehow. Also, the literature already contains some descriptions of stretches of DNA that have regulatory functions rather than coding for amino acids. Also, sometimes DNA (more often RNA) ends up curling around, interacting with itself like an enzyme, and modifying its own function. This could easily activate stretches that otherwise appear to be dummies. It's not well understood yet, but wait a few more years. -- John Chambers (617)247-1155 <...!ima!johnc> [The above opinions are my own; for a small fee, they can be yours, too.]
eddy@boulder.UUCP (04/08/87)
>Come now, isn't it a tad early to make such a declaration? The literature >has on the order of 100 genomes published, all but a handful being viruses. >A couple of real cases of overlapping genes have been discovered; both are >in viruses. From this you are going to presume to predict that there are >no cases at all in higher organisms? You have more chutzpa than I. No, John, the point Dizzy was making was that while cases of overlap exist, they are 1)very rare 2)very short and 3)only in two of the three reading frames. The party line is that these cases have evolved because of the pressure on viral genomes to be as small as possible (the smaller they are, the faster they replicate). The difficulties in writing overlapping codes are enormous even for a human who is doing the writing deliberately; the difficulties for evolution are incredible. And remember that your original point was that any given sequence potentially represents 6 (!) codings, not just two. Dizzy rightly replied that this is a good approximation to impossible. So what I'm trying to get across is that you're right, in theory; a given nucleotide sequence is capable of coding for 6 different proteins. The practical consideration is that DNA sequence length is not a limiting factor for anything except phage. Thus Dizzy is also right that there is little reason to expect coding region overlaps in anything but phage. (Um, just to be safe, the above applies only to overlaps for the purpose of information compression. Examples exist, I believe, of overlap for the purpose of regulation.) >Also, the literature already contains some descriptions of stretches of >DNA that have regulatory functions rather than coding for amino acids. >Also, sometimes DNA (more often RNA) ends up curling around, interacting >with itself like an enzyme, and modifying its own function. This could >easily activate stretches that otherwise appear to be dummies. It's not >well understood yet, but wait a few more years. I know the RNA literature to some degree (considering some of the big guns in RNA 'ribozymes' are here at Colorado). But I have never heard of DNA possessing catalytic activity. My impression was that the 2' OH on RNA was what enabled it to be reactive; DNA ('deoxy') lacks this 2' OH. Could you provide references for catalytic DNA?? --this is not a flame, I am really interested; there's too much molecular biology to hope to know it all. - Sean Eddy - Dept. of Molecular, Cellular, Developmental Biology - Univ. of Colorado, Boulder; Boulder, CO 80309 - - "Ph.D.'s are for suckers." -- from 'Ask Mr. Science'
werner@aecom.UUCP (04/09/87)
In article <430@haddock.UUCP>, johnc@haddock.UUCP (John Chambers) writes: > In article <1010@aecom.UUCP> diaz@aecom.UUCP (Dizzy Dan) writes: > >In article <425@haddock.UUCP>, johnc@haddock.UUCP (John Chambers) writes: > >> Granted, it is very rare that all six readings actually code for something > > > >Forget rare, there ain't no such animal. Although multiple reading > >frames have been observed in phage, and transcription from complementary > >strands observed in a variety of organisms (most recently, mice) there > >is no documented example of anywhere near the six possible readings > >coding for functional polypeptides. > > Come now, isn't it a tad early to make such a declaration? It is quite probable that an organism using all 6 reading frames in a given DNA sequence will never be found. Two is used quite often (usually opposite strands, including genes within an intron, rather than overlapping, which is also stretching the point). Three is only observed once, and that almost doesn't count. In the splicing of Polyoma T-antigens, the 3' splice site of the respective introns, fall into 3 reading frames. Small T-antigen encounters a stop codon several amino acids later, but large T goes on for quite some distance. The reason that this is cheating is that the first part (95%, 50%, 30%) of each transcript is absolutely identical - there is only one promoter. It is only the 3' end of the genes that are in all three reading frames. -- Craig Werner (MD/PhD '91) !philabs!aecom!werner (1935-14E Eastchester Rd., Bronx NY 10461, 212-931-2517) "Time flies when you're streaking out N. gonorrheae."
ma_jpb@bath63.UUCP (04/09/87)
An effect similar to that of DNA reading frames will have been experienced by anyone who has tried to write a disassembler. Finding where to start decoding an arbitrary chunk of code, particularly if the code may include static data is difficult, even for a machine such as the 6502 with a relatively sparse instruction set. It is very easy to find chunks of static data that decode for several instructions as apparently valid code. For machines such as the NS32016 with very densly encoded instruction sets, such that almost any bit sequence is a valid instruction, symbolic disassembly is exceedingly awkward. J.P. Bennett School of Mathematical Sciences University of Bath Bath, England, BA2 7AY Tel: +44 225 826891 Email: ma_jpb@uk.ac.bath.ux63
howard@cpocd2.UUCP (04/10/87)
In article <891@sigi.Colorado.EDU> eddy@beagle.Colorado.EDU (Sean Eddy) writes: >No, John, the point Dizzy was making was that while cases of overlap exist, >they are 1)very rare 2)very short and 3)only in two of the three reading >frames. I'm sure there was an article in Sci Am recently about a virus which had a very short segment of triple overlap. This makes point 3 false. >And remember that your original point was that any given sequence >potentially represents 6 (!) codings, not just two. Dizzy rightly >replied that this is a good approximation to impossible. Any sequence not containing a terminator does, in some sense, code for 6 proteins. It would perhaps be more accurate to say that the probability of all 6 of these proteins being at all functional (or, less likely, actually produced by an organism) is very close to zero. The exact probability is a negative exponential of the sequence length, which we could approximate via information theory and statistics about protein mutability vs. function. Anyone have any relevant statistics? -- Copyright (c) 1987 by Howard A. Landman. You may copy this material for any non-commercial purpose as long as this notice is retained. You may also transmit this material to others and charge for such transmission, as long as you place no additional restrictions on retransmission of the material by the recipients.
eddy@boulder.UUCP (04/12/87)
In article <569@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes: >In article <891@sigi.Colorado.EDU> eddy@beagle.Colorado.EDU (Sean Eddy) writes: >>No, John, the point Dizzy was making was that while cases of overlap exist, >>they are 1)very rare 2)very short and 3)only in two of the three reading >>frames. > >I'm sure there was an article in Sci Am recently about a virus which had a >very short segment of triple overlap. This makes point 3 false. Sorry, I stand corrected. Thanks; I didn't know about the polyoma example, though I should have. I also take back number two. Did a little reading on phiX174 (Nature 264: 34-41, 1976), which is a bacteriophage of E. coli. Apparently gene E which codes for a host cell lysis protein is located completely within the coding sequence for gene D, which is necessary for replication. No small overlap there, we're talking two complete genes. Pardon me while I extract my foot from my mouth. >>And remember that your original point was that any given sequence >>potentially represents 6 (!) codings, not just two. Dizzy rightly >>replied that this is a good approximation to impossible. > >Any sequence not containing a terminator does, in some sense, code for 6 >proteins. It would perhaps be more accurate to say that the probability of >all 6 of these proteins being at all functional (or, less likely, actually >produced by an organism) is very close to zero. The exact probability >is a negative exponential of the sequence length, which we could approximate >via information theory and statistics about protein mutability vs. function. >Anyone have any relevant statistics? But this I won't buy yet. What is meant by a terminator here? To me, 'terminator' refers to a transcriptional terminator. The regulatory signals for protein translation are different. Having no transcription stop site should, to my mind, make little difference to protein translation. Also, something to keep in mind is that translation is a very controlled system, for good reason. Protein synthesis costs a hell of a lot of energy. A cell that wantonly made all 6 possible proteins from a sequence would quickly be selected against in favor of a cell that only produced the functional one. - Sean Eddy - Dept. of Molecular, Cellular, Developmental Biology - Univ. of Colorado, Boulder; Boulder, CO 80309 - - "Science has done some wonderful things, but I'd rather be happy - than right." - "Are you?" - "Well, I'm afraid that's where it all falls down." - - from Hitchhiker's Guide to the Galaxy
howard@cpocd2.UUCP (04/16/87)
>In article <569@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes: >>Any sequence not containing a terminator does, in some sense, code for 6 >>proteins. In article <918@sigi.Colorado.EDU> eddy@boulder.Colorado.EDU (Sean Eddy) writes: >What is meant by a terminator here? To me, 'terminator' refers to a >transcriptional terminator. The regulatory signals for protein >translation are different. Having no transcription stop site >should, to my mind, make little difference to protein translation. All I was trying to do was eliminate the possibility of having the coding for one protein stop and the coding for another start in the same reading frame in the same sequence, because then the counting gets messier. Perhaps it would be clearer to look at it as a function of the base pairs: how many separate proteins are there for which THIS BASE PAIR is part of the coding? Or, rephrasing my above statement: Any base-pair can theoretically be part of the coding for 6 proteins. -- Copyright (c) 1987 by Howard A. Landman. You may copy this material for any non-commercial purpose, or transmit this material to others and charge for such transmission, as long as this notice is retained and you place no additional restrictions on retransmission of the material by the recipients.
evs@duke.cs.duke.edu (Ed Simpson) (04/30/87)
In article <430@haddock.UUCP> johnc@haddock.ISC.COM.UUCP (John Chambers) writes: > .... It seems >clear after just a little thought that selection would tend to eliminate >overlapping genes, perhaps replacing them with replicates that can then >mutate independently. Not if the overlapping DNA sequences confer some sort of "coadaptaion". This would be the case if the DNA sequences coded for proteins that conferred a higher fitness of the organism than homologous proteins produced by other DNA sequences. There's lots of discussion in the literature about the possibility of selection favoring increased linkage of certain genotypic combinations. In the case of non-overlapping genes there can never be 100% linkage; there is always the possibility of crossover occuring meiosis. It seems to me that overlapping genes would be one way of achieving essentially 100% linkage. -- UUCP: {decvax, seismo}!mcnc!duke!evs ARPA: evs@cs.duke.edu CSNET: evs@duke Ed Simpson, P.O.Box 3140, Duke Univ. Medical Center, Durham, NC, USA 27710
c60a-4er@tart17.BERKELEY.EDU (Class Account) (05/04/87)
a product which is deleterious in oversupply, such duplication could be deleterious, preserving the overlapping loci. Also, I seem to remember at least one pair of overlapping in-frame genes whose protein products must share an identical sequence to interact correctly, and which are protected by being overlapping from mutations which would disrupt their interaction. I can't remember what genes or in what organism. Can anyone help me out with this? Mary K. Kuhner