[sci.bio] question

rburns@teknowledge-vaxc.UUCP (03/27/87)

I was wondering roughly how many 'bytes' of information are contained
within human chromosomes?

werner@aecom.UUCP (03/28/87)

In article <11189@teknowledge-vaxc.ARPA>, rburns@teknowledge-vaxc.ARPA (Randy Burns) writes:
> I was wondering roughly how many 'bytes' of information are contained
> within human chromosomes?

	The human genome contain 3 * 10^9 base pairs, which is 1000
times as much as that of Escherichia coli, and about 300 times the
total of all published sequences to date (*).
	Much of that is repeated DNA, either satellite DNA, interspersed
repeats, or moderately repeated gene families (like ribosomal RNA).

	Hence, if a byte is a base pair, that's your answer, although
only two bits are required to specify a base, ergo a 'byte' could 
actually be a tetranucleotide, but most sequences are stored as
letters (ATCG). 
	Similarly, if information is the key phrase here, only about
10-20% of the genome encodes information, so that brings the total
storage requirements down from 3000 Mbp to 300-600Mbp, maybe even
less.

(*) Latest release of Genbank contains 10,913 sequences from 13,774
publications, totalling 10,961,365 base pairs.


-- 
			      Craig Werner (MD/PhD '91)
				!philabs!aecom!werner
              (1935-14E Eastchester Rd., Bronx NY 10461, 212-931-2517)
                   "Viruses do to cells what Groucho did to Freedonia."

srp@ethz.UUCP (03/29/87)

In article <11189@teknowledge-vaxc.ARPA> rburns@teknowledge-vaxc.UUCP writes:

>I was wondering roughly how many 'bytes' of information are contained
>within human chromosomes?

This assumes one base-pair is one bit.

Number of basepairs in the human genome = 2,900,000,000
					= 2.9 giga-_bits_
					= 362.5 mega-bytes

(Assuming 1000 base-pairs per gene	= 2.9 mega-genes)

Doesn't seem like much does it? A vast majority of these 'bytes' aren't even
part of a gene or control region. 

These numbers come from the combination of my calculator and the book...
"DNA Replication" by Arthur Kornberg" W.H. Freeman and Co.(1980) pg 20.

-- 
-----------

Scott Presnell  Swiss Federal Institute of Technology (ETH-Zentrum)
		Department of Organic Chemistry
		Universitaetsstrasse 16
		CH-8092 Zurich Switzerland.

uucp:		...seismo!mcvax!cernvax!ethz!srp     (srp@ethz.uucp)
earn/bitnet:	Benner@CZHETH5A

agranok@udenva.UUCP (03/29/87)

In article <978@aecom.UUCP> werner@aecom.UUCP (Craig Werner) writes:
>In article <11189@teknowledge-vaxc.ARPA>, rburns@teknowledge-vaxc.ARPA (Randy Burns) writes:
>> I was wondering roughly how many 'bytes' of information are contained
>> within human chromosomes?
>
>	Hence, if a byte is a base pair, that's your answer, although
>only two bits are required to specify a base, ergo a 'byte' could 
>actually be a tetranucleotide, but most sequences are stored as
>letters (ATCG). 

The whole arguement gets caught up in definitions, here.  I would consider a
bit to be a base pair, and a byte to be the set of three that encodes for one
amino acid.  Instead of eight bits to a byte, there are three.  After all, one
base pair by itself doesn't do much good.  But, if a base pair is a bit, then
what is a nucleotide?  I guess it all depends on what you mean by "informa-
tion."   That's the problem with trying to put restrictions from one system 
onto another.  I think a better question might be something like:
"How many amino acids (words in the language of proteins) are encoded for on the
human chromosomes?"  or "How many books could these words fill?"  I seem to
remember Sagan doing something like this on Cosmos.  Anyway, I think that would
give a much more easily palpable idea for the enormity of information involved.

-- 

                              Alex Granok 
                              hao!udenva!agranok
                              "Wait a minute.  Strike that.  Reverse it."

emigh@ecsvax.UUCP (03/30/87)

In article <3310@udenva.UUCP> agranok@udenva.UUCP (Alexander Granok) writes:
>(Craig Werner) writes:
>>(Randy Burns) writes:
>>> I was wondering roughly how many 'bytes' of information are contained
>>> within human chromosomes?

>>	Hence, if a byte is a base pair, that's your answer, although
>>only two bits are required to specify a base, ergo a 'byte' could 
>>actually be a tetranucleotide, but most sequences are stored as
>>letters (ATCG). 

>The whole arguement gets caught up in definitions, here.  I would consider a
>bit to be a base pair, and a byte to be the set of three that encodes for one
>amino acid.  Instead of eight bits to a byte, there are three.  After all, one
>base pair by itself doesn't do much good.  But, if a base pair is a bit, then

In the same way, a byte doesn't do much good in floating point arithmetic.:-)
The problem, of course, is that not all the genome is used for the message
in polypeptide chains.  There are noncoding regions (particularly in
eukaryotic organisms); rRNAs (often many thousand copies of each gene); tRNAs
(again with lots of copies except in the  prokaryotes and organelles); etc.
In humans, it is estimated that only 1-2% of all the DNA actually encodes
for amino acids.
This is mostly a problem of semantics.  If we wish to use "byte" as the
smallest unit of meaningful information, then the nucleotide is the byte.
The addition of the complementary base (to make a base pair) adds no additional
information, so the base pair could be considered as the "byte" as well.

-- 
Ted H. Emigh     Genetics and Statistics, North Carolina State U, Raleigh  NC
USENET: emigh@ecsvax.uucp		DOMAIN:	emigh%ecsvax.ncecs.edu
ARPA:	ecsvax!emigh@mcnc.org           BITNET: NEMIGH@TUCC
Distribution to monotremes and flightless waterfowl **RESTRICTED**

6065833@pucc.UUCP (03/31/87)

In article <2840@ecsvax.UUCP>, emigh@ecsvax.UUCP (Ted Emigh) writes, with
regard to the information stored in genes:
 
>If we wish to use "byte" as the
>smallest unit of meaningful information, then the nucleotide is the byte.
 
If you wish to regard the RNA's as the media for genetic storage, instead of
DNA, fine.  This is an alternative theory ('DNA is just RNA's backup').  But co
nsider that even if you discount duplicate pieces of RNA, there is the problem
of huge pieces of RNA junk stuck on the ends which are cut off as the RNA
strings leave the nucleus.  Not to mention that nonsense sequences in DNA
are transcribed into RNA, and only later cut out.  How do you tell which
nucleotides are meaningful?

lew@ihlpa.UUCP (03/31/87)

In article <3310@udenva.UUCP>, agranok@udenva.UUCP (Alexander Granok) writes:
> The whole arguement gets caught up in definitions, here.  I would consider a
> bit to be a base pair, and a byte to be the set of three that encodes for one
> amino acid.  Instead of eight bits to a byte, there are three.  After all, one
> base pair by itself doesn't do much good.  But, if a base pair is a bit, then
> what is a nucleotide?  I guess it all depends on what you mean by "informa-
> tion." 

Indeed! But this question has been well considered and has a conventional
answer.  A channel consisting of a series of uncorrelated symbols carries
x bits per symbol where:

	 x = sum ( i = 1 ; i<N ; i++ ) { P(i) * log2 ( 1 / P(i) ) }

N is the number of symbol values.  P(i) is the probability of
occurrence of the ith symbol value. Note that the information carried
depends on the usage.

If we assume each value occurs with probability 1/N then the expression
simplifies to:

	x = log2 N

With this simplification, DNA has a capacity of 2 bits per base pair
since there are 4 different base pairs.  However, each triplet only
codes for 20 amino acids plus a stop and start symbol so each triplet
carries log2 ( 22 ) = 4.5 bits or 1.5 bits per base pair.

If one goes on to consider the correlations among sequential codons
the bits per base pair are further reduced.  Lila Gatlin ( I think that's
the name ) wrote a book in which she (?) tries to build up a big
theory of genetic information content.  I tried to read it some time
ago and decided she didn't understand information theory. She kept talking
about the information content of a single genome, which is meaningless.

The above definition defines the information carrying capacity of a
channel in terms of the statistics of the messages it carries. The
information content of a given message is not defined.

	Lew Mammel, Jr.

roy@phri.UUCP (03/31/87)

In article <2112@PUCC.PRINCETON.EDU> 6065833@PUCC.PRINCETON.EDU writes:
> But consider that even if you discount duplicate pieces of RNA, there is
> the problem of huge pieces of RNA junk stuck on the ends which are cut
> off as the RNA strings leave the nucleus.  Not to mention that nonsense
> sequences in DNA are transcribed into RNA, and only later cut out.  How
> do you tell which nucleotides are meaningful?

	Sounds like the old disk drive capacity game.  How big is an Eagle?
Unformatted, it's about 450 Mbytes.  But, once you format, it it's down to
about 380 (on a Sun anyway) and if you put a file system on it you're down
to maybe 370 after you leave room for inodes and such, and if you consider
that you have to leave 10% free space, maybe you're down to 335, not to
mention the space you've set aside for swap space and maybe a spare root
partition, and, well, the list goes on.

	The point is, it's all how you look at it.
-- 
Roy Smith, {allegra,cmcl2,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

"you can't spell deoxyribonucleic without unix!"

howard@cpocd2.UUCP (03/31/87)

In article <978@aecom.UUCP> werner@aecom.UUCP (Craig Werner) writes:
>	Hence, if a byte is a base pair, that's your answer, although
>only two bits are required to specify a base, ergo a 'byte' could 
>actually be a tetranucleotide, but most sequences are stored as
>letters (ATCG). 

In article <3310@udenva.UUCP> agranok@udenva.UUCP (Alexander Granok) writes:
>The whole arguement gets caught up in definitions, here.  I would consider a
>bit to be a base pair, and a byte to be the set of three that encodes for one
>amino acid.  Instead of eight bits to a byte, there are three.  After all, one
>base pair by itself doesn't do much good.  But, if a base pair is a bit, then
>what is a nucleotide?  I guess it all depends on what you mean by "informa-
>tion."

Most of us use the standard definition, in which a "bit" is enough information
to answer a yes/no question.  The reason a base pair is 2 bits is that there
are 4 possibilities, not 2.  Craig correctly points out that this means that
4 base pairs could be stored in an (8 bit) byte.  He also correctly points out
that most nucleotide sequences, when stored in machine-readable form, use one
byte per base pair.  This makes it easier to search for subsequences, reverse
sequences, and places where genes overlap (as they do in some viruses).  But
the information content is no more than 2 bits; the rest is redundancy, and a
compression program could easily squeeze such a file down.

It is possible to store protein sequences using one byte per amino acid, and in
that case you would be partly right.  Here again, though, the real information
content is less than 5 bits per amino acid.

>"How many amino acids (words in the language of proteins) are encoded for on
>the human chromosomes?"

Since three bases code for one amino acid, the simplistic answer is N/3 where
N is the number of base pairs.  In reality things are messier: (1) there are
long stretches of DNA that seem to be doing nothing, (2) there are various
initiation and termination sequences that don't actually code for proteins,
(3) in some organisms a single stretch of DNA/RNA can code for more than one
protein (but never more than three).

>or "How many books could these words fill?"  I seem to
>remember Sagan doing something like this on Cosmos.

My recollection is that it was something like "1500 volumes of Encyclopedia
Britannica" for the human gene set, or a wall full of books.  But you should
be able to calculate that from the numbers Craig posted.  Just count the number
of pages in a book, count the number of letters on a typical page, and divide N
by both.
-- 

Copyright (c) 1987 Howard A. Landman.  Transmission of this material
constitutes permission from the intermediary to all recipients to freely
retransmit the material within USENET.  All other rights reserved.

emigh@ecsvax.UUCP (03/31/87)

In article <2112@PUCC.PRINCETON.EDU> 6065833@PUCC.PRINCETON.EDU writes:
>In article <2840@ecsvax.UUCP>, emigh@ecsvax.UUCP (Ted Emigh) writes, with
>regard to the information stored in genes:
> 
>>If we wish to use "byte" as the
>>smallest unit of meaningful information, then the nucleotide is the byte.
> 
>If you wish to regard the RNA's as the media for genetic storage, instead of
>DNA, fine.  This is an alternative theory ('DNA is just RNA's backup').  But co
>nsider that even if you discount duplicate pieces of RNA, there is the problem
>of huge pieces of RNA junk stuck on the ends which are cut off as the RNA
>strings leave the nucleus.  Not to mention that nonsense sequences in DNA
>are transcribed into RNA, and only later cut out.  How do you tell which
>nucleotides are meaningful?

I do not consider RNA as the media for genetic storage (except in the case
of RNA viruses).  RNA has two roles (and possibly more):  1) As an
intermediate step in going from DNA to polypeptide chains; and 2) as
molecules with enzymatic activity (rRNAs, tRNA, regulatory RNAs, etc).
I also do not feel that just because a segment of DNA will not make it
into a protein that it is "junk" -- only that we may not understand its
functions.  Obviously, 5' flanking regions have regulatory effects.  3'
flanking regions are involved in termination of transcription.  I am
unsure of the role of introns, but feel that they are not their by
chance.  They may serve as protection against endonucleases; or they may
be involved with nucleosome structure; or ....

But even with all of this (rRNAs, introns, flanking regions, ...), some
60% of the DNA in the human genome has **UNKNOWN** function (which doesn't
mean that it has no function).  Rather than trying to say that each
nucleotide MUST have some information, we should say that it is at the
nucleotide level that we have the potential for information.  As an
analogy, my old North Star Horizon CP/M computer effectively has 56K
memory.  I am positive that there are bytes among those 56K that have
NEVER contained ANY meaningful information -- but all 56K had the
potential for containing meaningful information.

-- 
Ted H. Emigh     Genetics and Statistics, North Carolina State U, Raleigh  NC
USENET: emigh@ecsvax.uucp		DOMAIN:	emigh%ecsvax.ncecs.edu
ARPA:	ecsvax!emigh@mcnc.org           BITNET: NEMIGH@TUCC
Distribution to monotremes and flightless waterfowl **RESTRICTED**

werner@aecom.UUCP (04/03/87)

In article <3310@udenva.UUCP>, agranok@udenva.UUCP (Alexander Granok) writes:
> >
> base pair by itself doesn't do much good.  But, if a base pair is a bit, then
> what is a nucleotide?  I guess it all depends on what you mean by "informa-
> tion."   

 	For DNA, base pair and nucleotide are interchangable terms, so I
had to laugh at the above.  Each base is a nucleotide, and since DNA is
complementary, each base automatically specifies the one across from it.
The same cannot be said for RNA, which is single-stranded.

	For instance the partial sequence of of the recombinant
plasmid, pBmCW5, reads:


	5'ATAGCTGATGCAAGATGAAGCTCTTGG3'

	From that I can read the complementary strand effortlessly
	5'CCAAGAGCTTCATCTTGCATCAGCTAT3'

so that the whole thing becomes:

	5'ATAGCTGATGCAAGATGAAGCTCTTGG3'
        3'TATCGACTACGTTCTACTTCGAGAACC5'

(and yes, the CW in pBmCW5 stands for Craig Werner)


-- 
			      Craig Werner (MD/PhD '91)
				!philabs!aecom!werner
              (1935-14E Eastchester Rd., Bronx NY 10461, 212-931-2517)
                          "I just won't sleep, that's all."