HUENSD@vax1.computer-centre.birmingham.ac.uk (02/19/91)
As requested, here is the info on a hopefully interesting problem in the application of computing techniques to finding a possible coding sequence in a region. Sorry that the description is a bit long but I hoped to give a description of the problem without all the jargon used in EBV research (Every group has a different name for similar things! Impenetrable or what?) Introductory information ======================== EBV is a herpesvirus that infects both B-lymphocytes and some epithelia. It is strongly associated with endemic Burkitt's lymphoma, nasopharyngeal carcinoma and is the causative agent of infectious mononucleosis (glandular fever). Up to 95% of humanity is infected, mostly asymptomatically. The virus has a number of modes of existence. It can infect B-lymphocytes and when it does so it can be in passive latency, where it appears to express only the EBNA-1 protein (HS4 coordinates 107950-109872) which is required for episomal maintenance. In this state, it is invisible to the immune system. It can also be in active latency when it expresses 8 proteins which, at least in vitro act to immortalise and transform the B-lymphocyte. Six of these are transcribed off two promoters, one at 11305 and the other in the BamHI-W repeats at 14352, 17424, etc. Very complex splicing to distant ORFs results in formation of mRNAs resulting in these distinct proteins. The other two proteins(LMP and TP) a separate set of promoters. When infecting epithelia, a different set of genes are expressed. EBNA-1 is still present but the two promoters (11305 and 14352 etc) are inactive. As EBNA-1 is transcribed from these promoters in active latency, it must be produced from another promoter here. Evidence suggests this lies left of 62461. LMP and TP are often present using the same promoters as in lymphocytic infection. An additional 4.8 kb message exists that runs over the the region spanning 152000-161000. The virus can enter lytic cycle in both lymphocytes and epithelia and about a hundred genes are eventually transcribed in this phase to produce the viral particle. Most of these genes are intronless. The Problem =========== In epithelial infection resulting in nasopharyngeal carcinoma, an abundant novel 4.8 kb message exists. Incomplete cDNAs have been cloned for this message. Their content is described below:- (N.B. - The reference sequence for the virus in HS4 is based on the B95-8 prototype isolate. This isolate has a number of variations chief of which is a deletion at 152012/152013 in the region of interest. Fortunately, the deleted region has since been from the Raji isolate, sequenced and deposited by the original authors in HS4RAJI. There is some uncertainty as to the exact sequence at the border of the deletion.) The cDNAs were isolated from a library of the NPC C15 cell line which has small but significant sequence differences. Not all sequence info was divulged in the original publication [1]. The following was reconstructed:- Transcript runs rightward (toward higher sequence no.) 5' most extent of available clones: at least as far as 9231 in HS4RAJI. (based on mapping in [1]. Longest clone not actually completely sequenced) 1st observed intron - start: 10630 in HS4RAJI. end: 155724 in HS4. 2nd observed intron - start: 157184 in HS4. end: Not given in publn. Sequence variation made it difficult to place in the HS4 sequence. 3rd observed intron - start: 157386 in HS4. end: 159083 in HS4. poly-A : 161013 in HS4. An intron observed by one group but not other: 160068-160238 in HS4. See [2]. (Other group inferred sequence of this region. Reported sequence variation between C15 isolate and prototype strain in this region: Deletions relative to B95-8: 155730 in HS4. 156074 in HS4. Proposed ORFs in message by authors [1]: cLF1 ==== 1 ATGGCCGGAG CTCGTCGACG GGCAAGGTGC CAGCGTCAGC AGGATGCGCC 51 TATAGCGCCC GGCCTCCTCC CCTGTCGACC AGAGGACGCA GGATATCTGC 101 AGGATCAGGT CAGCCTCGTT GGTGGCCGTG GGGAAGCCCT CCTCCCCCAG 151 ACACTCGATA TCGAAGGCCA GGGCCTGGTA GGAGGGCCAG GAGCTGTCTT 201 CACGCCGGAC CGAGAGGTCG CCCACCTCAC AGTCGTACTC GAGCTCGGCG 251 TACGAGTCCC GGTGCTGGAG GCGGGGGATG GCGCGGCGGC AGCTGTACCA 301 GCCAAAGGTG ACAAAGTCAT TGTCCAGGAC AAAGCGGCGC GTGGCATCCA 351 CGTTGGCCTC AAAGATCCGA CACCGTGCTT GTCTTGCAGC CACGTGGCCA 401 CGTGA 1. This frame is only possible because of the C15 deletion relative to B95-8 at 155730 which generates a frameshift to extend the ORF. However, the deletion at 156074 in HS4 truncates this ORF of some approx 100 residues. cBALF1 ====== 1 TGAGCCCCCG GGTACGCTGT AGAAGCTGTT GAAGGAGGTC TCTATCCAGT 51 CGCTCGGCTC GATGCCTGGC CATATCAGGG AAGTCAGGAA TGCCTTCTGG 101 TGGGGCAGCG TACCTGCGGC GTCACAGCAG CGAGCCAGGG CCACGTTGCT 151 GGGTGGGGGA AAGAGCCCGC TCTCCTCCGC CAGGGGCCCC GTGATGAAGG 201 TGTACAGGCT GTGCGTCAGC GCGTGCAGGT GCTCCGAGCT CAGGGTCTGG 251 GTAAACAGGT GTGTTTTGAT GTACTTGGAA TTCTCAAAGG CGGCACCCTC 301 GCCGGCGCGC CTGTCCTCCC AGGGACCCGA GACGAAGGCC CGTCTGTAGA 351 GGAAGTGGTT GCGCATGCGG GCCAGCTCCC AGTAGACCAC GTCCCCCCAG 401 ACGCGCAGGC ACAGGGTCTC GGTCAGGGTC TCGCTCTGTT GCGCCAGGCA 451 GGACTGCAGC TTGGCCAGAC CCTCGGTGGC CACCTGGCGC AGGTACTGCT 501 CCTTGCGCTT GAGCGCGTCC GAGAGGGCGC CGGACGGGCC GGGCTCTCGT 551 GCCCCAGCCG GCCGGGGCAC CTCCGGGCTC TCCCGGGACG CCTCCTCCTC 601 GCCTCGGCCC AACCGCTGCA TGGCTCGGTT GAGCCGCGTG TACAGCTCGT 651 TCCTCTTTTG CAGGATGGCC CGGTACTGGG GGTGCGCCGT GAAGGCGGCG 701 GCGCAGTCCG CCTTCAGCGC CTCCACCGCG TCGCCCGAGG AGCTGTAGAC 751 CCCGCCGCAG AAGAGCCGCT CCGTGGCCCC GGGAGCCACG GCGTCAAACA 801 GGTGAGTCAG CCTTGCCCCC GCCAGCGCCT CCTCGCAGGC CCCCCGCACC 851 AGGGCCAGGC GACGCTCCCG GGCAAACAGG GCAGAGAGGC GGGAATGGCC 901 GCCACCCTCC CCCTGCCCCG TTGCACCGAT AGCATGGCCG CCAGAGTTCC 951 AATAGAGGAG CTCCGAGAGC TCCGCCACCT CCGGGGGCAC TGTCGAGAAG 1001 ACGTTGTAGG TGTCCAGCGC TCTGGTCGCC CCCTCTGCCT CCGGCCGCCC 1051 CGGGCCCGGG ACCGCGCCCT CCTCTGGGCC GCCCGGCCTC GCCTTCTCCT 1101 CAGCCTCCAA CAGGTGCCCG AGCCCCGCCT GGCGGACTTC ATTCTCAAAC 1151 AGTCCCGAGA CCGGCTCCGG ATTCACCGGC ACCGCCAGGT GGTTACAGGA 1201 GACGTGGGTC CCCTCTGCCG TGGAAGGGTT GCCGTGGTTG GGCAGAACCA 1251 TCAGCTCGCC CACACAGCGC CAGCAGGGCA CAGAGGTGAT GTAGAGGCGC 1301 GGGTCTGGGA TGGGACTTAC GCCCCGAAAG CGGCCCAGCA GATCCAGGGC 1351 CCGTTCCAGG CTCTCCAGCC CCATGGTGTG AGACATGCAA TAAAACACGC 1401 TATTGATTCT CTTCATTAA 1. This region doesn't have a ATG until some two-thirds of the way thru'! One suggestion is that a non-AUG start is used. 2. Optional splice described above occurs with intron spanning 493-663. This splice is in-frame and removes a number of residues from the ORF. Supplementary Data ================== As mentioned above, the prototype sequence has a deletion that has since been sequenced in another isolate. This deletion actually removes a lytic origin of replication that is almost identical to the one at the other end of the genome. Region 3565-4609 in the HS4RAJI sequence are virtually identical to 52654-53697 in the HS4 sequence. I mention this because the origin of replication on the left end is known to be a promoter/enhancer region and has bidirectional transcripts extending away from it. However, while the leftward promoter of the origin is maintained in the HS4RAJI sequence, the homology at the righthand end ends some tens of nucleotides before where the rightward promoter is expected to be placed. The sequence variation observed between the various isolates may or may not be real. The trouble is that the sequence is based on the B95-8 isolate which has been passaged as an actively latent cell line for around 30 years. The above region is not transcribed in this lymphoid line. Raji has also been maintained similarly for almost as long. The cDNA was cloned from the C15 NPC tumour line which at least expresses this region though it too has been on the go for a long time - this time as a nude mouse xenograft. (When the genome project gets going, I hope they won't sequence DNA obtained from cell lines. I suspect any gene not transcribed in a particular tissue may well undergo changes on prolonged passaging as a cell line with attendant problems described here - the EBV episomal maintenance system has perhaps the highest fidelity of all viral episomal maintenance systems known). Also, if the mRNA is truly 4.8 kb as claimed, then the mapping data would appear to suggest that the inferred sequence is virtually full length (maybe even too long!). The actual sequenced stuff extends 1.3 kb less in 5' direction. One other point is that much of the 3' end of this region is littered with virtually back-to-back leftward ORFs that are known to be expressed as protein. This includes virtually the entire extent of the sequenced part of the mRNA! If codon preference methods are to be used, there are clear differences in the codon usage of genes expressed in latency and those expressed in lytic cycle. (see [3]) Latent messages tend to have high AT content at third position. When I refer to HS4 coordinates here, I assume the presence of the entire genome in one file. GCG users will have this split into two files, HS4 and HS4-2 with the first 100000 bp in first file. Questions ========= 1. Where is the 5' end of message (or at least of 5'-most exon)? 2. Are any of the proposed ORFs reasonable given their location so far from 5'end of mRNA? cLF1 is around 1.3 kb away, cBALF1 much further. 3. Given possible frameshift errors, is there a plausible reconstruction at 5' end that will give a more reasonable message? 4. What is the likelihood that this message doesn't encode protein at all but just serves either as an anti-sense transcript or just as a doggone no-good message designed to confuse molecular biologists!? I will be meeting the people who actually did the work next month so I may be to provide the answers to the above questions as approached by bench methods. They have had over a year to work since they published this stuff. References ========== [1] Hitt,M.M.,Allday,M.J.,Hara,T.,Karran,L.,Jones, M.D.,Busson,T.,Tursz,T. ,Ernberg,I.,Griffin,B.E. (1989) EBV gene expression in an NPC-related Tumor. EMBO J. 8:2639-2651. [2] Gilligan,K.,Sato,H.,Rajadurai,P.,Busson,P.,Young,L.S.,Rickinson,A.B., Tursz,T.,Rabb-Traub,N. (1990) Novel Transcription from the Epstein-Barr Virus Terminal EcoRI Fragment, DIJhet, in a Nasopharyngeal Carcinoma. J. Virol. 64:4948-4956. [3] Karlin,S.,Blaisdell,B.E.,Schachtel,G.A. (1990) Contrasts in Codon Usage between Latent versus Productive Genes of Epstein-Barr Virus. J. Virol. 64:4264-4273.