[bionet.molbio.bio-matrix] The EBV problem promised -Hope it's not too hellish!

HUENSD@vax1.computer-centre.birmingham.ac.uk (02/19/91)

As requested, here is the info on a hopefully interesting problem in the
application of computing techniques to finding a possible coding sequence
in a region. Sorry that the description is a bit long but I hoped to give
a description of the problem without all the jargon used in EBV research
(Every group has a different name for similar things! Impenetrable or what?)

Introductory information
========================
EBV is a herpesvirus that infects both B-lymphocytes and some epithelia. It is
strongly associated with endemic Burkitt's lymphoma, nasopharyngeal carcinoma
and is the causative agent of infectious mononucleosis (glandular fever). Up
to 95% of humanity is infected, mostly asymptomatically.

The virus has a number of modes of existence. It can infect B-lymphocytes and
when it does so it can be in passive latency, where it appears to express only
the EBNA-1 protein (HS4 coordinates 107950-109872) which is required for
episomal maintenance. In this state, it is invisible to the immune system. It
can also be in active latency when it expresses 8 proteins which, at least in
vitro act to immortalise and transform the B-lymphocyte. Six of these are
transcribed off two promoters, one at 11305 and the other in the BamHI-W repeats
at 14352, 17424, etc. Very complex splicing to distant ORFs results in formation
of mRNAs resulting in these distinct proteins. The other two proteins(LMP and
 TP)
a separate set of promoters.

When infecting epithelia, a different set of genes are expressed. EBNA-1 is
 still
present but the two promoters (11305 and 14352 etc) are inactive. As EBNA-1
is transcribed from these promoters in active latency, it must be produced
from another promoter here. Evidence suggests this lies left of 62461. LMP
and TP are often present using the same promoters as in lymphocytic infection.
An additional 4.8 kb message exists that runs over the the region spanning
152000-161000.

The virus can enter lytic cycle in both lymphocytes and epithelia and about
a hundred genes are eventually transcribed in this phase to produce the viral
particle. Most of these genes are intronless.

The Problem
===========
In epithelial infection resulting in nasopharyngeal carcinoma, an abundant novel

4.8 kb message exists. Incomplete cDNAs have been cloned for this message. Their
content is described below:-

(N.B. - The reference sequence for the virus in HS4 is based on the B95-8
prototype isolate. This isolate has a number of variations  chief of which is a
deletion at 152012/152013 in the region of interest. Fortunately, the deleted
region has since been from the Raji isolate, sequenced and deposited by the
original authors in HS4RAJI. There is some uncertainty as to the exact sequence
at the border of the deletion.)

The cDNAs were isolated from a library of the NPC C15 cell line which has small
but significant sequence differences. Not all sequence info was divulged in the
original publication [1]. The following was reconstructed:-

Transcript runs rightward (toward higher sequence no.)
5' most extent of available clones: at least as far as 9231 in HS4RAJI.
(based on mapping in [1]. Longest clone not actually completely sequenced)
1st observed intron - start: 10630 in HS4RAJI.
                        end: 155724 in HS4.
2nd observed intron - start: 157184 in HS4.
                        end: Not given in publn. Sequence variation made it
                             difficult to place in the HS4 sequence.
3rd observed intron - start: 157386 in HS4.
                        end: 159083 in HS4.
poly-A                     : 161013 in HS4.

An intron observed by one group but not other: 160068-160238 in HS4. See [2].
(Other group inferred sequence of this region.

Reported sequence variation between C15 isolate and prototype strain
in this region:
Deletions relative to B95-8: 155730 in HS4.
                             156074 in HS4.

Proposed ORFs in message by authors [1]:

cLF1
====

       1  ATGGCCGGAG CTCGTCGACG GGCAAGGTGC CAGCGTCAGC AGGATGCGCC

      51  TATAGCGCCC GGCCTCCTCC CCTGTCGACC AGAGGACGCA GGATATCTGC

     101  AGGATCAGGT CAGCCTCGTT GGTGGCCGTG GGGAAGCCCT CCTCCCCCAG

     151  ACACTCGATA TCGAAGGCCA GGGCCTGGTA GGAGGGCCAG GAGCTGTCTT

     201  CACGCCGGAC CGAGAGGTCG CCCACCTCAC AGTCGTACTC GAGCTCGGCG

     251  TACGAGTCCC GGTGCTGGAG GCGGGGGATG GCGCGGCGGC AGCTGTACCA

     301  GCCAAAGGTG ACAAAGTCAT TGTCCAGGAC AAAGCGGCGC GTGGCATCCA

     351  CGTTGGCCTC AAAGATCCGA CACCGTGCTT GTCTTGCAGC CACGTGGCCA

     401  CGTGA

1. This frame is only possible because of the C15 deletion relative to
   B95-8 at 155730 which generates a frameshift to extend the ORF. However,
   the deletion at 156074 in HS4 truncates this ORF of some approx 100 residues.

cBALF1
======

       1  TGAGCCCCCG GGTACGCTGT AGAAGCTGTT GAAGGAGGTC TCTATCCAGT

      51  CGCTCGGCTC GATGCCTGGC CATATCAGGG AAGTCAGGAA TGCCTTCTGG

     101  TGGGGCAGCG TACCTGCGGC GTCACAGCAG CGAGCCAGGG CCACGTTGCT

     151  GGGTGGGGGA AAGAGCCCGC TCTCCTCCGC CAGGGGCCCC GTGATGAAGG

     201  TGTACAGGCT GTGCGTCAGC GCGTGCAGGT GCTCCGAGCT CAGGGTCTGG

     251  GTAAACAGGT GTGTTTTGAT GTACTTGGAA TTCTCAAAGG CGGCACCCTC

     301  GCCGGCGCGC CTGTCCTCCC AGGGACCCGA GACGAAGGCC CGTCTGTAGA

     351  GGAAGTGGTT GCGCATGCGG GCCAGCTCCC AGTAGACCAC GTCCCCCCAG

     401  ACGCGCAGGC ACAGGGTCTC GGTCAGGGTC TCGCTCTGTT GCGCCAGGCA

     451  GGACTGCAGC TTGGCCAGAC CCTCGGTGGC CACCTGGCGC AGGTACTGCT

     501  CCTTGCGCTT GAGCGCGTCC GAGAGGGCGC CGGACGGGCC GGGCTCTCGT

     551  GCCCCAGCCG GCCGGGGCAC CTCCGGGCTC TCCCGGGACG CCTCCTCCTC

     601  GCCTCGGCCC AACCGCTGCA TGGCTCGGTT GAGCCGCGTG TACAGCTCGT

     651  TCCTCTTTTG CAGGATGGCC CGGTACTGGG GGTGCGCCGT GAAGGCGGCG

     701  GCGCAGTCCG CCTTCAGCGC CTCCACCGCG TCGCCCGAGG AGCTGTAGAC

     751  CCCGCCGCAG AAGAGCCGCT CCGTGGCCCC GGGAGCCACG GCGTCAAACA

     801  GGTGAGTCAG CCTTGCCCCC GCCAGCGCCT CCTCGCAGGC CCCCCGCACC

     851  AGGGCCAGGC GACGCTCCCG GGCAAACAGG GCAGAGAGGC GGGAATGGCC

     901  GCCACCCTCC CCCTGCCCCG TTGCACCGAT AGCATGGCCG CCAGAGTTCC

     951  AATAGAGGAG CTCCGAGAGC TCCGCCACCT CCGGGGGCAC TGTCGAGAAG

    1001  ACGTTGTAGG TGTCCAGCGC TCTGGTCGCC CCCTCTGCCT CCGGCCGCCC

    1051  CGGGCCCGGG ACCGCGCCCT CCTCTGGGCC GCCCGGCCTC GCCTTCTCCT

    1101  CAGCCTCCAA CAGGTGCCCG AGCCCCGCCT GGCGGACTTC ATTCTCAAAC

    1151  AGTCCCGAGA CCGGCTCCGG ATTCACCGGC ACCGCCAGGT GGTTACAGGA

    1201  GACGTGGGTC CCCTCTGCCG TGGAAGGGTT GCCGTGGTTG GGCAGAACCA

    1251  TCAGCTCGCC CACACAGCGC CAGCAGGGCA CAGAGGTGAT GTAGAGGCGC

    1301  GGGTCTGGGA TGGGACTTAC GCCCCGAAAG CGGCCCAGCA GATCCAGGGC

    1351  CCGTTCCAGG CTCTCCAGCC CCATGGTGTG AGACATGCAA TAAAACACGC

    1401  TATTGATTCT CTTCATTAA


1. This region doesn't have a ATG until some two-thirds of the way thru'!
   One suggestion is that a non-AUG start is used.

2. Optional splice described above occurs with intron spanning 493-663.
   This splice is in-frame and removes a number of residues from the ORF.

Supplementary Data
==================
As mentioned above, the prototype sequence has a deletion that has since been
sequenced in another isolate. This deletion actually removes a lytic origin
of replication that is almost identical to the one at the other end of the
genome. Region 3565-4609 in the HS4RAJI sequence are virtually identical to
52654-53697 in the HS4 sequence. I mention this because the origin of
 replication
on the left end is known to be a promoter/enhancer region and has bidirectional
transcripts extending away from it. However, while the leftward promoter of
the origin is maintained in the HS4RAJI sequence, the homology at the righthand
end ends some tens of nucleotides before where the rightward promoter is
expected to be placed.

The sequence variation observed between the various isolates may or may not
be real. The trouble is that the sequence is based on the B95-8 isolate which
has been passaged as an actively latent cell line for around 30 years. The
above region is not transcribed in this lymphoid line. Raji has also been
maintained similarly for almost as long. The cDNA was cloned from the C15
NPC tumour line which at least expresses this region though it too has been
on the go for a long time - this time as a nude mouse xenograft.

(When the genome project gets going, I hope they won't sequence DNA
obtained from cell lines. I suspect any gene not transcribed in a particular
tissue may well undergo changes on prolonged passaging as a cell line with
attendant problems described here - the EBV episomal maintenance system has
perhaps the highest fidelity of all viral episomal maintenance systems known).

Also, if the mRNA is truly 4.8 kb as claimed, then the mapping data would
appear to suggest that the inferred sequence is virtually full length
(maybe even too long!). The actual sequenced stuff extends 1.3 kb less
in 5' direction.

One other point is that much of the 3' end of this region is littered with
virtually back-to-back leftward ORFs that are known to be expressed as protein.
This includes virtually the entire extent of the sequenced part of the mRNA!

If codon preference methods are to be used, there are clear differences in the
codon usage of genes expressed in latency and those expressed in lytic cycle.
(see [3]) Latent messages tend to have high AT content at third position.

When I refer to HS4 coordinates here, I assume the presence of the entire
genome in one file. GCG users will have this split into two files, HS4 and
HS4-2 with the first 100000 bp in first file.

Questions
=========
1. Where is the 5' end of message (or at least of 5'-most exon)?

2. Are any of the proposed ORFs reasonable given their location so far from
   5'end of mRNA? cLF1 is around 1.3 kb away, cBALF1 much further.

3. Given possible frameshift errors, is there a plausible reconstruction at
   5' end that will give a more reasonable message?

4. What is the likelihood that this message doesn't encode protein at all
   but just serves either as an anti-sense transcript or just as a doggone
   no-good message designed to confuse molecular biologists!?

I will be meeting the people who actually did the work next month so I may be
to provide the answers to the above questions as approached by bench methods.
They have had over a year to work since they published this stuff.

References
==========
[1] Hitt,M.M.,Allday,M.J.,Hara,T.,Karran,L.,Jones, M.D.,Busson,T.,Tursz,T.
    ,Ernberg,I.,Griffin,B.E. (1989) EBV gene expression in an NPC-related
    Tumor. EMBO J. 8:2639-2651.

[2] Gilligan,K.,Sato,H.,Rajadurai,P.,Busson,P.,Young,L.S.,Rickinson,A.B.,
    Tursz,T.,Rabb-Traub,N. (1990) Novel Transcription from the Epstein-Barr
    Virus Terminal EcoRI Fragment, DIJhet, in a Nasopharyngeal Carcinoma.
    J. Virol. 64:4948-4956.

[3] Karlin,S.,Blaisdell,B.E.,Schachtel,G.A. (1990) Contrasts in Codon Usage
    between Latent versus Productive Genes of Epstein-Barr Virus. J. Virol.
    64:4264-4273.