[bionet.molbio.methds-reagnts] Why does RAPD mapping work? SUMMARY

frist@ccu.umanitoba.ca (04/13/91)
===================================================================

                 WHY DOES RAPD/AP MAPPING WORK?

                     Brian Fristensky, Ph.D.
                     Dept. of Plant Science
                     University of Manitoba
                  Winnipeg, MB CANADA  R3T 2N2
                       Phone: 204-474-6085
                        FAX: 204-275-5128
                  Email: frist@ccu.umanitoba.ca
===================================================================

0.   INTRODUCTION
I.   THE TECHNIQUE                for those of you just tuning in, 
                                  a short description of the      
                                  technique.
II.  PROPOSED MECHANISM           a model, and its statistical    
                                  justification
III. AN ALTERNATIVE MECHANISM  
IV.  THEORETICAL CONSIDERATIONS   some ramifications of the model, 
                                  and caveats for potential users.
     A. Assumptions of the model
     B. Imperfect priming
     C. Reaction dynamics
V.   SUGGESTED EXPERIMENTS
VI.  CONCLUSION
VII. REFERENCES

===================================================================
0.   INTRODUCTION
-----------------
Recently, I posted a query regarding a new PCR-based mapping
technique called RAPD (pronounced 'rapid') mapping [Williams et
al.] or AP-PCR [Welsh & McClelland] that potentially could replace
RFLP mapping in many cases. Since the precise mechanism behind this
technique is not yet clear, I discussed possible mechanisms, and
the problems with those mechanisms. Having recieved a number of
replies on the subject, including some from the authors of one of
the papers describing RAPD mapping, and given the subject some more
thought myself, I shall herein summarize those ideas in what I hope
is a coherent treatment of the subject. 

I. THE TECHNIQUE
----------------
RAPD/AP-PCR involves the use of short PCR primers to amplify
genomic DNA, generating genotype-specific patterns of bands, which
segregate just as traditional RFLP markers do. For a mapping
project, the investigator will typically screen a hundred or more
primers of random design, choosing primers that give a manageable
number of bands (eg. 5 or 6), and of those primers, choosing as
informative ones  those that detect polymorphism (ie. presence vs.
absence of a band) in the population. Linkage is established as
with RFLPs. [Welsh & McClelland] have shown that it can also be
used for classification of bacterial strains. 

The potential problem behind RAPD/AP-PCR is that even though the
genomic DNA is not restricted, you still get discrete, reproducible
banding patterns, using a single primer. Additionally, linear (as
opposed to exponential) PCR would not be expected to generate
sufficient signal to visualize bands in EtBr staining.

II. PROPOSED MECHANISM
----------------------
In my initial posting, I suggested that RAPD mapping depends on the
fortuitous occurrence in the genome of inverted repeats that can
serve as template for a given primer, such that each repeat unit
defines one end of a given PCR product visualized on a gel. [Welsh
& McClelland] speculate that this might be the case, and suggest
that imperfect primer matches may be sufficient for priming to take
place.  Most of the respondants to my initial posting agreed in
principle that imperfect priming could be part of the mechanism.
However, the situation is complicated by the observations of
[Williams et al.], who showed that even a single base difference
between 10-mer primers could result in a completely different
banding pattern with a particular genome.  

It is necessary, therefore, to reconsider the seemingly unlikely
hypothesis that RAPD mapping depends on _perfect_ inverted repeats,
spaced at a distance of several hundred to several thousand bp. 
For this purpose, let's define the relevant parameters.

Definitions:
     G     genome complexity
     k     the size of the primer
     d     the distance between the inverted repeat units 
           ie. typical fragment size
     p     probability of match between two bases chosen at random

For the sake of argument, let's say that the genome complexity G of
an organism is 10^9.   The frequency with which a sequence of k
nucleotides (nt) occurs is p^k.  Assuming p=0.25, the expected
number of occurrences E(k) of a given k-mer in a genome of
complexity G is
 
                 k                     9        10
     E(d) = G x p ,   or for tobacco 10 x (0.25) = 954 per genome

But that's the number of priming events at ONE position. It is easy
to be misled by the observation that the probability of getting TWO
priming events close together would approximate p^k squared, or
about 10e-12.  Antoni Rafalski (duPont) has cited the reasoning of
Ken Livak and Eric Lander, who conceptualize a fragment of size d
as having d chances for an inverted repeat to occur. I am
embarrassed to say that I initially rejected this idea, in spite of
having written a paper describing an analogous situation, in which
I presented a statistical model of the frequency with which look-up
table-based similarity searches find similarities at a defined
level of identitiy [Fristensky]. I am now convinced that these are
comparable situations. 

The problem can be stated as follows: given a k-mer match at one
site, what is the probability P(d) of finding the inverse
complement of this sequence within d nucleotides?

           k
  let P = p  and Q = 1-P. Then
                                        d
            d     i                1 - Q               k d
P(d) =     SUM  PQ        =    P  --------  =  1 - (1-p )
                                   1 - Q
           i=1 

For a given 10-mer match at a single site, the probability P(d) of
getting another 10-mer match within d=1000nt is 9.5e-4, or about 1
in 1000. If this 10-mer occurs 954 times in the genome, then you
have an expectation of seeing about 1 band for a typical 10 mer in
the tobacco genome.

This hand-waving explanation glosses over a number of points, but
it shows that the inverted repeat model is consistent with the
results for eukaryotic genomes. Much harder to explain are the
results obtained with _prokaryotic_ genomes. Fig. 1 of the RAPD
paper shows about the same number of bands for bacterial species as
with plant or human genomes! Section IV discusses some of the other
things that need to be considered in RAPD/AP-PCR.

III. AN ALTERNATIVE MECHANISM
-----------------------------
Antoni Rafalski cites another mechanism, suggested by Hemin Wu
(Yale University). As illustrated in the figure below, some
sequences containing a template for primer i will be able to form
hairpin structures downstream of i, as indicated in step 1. Such a
mechanism has been verified in 1st strand cDNA synthesis, and the
amazing thing about it is that very little secondary structure
seems to be necessary to make it work, as evidenced by the fact
that most genes can be cloned as cDNAs in this fashion. In step 2,
this hairpin elongates, creating the complementary strand,
containing domain a' and i' (ie. the complements of a and i,
respectively). When this duplex denatures (step 3), it forms a huge
inverted repeat. The region bounded by i and i' will now amplify
using primer i.



                   i          a
(1)         5' --------------------------        
                                    3'___)
                         | self priming
                         v

                   i          a
           5' --------------------------      
(2)        3' <_________________________) 
                    i'         a'
                         | duplex denatures
                         v

                   i          a                a'         i'
(3)        5' -------------------------------------------------- 3'
 

This model can be tested in the following experiment: 
     1. dilute PCR reaction
     2. denature 
     3. anneal for short time
     4. (+) S1 digestion  (-) no S1 digestion
     5. run on denaturing gel
     6. bands in (-) lane should be double size of (+) lane

Perhaps the biggest unknown is in step 2, which will only work if 
Taq polymerase can extend transient hairpins as Reverse
Transcriptase does. 

IV. THEORETICAL CONSIDERATIONS
-------------------------------

A. Assumptions of the model

Ron Sederoff (N. Carolina State Univ.) pointed out that any
calculation of probability should be based on genome _complexity_,
rather than genome size. This is very important for eukaryotic
genomes, which tend to have a lot of repetitive DNA. He went on to
say that even experimentally-determined estimates of complexity,
which are based on reassociation kinetics, may be an overestimate
of the true complexity. For example, [Murray et al.] have
demonstrated that pea  sequences ("fossil repeats") annealing with
single-copy kinetics under standard conditions can exhibit
repetitive reassociation kinetics under less stringent conditions. 

Another assumption is that inverted repeats occur ranomly as a
function of nucleotide frequency. If mechanisms exist by which
short inverted repeats are spontaneously generated, then such an
assumption will underestimate their true frequency.

B. Imperfect priming

In spite of the fact that the random model as described in II puts
the frequency of perfect inverted repeats at least in the right
ball-park for eukaryotic genomes, it still doesn't explain the fact
that bacterial genomes give about the same number of bands as the
genomes of higher eukaryotes. The mutation-scanning experiment of
[Williams et al., Fig. 2] would be consistent with the idea that
more and more mismatch is tolerated with increasing distance from
the 3' end. They showed that single base-substitutions anywhere in
the 3'-most bases of a 10-mer resulted in completely different band
patterns. However, there did appear to be several bands in common
between the patterns given by the original 10-mer, and that of the
10-mer with a base substitution at the 5'end. 
[Welsh & McClelland] have provided further evidence for imperfect
priming by doing the first two cycles with low annealing
temperatures (40C), and subsequent annealing steps under more
stringent conditions (60C). Their Fig.1 shows that as the annealing
temperature of the first two cycles increases, bands are lost from
the final population.

At the same time,  many users of PCR have observed that some
primers that can base pair at their 3'ends can self prime,
producing "primer-dimers". Primer dimers can form with as little as
2 GC base pairs. [Sommer & Tautz] have shown that 17-mers with
several mismatches can still prime, provided that no mismatches
occur in the 3'-most three nucleotides.  Additionally, [White et
al.] have shown that primer-linker pairs with only 9 base overlap
can amplify into long tandem arrays (concatamers) even when
annealing is carried out at temperatures far exceeding the
predicted Td of the duplex.

C. Reaction dynamics

Keith Elliston (Rutgers), Pam Norton (Roger Williams Gen. Hosp.),
John Williams (duPont) and myself all recognized the fact that
early events in PCR reactions can have profound effects on the
final products visualized. Remembering that each primer is
physically incorporated into the strand that it produces, and is
then copied faithfully in subsequent reactions, it is obvious that
even imperfect priming events in the early cycles will produce
templates that can be perfectly primed in later cycles.

Rather than considering priming as the outcome of a match/mismatch
decision, it is probably more realistic to consider priming as a
function of what Ron Sederoff calls "residence time". The better
duplex that can form, the more likely it is that the duplex will
stay together until another nucleotide is added. 

Finally, it is instructive to visualize the growing population of
fragments in solution as a biological community. In the early
cycles, a whole zoo of sequences is likely to be present. Fairly
quickly, a small number of sequences that are most efficient at
replicating will be selected, and will take over the population. 
Again, what we see in Ethidiuim staining is the end point of the
reaction. 

These considerations may explain the reason that both bacterial and
large eukaryotic genones are able to give about the same number of
bands. This observation may actually be a systematic artifact of
the limitations of reagents, and competition between templates
during amplification. Regardless of how many 'good' templates were
available in the early cycles, the population dynamics may favor a
small number of major products by later cycles, which is what we
see on the gel.

V.  SUGGESTED EXPERIMENTS
--------------------------
Perhaps the most important experiment to do would be to test the
inverted repeat hypothesis by sequencing the regions flanking the
ends of the RAPD/AP products. To do this, you first need to
determine the sequence internal to gel-isolated fragments from
RAPD/AP reactions. (Note that ends of these fragments are
guaranteed to have perfect primer sequences at both ends. It is
therefore necessary to directly sequence the flanking regions from
genomic DNA.) Once the internal sequence can be determined,
internal sequencing primers can be synthesized, and the flanking
regions sequenced, without cloning, by inverse PCR, using genomic
DNA as a template [Ochman et al., Triglia et al.]. Of course,
sequencing the internal regions will also test the Wu model.

Another experiment to try would be to simply include radiolabled
nucleotide in the RAPD/AP-PCR reaction, and remove aliquots from
the reaction at different cycles. Next, all aliquots would be co-
electrophoresed, and the gel autoradiographed for different time
intervals. Short exposures would show the relatively small number
of species present by the later cycles, while long exposures would
make it possible to visualize the presumptive zoo of fragments
present in early cycles. In this way, it should be possible to
compare the relative importance of specific priming versus
competition.
VI. CONCLUSION
-------------
I hope I have accurately represented the views of those who
responded by my initial posting, and I would like to thank all
involved for the stimulating discussions and exchanges that we have
had on this subject. Hopefully, the understanding of the RAPD/AP-
PCR mechanism will lead to its improvement as a technique. 

VII. REFERENCES
----------
Fristensky B. (1986) Improving the efficiency of dot-matrix
similarity searches through use of an oligomer table. NUCL. ACIDS
RES. 14:597-610.

Murray, M., Peters, D.L. and Thompson, W.F. (1981) Ancient repeated
sequences in the pea and mung bean genomes and implications for
genome evolution. J. MOL.EVOL. 17:31-42.

Ochman, H., Gerber, A.S. and Hartl, D.L. (1988) Genetic
applications of an inverse polymerase chain reaction. GENETICS
120:621-623.

Sommer, R. and Tautz, D. (1989) Minimal homology requirements for
PCR primers. NUCL. ACIDS RES. 17:6749.

Triglia, T., Peterson, M.G. and Kemp, D.J. (1988) A procedure for
in-vitro amplification of DNA segments that lie outside the
boundaries of known sequences. NUCL. ACIDS RES. 16:8186.

Welsh, J. and McClelland, M. (1990) Fingerprinting genomes using
PCR with arbitrary primers. NUCL. ACIDS RES. 18:7213-7218.
(Editor's note: "Publication of this paper was delayed by the
authors to allow simultaneous publication with a paper submitted
later by another group. Nucleic Acids Research regrets that due to
administrative errors the other paper, by Williams et al., was
published on pages 6531-6535 of issue 22. Both sets of authors
agree that the two papers should be considered as published
simultaneously and should be referred to together.)

White, M.J., Fristensky, B. and Thompson, W.F. (1991) Concatemer
chain reaction: A Taq DNA polymerase-mediated mechanism for
generating long tandemly-repetitive DNA sequences. (submitted)
 
Williams, J.G.K, Kubelik, A.R. Livak, K.J., Rafalski, J.A. and
Tingey, S.V. (1990) DNA polymorphisms amplified by arbitrary
primers are useful as genetic markers. NUCL. ACIDS RES.
18:6531-6535.