frist@ccu.umanitoba.ca (04/13/91)
=================================================================== WHY DOES RAPD/AP MAPPING WORK? Brian Fristensky, Ph.D. Dept. of Plant Science University of Manitoba Winnipeg, MB CANADA R3T 2N2 Phone: 204-474-6085 FAX: 204-275-5128 Email: frist@ccu.umanitoba.ca =================================================================== 0. INTRODUCTION I. THE TECHNIQUE for those of you just tuning in, a short description of the technique. II. PROPOSED MECHANISM a model, and its statistical justification III. AN ALTERNATIVE MECHANISM IV. THEORETICAL CONSIDERATIONS some ramifications of the model, and caveats for potential users. A. Assumptions of the model B. Imperfect priming C. Reaction dynamics V. SUGGESTED EXPERIMENTS VI. CONCLUSION VII. REFERENCES =================================================================== 0. INTRODUCTION ----------------- Recently, I posted a query regarding a new PCR-based mapping technique called RAPD (pronounced 'rapid') mapping [Williams et al.] or AP-PCR [Welsh & McClelland] that potentially could replace RFLP mapping in many cases. Since the precise mechanism behind this technique is not yet clear, I discussed possible mechanisms, and the problems with those mechanisms. Having recieved a number of replies on the subject, including some from the authors of one of the papers describing RAPD mapping, and given the subject some more thought myself, I shall herein summarize those ideas in what I hope is a coherent treatment of the subject. I. THE TECHNIQUE ---------------- RAPD/AP-PCR involves the use of short PCR primers to amplify genomic DNA, generating genotype-specific patterns of bands, which segregate just as traditional RFLP markers do. For a mapping project, the investigator will typically screen a hundred or more primers of random design, choosing primers that give a manageable number of bands (eg. 5 or 6), and of those primers, choosing as informative ones those that detect polymorphism (ie. presence vs. absence of a band) in the population. Linkage is established as with RFLPs. [Welsh & McClelland] have shown that it can also be used for classification of bacterial strains. The potential problem behind RAPD/AP-PCR is that even though the genomic DNA is not restricted, you still get discrete, reproducible banding patterns, using a single primer. Additionally, linear (as opposed to exponential) PCR would not be expected to generate sufficient signal to visualize bands in EtBr staining. II. PROPOSED MECHANISM ---------------------- In my initial posting, I suggested that RAPD mapping depends on the fortuitous occurrence in the genome of inverted repeats that can serve as template for a given primer, such that each repeat unit defines one end of a given PCR product visualized on a gel. [Welsh & McClelland] speculate that this might be the case, and suggest that imperfect primer matches may be sufficient for priming to take place. Most of the respondants to my initial posting agreed in principle that imperfect priming could be part of the mechanism. However, the situation is complicated by the observations of [Williams et al.], who showed that even a single base difference between 10-mer primers could result in a completely different banding pattern with a particular genome. It is necessary, therefore, to reconsider the seemingly unlikely hypothesis that RAPD mapping depends on _perfect_ inverted repeats, spaced at a distance of several hundred to several thousand bp. For this purpose, let's define the relevant parameters. Definitions: G genome complexity k the size of the primer d the distance between the inverted repeat units ie. typical fragment size p probability of match between two bases chosen at random For the sake of argument, let's say that the genome complexity G of an organism is 10^9. The frequency with which a sequence of k nucleotides (nt) occurs is p^k. Assuming p=0.25, the expected number of occurrences E(k) of a given k-mer in a genome of complexity G is k 9 10 E(d) = G x p , or for tobacco 10 x (0.25) = 954 per genome But that's the number of priming events at ONE position. It is easy to be misled by the observation that the probability of getting TWO priming events close together would approximate p^k squared, or about 10e-12. Antoni Rafalski (duPont) has cited the reasoning of Ken Livak and Eric Lander, who conceptualize a fragment of size d as having d chances for an inverted repeat to occur. I am embarrassed to say that I initially rejected this idea, in spite of having written a paper describing an analogous situation, in which I presented a statistical model of the frequency with which look-up table-based similarity searches find similarities at a defined level of identitiy [Fristensky]. I am now convinced that these are comparable situations. The problem can be stated as follows: given a k-mer match at one site, what is the probability P(d) of finding the inverse complement of this sequence within d nucleotides? k let P = p and Q = 1-P. Then d d i 1 - Q k d P(d) = SUM PQ = P -------- = 1 - (1-p ) 1 - Q i=1 For a given 10-mer match at a single site, the probability P(d) of getting another 10-mer match within d=1000nt is 9.5e-4, or about 1 in 1000. If this 10-mer occurs 954 times in the genome, then you have an expectation of seeing about 1 band for a typical 10 mer in the tobacco genome. This hand-waving explanation glosses over a number of points, but it shows that the inverted repeat model is consistent with the results for eukaryotic genomes. Much harder to explain are the results obtained with _prokaryotic_ genomes. Fig. 1 of the RAPD paper shows about the same number of bands for bacterial species as with plant or human genomes! Section IV discusses some of the other things that need to be considered in RAPD/AP-PCR. III. AN ALTERNATIVE MECHANISM ----------------------------- Antoni Rafalski cites another mechanism, suggested by Hemin Wu (Yale University). As illustrated in the figure below, some sequences containing a template for primer i will be able to form hairpin structures downstream of i, as indicated in step 1. Such a mechanism has been verified in 1st strand cDNA synthesis, and the amazing thing about it is that very little secondary structure seems to be necessary to make it work, as evidenced by the fact that most genes can be cloned as cDNAs in this fashion. In step 2, this hairpin elongates, creating the complementary strand, containing domain a' and i' (ie. the complements of a and i, respectively). When this duplex denatures (step 3), it forms a huge inverted repeat. The region bounded by i and i' will now amplify using primer i. i a (1) 5' -------------------------- 3'___) | self priming v i a 5' -------------------------- (2) 3' <_________________________) i' a' | duplex denatures v i a a' i' (3) 5' -------------------------------------------------- 3' This model can be tested in the following experiment: 1. dilute PCR reaction 2. denature 3. anneal for short time 4. (+) S1 digestion (-) no S1 digestion 5. run on denaturing gel 6. bands in (-) lane should be double size of (+) lane Perhaps the biggest unknown is in step 2, which will only work if Taq polymerase can extend transient hairpins as Reverse Transcriptase does. IV. THEORETICAL CONSIDERATIONS ------------------------------- A. Assumptions of the model Ron Sederoff (N. Carolina State Univ.) pointed out that any calculation of probability should be based on genome _complexity_, rather than genome size. This is very important for eukaryotic genomes, which tend to have a lot of repetitive DNA. He went on to say that even experimentally-determined estimates of complexity, which are based on reassociation kinetics, may be an overestimate of the true complexity. For example, [Murray et al.] have demonstrated that pea sequences ("fossil repeats") annealing with single-copy kinetics under standard conditions can exhibit repetitive reassociation kinetics under less stringent conditions. Another assumption is that inverted repeats occur ranomly as a function of nucleotide frequency. If mechanisms exist by which short inverted repeats are spontaneously generated, then such an assumption will underestimate their true frequency. B. Imperfect priming In spite of the fact that the random model as described in II puts the frequency of perfect inverted repeats at least in the right ball-park for eukaryotic genomes, it still doesn't explain the fact that bacterial genomes give about the same number of bands as the genomes of higher eukaryotes. The mutation-scanning experiment of [Williams et al., Fig. 2] would be consistent with the idea that more and more mismatch is tolerated with increasing distance from the 3' end. They showed that single base-substitutions anywhere in the 3'-most bases of a 10-mer resulted in completely different band patterns. However, there did appear to be several bands in common between the patterns given by the original 10-mer, and that of the 10-mer with a base substitution at the 5'end. [Welsh & McClelland] have provided further evidence for imperfect priming by doing the first two cycles with low annealing temperatures (40C), and subsequent annealing steps under more stringent conditions (60C). Their Fig.1 shows that as the annealing temperature of the first two cycles increases, bands are lost from the final population. At the same time, many users of PCR have observed that some primers that can base pair at their 3'ends can self prime, producing "primer-dimers". Primer dimers can form with as little as 2 GC base pairs. [Sommer & Tautz] have shown that 17-mers with several mismatches can still prime, provided that no mismatches occur in the 3'-most three nucleotides. Additionally, [White et al.] have shown that primer-linker pairs with only 9 base overlap can amplify into long tandem arrays (concatamers) even when annealing is carried out at temperatures far exceeding the predicted Td of the duplex. C. Reaction dynamics Keith Elliston (Rutgers), Pam Norton (Roger Williams Gen. Hosp.), John Williams (duPont) and myself all recognized the fact that early events in PCR reactions can have profound effects on the final products visualized. Remembering that each primer is physically incorporated into the strand that it produces, and is then copied faithfully in subsequent reactions, it is obvious that even imperfect priming events in the early cycles will produce templates that can be perfectly primed in later cycles. Rather than considering priming as the outcome of a match/mismatch decision, it is probably more realistic to consider priming as a function of what Ron Sederoff calls "residence time". The better duplex that can form, the more likely it is that the duplex will stay together until another nucleotide is added. Finally, it is instructive to visualize the growing population of fragments in solution as a biological community. In the early cycles, a whole zoo of sequences is likely to be present. Fairly quickly, a small number of sequences that are most efficient at replicating will be selected, and will take over the population. Again, what we see in Ethidiuim staining is the end point of the reaction. These considerations may explain the reason that both bacterial and large eukaryotic genones are able to give about the same number of bands. This observation may actually be a systematic artifact of the limitations of reagents, and competition between templates during amplification. Regardless of how many 'good' templates were available in the early cycles, the population dynamics may favor a small number of major products by later cycles, which is what we see on the gel. V. SUGGESTED EXPERIMENTS -------------------------- Perhaps the most important experiment to do would be to test the inverted repeat hypothesis by sequencing the regions flanking the ends of the RAPD/AP products. To do this, you first need to determine the sequence internal to gel-isolated fragments from RAPD/AP reactions. (Note that ends of these fragments are guaranteed to have perfect primer sequences at both ends. It is therefore necessary to directly sequence the flanking regions from genomic DNA.) Once the internal sequence can be determined, internal sequencing primers can be synthesized, and the flanking regions sequenced, without cloning, by inverse PCR, using genomic DNA as a template [Ochman et al., Triglia et al.]. Of course, sequencing the internal regions will also test the Wu model. Another experiment to try would be to simply include radiolabled nucleotide in the RAPD/AP-PCR reaction, and remove aliquots from the reaction at different cycles. Next, all aliquots would be co- electrophoresed, and the gel autoradiographed for different time intervals. Short exposures would show the relatively small number of species present by the later cycles, while long exposures would make it possible to visualize the presumptive zoo of fragments present in early cycles. In this way, it should be possible to compare the relative importance of specific priming versus competition. VI. CONCLUSION ------------- I hope I have accurately represented the views of those who responded by my initial posting, and I would like to thank all involved for the stimulating discussions and exchanges that we have had on this subject. Hopefully, the understanding of the RAPD/AP- PCR mechanism will lead to its improvement as a technique. VII. REFERENCES ---------- Fristensky B. (1986) Improving the efficiency of dot-matrix similarity searches through use of an oligomer table. NUCL. ACIDS RES. 14:597-610. Murray, M., Peters, D.L. and Thompson, W.F. (1981) Ancient repeated sequences in the pea and mung bean genomes and implications for genome evolution. J. MOL.EVOL. 17:31-42. Ochman, H., Gerber, A.S. and Hartl, D.L. (1988) Genetic applications of an inverse polymerase chain reaction. GENETICS 120:621-623. Sommer, R. and Tautz, D. (1989) Minimal homology requirements for PCR primers. NUCL. ACIDS RES. 17:6749. Triglia, T., Peterson, M.G. and Kemp, D.J. (1988) A procedure for in-vitro amplification of DNA segments that lie outside the boundaries of known sequences. NUCL. ACIDS RES. 16:8186. Welsh, J. and McClelland, M. (1990) Fingerprinting genomes using PCR with arbitrary primers. NUCL. ACIDS RES. 18:7213-7218. (Editor's note: "Publication of this paper was delayed by the authors to allow simultaneous publication with a paper submitted later by another group. Nucleic Acids Research regrets that due to administrative errors the other paper, by Williams et al., was published on pages 6531-6535 of issue 22. Both sets of authors agree that the two papers should be considered as published simultaneously and should be referred to together.) White, M.J., Fristensky, B. and Thompson, W.F. (1991) Concatemer chain reaction: A Taq DNA polymerase-mediated mechanism for generating long tandemly-repetitive DNA sequences. (submitted) Williams, J.G.K, Kubelik, A.R. Livak, K.J., Rafalski, J.A. and Tingey, S.V. (1990) DNA polymorphisms amplified by arbitrary primers are useful as genetic markers. NUCL. ACIDS RES. 18:6531-6535.