[bionet.molbio.evolution] Calculating expected values in Infinite Allele

snell@utzoo.uucp (snell) (08/05/89)

(I posted this to sci.math.stat yesterday, and am throwing it in this
ring today.   I could provide a program in C which produces exactly
the same thing as this S program.  This group has rather low traffic,
but at least the subject is suitable.)

I am working with allele frequency data based on electrophresis,
and want to test possible departure of my data from the 
Infinite allele--Constant mutation rate (IC) model.
I am trying to generate expected `n' values for various allele frequency 
intervals using equation (5) from Chakraborty, Fuerst and 
Nei (Genetics 94: 1039-1063, 1980).  One needs only an estimate of 
heterozygosity and number of genes in the sample (for diploid species,
N genes  = # individuals X # loci X 2).
A U-shaped distribution is expected, with most alleles being either
very common or very rare (Chakraborty et. al 1980, Fig. 1).
Their equation 5 is printed as follows:
     
     
                   nq           n
                   ___         ____
                   \      1    |  |  n + 1 - j   
     n(p,q) = M     >    ---   |  |  ----------
                   /__    i    |  |  n + M - j
                  i=np+1       j=1	    
     
This equation (apparently) was used by Barrowclough, Johnson and Zink (pages
135-154 _In_ Current Ornithology, Vol. 2, Plenum Press, New York, 1985).
However, Barrowclough et al. modified it somewhat (presumably in an attempt
to clarify it: `n' having originally been used in different places for 
different things).  The equations appear different, as indicated by the 
differing symbols over the product sign.  (I assume, in Chakroborty 
et al.'s formulation, that the string of products is not really `1 to N';
otherwise it would be a constant.)

Barrowclough et al.'s formulation is as follows:

                   Nq           i
                   ___         ____
                   \      1    |  |  N + 1 - j   
     n(p,q) = M     >    ---   |  |  ----------
                   /__    i    |  |  N + M - j
                  i=Np+1       j=1	    

where N = individuals * loci * 2
      M = H/(1-H)    ;H = heterozygosity estimate
      p = lower boundary of frequency class
      q = upper boundary of frequency class

I have assumed, since Nq and Np are not always integers (for instance,
where N = 1100 and p = .05), one would round up or down as appropriate
to include the integer values falling within the frequency class.

I am unable to interpret what is meant by either formulation in a way
that gives a U-shaped expected curve with n(p,q) values similar to
those provided by these authors.  Below is an S program which (I think)
should give expected values acording to Barrowclough et al.'s formula.
It does not.  I have set it up to calculate expected values for 
Barrowclough et al's _Sphyrapicus nuchalis_ example (their Fig. 1).  
From that figure, expected values in the 12 frequency classes are 
about (it is hard to tell): 
2., 1.5, 1.5, 1., 1., 1., 1., 1.5, 1.5, 2., 1.5, and 28.

This S program produces a U-shape, but n(p,q) values are all far too 
small (all less than one, and they sum up to more than one).  Modifying 
this program slightly so that the product string becomes a constant
not give the right sort of values either.

Any suggestions on how this program might be modified, or on how to
correctly calculate these expected values, would be most helpful.
This has everyone about here rather stumped.

                       ***************************
# Barrowclough et al. (1985: Fig. 1) example, _Sphyrapicus nuchalis_ 
# (a Sapsucker type woodpecker) data.
# Barrowclough's Pi product notation used.

H_0.039                     # heterozygosity estimate
M_H/(1-H) 
freqs_c(0,.05,.1,.2,.3,.4,.5,.6,.7,.8,.9,.95,1.)
N_1110                      # number of genes (15 birds * 37 loci * 2)
p.q_floor(freqs*N)          # `floor' value of limits on frequency classes
P_p.q[1:(len(p.q)-1)]+1     # integral np+1
Q_p.q[2:len(p.q)]           # integral nq
Q[len(Q)]_p.q[len(p.q)]     # fixing upper nq value (where q = 1)
n.exp_NA                    # vector for expected n(p,q) values

t.x_NA                             # vector for product terms
for (j in 1:N) {                   # loop for product terms
  t.x_append(t.x,(N+1-j)/(N+M-j))  # getting product terms
}
t.x_t.x[2:len(t.x)]                # getting rid of the NA

for (n in 1:len(P)) {        # frequency class loop

  t.s_NA                                    # vector for sums
  for (i in P[n]:Q[n]) {                    # summation loop
    t.s_append(t.s,(prod(t.x[1:i],T))/i)    # getting summation term
  }
  t.ss_sum(t.s,T)                           # the summation
  n.exp_append(n.exp,M*t.ss) # expected n(p,q) allele frequency--IC model
}   
n.exp_n.exp[2:len(n.exp)]   # getting rid of the NA
n.exp                       # list of expected values under IC model

                       ***************************
-- 
Name: Richard Snell
Mail: Dept. Zoology, Univ. Toronto, Toronto, Ontario, Canada    M5S 1A1
UUCP: uunet!attcan!utzoo!snell      BITNET: snell@zoo.utoronto.ca