[net.math] Drawing balls

davido (05/06/83)

In-Real-Life: David H. Olson @ Tektronix, Instrument Division
----------
A while back, I submitted a problem concerning drawing different
colored balls from an urn containing an infinite population of n total
different colors.  I was interested in the probability of getting k
different colors, k<n after drawing t times where t>k.  Probability
texts provided a formula where t = k,  but I wanted to increase my
chances of getting the k colors by continuing to draw.

Gary Levin of the University of Arizona replied with an answer and
even included a program for doing the computation.  Mark Paulin of
Tektronix also replied with a similar answer.  Since I had posed the
problem, I am also posting the solution provided by Gary and Mark.
Since Gary was the first to reply, I'll use his notation.

His approach was to calculate the cumulative probability as the sum
from i = k to t of the probability of getting exactly k colors on
exactly draw t.  If you call the probability of getting exactly k
different colors in t tries P(n,t,k), then on draw t, the cumulative
probability, cum, is the probability that you had gotten the k colors
on previous draws plus the probability that you got it on this draw.
The last term is equal to the probability that you got exactly k-1
colors on the t-1 draw times the ways of getting the kth color on this
draw, or P(n,t-1,k-1) * n-(k-1)/n.

P(n,t,k) can be computed as the probability of having gotten k-1 balls
on draw t-1 and getting a new one this time plus the probability of
having gotten the k colors on the t-1 draw and drawing a duplicate
color this time.  In the first case, you need to select one of the
remaining n-(k-1) out of n types.  In the second case, you need to
select one of the k out of n types already selected.

     P(n,t,k) = (P(n,t-1,k-1) * n-(k-1)/n)  +  (P(n,t-1,k) * n/k)

If computed in a straightforward recursive fashion, the P function
would take O(fib(t+k)), where fib(i) is the ith Fibonacci number.  Use
a tabular technique to reduce the running time to O(t*k).

I have appended Gary's program to compute the probability of getting k
hits in t or less tries from n types.  Invoke the program as "a.out n t
k".  It will print all non-zero probabilities for tries less than or
equal to t.

####   Program follows  ###
/* Gary Levin's program.  Enter balls picks hits.  Prints for each
pick < hits the cumulative and marginal probabilities of k hits after
t draws from infinite collection of n distinct types of balls */

#include <stdio.h>
#define  maxt 200
float prob();
float P[maxt][maxt];

error(s) char *s; { fprintf(stderr, s); exit(1); }

main(argc,argv) int argc; char **argv; {
    int n,t,k;
    int ik,it;
    int i;
    float cum;
    float pr;

    if( argc == 1 ) error("Usage: balls picks hits\n");
    if( argc != 4 ) error("Wrong number of parameters\n");

    /* get parameters */
	n = atoi(argv[1]); t = atoi(argv[2]); k = atoi(argv[3]);
    printf("n=%d t=%d k=%d\n\n",n,t,k);
    if( t>=maxt )    error("Too many picks, increase maxt.\n");
    if( t<k || k>n ) error("Parameters are incompatable.\n");

    for( ik=0; ik<=k; ik++ )
    for( it=0; it<=t; it++ ) P[it][ik] = -1;

    cum = 0;  /* prob of k hits in it or less tries */
    for( it=k; it<=t; it++ ) {
	pr = prob(n,it-1,k-1)*(n-k+1)/n;
	cum = cum + pr;
	    /* (prob of k hits in it-1 tries or less)
	       + (prob of getting the kth hit on exactly try it) */
	if( cum > 0 )
	    /* print it, cum and pr */
	    printf("t=%d	%g	%g\n", it, cum, pr );
    }
}


/* return probability of getting exactly k different results in t picks
    from n uniform events
*/
float prob(n,t,k) int n,t,k; {
    if( P[t][k] < 0 ) {
	if( k==0 && t==0 )		P[t][k] = 1;
	else if( k>n || k>t || k==0 )	P[t][k] = 0;
	else
	    P[t][k] = prob(n,t-1,k-1)*(n-k+1)/n +
		      prob(n,t-1,k)  * k / n;
	    /* Either in t-1 tries you got k-1 hits and hit a new one this
	    try [the (n-(k-1))/n factor], or in t-1 tries you got k hits
	    and got a duplicate this try [the k/n factor]  */
    }
    return P[t][k];
}
/* The P array is a technique to reduce the running time from exponential
to quadratic.  */