[net.math.stat] Normal distribution probability problem.

gknight@ut-ngp.UUCP (gknight) (01/11/86)

Two questions from the same set of facts.  Assume you have two
populations, A and B, for which you have full information (i.e.,
the value of every event in each population).

	1)  If you draw a sample of size n = 1 from each population,
what is the probability that the sample from population A is larger
than the sample from population B?

	2)  Now assume the only information you have about the two
populations is based on samples of, say, size n = 10.  Thus you have
the mean and standard deviation of a sample from each population, but
know nothing about the individual events within the populations.  Now
what is the probability that a single sample drawn from population A
will be larger than one drawn from population B?

	I think the answer to (1) is simply a z-value and the
probability is the area under the normal curve.  But I haven't a clue
on how to work it out if you only have sample data.  I'm interested in
theory as well as an algorithm.  Any help will be appreciated.

			Thanks. 

hes@ecsvax.UUCP (01/12/86)

> Two questions from the same set of facts.  Assume you have two
> populations, A and B, for which you have full information (i.e.,
> the value of every event in each population).

   Since the Subject: line says that this is for the normal 
distribution - full information means knowing the mean and variance
of each population.  (If the question was about two finite populations,
then we've got a different subject.)
> 
> 	1)  If you draw a sample of size n = 1 from each population,
> what is the probability that the sample from population A is larger
> than the sample from population B?
> 
   It's a double integral, based on the conditional probability:
Prob{Obs from A > Obs from B} =

      Int of distn of Oa [for each Oa find prob that Ob < Oa]

where Oa is an obs(ervation) from pop A, and Ob from pop B.
Each distn (distribution) is normal with appropriate mean and var.
The prob inside the bracket is the left tail of B integrated from
neg inf to Oa (and so is a cumulative normal).  The first integral
is taken over the neg inf to pos inf range of Oa.  So we have:

             pos inf           Oa
              Int    Na        Int    Nb  dOb   dOa
            neg inf           neg inf

as the desired probability.  Na and Nb are the two normal distribution
functions, where the random variables are called Oa and Ob.  In the 
language you use below, the left Na is an x value, and the integral of 
Nb is an area under the normal curve.
   This may simplify.

> 	2)  Now assume the only information you have about the two
> populations is based on samples of, say, size n = 10.  Thus you have
> the mean and standard deviation of a sample from each population, but
> know nothing about the individual events within the populations.  Now
> what is the probability that a single sample drawn from population A
> will be larger than one drawn from population B?
> 
   One certainly could use the same formula above, with estimates of the
means and variances of populations A and B replacing the parameters in Na,Nb. 
That would yield an estimate of the desired probability.  (One can't 
know the actual probability without knowing the parameters of the two
populations.)  I'd rather not conjecture as to the properties of the
estimator of the probability.

> 	I think the answer to (1) is simply a z-value and the
> probability is the area under the normal curve.  But I haven't a clue
> on how to work it out if you only have sample data.  I'm interested in
> theory as well as an algorithm.  Any help will be appreciated.
> 
   The theory is the double integral- the algorithm is left as an
exercise for the reader. :-)  (This integration should be reasonably
easy to do numerically.)
> 			Thanks. 

--henry schaffer

eeb@ukc.UUCP (E.E.Bassett) (01/15/86)

	Two questions from the same set of facts.  Assume you have two
	populations, A and B, for which you have full information (i.e.,
	the value of every event in each population).
	
		1)  If you draw a sample of size n = 1 from each population,
	what is the probability that the sample from population A is larger
	than the sample from population B?
	
		2)  Now assume the only information you have about the two
	populations is based on samples of, say, size n = 10.  Thus you have
	the mean and standard deviation of a sample from each population, but
	know nothing about the individual events within the populations.  Now
	what is the probability that a single sample drawn from population A
	will be larger than one drawn from population B?
	
		I think the answer to (1) is simply a z-value and the
	probability is the area under the normal curve.  But I haven't a clue
	on how to work it out if you only have sample data.  I'm interested in
	theory as well as an algorithm.  Any help will be appreciated.
	
				Thanks. 
	
	
The answer to (1) is easy: yes, the probability is what you term
a z-value. To be specific, let A and B represent the random variables
from the two populations; let A be distributed N(meana, vara) and
B N(meanb,varb). (Note that the second parameter shown is the variance
rather than its square root, the s.d.)
Then, since linear combinations of normals are themselves normal,
   
          A - B  is N (meana - meanb , vara + varb) ,
and the probability you require is simply
               
          Pr (A - B > 0).

Substituting the (known) values for the means and variances of A 
and B is then easy.

The answer to (2) is more tricky.  Since you have sample information
about the distributions you can estimate the means and standard deviations
in the usual way, i.e. by the sample means and sample standard deviations.
Substituting these in the formula obtained for (1) will now give an
estimate of the probability you require.
Of course, many statisticians - particularly the Bayesian variety - will
argue that one should be able to express odds on A exceeding B in the
situation described in (2). So stick in the odd prior distribution or
two (actually four, if you can take the means and variances as
independent), turn a handle or four and out comes the posterior
probability. Haven't tried it, but it looks rather Fisher-Behrens'ish
to me.

   Eryl Bassett
   Univ. of Kent
   Canterbury, U.K.