[comp.ai] info wanted on learning probabilities

cho@sol4.cs.psu.edu (Sehyeong Cho) (01/31/91)

Hi, netland.
Has anyone built a system (or read a paper describing a system) that:
 Begins with very rough estimate of cond. prob's.
 Then learns (updates) the conditional probabilities by examples?

For instance, suppose I believed P(death|shoot SCUD) = 0.01,
and then experience a few "shoot SCUD" events along with the
results. (say,  5 SCUD's shot, 1 caused death)
The 5 events are too few to conclude P=0.2. So, it must be
somewhere between 0.01 and 0.2.
Any theory (or heuristics, or psychological evidence..) about how to?

Thanks in advance.

--
                      |  Yesterday I was a student.
     Sehyeong Cho     |  Today I am a student.
     cho@cs.psu.edu   |  Tomorrow I'll probably still be a student.
                      |  Sigh.. There's so little hope for advancement.

gblee@maui.cs.ucla.edu (Geunbae Lee) (01/31/91)

I think the following book and other Pearl's works can answer your
question (sorry for latex format). Pearl's Bayesian network can
formulate, propagate, and revise beliefs (similar to your conditional 
probablility) according to prior experiences. The book has a good bibliography
for this field too. 

@Book{pearl:probabilistic,
  author = 	"Judea Pearl",
  title = 	"Probabilistic Reasoning in Intelligent Systems:
Networks of Plausible Inference",
  publisher = 	 KAUF,
  year = 	"1988",
  address = 	 KAUF-ADDR,
  rem  = 	"pearl.88b"
}

-- 
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+  Geunbae Lee, Artificial Intelligence Lab, Computer Science Dept, UCLA.   +
+  INTERNET:gblee@cs.ucla.edu, PHONE:213-825-5199 (office)                  +
+  Sir, AI is the science that makes machines smart, but people dumb!!!     +

rpg@cs.tulane.edu (Robert Goldman) (01/31/91)

In re Pearl's work on Bayesian networks, Spiegelhalter and Lauritzen have
done some work on training these networks in a Bayesian way.  The
paper I have is titled "Sequential Updating of conditional
probabilities on directed graphical structures," in Networks, 1990.

Also, Geoff Hinton's Boltzmann machine is, if I remember correctly,
a trainable Markov Random Field.

Hidden Markov Models are also a class of trainable probability
models.

I think maybe we need more information about the kinds of
probabilities you want to learn.  As far as I can tell, almost all
of bayesian estimation falls under the rubric of this question!

R

almond@lisbon.stat.washington.edu (Russell Almond) (02/01/91)

I would also be interested in information on learning probabilities
and I am willing to compile and post a reference list.

Let me recap the problem as I understand it.  

Sehyeong Cho asks about learning a conditional probability $P(A|B)$.
The standard Bayesian model call $P(A|B)$ some parameter $\theta$.
{\it A priori\/}, that is before any data are available, a probability
distribution is used to express our state of knolwedge (or ignorance)
about the parameter $\theta$.  This is called a "prior distribution."
There is some disagreement in the statistical community about what the
correct prior distribution for representing ignorance in this case is.
Bayes and Laplace advocated using a uniform distribution, which is
also a beta distribution with parameters (1,1).  Jeffreys advocates
using a beta (1/2,1/2).  Others have advocated a beta(0,0) which is
really not a probability distribution, but is the limit of a series of
probability distributions.  There are other possibilities which are
not beta distributions, but they lead to greater complexity.

Assume for the sake of simplicity we have chose as our prior
distribution a beta distribution with (hyper)parameters
$\alpha,\beta$.  We then observe $n$ cases in which $B$ occurs and
that in $x$ of them A occurs as well.  Note that we must make an
additional assumption here that our observation is unbiased; that is,
we have no reason to believe that if $(A,B)$ occurs we are no more
likely have it brought to our attention than if $(\neg A,B)$ occurs.
This might not be the case in Sehyeong's original example, if for
example, newspapers were more likely to omit reporting on SCUD
launches if no deaths occur.  Assuming this is the case, we are lead
to believe that {\it a posteriori\/} our knowledge about the parameter
has a beta distribution with (hyper)parameters $\alpha+x, \beta+n-x$.

There is a slightly more complex belief function formulation of this
problem (based loosly on Fisher's fiducial arguments) which results
instead of an exact probability distribution for $\theta$ upper and
lower bounds for $\theta$ in the form of a "bivariate beta" belief
function.  This is developed in Dempster[1966], and recaped in
Almond[1989,1991].  The bounds capture the poseterior distributions
corresponding to all three of the posterior distributions cited above.
Using these bound has the advantage of simpler assumptions but the
disadvantages of greater computational complexity and weaker
decision-making power.

Robert (Goldman) brings up the next logical question, which is the one
that I am currently working on.  Suppose we have build a probabilistic
graphical model of the kind developed in Pearl[1988] or Lauritsen and
Spieglehalter[1988].  We jointly elict the probabilities
$\theta_1=P(A|B)$ and $\theta_2=P(A|\neg B)$, but there is uncertainty
about those values.  By the Bayesian paradigm, we should express that
uncertainty by a (joint) probability distribution over the two
parameters.

This is the upshot of the 1990 Lauritzen and Speigelhalter paper which
Robert cites.  There are some non-trivial technical problems here
of which L&S only scratch the surface.  For example, L&S build a
number of models which assume the independence of $\theta_1$ and
$\theta_2$.  This assumption is particularly suspect, even if only
made for the sake of convenience.  In the case of many graphical
models, it may be the case that we known with some certainty that
$P(A|B) > P(A|\neg B)$ or visa versa.  L&S also note that when they
observe incomplete data (that is observe A but not B) that
$\theta_1$ and $\theta_2$ will be dependent {\it a posteriori\/}, even
if they are independent {\it a priori\/}.  

David Madigan, Jeremy York and I have been noodling around with some
alternative models, but we have not yet written anything up.  I would
be eager to talk with anybody else who is working on the problem.

	--Russell Almond
	University of Washington,
	Department of Statistics, GN--22
	Seattle, WA  98195
	(206) 543-4302
	almond@stat.washington.edu

kgo@iesd.auc.dk (Kristian G. Olesen) (02/04/91)

A short version of Spiegelhalter and Lauritzens work on learning probabilities
can be found in:

Spiegelhalter and Lauritzen: Techniques for Bayesian Analysis in Expert
Systems. Annals of Mathematics and Artificial Intelligence, 2 (1990)
353-366.

As pointed out by Almond there are some problems involved with e.g.
inclomplete data.

At Aalborg university we're currently working on a prototype implementation
of the scheme where uncertainty on conditional probabilities are modelled
as Dirichlet distributions. A series of experiments with this prototype
aims at a clarification of the strengths and limits of the method. We're
dealing with learning as well as adaption of conditional probabilities as
cases become known. Currently four factors are being investigated:

1. Significance of the precision of original conditional probabilities.
2. Observational schemes (incomplete data).
3. Different types of prior distributions.
4. Learning schemes (learning, adaption).

The results so far seems promising, but it's still to early to conclude. I do,
however, feel confident that practically applicable methods will turn up. If
succesfull the method will be integrated in the HUGIN shell.

Kristian G. Olesen
Aalborg University
Institute of Electronic Systems
Frederik Bajers Vej 7 D
9220 Aalborg Ost
Denmark
Phone +45 98 15 85 22, 4960
E-mail: kgo@iesd.auc.dk

kyoden@arpeggio.rl1.isl.mei.co.jp (Tatsuro Kyoden) (02/07/91)

--
-----------------------------------------------
$@7PED(J $@<yO/!wBh#18&5f<<(J.$@>pJsDL?.4X@>8&5f=j(J.$@>>2<EE4o(J

gowj@novavax.UUCP (James Gow) (02/10/91)

I am interested in some references in the MIT media lab. Can anyone suggest
access to this facility and these references?

Harel, I. (1990) (Ed.) Constructionist Learning: A 5th Anniversary
Collection of Papers.
 
Papert, S. (1990). A Unified computer Envrionment For Schools: A
Constructionist/Cultural Approach.

Resnick,M. (1990). Logo:A Computaional Environment for Exploring
Self-Organizing Behavior. Ph.D. Thesis Proposal.


I have been told these items are not available through inter=-library loan
or other normal channels.

James