[comp.ai.neural-nets] When answers are a many-clustered thing.

barryn@world.std.com (barry n nelson) (05/30/91)
I am posting this for a friend on Compuserve, please reply to Steven
Hokanson's email address on Compuserve given at the end of the
article.


     I have been working on multivariate analysis of nominal data. 
In keeping with my normal apprehension that someone is always
gaining ground on me, I have been following the comp.ai.neural-nets
dialogue for several months now.  The recent spate of comments on
the value of cluster analysis caught my eye.  Perhaps a problem
that I have come across has been visited by others ...

     There might be more than 1 answer!

     I will give two hypothetical views of how this might occur. 
The first is in matrix form, the second is pictorial.  Given a
matrix with 72 variables forming the columns and a 1000
observations forming the rows, the task is to predict the values in
the 72nd column.  

     In analyzing this data set, the most wondrous methodology
ever, (i.e. the most recently discovered), is used and surprisingly
3 different answers are found.  Depending on parameter starting
values, or due to the data sequence, or perhaps simply due to the
multiple runs made for validation using the "leave-one-out" method,
several equally stable solutions are found.  

     The first 10 variables are quite good predictors (all by
themselves), with say 90% accuracy.  The variables in columns 26
thru 50 are almost as good (say 89%).  The "best" however are
variables  60 thru 71 (91%).  Each of these sets of variables (let
me label them A, B, and C respectively), produce erroneous
predictions on different observations.  Let us say that the overall
overlap in errors is only 5%.

     Now, mathematically, set C is the most accurate predictor. 
However, I feel as if there should be 3 different clusterings given
as answers.  One reason for giving the three answers is that the
user (the person making a prediction) might not always have all 71
data values known.  The 3 answers would let the user use the
clustering for which he has the best data.  Of course this leads to
the conclusion that the solution given should be a function of the
known variables; there should be 71!+ answers.  [ This thought
drifts over into another concern of mine: is there one "absolute"
clustering of the data where the ability of all variables to
predict the values of all the other variables is maximized, or are
clusters purely utilitarian, and the value of a cluster solution is
solely dependent on how it is to be used? ]

     I now present a similar problem using geometry.  Take a simple
square as the enclosure for the data points.  Let there be three
divisions of the area that produce similar levels of accuracy in
predicting the "category" to which the data set belongs.  The first
(1) is a division made by drawing a line from the upper left vertex
to the lower right vertex.  This bisection of the square has an
accuracy of 90%.  The second (2) division is made by drawing two
lines: one from the upper right vertex to the midpoint of the
bottom side of the square, and a second from the same vertex to the
midpoint of the left side of the square (89%).  The third (3)
division is made by two lines: connect the left midpoint to the
right midpoint, and connect the top midpoint to the bottom
midpoint.  

     Again there are three conflicting answers.  None is a complete
subset of the other.  Depending on whether the clustering algorithm
is biased towards many or few groups, the answers are likely to be 
different.  If the methodology is very good, answer C, or 3, should
be found.  Still, are we really doing our jobs as analysts if we
neglect mentioning the other, very good, but different answers?

     This is one reason I find neural nets scary.  Can we know what
"almost-as-good" answers have been skipped over in the quest for
yet ever more marginal improvement in performance?  Isn't there
some value in the solutions discovered (and passed by) that are
"almost-as-good"?  I keep coming back to this thought that reducing
a data set to a single predictive clustering is a good and noble
thing, when one clustering stands out as 100% accurate (or vastly
superior to all the others).  But, when many fairly good
clusterings exist don't they each define significant aspects of the
aggregate of the data set's observations?

     These thoughts have rattled around in my head long enough.  I
hope someone can contribute to my confusion so I can give up on
this line of reasoning and get on with my real work.


From: Steven Hokanson         | Tyger! Tyger! burning bright
      312 Fitzwater Street    | In the forests of the night
      Philadelphia PA 19147   | What immortal hand or eye
      215-928-1619            | Could frame thy fearful symmetry?
 73700.1212@CompuServe.COM    |    -- William Blake --
-- 
Barry N. Nelson    barryn@world.std.com   or  {xylogics,uunet}!world!barryn