barryn@world.std.com (barry n nelson) (05/30/91)
I am posting this for a friend on Compuserve, please reply to Steven Hokanson's email address on Compuserve given at the end of the article. I have been working on multivariate analysis of nominal data. In keeping with my normal apprehension that someone is always gaining ground on me, I have been following the comp.ai.neural-nets dialogue for several months now. The recent spate of comments on the value of cluster analysis caught my eye. Perhaps a problem that I have come across has been visited by others ... There might be more than 1 answer! I will give two hypothetical views of how this might occur. The first is in matrix form, the second is pictorial. Given a matrix with 72 variables forming the columns and a 1000 observations forming the rows, the task is to predict the values in the 72nd column. In analyzing this data set, the most wondrous methodology ever, (i.e. the most recently discovered), is used and surprisingly 3 different answers are found. Depending on parameter starting values, or due to the data sequence, or perhaps simply due to the multiple runs made for validation using the "leave-one-out" method, several equally stable solutions are found. The first 10 variables are quite good predictors (all by themselves), with say 90% accuracy. The variables in columns 26 thru 50 are almost as good (say 89%). The "best" however are variables 60 thru 71 (91%). Each of these sets of variables (let me label them A, B, and C respectively), produce erroneous predictions on different observations. Let us say that the overall overlap in errors is only 5%. Now, mathematically, set C is the most accurate predictor. However, I feel as if there should be 3 different clusterings given as answers. One reason for giving the three answers is that the user (the person making a prediction) might not always have all 71 data values known. The 3 answers would let the user use the clustering for which he has the best data. Of course this leads to the conclusion that the solution given should be a function of the known variables; there should be 71!+ answers. [ This thought drifts over into another concern of mine: is there one "absolute" clustering of the data where the ability of all variables to predict the values of all the other variables is maximized, or are clusters purely utilitarian, and the value of a cluster solution is solely dependent on how it is to be used? ] I now present a similar problem using geometry. Take a simple square as the enclosure for the data points. Let there be three divisions of the area that produce similar levels of accuracy in predicting the "category" to which the data set belongs. The first (1) is a division made by drawing a line from the upper left vertex to the lower right vertex. This bisection of the square has an accuracy of 90%. The second (2) division is made by drawing two lines: one from the upper right vertex to the midpoint of the bottom side of the square, and a second from the same vertex to the midpoint of the left side of the square (89%). The third (3) division is made by two lines: connect the left midpoint to the right midpoint, and connect the top midpoint to the bottom midpoint. Again there are three conflicting answers. None is a complete subset of the other. Depending on whether the clustering algorithm is biased towards many or few groups, the answers are likely to be different. If the methodology is very good, answer C, or 3, should be found. Still, are we really doing our jobs as analysts if we neglect mentioning the other, very good, but different answers? This is one reason I find neural nets scary. Can we know what "almost-as-good" answers have been skipped over in the quest for yet ever more marginal improvement in performance? Isn't there some value in the solutions discovered (and passed by) that are "almost-as-good"? I keep coming back to this thought that reducing a data set to a single predictive clustering is a good and noble thing, when one clustering stands out as 100% accurate (or vastly superior to all the others). But, when many fairly good clusterings exist don't they each define significant aspects of the aggregate of the data set's observations? These thoughts have rattled around in my head long enough. I hope someone can contribute to my confusion so I can give up on this line of reasoning and get on with my real work. From: Steven Hokanson | Tyger! Tyger! burning bright 312 Fitzwater Street | In the forests of the night Philadelphia PA 19147 | What immortal hand or eye 215-928-1619 | Could frame thy fearful symmetry? 73700.1212@CompuServe.COM | -- William Blake -- -- Barry N. Nelson barryn@world.std.com or {xylogics,uunet}!world!barryn