[sci.math.stat] Clustering code sought

demers@beowulf.ucsd.edu (David Demers) (02/27/91)

Forgive me if this is a bit long...

I am looking for two pieces of software:

(1) a k-means program to group a large number of points (all of which
are inside of the unit 3-sphere)
(I don't really need to compute k-means, since the points are relatively uniformly
distributed, I just need a fast way to partition
the unit 3-sphere into k "boxes" of roughly equal volume and near-minimum
surface area), and 

(2) a flexible clustering program.

I have a SPARCstation, SGI Personal Iris & IBM PC(!) available as platforms,
would prefer standard Unix stuff, C or FORTRAN.

LONGER DESCRIPTION OF DATA AND PROBLEM:

My data consists of pairs of 3 vectors, which are the independent and
dependent variables of a smooth and continuous a.e. four-to-one
map from a 3-d manifold onto the unit sphere (the interior,
not just the surface...).  Thus X->Y for X & Y both 3-vectors.

The distribution of Y is nearly uniform in the sphere, and I want to
cut it up into smaller bits, say by k-means, then 
cluster the associated vectors X into 4 clusters so that I can 
partition the entire set of data (25,000 or so sample points) 
into 4 separate sets, so that the inverse map for each
of the 4 branches can be approximated. 

The data X for a region in Y should look like ellipsoids; or at 
least I know the convex hulls of the 4 clusters will not intersect. 

I do have a C program to do hierarchical clustering, originally  by
Yoshiro Miyata, I believe, and last modified by Andreas Stolcke, 
which I may be able to tweak.

I'd like to be able to use different clustering criteria on the data.
I do suspect single-linkage will be adequate, which is why
the hierarchical program may be satisfactory if modifiable, but 
flexibility in choosing criteria would be nice.  

I have looked at Fukunaga and Duda & Hart. They are somewhat helpful 
in regards to theory (though there seems to be little theory in
clustering...), but not for algorithms.
I have begun to code this up, and can probably do it all myself
in a week or so, but it is always nice to avoid reinventing the wheel,
to reinvent a phrase.  It's more fun to spend time analyzing the 
data and thinking about it than debugging C...

If anyone can point me to good source code, references or even algorithms,
I'd greatly appreciate it.

-- 
Dave DeMers					demers@cs.ucsd.edu
Computer Science & Engineering	C-014		demers%cs@ucsd.bitnet
UC San Diego					...!ucsd!cs!demers
La Jolla, CA 92093-0114	  (619) 534-8187,-0688  ddemers@UCSD