[comp.theory] Least-squares clustering.

markh@csd4.csd.uwm.edu (Mark William Hopkins) (04/14/91)

   I have a particular algorithm in mind.  It takes a set, S, of N-vectors
and partitions it into sets S1, S2 such that

		    sigma(S1) + sigma(S2) is minimized
                    where
                       sigma(S) = total square deviation of S
			  (that is, sum x in S: |x - average(S)|^2)

It runs sequentually through S and calculates S1 and S2 in one pass.

markh@csd4.csd.uwm.edu (Mark William Hopkins) (04/14/91)

In article <11014@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:
>   I have a particular algorithm in mind.  It takes a set, S, of N-vectors
>and partitions it into sets S1, S2 such that
>
>		    sigma(S1) + sigma(S2) is minimized
...

I forgot to finish the previous posting.  The question is: does anyone know
of such an algorithm?

wido@isgtec.uucp (Wido Menhardt) (04/15/91)

In article <11014@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:
> 
>    I have a particular algorithm in mind.  It takes a set, S, of N-vectors
> and partitions it into sets S1, S2 such that
> 
> 		    sigma(S1) + sigma(S2) is minimized
>                     where
>                        sigma(S) = total square deviation of S
> 			  (that is, sum x in S: |x - average(S)|^2)
> 
> It runs sequentually through S and calculates S1 and S2 in one pass.

the algorithm which solves this problem suboptimally and iteratively 
is known as K-MEANS, very popular, and it works as follows:

1) assign each vector of S randomly (or systematically) to either S1 or S2
2) compute the mean vectors fro S1 and S2
3) re-assign each vector of S to the class (S1,S2) whose mean vector is 
   closer (this takes care of the square deviation criterion)
4) go to 2). 

stop the iteration, if nothing changes anymore. the result depends on (1).

I would be interested to hear about a one-pass algorithm: after all, you 
do kot know average(S1) in advance ....

-wido.