Statistical considerations on the k-means algorithm

Dana Mihai, Mihai Mocanu


Cluster analysis from a data mining point of view is an important method for knowledge discovery in large databases. Clustering has a wide range of applications in life sciences and over the years it has been used in many areas. The k-means is one of the most popular and simple clustering algorithm that keeps data in main memory. It is a well known algorithm for its efficiency in clustering large data sets and converges to acceptable results in different areas. The k-means algorithm with a large number of variables may be computationally faster than other algorithms. The current work presents the description of two algorithmic procedures involved in the implementation of k-means algorithm and also describes a statistical study on a known data set. The statistics obtained from the experiment, by detailed analysis, can be used to improve the future implementation techniques at the k-means algorithm to optimize data mining algorithms which are based on a model.

Full Text: