# Technique in data mining

### Abstract

Clustering is a technique in data mining to find interesting pattern in given data sets that means a large data set is grouped into clusters of smaller sets of similar data. K-Means is one of the most popular clustering algorithms. Initial centroids are required as starting point values as input parameters when using k-means clustering algorithm. There are two approaches to choose initial centroids one is Synthetic method starting point having total ten (10) different methods of initial starting points and the other one is based on Actual sample starting point having four (4) different methods. This paper envisages only on initial centroids based on Actual sample starting point of k-means algorithm using intelligent agents. Intelligent agents are now frequently used in distributed networked computing because they reduces network traffic, overcome network latency, operate in heterogeneous environment, autonomous and fault-tolerant behavior. A multiagent system (MAS) is proposed in this research paper for the generation of initial centroids using actual sample datapoints. This multiagent system comprises four agents of k-means clustering algorithm using different initial centroids starting points.

### Key words

Cluster, Clustering, Initial centroid, MAS, Actual method, Synthetic method

### Introduction:

'K' in this algorithm is for number of clusters as an input and 'Means' stands for an average, location of all the members of a particular cluster. It is used for finding the similar patterns due to its simplicity and fast execution. This algorithm uses a square-error criterion for re-assignment of any sample from one cluster to another, which will cause a decrease of the total squared error. It is easy to implement, and its time and space complexity is relatively small.

There are different techniques available to use the initial points (it is required as an input). The following are the different methods to calculate the initial points:

- Corner: In this method all the values in the data sets are scaled to be in [-1, 1], the set of all the clusters close to the vertices (-1,...,-1) is considered. This is usually a 'bad' set of initial points since it lies on the boundary of the data, and can be considered an outlier.
- Bins: This method consists in dividing the space in bin and then takes random points inside each bin, this assures that the set of initial points are distributed covering the entire dataset.
- Centroid: This method consists of choosing all the starting clusters close to the mass centroid of the dataset. Each cluster center is calculated by adding a small random perturbation to the centroid of the dataset.
- Spread: The cluster centers are distributed randomly and trying to cover the entire space. This method is similar to bins.
- PCA: The data points are projected in the space of the principal component, a clustering procedure is applied to this one-dimensional set. The cluster centers are calculated depending on the obtained clusters in the one-dimensional space.

Among these above mentioned methods, in order to calculate the initial starting points, we will opt for the 'centroid' method because it has been advocated throughout the literature that the centroid method produces the best results. A Centroid represents an average location of the particular cluster.

The 'centroid' method is further divided into two different approaches to choose initial starting points one is Synthetic method starting point having total ten (10) different methods of initial starting points and the other one is based on Actual sample starting point having four (4) different methods. In this paper only Actual sample starting point of k-means algorithm using intelligent agents are discussed.

A multiagent system (MAS) is proposed in this research paper for the generation of initial centroids using actual sample datapoints. This multiagent system comprises four agents of k-means clustering algorithm using different initial centroids starting points. The figure -1 below is the architecture of this proposed MAS.

The rest of the paper is organized as follows:

In section 2 we present an overview of K-means clustering algorithm. Section 3 is about the four different formulae for starting point based on actual sample data. In section 4 we draw a comparison between these formulae. Section 5 is about the the results and discussion and finaly section 6 is about the conclusion.

### An overview of K-means Algorithm:

The K-means algorithm is:

- Select the value of 'k', the number of clusters.
- Calculate the initial centroids from the actual sample of dataset using the proposed formula. Divide data points into 'k' clusters.
- Move datapoints into clusters using the distance formula. Recalculate new centroids. These centroids are calculated on the basis of average or means.
- Repeat step 3 until no datapoint is to move.

The figure - 2 below illustrates these steps of K-means in the shape of a flow chart.

Enter the number of clusters and number of iterations, which are the required and basic inputs of the K-Means algorithm, then compute the initial centroids by providing the formulae discussed in section 3. Calculate the distance by using Euclidean's distance formula in equations 1. On the basis of these distances, generate the partition by assigning each sample to the closet cluster. Compute new cluster centers as centroids of the clusters, again compute the distances and generate the partition. Repeat this until the cluster memberships stabilizes. This algorithm uses a square-error criterion in equation 2 for re-assignment of any sample from one cluster to another, which will cause a decrease of the total squared error.

### Euclidean Distance Formula:

The following are major issues in k-means clustering algorithm:

- The algorithm is only applicable to data sets where the notion of the mean is defined. Thus, it is difficult to apply to categorical data sets. There is, however, a variation of the k-means algorithm called k-modes, which clusters categorical data. The algorithm uses the mode instead of the mean as the centroid.
- The user needs to specify the number of clusters k in advance. In practice, several k values are tried and the one that gives the most desirable result is selected. We will discuss the evaluation of clusters later.
- The algorithm is sensitive to outliers. Outliers are data points that are very far away from other data points. Outliers could be errors in the data recording or some special data points with very different values.
- The algorithm is sensitive to initial seeds, which are the initially selected centroids. Different initial seeds may result in different clusters. Thus, if the sum of squared error is used as the stopping criterion, the algorithm only achieves local optimal. The global optimal is computationally infeasible for large data sets.
- The k-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids or hyper-spheres. [1]

The followings are areas where K-Means Clustering Algorithm can be applied:

- Marketing: Finding groups of customers with similar behaviour given a large database of customer containing their properties and past records. [2]
- Biology: Classification of plants and animals given their features. [2]
- Libraries: Book ordering. [2]
- Insurance: Identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds. [2]
- City-planning: Identifying groups of houses according to their house type, value and geographically location. [2]
- Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones. [2]
- WWW: Document classification; clustering web log data to discover groups of similar access patterns. [2]
- Medical Sciences: Classification of medicines; patient records according to their doses etc. [2][3]
- Liu, Bing. Web Data Mining Exploring Hyperlinks, Contents, and Usage Data. ISBN-13 978-3-540-37881-5, Springer Berlin Heidelberg New York
- web site http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?action=browse&id=STLAlgorithmExtensions/ClusteringAlgorithms&revision=3
- Skrypnik, Irina., Terziyan, Vagan., Puuronen, Seppo., and Tsymbal, Alexey. Learning Feature Selection for Medical Databases. CBMS 1999.
- Kantardzic, Mehmed. Data Mining: Concepts, Models, Methods, and Algorithms. ISBN: 471228524 John Wiley & Sons., 2003.
- MacQueen, J.B. "Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability", Berkeley, University of California Press, 1:281-297
- Davidson, Ian. Understanding K-Means Non-hierarchical Clustering, SUNY Albany - Technical Report 2002.
- Mirkin, Boris. Towards Comprehensive Clustering of Mixed Scale Data with K-Means. School of Computer Science and Information Systems, Birkbeck College, University of London, Mallet Street, London, WC1E 7HX., 1998.
- web site http://dms.irb.hr/tutorial/tut_clustering_short.php
- Robinson. F, Apon. A., Brewer. D., Dowdy. L., Doug Hoffman. D., Lu. Baochuan., "Initial Starting Point Analysis for K-Means Clustering: A Case Study"

In this research paper will apply this k-means clustering algorithm on two medical datasets Breast and Diabetes.

### Initial centroids methods:

In this section we will discuss the generation of initial centroids based on actual sample data points.

### Initial centroids (proposed formula):

The formula for initial centroids is given below:

The max X - min X will provide the range of 'X' attribute, similarly max Y - min Y will give the range of 'Y' attribute. If both the attributes have zero value then this formula will not work. The value of 'k' must be 1 if 'k' is zero then again it will give an error, the division by zero. Here 'n' in this formula is the number of iterations, the value of 'n' should be small otherwise the complexity time and space both will be very high and the value of initial centroids will also become very high and may be out of the range in the given dataset. This is another drawback of this formula.

Data standardization is the most important step for clustering. The Euclidean distance method requires the standardization of attributes so that all attributes can have equal impact on the distance computation. This is to avoid obtaining clusters that are dominated by attributes with the largest amounts of variation. The proposed formula is also providing the data standardization. The division of range by the number of clusters will give average or means of the attributes. The multiplication with the number of iteration gives new central point for the calculation of distance. The name of this formula is Meansroid. In this way, we can calculate the initial centroids; this will be the starting point of this algorithm by using the square-errors.