Data mining can be defined as a process which is used to analyse the available data from various perspectives which can then be summarized so that we can get useful information from it. This information can really be helpful to cut the cost, increase or decrease revenues or sometimes even for both. Data mining is also useful to recognise the various patterns between the data and also to establish the relations between them. Now a days data mining is used in many organisations and also it is considered as one of the important tool data analysing.
Data mining is considered as a new concept. Now a days many organisations started using various concepts of data mining to study and analyse various research reports of the market. Also various concepts of data mining are being used in different fields which includes mathematics, cybernetics and even in genetics. In the field of CRM i.e Customer Relationship Management, web mining is widely used which is considered as one of the type of data mining. As internet is useful in almost every field therefore the importance of this concept is increasing rapidly. A huge amount of data is collected from the website by using a specific program called data miner. This collected data is then analysed and related with other data to find various patterns in user behaviour.
Development in various fields such as in transmission of data, storing data, data capture required the all the organisation to develop data warehouses. Though like data mining, it is a new technique but the concept used in it is not new. Data warehousing is generally used to store historical data. Data warehousing makes it possible to store data in centralised manner which is useful to easily analyse the data and also to improve the user access.
Data can be termed as any text, number or even facts which the computer processes. Now a days, every organisation is trying to expand and therefore they are collecting as many data as possible. Thus, these organisations required to stored data in different database and sometimes in different formats too.
Information is defined as the relationship between all the available data in the database.
The information gathered from the data can be used as knowledge. With the help of this knowledge we can get the information of the past. It will also be helpful to make any decisions for the future.
Data Mining consists of various parameters. These parameters are:
New and different patterns can be found out with the help of classification. It is one of the types of data mining technique.
Association is the process which is useful to connect one event with another.
Clustering is defined as the process to group the available data as per the logical relationship.
Path or Sequence analysis:
Path or Sequence analysis finds the particular patterns which we get from the earlier events.
Forecasting can be defined as predicting the future of the data.
In data mining there are different levels to analysis the data. These levels are given below:
- Artificial Neural Networks.
- Genetic Algorithms.
- Decisions Tree.
- Nearest Neighbour Method.
- Rule Induction.
- Data Visualisation.
Cluster can be defined as the collection of data which are similar in nature. Cluster is a group of various data objects which are similar to each other and even treated similarly in a single group. Thus, when the term cluster is used it consists of the data which is similar to each other and does not contain any similarity between the data of other groups.
Clustering and its use in Data Mining:
Clustering is the process by which clusters can be formed. Clusters can be formed by separating or dividing the set of data in to a set of logical sub-classes.
Clustering is one of the most important step in data mining. It is frequently used in the data mining process. The basic and main purpose of using clustering is to find out the various groups of data which have similar records. Because this records forms the base for the further study about the relationships of the various available records.
Now a day, for grouping the datasets, various clustering methods are available. Also all these methods consist of different strategies. Selection of a particular clustering method depends upon the output required by the user.
In data mining process various types of clustering methods are used. These are given below:
- Hierarchical method.
- Partitioning method.
- Grid-Based method.
In non-hierarchical method, the 'X' objects data is divided in to 'Y' clusters. It sometimes allows overlapping or sometimes does not allow overlapping. This method is further classified as "Partitioning Methods". In this method there may be chances that the classes in it may be mutually exclusively.
When this method is used, we get the result as 'Y' clusters. Therefore in this, every object belongs to single cluster. In this method, all the information about the object present in the cluster is stored by the cluster representative and each cluster representative represents one cluster.
The partition of dataset is done by using the single pass method. The procedure which is required to partition the dataset is given below:
- Firstly, cluster representative or centriod is made. This centroid or cluster representative is the very first object of first cluster.
- Now by using this centroid or cluster representative, similarity 'S' is calculated for the very next object by making use of some similarity coefficient.
- After getting the value of 'S' compare it with some specified threshold value. If the threshold value is less than the highest value of 'S', to a equivalent cluster it is required to add an object and again the centroid is formed or the new centroid is formed by using the object itself. Return to step 2 if any object remained unclustered.
In this method, the successive clusters are formed with the help of hierarchical algorithm and the clusters that were established previously. Hierarchical algorithm can be of two types. It can be top to down i.e. divisive or bottom to up i.e. agglomerative.
In "Divisive Algorithm", it starts by using all the available data and then process by dividing it into relatively smaller clusters whereas in "Agglomerative Algorithm" it start with single element considering it as a single cluster and then add it into relatively larger clusters.
The clusters can be created either by breaking the hierarchy of clusters (divisive) or by building the hierarchy of clusters (agglomerative). Traditionally, this hierarchy of clusters is represented by a tree. It is also called as "Dendrogram". In this, the cluster which contains all the elements is located at one end whereas all individual elements are located at other end. In Divisive algorithm, roots are considered as the starting point whereas in case of Agglomerative algorithm it begins from the leaves of the tree.
Agglomerative Hierarchical Clustering Method:
The use of agglomerative hierarchical clustering method is rapidly increasing in today's world. For the successful formation of this method there are some basic steps which are required to be followed. These steps are given below:
- Find the two objects which are very near to each other. Merge these objects to form a cluster.
- In the next step, find the two points that are very near to each other and merge them. These points may be either cluster of objects or individual objects.
- Return to step 2 if at the end more than one clusters remains.
Grid Based Clustering:
The available data space can be separated into the cells containing grids with the help of Grid Based Clustering. These cells are then merg together to form to form clusters.
There are various types of grid Based Clustering. Some of them are given below.
- Wave Cluster
In this approach, the purpose of using it is to divide every dimension in to intervals. It is also used to compute all the intense units.
STING is termed as a Staticicial Information Grid approach. In the year 1997, this approach was first used by Yang, Wang and Muntz. It is a multi-resolution and a grid based clustering technique. In this all the available area is separated in to a cell which is rectangular in shape. Also all the necessary information is preserved in cell hierarchy. Hence it is possible to performed clustering on cells.
Wave clustering is very useful in signal processing so as to take care of the multidimensional data spaces.
Optimal grid and adaptive grid are some of the methods of irregular grids which are also used.
Grid Based clustering possesses some advantages. These are
- Grid based clustering consist of very fast processing time.
- It is not dependent on the number of objects.
- It depends on the number of cells. Hence, it is really fast.
SLINK is termed as Single link method. The main reason of using this method is to combine two identical objects in the same cluster. This method comes under hierarchical methods and at each step of the method this process is executed. Its main purpose can be found out by its name itself. Its main objective is to combine two clusters by using shortest path between the objects.
Complete link method is similar to the Single link method. The characteristics of both these methods are similar. The only difference between the two is that between the cluster, complete link method uses less identical pair of clusters as compared with single link method. Complete link method is used to determine the similarity of the inter cluster.
Group Average Method:
The group average method is different as compared with single link or complete link methods. In a particular cluster, group average method depends on the average value of pair and does not depends on the level of similarities as it is in single link method and complete link method.
Clusters used in other communities:
Basically, clustering is the process of generating clusters. Clusters are the set of meaningful data which are obtained by separating or dividing the data into sub-classes. Apart from data mining, clustering is also very useful in many other fields. It includes fields like marketing, biology, market research, medicines, mathematics, chemistry, insurance, to analyse social network, etc.
Clustering used in oral medicine:
Now a days, various clustering techniques are used in oral medicines and its importance is increasing day by day. This technique helps them to determine the relation between various classifications and attitudes of the similar patients after they have been examined.
The figure given below shows an example of clustering.
The above figure shows the smaller clusters which are obtained by dividing the large clusters using partitioned clustering method.
Clustering used in biology:
Clustering technique is also widely used in various fields of biology. This technique is very useful in animals and plants ecology. With the help of this technique we can differentiate and compare many organisms.
It is also very useful in bioinformatics because with its help various groups of genes can be made which have similar pattern.
Use of Clustering in Marketing:
In market research we are required to collect a variety of data. This data can be collected from various surveys. Due to Clustering method the process could be having compound data made complex. It is really useful in particularizing the general extraction of data which could be as accordingly. These compounds include the classification of gender, aging factor and the particular geographical distribution and so on.
The ultimate necessity of the cluster marketing is to characterize the following requirement, which may include the following features,
Breaking down the compound market segment into a simpler and easily usable one that contributes towards the essential requirement, in which it have to be met by the organisation.
The features obtained through having cluster marketing contribute towards the product development, which is indeed an opt one for any of the organisational activity. Moreover it even contributes towards the availability of products.
Clustering methods are also used in marketing to modify or improve the product which is already present in the market. It can also be useful decide the development of the product in future.
Also various organisations use clustering method to decide about the particular market in which they can launch their product. It is also useful to decide a particular technique to test the product in a particular market.
Marketing field consists of various types of clustering methods that are useful in various aspects of the marketing field. Some types of clustering methods that are widely used in marketing are hierarchical clustering, non-hierarchical clustering and agglomerative clustering.
In hierarchical clustering all the objects are arranged according to the hierarchical structure. Example of this cluster is Divisive cluster. In the divisive cluster, it consists of a large cluster which is made up of a number of objects. This single large cluster is then broken down into relatively smaller cluster.
This clustering is also known as k-means clustering. In non-hierarchical clustering first the centre of the entire cluster is found out. This centred is called as centroid. Then all the objects that are at equal distance from this centroid are grouped together.
In agglomerative clustering, first all the objects are considered as separate clusters. These objects are treated separately. Then these clusters are grouped again to form relatively large cluster. Various examples of this method is given below.
When we are required to increase the distance between the clusters centre, centroid method is used. By using this method distance of the centroid of the clusters can be maximised.
When we are required to reduce the distance between the two clusters, variance method is used. By using this method distance between the two clusters can be decreased.
In this method, the clusters are formed by creating the link between various objects present in the cluster. The formation of the link between the objects depends upon its distance with each other. There are two types of linkage method, single linkage method and complete linkage method. In single linkage method, the linkage between the objects is generated by using nearest neighbour rule. In this the object forms the link with the other object which is closest to it to generate a cluster. In complete linkage method, the link between the object is created by using farthest neighbour rule. In this method, to generate a cluster the object forms the link with the other object which is far away from it.
Importance of Clustering in Medicine:
In medical field, there can be many situations where we have to distinguish things and also required to study them thoroughly. In such situations clustering is really helpful and thus it plays an important in it. In the field of medicine it is used to differentiate and classify various types of blood samples or tissues. In this field all information must be accurate. This can be obtained by employing clustering method. Also clustering consists of various intense methods and these are helpful where obtaining intense results is of great importance.
As discuss above now clustering is being used in almost every field. The importance of data mining has increased and now it has been used in many organisations for analysing the data. Use of clustering in data mining is an added advantage. It makes it easy to differentiate various types of data and also makes it possible to recognise the relation between the data. Now its importance and use is not restricted only in database organisation but it is also being used in various other fields which includes marketing, medicines, aircraft, biology and many more. But still some people are of the opinion that some development is required in this field. Many clustering methods and techniques are available. But some of this techniques and methods do not have any proper base. Some of them are based on mathematical methods while some on informal methods. This makes it really difficult to perform the comparisons between them. In spite of all the above problems, it is still a very interesting field and many people use it in many fields.
- Data Mining: What is Data Mining, retrieved from http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm
- What is Data Mining: Definitions, retrieved from http://searchsqlserver.techtarget.com/sDefinition/0,,sid87_gci211901,00.html#
- Principles of Knowledge Discovery in Databases, by Dr. Osmar R. Zaiane and Dr. Jiawei Han, retrieved from http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/sld007.htm
- Data Clustering and Its Applications, by Raza Ali, Usman Ghani and Aasim Saeed, retrieved from http://members.tripod.com/asim_saeed/paper.htm
- Data Clustering, published on 22/04/2008, retrieved from http://www-staff.it.uts.edu.au/~yczhao/clustering/clustering.htm
- An Introduction to Cluster Analysis for Data Mining, published on 10/02/2000, retrieved from http://www-users.cs.umn.edu/~han/dmclass/cluster_survey_10_02_00.pdf