At present the world is moving very faster in every field with the help of the available technology like computerization of every possible field, barcode sensing devices, usage of digital cameras, the satellites and mainly the world wide web etc. this makes us capable of generating and collecting lots of data in terabytes which is very overwhelming and now everyone started believing that really very useful information is hidden in that huge amount of collected raw data.
Therefore, the available huge amount of data should be analysed and the hidden valuable information should be traced out of it. The analysing of the whole data that is available is not possible to do manually because the data is massive in content. So, there should be some automated tools that summarize the data and extract the useful unknown patterns or information which are helpful in decision making for the business improvement.
This project deals with the mining of the data that is obtained from the course websites through the Google analytics to find the interesting patterns to improve the development of the website. [B]
What is data mining?
The data mining is the concept which mainly helps in dealing with the behaviour of collective observations. [A]
"Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amount of data stored in repositories using pattern recognition technologies as well as statistical and mathematical techniques."
The Gartner Group
Once the source data is ready to analyze then the next step is to choose the proper mining algorithm to implement on the data that is gathered to analyze the data. Different types of mining techniques are available and they are classification, regression and clustering etc. These data mining techniques are used in many different fields like marketing, scientific research; fraud detection in banking etc. depends upon the requirement.
What type of data can be mined?
Almost all the data that is available from simple numerical and plain text documents to complex data like spatial, web and multi-media can be mined. The list of different sources of data that can be mined are business transactions, scientific data, medical and personal data, surveillance video and pictures, satellite sensing, games, digital media, CAD and software engineering data, virtual world, emails and texts, world wide web.
In this project, the source data is retrieved from the world wide web but that is already mined by a web mining tool Google analytics but that data does not provide any useful information for developing the web site so we take the facts or output of that tool as input data and do mining on that and trying to get unknown interesting patterns which in turn help in decision making for the improvement of the web site.
Motivation for the project:
The two main things that motivated me to choose this project are as follows. Firstly the source data is not the direct raw data. So, the input data itself is a summary data which is already mined so handling such data is different than taking the direct source data.
Secondly because of the tools like Google analytics which give facts about the particular websites like how many people visited the site, how much time they stayed in that site, through which browsers the websites are being visited and about the places of the visitors. By looking at this data the owner of the particular website come to know how the site is being visited by different users and from where the site is being visited but the owner of that particular website is not getting any suggestions or ideas to improve the website. For that the owner of the website has to work on the facts that are available through the tool to improve his site.
Goals and objectives:
The core part of this project is getting the source data from the web which is already mined by the web mining tool called Google analytics and trying to get some more useful information which helps in developing the course websites by knowing the behaviour of the users of the particular websites.
Therefore, this project mainly deals with building models that mine the data or facts available from the Google analytics and generate different groups of users based upon their behaviour by considering the time period as a parameter. By that, the owner of the website can easily compare the behaviour of the users with the extra knowledge he is having about his site like the things or activities that are taken place or conducted during particular time period for example in this project the university course websites have been taken so the extra knowledge the owner knows about his website are like the time of exams, the time of holidays etc. By this, the owner of the website can improve or develop his website based upon the users' behaviour.
Structure of the report
The main aim of this chapter literature review is to give the complete theory about the data mining and about the different data mining concepts and techniques which are obtained from reading the journals, books and papers. The main concepts that are covered in this chapter are different data mining techniques, the study about the web mining and the details about the Google analytics.
These we are having different technologies that can generate and store the data. In this present situation the data is not a problem but we are unable to generate the useful information out of the large sets of data that is available. Day by day the data we are collecting is increasing rapidly and this rapid increase of the data need to build or develop new technologies and tools that analyse this massive amount of data and get the valuable information out of it.
In general words data mining can be explained as process of identifying information and patterns which are previously unknown by using statistical and mathematical techniques.
Now a day's all the companies have recognised data mining as powerful concepts that which can create a great impact on the performance issues of their company. Hence, the data mining has become interesting area in research field. Where they are trying to get both the artificial intelligence techniques and statistical techniques together to sort out the issues of the data mining.
Among all the technologies that are available in the market the data mining is latest and new technology. As, it is a new technology it has to face many challenges because extracting useful information from massive amount of data is not so easy and sometimes it is very complicated. Therefore the data mining should be able to
Deal with different data
In real time the data is not available in one single format. It is of different types like web data which is hyper text data,