Data mining is defined as a sophisticated data search capability that uses statistical algorithms to discover patterns and correlations in data. The term is an analogy to gold or coal mining; data mining finds and extracts knowledge ("data nuggets") buried in corporate data warehouse or information that visitors have dropped on a website, most of which can lead to improvements in the understanding and use of the data. The data mining approach is complementary to other data analysis techniques such as statistics, on-line analytical processing (OLAP), spreadsheets and basic data access. In simple terms, data mining is another way to find meaning in data (Rygielski, Wang & Yen, 2002).
Data mining discovers patterns and relationships hidden in data and is actually part of a larger process called 'knowledge discovery' which describes the steps that must be taken to ensure meaningful results. (Hoffer & Prescott, 2009).
Data Mining Concepts
Data mining is a component of a wider process called "knowledge discovery from database" (Berson & Smith, 2007). It involves scientists and statisticians, as well as those working in other fields such as machine learning, artificial intelligence, information retrieval and pattern recognition.
Before a data set can be mined, it first has to be "cleaned". This cleaning process removes errors, ensures consistency and takes missing values into account. Next, computer algorithms are used to "mine" the clean data looking for unusual patterns. Finally, the patterns are interpreted to produce new knowledge.
How data mining can assist bankers in enhancing their businesses is illustrated in this example. Records include information such as age, sex, marital status, occupation, number of children, and etc. of the bank's customers over the years are used in the mining process. First, an algorithm is used to identify characteristics that distinguish customers who took out a particular kind of loan from those who did not. Eventually, it develops "rules" by which it can identify customers who are likely to be good candidates for such a loan. These rules are then used to identify such customers on the remainder of the database. Next, another algorithm is used to sort the database into cluster or groups of people with many similar attributes, with the hope that these might reveal interesting and unusual patterns. Finally, the patterns revealed by these clusters are then interpreted by the data miners, in collaboration with bank personnel.
Data mining tools and techniques
Data mining tools
Organizations that wish to use data mining tools can purchase mining programs designed for existing software and hardware platforms, which can be integrated into new products and systems as they are brought online, or they can build their own custom mining solution. For instance, feeding the output of a data mining exercise into another computer system, such as a neural network, is quite common and can give the mined data more value. This is because the data mining tool gathers the data, while the second program (e.g., the neural network) makes decisions based on the data collected.
Different types of data mining tools are available in the marketplace, each with their own strengths and weaknesses. "Most of data mining tools can be classified into one of three categories: traditional data mining tools, dashboards, and text-mining tools" (Gargano & Ragged, 1999). Below is a description of each.
- Traditional Data Mining Tools. Traditional data mining programs help companies establish data patterns and trends by using a number of complex algorithms and techniques. Some of these tools are installed on the desktop to monitor the data and highlight trends and others capture information residing outside a database. The majority are available in both Windows and UNIX versions, although some specialize in one operating system only.
- Dashboards. Installed in computers to monitor information in a database, dashboards reflect data changes and updates onscreen often in the form of a chart or table enabling the user to see how the business is performing. Historical data also can be referenced, enabling the user to see where things have changed (e.g., increase in sales from the same period last year). This functionality makes dashboards easy to use and particularly appealing to managers who wish to have an overview of the company's performance.
- Text-mining Tools. The third type of data mining tool sometimes is called a text-mining tool because of its ability to mine data from different kinds of text from Microsoft Word and Acrobat PDF documents to simple text files, for example. These tools scan content and convert the selected data into a format that is compatible with the tool's database, thus providing users with an easy and convenient way of accessing data without the need to open different applications. Scanned content can be unstructured (i.e., information is scattered almost randomly across the document, including e-mails, Internet pages, audio and video data) or structured (i.e., the data's form and purpose is known, such as content found in a database). Capturing these inputs can provide organizations with a wealth of information that can be mined to discover trends, concepts, and attitudes.
Data Mining Techniques
The most commonly used techniques include artificial neural networks, decision trees, and the nearest-neighbor method. Each of these techniques analyzes data in different ways:
- Artificial neural networks are non-linear, predictive models that learn through training. Although they are powerful predictive modeling techniques, some of the power comes at the expense of ease of use and deployment.
- Decision trees are tree-shaped structures that represent decision sets. These decisions generate rules, which then are used to classify data. Decision trees are the favored technique for building understandable models.
- The nearest-neighbor method classifies dataset records based on similar data in a historical dataset.
Each of these approaches brings different advantages and disadvantages that need to be considered prior to their use. Neural networks, which are difficult to implement, require all input and resultant output to be expressed numerically, thus needing some sort of interpretation depending on the nature of the data-mining exercise. The decision tree technique is the most commonly used methodology, because it is simple and straightforward to implement. Finally, the nearest-neighbor method relies more on linking similar items and, therefore, works better for extrapolation rather than predictive enquiries.
A good way to apply advanced data mining techniques is to have a flexible and interactive data mining tool that is fully integrated with a database or data warehouse. Using a tool that operates outside of the database or data warehouse is not as efficient. Using such a tool will involve extra steps to extract, import, and analyze the data. When a data mining tool is integrated with the data warehouse, it simplifies the application and implementation of mining results. Furthermore, as the warehouse grows with new decisions and results, the organization can mine best practices continually and apply them to future decisions.
Benefits of data warehouse/data mining to organization
Benefits of data mining
The benefits of data mining to organization are:
Data mining can aid direct marketers by providing them with useful and accurate trends about their customers' purchasing behavior. Based on these trends, marketers can direct their marketing attentions to their customers with more precision. For example, marketers of a software company may advertise about their new software to consumers who have a lot of software purchasing history. In addition, data mining may also help marketers in predicting which products their customers may be interested in buying. Through this prediction, marketers can surprise their customers and make the customer's shopping experience becomes a pleasant one.
Retail stores can also benefit from data mining in similar ways. For example, through the trends provide by data mining, the store managers can arrange shelves, stock certain items, or provide a certain discount that will attract their customers.
Data mining can assist financial institutions in areas such as credit reporting and loan information. For example, by examining previous customers with similar attributes, a bank can estimated the level of risk associated with each given loan. In addition, data mining can also assist credit card issuers in detecting potentially fraudulent credit card transaction. Although the data mining technique is not a 100% accurate in its prediction about fraudulent charges, it does help the credit card issuers reduce their losses.
Data mining can aid law enforcers in identifying criminal suspects as well as apprehending these criminals by examining trends in location, crime type, habit, and other patterns of behaviors.
Data mining can assist researchers by speeding up their data analyzing process; thus, allowing them more time to work on other projects.
Data Mining is the extraction of hidden predictive information from large databases. This is a new powerful new technology with great potential to help companies focus on the most important information in data warehousing. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyzes of past events provided by Data mining is important to large systems because it finds things in large data repositories that you did not know existed. "A simple metaphor would be finding two needles in a haystack that match. The haystack is the database, the individual lengths of the hay represent your data fields, and the needles represent data fields with a relationship worth more to you than all the hay put together" (Newquist, 1997).
- (1999). "Good prospects ahead for data mining". Australian Academy of science, pp. 1-10.
- Berson, A. and Smith, S.J. (1997). Data Warehousing, Data Mining, & OLAP, Mc.Graw-Hill, New York, NY.
- Gargano, Michael L. and Raggad, Bel G. (1999). " Data Mining : A Powerful Information Creating Tool.
- Gargano, Michael L. and Ragged, Bel G. (199). "Data mining a powerful information creating tool". OCLC Systems & Services. (vol. 15(2), pp. 81-90).
- Liao, D. and Chien, H. (2005-09-01). "Using Data-Mining to Explore Information Networks in the Legislative Process"Paper presented at the annual meeting of the American Political Science Association, Marriott Wardman Park, Omni Shoreham, Washington Hilton, Washington, DC Online
. 2009-05-25 from http://www.allacademic.com/meta/p42593_index.html
- Morphy, Erika. (2002). "The new system would clearly be a trade-off, but the whole privacy debate is a trade-off among varying and competing interests". CRMDaily.com. (pp. 1-3).
- Newquist, H.P. (1997, Oct 3). "Data mining : The AI metamorphosis". Retrieved from http://www.dbpd.com/newquist.html
- Pilot Software. (1997, Oct 3). "An Introduction to Data Mining". Online Internet, Retrieved from http://www.pilotsw/dmpaper/dmindex.htm
- Ville, Barry De. (2001). "Data Mining Tool and Techniques". Microsoft Data Mining, pp. 59-91.
- Wu, J. (2002). "Business Intelligence: The value in mining data". Retrieved from http://www.dmreview.com/master.cmf (pp. 1-4).