Moving forward to the information age had consequence to the advancement of the using of data and information. This brings to the introduction of databases which then expand to detailed concept of data warehouse and Data mining. In this paper, data warehouse and data mining concept were discuss in detail. Data mining tools and techniques were also highlighted. The implementation of data warehouse and data mining were mushrooming among organizations. This is because the implementation of data warehouse and data mining had brought many benefits to the organization. This article will also highlight the benefits of implementation of data warehouse and data mining to the organization.
Keywords : Data warehouse; Data mining; Data mart; Data mining tools
In the early beginning of the use of information technology, various problems regarding to the data management had occur. Problem always occur when data is updated while at the same there is long running queries. User who is making updates has to wait until the queries completed. It is wasting time to wait for the query to complete. To avoid this is by build read only copy of data. On-line transaction processing (OLTP) is the application that updates data while the application that issues queries to the read only database is called a decision support system (DSS). Most organizations apply disparate OLTP and DSS application in several databases. As for example, finance OLTP and finance DSS are placed in difference database system with sales OLTP and sales DSS. This means that the system is stand by themselves, this has disable users ability to access several kind of data at once. The users need to query different DSS in order to gather different data. However in certain cases, data may have fundamental conflict between DSS. Some data will not use the same format in other DSS, as for example a measurement may be stored in meters in a DSS while other DSS might use yards. Alternative was thinking to provide solution for this problem. Organizations come out with the idea that they need an integrated system which is a data warehouse that integrates data from several stand alone systems and provide excellent data sharing. A data warehouse will respond to the user's queries but it will not provide the patterns in data. To find the data patterns, the data mining is used to mine key information from a data warehouse. Data mining is done by execute the software which examines a database then find patterns in the data. To run the data mining, data mining tools need to be use. There are variety of data mining tools can be used for different mining algorithm. Among of the popular data mining tools and techniques include association rules, genetic algorithms, decision trees and neural networks.
Data warehouses can be defined as central storage for data where the data are collected from several source including data from the operational database. The data were then cleaned and integrated to used for decision making (William Inmon,1990) as cited by (Mannino, M.V. & Walter, Z., 2004).Besides providing the executive and manager with single view of the truth, data warehouse is specially programmed and organized for data retrieval and analysis. Through the conversion of operational and transactional data into enterprise information, data warehouse will brings excellent decision making.
Other than that, data warehouse give opportunity to the organization to break their organization's obstacle, as disseminated information were collected and combined from various sources. Perfectly build data warehouses include architecture, coordination and phase by phase data migration from operational systems and transactional system into the nature which optimized for business intelligence, decision support, and informational processing.
Basically, data warehouse will keep and analyze converted data which collected from software system in the entire operational environment. Data warehouse are also important for data analyzing in the intelligence environment.
Classifications of Data Warehouses
Usually, the complexity, size, and magnitude of data warehouses should tailored to the organization's schedule, requirements, budget constraints, unique needs, technology infrastructure, and available resources. However, organizations always choose to build and maintain two types of data warehouses. This includes Enterprise Data Warehouse and Data Mart (Adam Getz, 2006). Enterprise data warehouse refer to the implementation in the large organization wide which crosses the entire business functions and covered all data elements from the entire department and units. Enterprise data warehouse contain broad area of interrelated subject area and include various data that needed by the organization to the enhancement of data analyze. Data entities and fields from the entire organizational departments and units collected and convert in to a center storage/ repository. All units and division such as marketing and accounting will involve and work together to centralize the analysis of all of the disparate data. The data will be converted to standard format that can be use by the entire organization. This will improve the organization's analysis methods. In addition, this will improve the organization's data quality, consistent result and of course the organizational efficiency. Data mart is differ from the enterprise data warehouse. Data mart especially designs to support one business function or units to cover specific questions within relatively narrow confines. Beside that, data mart was especially created to support the special purpose which is for tactical and quick retrieval. The data focused in a short development schedule on a rapid implementation. Each units or department in the organization such as accounting and marketing will use data mart as their reporting and analytical system. To enable each department in the organization to analyze data for the needs of their units, the department will design data mart which contain enough data fields and entities to support their needs. Other than that, data marts can be gather directly from an operational system implement by the organization or in the data warehouse. Both transactional and analytical system can convert the data kept in data mart.
Data Warehouse Architecture
In traditional data warehouse architecture, basically there will be data source that comes from various locations such as the databases, external data and text file (Jarke, M., Jeusfeld, M.A., Quix, C., & Vassiliadis, P., 1999). Other then that, data warehouse architecture will contain data transmitter that transmit the data from a database to another. Before entering the data into the data warehouse, the data will be processed to build consistency and standardize data, this stage of conflict resolution done by the mediature (G. Wiederhold, 1992) as cited by (Jarke, M., Jeusfeld, M.A., Quix, C., & Vassiliadis, P., 1999). This architecture also contain repository that store data about the data in the data warehouse. Data warehouse architecture also include data mart. This will enable the organization to customize their data warehouse architecture to various departments and business function in the organization, such as finance, marketing and purchasing. User will directly query for data in the data warehouse for various needs.
Data mining is defined as a process whose aims is to find valid, novel, potentially useful and understandable correlations and patterns in existing data, using a broad spectrum of formalisms and techniques (H.M. Chung & P. Gray (1999), P. Smyth, D. Pregibon, & C. Faloutsos (2002) ) as cited by (Nenad Jukic & Svetlozar Nestorov, 2006).while others define data mining as the extraction of useful information from large data sets (Hand et. al., 2001) as cited by (Karthik Jayashankar, 2007). In other word, data mining is best defined as process of extracting patterns and relationship hidden in data to find the meaning in data.
Data mining approach is complement to data analysis techniques. This includes basic data access, statistics, on-line analytical processing and spreadsheets. However, data mining software eliminate the organization understanding on the data, the need to know the business and aware on the general statistical method. In addition, data mining not usually find knowledge or patterns that can be trusted directly without verification. Besides, data mining also can be use to generate hypothesis, however data mining does not used to validate the hypothesis.
Data mining process
Data mining commonly involves few processes, this include preparation, classification, clustering, forecasting and association rule learning (Karthik Jayashankar, 2007). The first step in data mining is data preparation and exploration. In this process, data will be clean to correct the data entry errors, sampling and reduce the complexity Classification separated the data in to group. As for example, mail program can classify an email as legitimate email or a spam. This process is to develop rules using data with known classification and apply this rule to unknown classification data. Forecasting is the process of predicting binary classes which goal is to find probability of the variable numerical value.
In clustering, similar records in the data will be grouped according to clustering algorithm. This process is almost the same with classification rule, however but the groups are not standardized. The algorithm was then used to try putting similar items together. Association rule learning is the task that identifies the relationships between items. For example a supermarket might look for data on customer purchasing habits. By using the association rule learning, supermarket will be able to find what product is frequently bought together. The founded information can be as recommendation for marketing purposes. While regression is an attempt used to find a function which represents the data with the minimum error.
Data Mining Tools and Techniques
Data mining tools collect data and model the data to represent the reality. The model will represent and describe the data relationship and pattern. Based on orientation process, data mining activities divide into three categories which include discovery, predictive modeling and forensic analysis (Chris Rygielski, Jyun-Cheng Wang & David C. Yen, 2002). Discovery is the process of finding the hidden patterns in a database without gives idea and hypothesis on what the patterns might be. While predictive modeling is the process of using the pattern gather from the database and use the data to predict future. The third categories are the forensic analysis. Forensic analysis is the process of implementing the extracted patterns to determine differences or non-standardized data.
Data mining automates the process relevant patterns of current and historical data in the database to be analyzed to forecast the future. Through the ability of data mining tools to predict and analyze behaviors of data in the databases, it will be able to guide the organization to produce proactive and efficient decision making and answer question that is urgently need to be solve in a little time
There are various types of data mining available in the market. Each tool comes with its own advantage and weaknesses. Information personal have to keep update with the different type of data mining tools and suggest to purchase the right tools that support the best need of the organization. Data mining tools can be classified in to three main categories which is dashboard, text mining tools and traditional data mining tools, (John Silltow, 2006). Traditional data mining tools use complex algorithms and technique to establish data trends and patterns. To monitor data, trends and captures information that not in the database, these tools should be installed in the desktop. Most of the tools are compatible with both Windows and UNIX version.
The second categories of data mining are dashboard. Basically organization will install these tools to monitor the data changes, information contained in the database and onscreen update. Basically this tool comes in the form of table and chart to allow the user to get better seeing of the business performance. Beside that, dashboard also allow user to refer historical data. This will enable user to find changes on the data. Beside easy to use, this function makes dashboard interesting and easier for the manager to view the company's overall performance.
Text mining tools is the third type of data mining. This tool has the ability to mine data in various kind of text such as Microsoft words and acrobat PDF. The ability of this tools to scan and convert data into the right format that suitable with the tool's database has brings easy and convenient data access to the user. By the use of this tool, user does not need to open different application for every different data format. The data scanned may contain structured or unstructured data. This input captured will gives organization a wealth of information which can be mined to determine attitudes, trend and concept. The origins of data mining began on the first storage of data in the computers and continue with the progress in data access, until nowadays technology that allows users to browse through data in actual time.
Best way in applying advanced data mining techniques is should have interactive and flexible data mining tools which is directly integrated with the organization's data warehouse (John Silltow, 2006). It is the best practice to integrate data mining to data warehouse. This allows organization to simplify the application and mining result implementation. Besides, if the data warehouse grows larger, organization can mine best practice continually and apply for the future decision making. In contras, with using outside mining tools that is not efficient and time consuming where by, few extra mining steps are required.
In implementing data mining tools, the information professional in charge may choose from variety data mining techniques that is suitable to be use. The nearest-neighbor method, artificial neural networks and decision trees were the common mining techniques that implemented by current organization. Each technique has its own method in mining the data. Artificial neural networks are a powerful predicting technique that helps organization to review records to find fraud and take action to minimize the fraud. In term of use, this technique is more complex compare to other techniques. However, artificial neural networks are best to use in the units where there can be reused. As for the example at the monthly credit cards transaction to control anomalies. Decision trees can be used as example and assess weather the organization had choose the right decision. It also provides models for the auditor to make decision in the form of decision sets. The decision tree can generate rules that can be used to classify information. Basically this technique is used for an understandable model. are tree-shaped structures that represent decision sets. The nearest-neighbor method is a mining technique that used by the organization to find or locate other similar items with their interested documents. This technique can group the dataset records with other data in historical dataset according to similarity.
Benefits to organization
The implementation of data warehouse in the organization bring easiness for the decision makers and the managers, where by the data warehouse will quickly extract information to give solution to the organization's query. Data warehouse provide sources for information analysis to support decision making. Successful implementation of data ware house have brings several benefits and value to the organization. Organization will gain immediate and long term benefits through the implementation of data warehouse.
Through the implementation of data warehouse and analytical application, the organization will gain substantial cost savings and positive affects towards the organization's bottom line financial. This can be proved based on a study on business analysis that focused on financial Impact. The study found that the implementation of business analytics have generated a median 5 years ROI of 112%. Among the organizations that involve in the study, 54% have a ROI around 101% or more (International Data Corporation, 2002). Return on Investment (ROI) is the amount of increased or revenue decreased in an organization.
Other than that, data ware house also help in enhanced business decisions. Data warehouse provide organization with credible facts that backed up with evidence and data that encapsulated within the organization. The top level of the company such as the manager and executive can be freed to make decisions based their own knowledge, or instinct. In addition, decision makers may ask for actual organizational data and retrieve highly organized that support to their need.
Besides, data ware house also provide timely access to data. Previously organization has to spend lot of time to access several of data from many different sources and have to ask then analyzed the data as they need. Nowadays, schedule routine (ETL) were set up in the environment of data warehouse to collect and combine relevant data from separated source system and transform the data into the right format which useful to answer query and for important for analysis. Data warehouse is setup with schedule routines which collect data from various sources and standardize their format. This allows organization to access data from various sources with easy access with fast retrieval for their need to analyze and answer query.
Business user will allows to ask for data directly even with less information technology support and they can generate reports and queries by their self. The business user may use the query and analysis tools directly and can produce reports and queries by their self. This will reduce the time for the production of reports and query by the information professional. Further, the decision makers may access to the data only by using one interface without need to compile the data from various locations.
Consistency of data can be gain from the implementation of data warehouse. Data will be collect and combine from many different sources and then convert in to a standard format. The data are gathered from separated systems before convert into standardize format. Data format and nomenclature between the different organizational units will be standardized throughout the enterprise, while the data with inconsistence nature will be removed. In other words, all organization's units will use the same data storage/repository as the main source for their queries and analysis. Organizational units such as human resource, operation and R&D will use the same data repository as a resource for its own unit's queries and analysis. So, each of the organizational units will generate consistence result with other units within the enterprise
Data warehouse also can improve the system performance. Data warehousing environments are designed and organized with the primary focus were to provide strategic critical analysis and fast data retrieval. The underlying structure is specialized to store large amount of data and user was able to query for data at fasters retrieval. Differ from the operational system which focuses on processing transaction, data warehouse is especially built for optimization analysis and retrieval of data compared to the efficient creation and modification of data. Data warehouse reduce the system burden by efficiently distribute the system load to the entire organization's technology infrastructure.
The implementation of data warehouse and data mining in organizations have brings several benefits to the improvement and effectiveness of the organization process. The information professional should think of some modification and adjustment in data warehouse can be suggested to enhance the function and the ability of the data warehouse. This also can be apply to data mining, some modification to the tools and techniques will influence the ability of the data warehouse. New techniques in data mining can be think to provide more options to choose best techniques to be implemented.