In order to get a good understanding of how a data warehouse works it is important to understand the ETL process. To know ETL to a certain extent we must answer the following questions: What is it? Why is it done? How is it done? Where is it done? And who does it?
What is it?
ETL stands for Extract-Transform-Load. The ETL system is part of the backroom of a data warehouse and is the main component of that area. The ETL system pulls data from different sources (Extract), then consolidates it, reconciles the data, cleans the information so that it is "good data", standardizes it into a single format (Transform) and loads all of the incoming source data into the data warehouse (Load) that users can query. Given all the tasks involved it is no surprise that when setting up a data warehouse nearly three quarters of time is spent in this area. It is also the most expensive area of the data warehouse.
Why is it done?
We perform the ETL process because it is the foundation of the data warehouse. Without the ETL process we could not have a data warehouse that collected data from different source systems, at least one that was easily queryable and filled with good data.
How is it done?
According to Kimball (pg 375) The ETL process is a complex system with four core areas and 34 subsystems. The major components of ETL are extracting, cleaning and conforming, delivering, and managing.
As stated earlier, extracting involves gathering data from the company source systems. There are three subsystems that are involved in this process: Data Profiling, Change Data Capture and Extract. First there is the data profiling subsystem. The
- Discuss the topics below that you feel are important:
- What is it?
- Why is it done?
- How is it done?
- Where is it done?
- Who does it?
- Role in the DW
- Data quality
The software processes that facilitate the population of the data warehouse are commonly known as Extraction-Transformation-Loading (ETL) processes. ETL processes are responsible for (i) the extraction of the appropriate data from the sources, (ii) their transportation to a special-purpose area of the data warehouse where they will be processed, (iii) the transformation of the source data and the computation of new values (and, possibly records) in order to obey the structure of the data warehouse relation to which they are targeted, (iv) the isolation and cleansing of problematic tuples, in order to guarantee that business rules and database constraints are respected and (v) the loading of the cleansed, transformed data to the appropriate relation in the warehouse, along with the refreshment of its accompanying indexes and materialized views.
Role in the DW
- Data quality