Experiment Framework Design

CHAPTER 4

RESULTS AND DISCUSSIONS

4.1 Experiment Framework Design

The experimental tools for Cube Browser, Pivot Table Analysis and WOLAP design and specification are chosen for benchmarking as to verify the accuracy of the data processing in the multidimensional cubes. All experimental tools are written using Visual Basic for cube browser/data mining Model browser, ASP for WOLAP and Ms.Excel for Pivot Table bridging into Ms. SQL Server 2000 and Ms. Analysis Services. For Data mining model browser, Microsoft SQL Server 2000 data mining benchmark is it has the whole testing methods and results for us to compare and validate.

The data are converted from 2 dimension relational databases from Ms. Access, Text files, and CSV/Excel data type to the dimension data type processed in the OLAP Engine. At present, there are no standard or universal query languages are available for multidimensional data type. Hence, Multidimensional Expressions (MDX) from Microsoft SQL Server is chosen for our experiment for client and web experimental tools. The MDX is a type of language used in Microsoft Analysis Service to access multidimensional database. MDX provides a rich and powerful syntax for querying and manipulating the multidimensional data stored in OLAP server cubes. Table 4.1 explains MDX query for smart school database. And also the result extracted from the experimental tools.

We do not have the detailed specifications of Microsoft SQL Server 2000 data mining benchmark. We randomly generated the dimensional data in the database and use the Data Mining Expression (DMX) of SQL server 2000 as our query language for this research.

Table 4.1 - MDX Query for Smart School Multidimensional Database

SELECT [Academics].MEMBERS ON ROWS,

NEST ([Race].[Race Desc].MEMBERS, [Religion].[Religion Desc].MEMBERS) ON COLUMNS FROM EXAMINATIONS

4.2 Experiment Framework Specifications

The experiment benchmark for this research is to develop a complete framework for Data Mining and data warehousing Application using multidimensional approach. This research focus on Smart School, E-Commerce and Medical informatics database to prove the framework can be adaptable for any kind of databases and the framework are able to processed and the result will be discuss in section 4.3 for result benchmarking using the experimental tools.

This section focus on the experiment specification on discussing how the framework perform the Extraction, Transformation and Loading (ETL) processes into data warehouse 2-Dimensional Database and could easily transform the mentioned 2-Dimendional Database warehouse ER-Diagram schema into Star/Snowflake schema.

To complete the ETL process to data warehouse entity relationship schema and snowflake schema, database transport route have to be defined and also identify all collective source databases pointing to target data warehouse system. The architecture and also implementation of the database transport via ETL are done in Ms SQL Server 2000 data transportation services. Diagram 4.1 depicts the ETL process for Smart School Database and the ETL processes are universal for all type of databases and also text files varies from CSV/EXCEL/TEXT. The completion process of ETL will then result a collective data from many target databases which will produce a 2-Dimensinal Data Warehouse varies from Gigabytes to Terabytes. Diagram 4.2 depicts the Entity-Relationship Diagram (ERD) of the smart school data warehouse schema)

Data Warehouse database are then ready to be transform from 2-D entity relationship schema to OLAP multidimensional database star/snowflake schema where cube production are started out from here. These processes will then need to be carefully executed as the first step is to determine the "fact table" or transaction table from all of the tables. This step need careful analysis as it will reflect the result to be used for mining later and this step should usually be done manually.

For the Smart School database, the table “dttExamResult” & “dttExamResult_Paper2” (See Diagram 4.3) are the fact tables which holds all the transaction such as examination result and also pass and fail score. As for the Ecommerce database, "tblOrder_ProductDetails" (See Diagram 4.4) are the fact table holding transaction details like unit price, postage fees quantity and date time delivered. The primary key of the mentioned transaction tables will then becomes the primary key of this fact table, the source table of the foreign key of transaction tables becomes the dimension tables, and the foreign key of transaction tables becomes the primary key of the dimension table.

After the snowflake schema is transformed, multidimensional database storage design need to be define. There are flavours of such as MOLAP, ROLAP and HOLAP which had been discuss in chapter 2 Literature Review of OLAP. So, OLAP data can be stored in the data warehouse itself (ROLAP) or in a separate cube repository (MOLAP) or HOLAP combination of both ROLAP and MOLAP. Depending on the scale and size of data warehouse and the expected load, the ideal storage method can be chosen. Also, for large scale implementations, MOLAP are preferred as it consists of separate OLAP repository.

Aggregation need to be setup in multidimensional database as it will significantly improve the queries response time and optimize query processing performance. Queries are always made against multidimensional databases and for that larger amount of CPU and memory resources are needed to perform such processing. Designing a Multidimensional database is crucial and identification of the fact and dimension tables may directly influence the growth of the multidimensional database size. The number of aggregations increases exponentially (See Table 4.2) with the number of OLAP levels and dimensions based on the formula:

a = l^d

a - Aggregations, l - Levels, d - Dimensions

Table 4.2 - Number of OLAP Aggregations Increases Exponentially

Levels

2

3

4

5

Dimensions

2

4

9

16

25

3

8

27

64

125

4

16

81

256

625

5

32

243

1024

3125

6

64

729

4096

15625

7

128

2187

16384

78125

4.3 Experiment Results for OLAP Performance

Knowing the aggregation result grows exponentially fast into millions of aggregations, manual pre-calculation of the most common or popular aggregation are made in order to reduce the processing of the multidimensional database. Table 4.3 explains the experiment of calculating the aggregations for smart school, E-commerce and medical databases by testing 100% performance OLAP aggregation test by using Ms Analysis Services. With this aggregation test, avoidances of aggregations exponential growth can be avoided as the design of the snowflake schema can be redesign for efficiency.

Table 4.3 - Calculation of Aggregation for Experimental Databases

Data Storage Type

Aggregation storage type

Number of aggregations

Smart School

MOLAP

9.707MB

695

ROLAP

53.964 MB

695

HOLAP

9.707 MB

695

Ecommerce

MOLAP

0.014 MB

21

ROLAP

0.076 MB

21

HOLAP

0.014 MB

21

Medical

MOLAP

0.058 MB

83

ROLAP

0.377 MB

83

HOLAP

0.058 MB

83

The overall tests shows and prove that MOLAP are a best choice of implementing a multidimensional database. MOLAP source data and aggregations are stores in multidimensional format which makes it the fastest option for data retrieval. ROLAP on the other hand stores aggregations within the source relational database which needs more aggregation storage which are a major concern when larger implementation of data warehouse and this will slow down the data retrieval. Notice that in the experiment ROLAP use almost 5 times the size than MOLAP and HOLAP.

Lastly, notice that MOLAP & HOLAP share the same aggregations storage type but the with differences that HOLAP offers MOLAP performance where data must be obtain from the table and also HOLAP perform as slow as ROLAP.

By referring to the graph on Performance Vs Size in Table 4.4, MOLAP increasing performance gain requires additional disk space utilization. In this experiment, the multidimensional databases are tested 100% with total database size. The result shows that MOLAP again outshine ROLAP and HOLAP implementations.

Table 4.4 Performance Vs Size Test on aggregations

MOLAP

ROLAP

HOLAP

Smart School

Ecommerce

Medical

The experiment framework results demonstrate that Multidimensional database (MOLAP) method is effective in producing efficient and fastest data retrieval querying the database and this method will be then tested in the four experimental tools (Cube Browser, Web-OLAP, Pivot Table and data mining model browser) for result accuracy benchmarking from the framework design & specifications. Cube browser, Web OLAP and Pivot table will be used for result accuracy benchmarking in section 4.4

4.4 Experiment Tools Results

Based on the experiment design, specification and results discussed in section 4.1~4.3, MOLAP are favourable as it is fast for information retrieval and manipulations cubes for slicing and dicing. MOLAP stores data in its own multidimensional database repository which are separated from the relational database/data warehouse. MOLAP are directly linked to applications such as VB, ASP, dot NET, JAVA and et cetera using application programming interfaces (API) because results are published instantly after the complex calculations are deployed when the multidimensional database are formed (George, C., 1996).

In this section, four experimental tools are developed based on VB and ASP programming language to prove the accuracy of the result being executed to the Multidimensional databases. These experiments are separated into three parts:

(1) To prove the accuracy of the result query via Multidimensional Expression (MDX) versus cube browser and Pivot table for Smart School, E-Commerce and Medical Low Birth Weight Databases base on the research questions that are listed in Table 4.5.

Table 4.5 Research Questions to prove the accuracy of the results

RQ1. Smart School: What is the total examinations results for paper 1 based on all academics filter based on Religion nested through gender?

RQ2. E-Commerce: What is result for the overall order list for all categories of products filter based on Religion nested through gender?

RQ3. Medical Low Birth Weight: What is the results for pregnant woman smoking and not smoking during pregnancy based on Birth Weight <= 2500g nested through race?

Table 4.5 research questions are now answered in table 4.6~4.8 separately based for Smart School, E-Commerce and Medical Low Birth Rate Databases. Research Questions are translated to MDX to query the result in the multidimensional database. In each section from 4.6~4.8, results are shown in WOLAP, MDX Client, PIVOT Table and Cube Browser. Noticed that all mentioned results circled in each section are the same which proved that the results are accurate for all four experimental tools used in this research.

Table 4.6 Research Questions 1 Result Accuracy Benchmark for Smart School

RQ1

SELECT [Academics].MEMBERS ON ROWS,

NEST ([Gender].[Gender Desc].MEMBERS, [Religion].[Religion Desc].MEMBERS) ON COLUMNS FROM EXAMINATIONS

WOLAP

MDX

PIVOT Table

Cube Browser

Table 4.7 Research Questions 2 Result Accuracy Benchmark for E-Commerce

RQ2

SELECT [Category].MEMBERS ON ROWS,NEST ([Gender].[Gender Desc].MEMBERS, [Religion].[Religion Desc].MEMBERS) ON COLUMNS FROM [Order List]

WOLAP

Client MDX

Pivot Table

Cube Browser

Table 4.8 Research Questions 3 Result Accuracy Benchmark for Medical Low Birth Rate

RQ3

SELECT [Smoking During Pregnancy].[Smoke Desc].MEMBERS ON ROWS,

NEST ([Low Birth Weight].[Lbw Desc].MEMBERS, [Race].[Race Desc].MEMBERS) ON COLUMNS FROM [Low Birth Weight]

WOLAP

Client MDX

PIVOT Table

Cube Browser

(2) The overall grand total query via cube browser and Pivot table for E-Commerce. This experiment will use different query execution and finally the grand total results are accurate with the total of 977,623.29.

Table 4.9 Total Result for Accuracy Benchmark on E-Commerce Data

Tools

Results

Cube Browser

PIVOT Table

Both experiment proved that the results are accurate while querying the multidimensional database in any form or type of condition for example by gender, religion, race, state and et cetera. The overall experimental tools used for querying the multidimensional database indeed prove that the benchmarking results are accurate and efficient.

(3) to perform Data Mining using Multidimensional Database for Medical Low Birth Weight Database based on the research questions that are listed in Table 4.10.

Table 4.10 Research Questions Multidimensional Data Mining

RQ1. What is the probability of Low Birth Weight for Mothers who Smoke and do not smoke during pregnancy based on Hypertension, Low Birth Weight, Premature, Race, Smoke and Urine type.

RQ2. What is the probability calculation of 5 cluster group for Low Birth Weight for Mothers who Smoke and do not smoke during pregnancy based on Hypertension, Low Birth Weight, Premature, Race, Smoke and Urine type.

Table 4.10 research questions are now answered in table 4.11~4.12 separately based for Smart School, E-Commerce and Medical Low Birth Rate Databases. Research Questions are then carefully plan for data mining process case and model training in the multidimensional database. In each table 4.11, results on the Decision Tree Mining model and table 4.12 shows the results on the clustering Mining Model for calculating the probability of Low Birth Weight for mothers who Smoke and do not smoke during pregnancy based on Hypertension, Low Birth Weight, Premature, Race, Smoke and Urine type. Noticed that all mentioned results circled in table 4.11 and 4.12 are the same which proved that the results are accurate for the data mining mode browser experimental tool used in this research. The total probability cases are 183 for decision Tree and clustering total probability cases for 5 groups are 50.26, 43.45, 39.52, 31.14, 24.62 respectively.

Table 4.11 Research Questions 1 Result Data Mining Benchmark for Medical Low Birth Weight using Decision Tree

RQ1

Decision Tree are used to compute the probability result based on prediction tress of Hypertension, LBW, smoking, urine, premature and race,.

Prediction: Hypertension

Predictive Tree

Hypertension

Cases

Probability

Do Not Suffer from Hypertension

171

93.44%

Suffer Once from Hypertension

12

6.56%

Total

183

100.00%

Prediction: Low Birth Rate

Predictive Tree

Low Birth Weight

Cases

Probability

Birth Weight 2500g

124

67.76%

l = Birth Weight < 2500g

59

32.24%

Total

183

100.00%

Prediction: Smoking

Predictive Tree

Smoking

Cases

Probability

Do Not Smoke During Pregnancy

112

61.20%

Smoking During Pregnancy

71

38.80%

Total

183

100.00%

Prediction: Urine

Predictive Tree

Urine

Cases

Probability

No Presence of Uterine Irritability

156

85.25%

Presence of Uterine Irritability

27

14.75%

Total

183

100.00%

Prediction: Premature Delivery

Predictive Tree

Premature

Cases

Probability

Do Not Suffer from Premature Labor

153

83.61%

Suffer Once from Premature Labor

24

13.11%

Suffer Three Times from Premature Labor

1

0.55%

Suffer Twice from Premature Labor

5

2.73%

Total

183

100.00%

Prediction: Race

Predictive Tree

Race

Cases

Probability

Black

25

13.66%

Other

66

36.07%

White

92

50.27%

Total

183

100.00%

Table 4.12 Research Questions 1 Result Data Mining Benchmark for Medical Low Birth Weight using Clustering

RQ2

Clustering are used to prepare the probability calculation

Node Attributes: Hypertension

Cluster Group

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

Do Not Suffer from Hypertension

48.27

96.04%

38.53

88.68%

38.96

98.59%

29.04

93.24%

22.19

90.13%

Suffer Once from Hypertension

1.99

3.96%

4.92

11.32%

0.56

1.41%

2.11

6.76%

2.43

9.87%

Total

50.26

100.00%

43.45

100.00%

39.52

100.00%

31.14

100.00%

24.62

100.00%

Node Attributes: Hypertension

Cluster Group

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

Do Not Suffer from Hypertension

48.27

96.04%

38.53

88.68%

38.96

98.59%

29.04

93.24%

22.19

90.13%

Suffer Once from Hypertension

1.99

3.96%

4.92

11.32%

0.56

1.41%

2.11

6.76%

2.43

9.87%

Total

50.26

100.00%

43.45

100.00%

39.52

100.00%

31.14

100.00%

24.62

100.00%

Node Attributes: Low Birth Weight

Cluster Group

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

Birth Weight ge 2500g

50.11

99.70%

0.56

1.28%

38.78

98.13%

30.07

96.55%

10.49

42.59%

l = Birth Weight < 2500g

0.15

0.30%

42.9

98.72%

0.74

1.87%

1.08

3.45%

14.14

57.41%

Total

50.26

100.00%

43.45

100.00%

39.52

100.00%

31.14

100.00%

24.62

100.00%

Node Attributes: Race

Cluster Group

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

Black

10.2

20.29%

8.03

18.48%

1.87

4.73%

1.41

4.54%

4.49

18.22%

Other

29.28

58.24%

23.79

54.74%

5.05

12.78%

6.77

21.73%

2.12

8.60%

White

10.79

21.46%

11.63

26.77%

32.6

82.49%

22.96

73.73%

18.02

73.18%

Total

50.26

100.00%

43.45

100.00%

39.52

100.00%

31.14

100.00%

24.62

100.00%

Node Attributes: Premature Delivery

Cluster Group

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

Do Not Smoke During Pregnancy

47.5

94.49%

26.56

61.12%

27.09

68.55%

6.83

21.94%

7.03

28.54%

Smoking During Pregnancy

2.77

5.51%

16.9

38.88%

12.43

31.45%

24.31

78.06%

17.59

71.46%

Total

50.26

100.00%

43.45

100.00%

39.52

100.00%

31.14

100.00%

24.62

100.00%

Node Attributes: Urine

Cluster Group

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

Cases

Probability

No Presence of Uterine Irritability

46.04

91.59%

30.87

71.05%

33.54

84.88%

27.92

89.64%

22.63

91.92%

Presence of Uterine Irritability

4.23

8.41%

12.58

28.95%

5.98

15.12%

3.23

10.36%

1.99

8.08%

Total

50.26

100.00%

43.45

100.00%

39.52

100.00%

31.14

100.00%

24.62

100.00%

In this research, the experiment will focus on two computational algorithms by Microsoft SQL Server 2000 Data Mining Benchmark which is Microsoft Decision Tree and Microsoft Clustering. In this data mining benchmark, defining the case table for model training are very important as it is this will be the most recourse consuming task for this experiment. This experiment focuses of the case model training for Medical Low Birth Rate Database. This experiment is organized in two parts. The first part covers the training using Decision Trees mining model and second part will be Clustering.

The data is highly concurrent to the generated probability results. Therefore, in the data and the trees created by the Decision Trees and clustering algorithm, there are many patterns result for probability calculation. The effects of the data mining model that will be use for training are the case and nested table of medical low birth rate database.

In the experiment, case table are selected in the Medical Low Birth Weight database for decision tree and clustering training. A case table is a table containing case information related to the non-nested part of the data, and a nested table is a table containing information related to the nested part of the data.

4.5 Comparisons Between Experimental Tools

The experiment covered four important areas of testing involving Web and client based applications to prove that the accuracy of the results queried through the Multidimensional Database are accurate and efficient. Overall experiments proved that the result are accurate based on three test cases as discussed in section 4.4 on the experimental tools result for the purpose of result accuracy benchmark for WOLAP, Cube Browser and PIVOT Table and data mining model browser.

In this experiment, we had concluded the findings of the ten factors in section 3.3 where extensive evaluations, comparisons and also benchmarking of the accuracy of the results using Multidimensional databases. The comparisons of the experimental tools used in this research had proven the multidimensional database have its own scalability, performance and also methods used for querying the database. Here, experimental are done by examining how well the experimental tools fit together with the multidimensional database. Table 4.13 shows the overall results and evaluation of the four experimental tools used in this research to query the multidimensional database.

The experiment results show that the experimental tools method is an effective tool in querying accurate results in a multidimensional database. Our benchmark generated results proved that by using the experimental tools namely WOLAP, Cube Browser, PIVOT table or Data Mining Model Browser are accurate while querying the multidimensional database. The training of the mining models using Decision Tree and clustering methods also prove that the experiment result are accurate base on six prediction tree for Decision Tree and node attributes for clustering.

Table 4.13 Comparison Results of the Experimental Tools

WOLAP

Cube Browser

Pivot Table

Data Mining Browser

Web Capability

P

P

P

Data Drilling

P

P

P

MDX Language

P

P

Charts/Graphs

P

P

P

Connect to MDDB

P

P

P

P

Complication

Y - Need to know MDX query before using

N - Selection of MDDB and can start Drill down data

N - Use drag and drop of MDDB data fields before processing data/Charts

Y - Need to define case table for mining MDDB data fields before processing data/Charts

Easy to use

N - Need to have knowledge in MDX query

Y - Drag and Drop of data

N - Need to have Knowledge in Excel (Not Difficult)

Y - Need to have Knowledge in Data Mining techniques such as association, classification & clustering

Built from

*For This research

ASP/ASP.NET

VB/VB.NET/C#

Excel/VBA

Decision Tree & Clustering Algorithm

Query the MDDB Data Engine

Moderate

*Depend on Bandwidth

Fast

Fast

Depending on the size of the database.

Analysis on Data

*Proven as benchmark in this research

Accurate

Accurate

Accurate

Accurate

4.6 Research Contributions

In this research, benchmark data are gathered from School, Medical and e-Commerce in text, csv/excel and access database. Data are then Extract, Transform and load into Data Warehouse via Data Transformation Services (DTS) in Microsoft SQL Server. Once Data Warehouse data are gathered, multidimensional database can be created through Microsoft Analysis Services OLAP Engine using Multidimensional OLAP (MOLAP) cubes using snowflake schema. Data can be then drill up/down by using the experimental tools. Also, Data Mining Model can be created by using the MOLAP cubes by applying Decision Tree & Clustering model. Lastly, by using the experimental tools WOLAP, Cube Browser, PIVOT Table and Data Mining Model Browser to view and manipulate data via Multidimensional OLAP cubes and Data Mining models Decision Tree & Clustering Trained data for benchmarking and result accuracy. It is well known that good benchmark is the key to drive a successful industry and technology way forward.

Please be aware that the free essay that you were just reading was not written by us. This essay, and all of the others available to view on the website, were provided to us by students in exchange for services that we offer. This relationship helps our students to get an even better deal while also contributing to the biggest free essay resource in the UK!