In this paper, i have done a research work on the most efficient "Distributed database" which is actually used by many of the organizations. I have gathered information about this database through some of the papers and journals. I have also discussed about the issues faced by the distributed database system and have focussed mainly on the "Distributed query optimization issue". At last i have critically evaluated the paper with my findings and understanding of the concept.
In general, a database is the collection of data or interrelated data which is stored in for the purpose of storage and retrieval by implementing a system on to it. Distributed database is one type of database, where its data are widely spread across the global with a interrelated logical relation with them. This database provides its user to have a feel of dealing their own data from their own processor, which is actually not the case. The distributed database are preferred by many of the organizations due to its wide spread advantages such as concurrency control, query optimization, increase in data availability and readability and reduction in break down due to their widely spread data availability. It has its own draw backs which are quiet serious and very difficult to get rectified. Some of researches are still on to bring a optimal solution for this problem. I have researched on it and i have explained in this paper.
DISTRIBUTED DATABASE ISSUES:
Distributed database usage as been gradually increasing among the top level organizations by years from now. But there are some serious issues in this database, which really brings down its value and thereby to overcome it some of the research works are undergone in the last decades. Some of issues in distributed databases are:
- Distributed Query optimization issues
- Distributed Security issues
- Distributed Concurrency control issues
- Distributed Catalog management issues
- Distributed Update propagation issues
Above listed are some of the generally faced issues in the distributed databases. In this paper i have focussed my views on the issue called "Query optimization".
DISTRIBUTED QUERY OPTIMIZATION:
In general a database system is about working on data to storing a data or retrieving a data based on the requirement of the users. This kind of storing, updating and retrieving data process are processed by the queries. So the parts played by the queries are so vital in it. Though the roles of queries are so vital, it has its own de merits if they are not processed properly.
Query processing is the way the query is actually processed. (i.e.) the procedures which are undergone in it before the usage. In general a query is processed initially by translating it into a low level data manipulations and then by a proper optimization. One of the most important aspects in query processing is query optimization. Query optimization is about giving many number of execution paths for a query and finding a best execution path out of the alternatives. This process of bringing down a best execution path out the alternative paths is called as query optimization.
Query processing differs from database to database,(i.e) in centralized databases the queries processing are different compared to the query processing in the distributed database. Distributed database came into process only because of the disadvantages in the previous database (centralized system). Though the merits provided by the distributed database system are highly efficient and productive, the complexities faced in some process of the distributed database are really high and difficult and so it's a good practice to propose a system or solution plan based on the level of complexities faced by them. Thereby, as explained above the query optimization of centralized and distributed varies and that will be discussed in this paper.
QUERY OPTIMIZER AND ITS NEED:
Query optimizer is actually a tool, which is used in the query processing to depict alternative set of execution plans for a single executable query and from that it picks the best out of the alternatives. Each of the alternatives predicted by the optimizer will be producing almost a similar output, but the amount of time spent on the query, cost of the process and also the performance of the system might vary depending on the execution path. So it is always advisable to pick the right path for execution.
QUERY OPTIMIZATION EXAMPLE:
In this part i have explained about the query optimization with a small SQL language query, which will reveal the importance of query optimizer and also the impacts of choosing a bad or an ordinary execution path for an executable query.
Once this query is ready for execution, the role of optimization turns important. Now for executing this query, the query processing technique is implemented. In the query processing, query after undergoing initial steps will be given to the optimizer. Query optimizer now generates different set of execution paths for this query. Considering the above let us see how it actually works.
The query is about just selecting the name and the department floor for the students who have theirs credits more than 75.(i) Description about the contents, structure and the run time environment of this database:( ionnis ) pdf
So based on the above query let us assume a description for the contents , structure and the run time environment of the database. Number of pages in student page and student tuples is 30000 and 300000 respectively. And the filter used in this query restricts the student tuples to 20. Number of pages in student department and student department tuples is 5 and 50 respectively. The indices of student are a "clustered B+ tree" on student.stud_credits and that of student dept is "clustered hashing" on studept.stud_deptno. The cost of accessing a single page is 15ms.
As mentioned above, a single query can be executed in many ways which produces the same result. Likewise, the above query can be executed in more than one way.First way of execution:
Using the clustered B+ tree it is easy to filter the student tuples. Once filtered students are available , then for each tuple it is checked using hashing index to find the tuple which matches student department number.Second way of execution:
The next possible way of executing the same query can be , without using the index and just with the process of page level nested loops.Third way of execution:
The other possible execution path can be by, scanning the entire student table relation with each of the student department tuples. Then by storing all the resulted pairs and scanning with the department number attribute in the tuple pairs.
These are some of the possible execution paths for the above specified query. So now, its the optimizer's part to find the best out this alternatives by cross checking the cost estimation, time evaluations and performance issue for each of this alternative. In this query by a rough estimation it takes almost less than a minute for the query to execute, since it has used indexing at the right time. But when comparing other two paths, though they tend give the same result its not recommended due to its cost and time consumption. Thereby, on using the optimizer it has opted out the best solution.
PRE-OPTIMIZATION PROCESS (alaa pdf)
Best solution in query optimization is about choosing the best alternative path by considering the cost estimation of the query execution, communication cost involved in the data transferring etc.
Before the query optimization, the raw query is translated into the low level language and given as a input to the query parser. The query parser checks for the semantic evaluation and also checks for the valid query structure. The ultimate goals of translating the high level query language is to produce a perfect transformation of query language in such a way that it produces a result as expected with an equivalence between relational calculus and the predefined rules maintained in relational algebra. Then the second objective is to produce an optimal solution. (i.e) a solution which is derived by considering all the cost and performance affecting factors. So in terms of meeting all this requirements properly, the entire process is handled by four divisions namely Query decomposition, data localization, global optimization, and local optimization [ distributed d s pdf]. In query decomposition as discussed above query is parsed and it produces a relational algebraic equivalence. Since the data used in distributed query processing is almost global, they are specified in the global relations irrespective to the data distribution information's. Since the data are specified in global relations, they are localized using data localizer programs. Data localizer, localizes the global relations and produces a fragmented query which is far optimal than the initial query. Then this fragmented query is passed to the query optimizer, which does the procedure as explained before to choose an ultimate execution plan. This part is followed by the global optimizer, which looks to bring out an optimal solution by suggesting the ordering of operations involved in the process, looking after the data communications between the sites and most importantly providing the correct choice of local and distributed algorithms for the database operations.
The optimizer's which follow this kind of approach will be considered to be comprised of three components: namely search space, search strategy and cost model. They proceed with their evaluations based on this set of components. Each of these components almost produces similar kind of result at the end but, they vary by process.
So far in this paper, the query optimization and needs are discussed and their importance is well known by the demands. Though it's a known fact that query optimization plays a vital part in booming up the system's performance it has its own difficulties in handling its process.
Here are some of the difficulties which are faced by the distributed query processing:
DISTRIBUTED QUERY OPTIMIZER ISSUES:
As explained above, the distributed query processing holds one separate optimizer called as global optimizer which optimizes the query by providing suggestion on ordering operation and selecting good algorithms for database operations. But, this global optimizer apart from its usage has some issues to be sorted such as the restrictions which are imposed on cost model, preference to sub set query, thick line of trade-off existence between the amount of money[cost] spent and the quality of the optimization and issue of optimization and re-optimization intervals.
Cost model issue:
Cost model serves as an important component in the global query optimization, which provides most required information about the access methods and its different impacts in the distributed database system. It also provides information about the database related to its physical schema and its attributes. Cost model is very much essential in terms of predicting the cost which will also estimate the cost of the alternative plans for the executable query. Then my making a proper comparison, a best executable path will be selected. Mostly the restrictions are made on to the cost model, restricting their effectiveness in optimizing the query in improving the throughput of the system. Cost models are planned in many formats based on their failures and performance statistics.
Subset query issue:
It's a known fact that due to the high demand on databases and its functions, the existence and availability of advanced query languages are more in existence In spite of all this advanced existence, the global query optimizer prefers using the sub set of the queries such as select, predict and joins with come conjunctives such as "AND" "BETWEEN" etc. This set of queries is important as they are capable of giving good optimal solution. Also used to the usage of this variety of [SPJ] query sets , a separate theory for joining and semi-join ordering is made in place for the betterment of usage. Some of the query sets which provide ultimate promise in giving optimal solutions are by sorting, union, aggregations etc.
Trade -off issue:
As discussed above the optimizer, takes optimization cost in account while optimizing the query. There are plenty of possibilities which would lead to give an possible trade off between optimization cost and the quality of optimization. Trade off between cost and quality of optimization is nothing but the proportionality which incurs between them, when there is a need for increasing the optimization on queries to find a perfect solution. (i.e.) the quality of the optimization can be increased if the cost and time are spent more on finding strategy to find the exact path. But at the mean time, when the cost value is considered high there arises the trade-off between quality and cost. So in this case it is affordable to spent more cost on achieving high quality for the most generally repeated queries. But at the mean time it's a mere waste for the queries such as "Ad hoc" which are used very rarely or once.
Search space & search strategy issue:
The search space is one of the other components which is used in the optimizer tool for the query optimization. Search space can also be termed as solution space, which is depicted with several sets of query execution plans (QEP). Each set of solution space finds almost the same result but with different computation methods. There are several points in the solution space, with each solution point containing its own solution for the problem. Any of the solution in search space is described by a query tree. They play a major part in finding solution points for the join operations. It's a known fact that the "join" operations are generally associative and commutative in nature. Thereby , in particular with the join operations the query tree is found be increased with number of operations leading to number of relations and this in turn makes the search space operation very difficult in terms of execution time and cost. This is the major issue which is been dealt with the search spaces. So considering this issue, search spaces are followed with certain restrictions to limit themselves in exceeding the cost and time by picking the correct strategy.
Re-compilation & re-optimization issue:
In the distributed query optimization, the global optimizer is used just before the execution of the query. The cost model involved in this model can be inaccurate at times due to database re-organization. Therefore due to the changes faced due to the re-organisation in global optimization, the need of re-compilation and re-optimization case occurs. And as discussed above about the cost model issues, which sets a trade-off between quality and cost. Likewise here it brings trade-off problem due the time involved in the recompilation and re-optimization.
In this section, i have made a discussion about my findings through this research work.
As discussed earlier, the query optimization is one of the highly efficient and a cost saving technique which actually increases the overall performance of the system. But in my point of view, this performance of gain through query optimization can only be achieved through the proper implementation works in the process. (i.e) For example, we will consider the [section 3.3]. In this section i have explained about query optimization through that example by referring [ionnis.pdf]. In this he has explained about the needs of query optimizer by mentioning different execution paths for a single executable query and showing their execution time and cost. From this i have clearly understood the need and the importance of the query optimization. Thereby i firmly support author's point of view on the need of query optimization.
In the [section 3.4] i have discussed about the pre-optimization and the whole query processing procedure. In this section i have referred [alaa.pdf], in this the author has explained about the query optimization processing. Since it is a distributed query processing the source of data is all over the global and thereby a special optimizer called global optimizer is used. It plays a major part in addressing the order of the operations and also deals with the transfer of data between communication sites.
In the distributed query optimization issues part, i have discussed about the issues faced in the optimizer tool for query optimization. In the cost model issue, the need of cost model is very clearly discussed. In this, the importance of using a good cost model is discussed as it might lead to trade-off situation. (i.e) A good cost model should be used, which must be good enough in delivering a high quality optimization and also with less cost involved in the optimization process. Because by violating this, it might lead to trade-off between execution quality and cost. In the search space issues, the existence of larger search space will extensively lead to problem of high execution cost and also might change the optimization as a time prohibitive one. So instead of getting into this trouble its highly advisable to make a clear view on the shapes of the search spaces and also restrictions should be made on heuristics. These are some of the important features and problems of distributed database , which i have discussed throughout this paper by undergoing a research work on distributed database.
In this paper, i have discussed my research work on distributed database and have critically evaluated my views. I conclude this paper, by suggesting that extensive research work on distributed database will surely provide an optimal solution.