Distributed database usage as been gradually increasing among the top level organizations by years from now. But there are some serious issues in this database, which really brings down its value and thereby to overcome it some of the research works are undergone in the last decades. Some of issues in distributed databases are:
- Distributed Query optimization issues
- Distributed Security issues
- Distributed Concurrency control issues
- Distributed Catalog management issues
- Distributed Update propagation issues
Above listed are some of the generally faced issues in the distributed databases. In this paper i have focussed my views on three issues, namely security, concurrency and query optimization issues.
Distributed query optimization issues:
In general a database system is about working on data to storing a data or retrieving a data based on the requirement of the users. This kind of storing, updating and retrieving data process are processed by the queries. So the parts played by the queries are so vital in it. Though the roles of queries are so vital, it has its own de merits if they are not processed properly.
Query processing is the way the query is actually processed. (i.e.) the procedures which are undergone in it before the usage. In general a query is processed initially by translating it into a low level data manipulations and then by a proper optimization. One of the most important aspects in query processing is query optimization. Query optimization is about giving many number of execution paths for a query and finding a best execution path out of the alternatives. This process of bringing down a best execution path out the alternative paths is called as query optimization.
Query processing differs from database to database,(i.e) in centralized databases the queries processing are different compared to the query processing in the distributed database. Distributed database came into process only because of the disadvantages in the previous database (centralized system). Though the merits provided by the distributed database system are highly efficient and productive, the complexities faced in some process of the distributed database are really high and difficult and so it's a good practice to propose a system or solution plan based on the level of complexities faced by them. Thereby, as explained above the query optimization of centralized and distributed varies and that will be discussed in this paper.
QUERY OPTIMIZER AND ITS NEED:
Query optimizer is actually a tool, which is used in the query processing to depict alternative set of execution plans for a single executable query and from that it picks the best out of the alternatives. Each of the alternatives predicted by the optimizer will be producing almost a similar output, but the amount of time spent on the query, cost of the process and also the performance of the system might vary depending on the execution path. So it is always advisable to pick the right path for execution.
QUERY OPTIMIZATION EXAMPLE:
In this part i have explained about the query optimization with a small SQL language query, which will reveal the importance of query optimizer and also the impacts of choosing a bad or an ordinary execution path for an executable query.
Once this query is ready for execution, the role of optimization turns important. Now for executing this query, the query processing technique is implemented. In the query processing, query after undergoing initial steps will be given to the optimizer. Query optimizer now generates different set of execution paths for this query. Considering the above let us see how it actually works.
The query is about just selecting the name and the department floor for the students who have theirs credits more than 75.(i) Description about the contents, structure and the run time environment of this database:( ionnis ) pdf
So based on the above query let us assume a description for the contents , structure and the run time environment of the database. Number of pages in student page and student tuples is 30000 and 300000 respectively. And the filter used in this query restricts the student tuples to 20. Number of pages in student department and student department tuples is 5 and 50 respectively. The indices of student are a "clustered B+ tree" on student.stud_credits and that of student dept is "clustered hashing" on studept.stud_deptno. The cost of accessing a single page is 15ms.
As mentioned above, a single query can be executed in many ways which produces the same result. Likewise, the above query can be executed in more than one way.
First way of execution:
Using the clustered B+ tree it is easy to filter the student tuples. Once filtered students are available , then for each tuple it is checked using hashing index to find the tuple which matches student department number.
Second way of execution:
The next possible way of executing the same query can be , without using the index and just with the process of page level nested loops.
Third way of execution:
The other possible execution path can be by, scanning the entire student table relation with each of the student department tuples. Then by storing all the resulted pairs and scanning with the department number attribute in the tuple pairs.
These are some of the possible execution paths for the above specified query. So now, its the optimizer's part to find the best out this alternatives by cross checking the cost estimation, time evaluations and performance issue for each of this alternative. In this query by a rough estimation it takes almost less than a minute for the query to execute, since it has used indexing at the right time. But when comparing other two paths, though they tend give the same result its not recommended due to its cost and time consumption. Thereby, on using the optimizer it has opted out the best solution.
Pre-optimization process (alaa pdf)
Best solution in query optimization is about choosing the best alternative path by considering the cost estimation of the query execution, communication cost involved in the data transferring etc.
Before the query optimization, the raw query is translated into the low level language and given as a input to the query parser. The query parser checks for the semantic evaluation and also checks for the valid query structure. The ultimate goals of translating the high level query language is to produce a perfect transformation of query language in such a way that it produces a result as expected with an equivalence between relational calculus and the predefined rules maintained in relational algebra. Then the second objective is to produce an optimal solution. (i.e) a solution which is derived by considering all the cost and performance affecting factors. So in terms of meeting all this requirements properly, the entire process is handled by four divisions namely Query decomposition, data localization, global optimization, and local optimization [ distributed d s pdf]. In query decomposition as discussed above query is parsed and it produces a relational algebraic equivalence. Since the data used in distributed query processing is almost global, they are specified in the global relations irrespective to the data distribution information's. Since the data are specified in global relations, they are localized using data localizer programs. Data localizer, localizes the global relations and produces a fragmented query which is far optimal than the initial query. Then this fragmented query is passed to the query optimizer, which does the procedure as explained before to choose an ultimate execution plan. This part is followed by the global optimizer, which looks to bring out an optimal solution by suggesting the ordering of operations involved in the process, looking after the data communications between the sites and most importantly providing the correct choice of local and distributed algorithms for the database operations.
The optimizer's which follow this kind of approach will be considered to be comprised of three components: namely search space, search strategy and cost model. They proceed with their evaluations based on this set of components. Each of these components almost produces similar kind of result at the end but, they vary by process.
So far in this paper, the query optimization and needs are discussed and their importance is well known by the demands. Though it's a known fact that query optimization plays a vital part in booming up the system's performance it has its own difficulties in handling its process.
Here are some of the difficulties which are faced by the distributed query processing:
As explained above, the distributed query processing holds one separate optimizer called as global optimizer which optimizes the query by providing suggestion on ordering operation and selecting good algorithms for database operations. But, this global optimizer apart from its usage has some issues to be sorted such as the restrictions which are imposed on cost model, preference to sub set query, thick line of trade-off existence between the amount of money[cost] spent and the quality of the optimization and issue of optimization and re-optimization intervals.
Cost model issue:
Cost model serves as an important component in the global query optimization, which provides most required information about the access methods and its different impacts in the distributed database system. It also provides information about the database related to its physical schema and its attributes. Cost model is very much essential in terms of predicting the cost which will also estimate the cost of the alternative plans for the executable query. Then my making a proper comparison, a best executable path will be selected. Mostly the restrictions are made on to the cost model, restricting their effectiveness in optimizing the query in improving the throughput of the system. Cost models are planned in many formats based on their failures and performance statistics.
Subset query issue:
It's a known fact that due to the high demand on databases and its functions, the existence and availability of advanced query languages are more in existence In spite of all this advanced existence, the global query optimizer prefers using the sub set of the queries such as select, predict and joins with come conjunctives such as "AND" "BETWEEN" etc. This set of queries is important as they are capable of giving good optimal solution. Also used to the usage of this variety of [SPJ] query sets , a separate theory for joining and semi-join ordering is made in place for the betterment of usage. Some of the query sets which provide ultimate promise in giving optimal solutions are by sorting, union, aggregations etc.
Trade -off issue:
As discussed above the optimizer, takes optimization cost in account while optimizing the query. There are plenty of possibilities which would lead to give an possible trade off between optimization cost and the quality of optimization. Trade off between cost and quality of optimization is nothing but the proportionality which incurs between them, when there is a need for increasing the optimization on queries to find a perfect solution. (i.e.) the quality of the optimization can be increased if the cost and time are spent more on finding strategy to find the exact path. But at the mean time, when the cost value is considered high there arises the trade-off between quality and cost. So in this case it is affordable to spent more cost on achieving high quality for the most generally repeated queries. But at the mean time it's a mere waste for the queries such as "Ad hoc" which are used very rarely or once.