Modification of Indexing Using Query Pattern Change Detection Mechanism
High-dimensional databases pose a challenge with respect to efficient access. High-dimensional indexes do not work because of the often-cited “curse of dimensionality.” A potential solution is to use lower dimensional indexes that accurately represent the user access patterns. A query response using the physical database design that is developed based on a static snapshot of the query workload may significantly degrade if the query patterns change. To address these issues, we introduce a parameterizable technique to recommend indexes based on index types that are frequently used for high-dimensional data sets and to dynamically adjust indexes as the underlying query workload changes. We incorporate a query pattern change detection mechanism to determine when the access patterns have changed enough to warrant change in the physical database design.
AN increasing number of database applications such as business data warehouses and scientific data repositories deal with high-dimensional data sets. As the number of dimensions/attributes and the overall size of data sets increase, it becomes essential to efficiently retrieve specific queried data from the database in order to effectively utilize the database. Indexing support is needed to effectively prune out significant portions of the data set that are not relevant for the queries. Multidimensional indexing, dimensionality reduction, and Relational Database Management System (RDBMS) index selection tools all could be applied to the problem. However, for high-dimensional data sets, each of these potential solutions has inherent problems.
We address the high-dimensional database indexing problem by selecting a set of lower dimensional indexes based on the joint consideration of query patterns and data statistics. This approach is also analogous to dimensionality reduction or feature selection, with the novelty that the reduction is specifically designed for reducing query response times rather than maintaining data energy, as in the case for traditional approaches. Our reduction considers both data and access patterns and results in multiple and potentially overlapping sets of dimensions rather than a single set. The new set of low-dimensional indexes is designed to address a large portion of expected queries and allows effective pruning of the data space to answer those queries.
Query pattern evolution over time presents another challenging problem. A pattern change could be the result of periodic time variation (for example, different database uses at different times of the month or day), a change in the focus of user knowledge discovery (for example, a researcher discovery spawns new query patterns), a change in the popularity of a search attribute or simply the random variation of query attributes, we introduce a dynamic mechanism to detect when the access patterns have changed enough that the introduction of a new index, the replacement of an existing index, or the construction of an entirely new index set is beneficial.
The query workload representation consists of a set of attribute sets that frequently occur over the entire query set that has nonempty intersections with the attributes of the query for each query. To estimate the query cost, the data set is represented by a multidimensional histogram, where each unique value represents an approximation of data and contains a count of the number of records that match that approximation. For each possible index for each query, the estimated cost of using that index for the query is computed.
Initial index selection occurs by traversing the query workload representation and determining which frequently occurring attribute set results in the greatest benefit over the entire query set. This process is iterated until an indexing constraint is met or no further improvement is achieved by adding additional indexes. Analysis speed and granularity are affected by tuning the resolution of the abstract representations. The number of potential indexes considered is affected by adjusting the data mining support level. The size of the multidimensional histogram affects the accuracy of the cost estimates associated with using an index for a query.
In order to facilitate online index selection, we propose a control feedback system with two loops: a fine-grained control loop and a coarse control loop. As new queries arrive, we monitor the ratio of the potential performance to the actual performance of the system in terms of cost, and based on the parameters set for the control feedback loops, we make major or minor changes to the recommended index set.
A number of techniques have been introduced to address the high-dimensional indexing problem such as the X-tree and the GC-tree. Although these index structures have been shown to increase the range of effective dimensionality, they still suffer performance degradation at higher index dimensionality.
A query workload is a set of SQL data manipulation statements. The query workload should be a good representative of the types of queries that an application supports. First, one-dimensional candidate indexes are chosen. Then, a candidate index selection step evaluates the queries in a given query workload and eliminates from consideration those candidate indexes that would not provide a useful benefit. The remaining candidate indexes are evaluated in terms of the estimated performance improvement and index cost. The process is iterated for increasingly wider multicolumn indexes until a maximum index width threshold is reached or iteration does not yield any improvement in performance over the last iteration.
Automatic Index Selection
A cost model is used to identify beneficial indexes and decide when to create or drop an index at runtime. Costa and Lifschitz  propose agent-based database architecture to deal with an automatic index creation. Microsoft Research has proposed a physical-design alerter to identify when a to physical design could result in improved performance.
The overall goal of this work is to develop a flexible index selection framework that can be tuned to achieve effective static index selection and online index selection for high-dimensional data under different analysis constraints. For the static index selection, when no constraints are specified, the goal is to recommend the set of indexes that yields the lowest estimated cost for every query in a workload for any query that can benefit from an index. In cases where a constraint is specified either as the minimum number of indexes or a time constraint, we want to recommend a set of indexes within the constraint, from which the queries can benefit the most. When there is a time constraint, we need to automatically adjust the analysis parameters to increase the speed of analysis.
For the online index selection, the goal is to develop a system that can recommend an evolving set of indexes for incoming queries over time such that the benefit of index set changes outweighs the cost of making those changes. Therefore, an online index selection system that differentiates between low-cost index set changes and higher cost index set changes and can also make decisions about index set changes based on different cost-benefit thresholds is desirable.
While maintaining the original query information for later use to determine the estimated query cost, we apply one abstraction to the query workload to convert each query into the set of attributes referenced in the query. We perform frequent item set mining over this abstraction and only consider those sets of attributes that meet a certain support to be potential indexes. By varying the support, we affect the speed of index selection and the ratio of queries that are covered by potential indexes. We further prune the analysis space using association rule mining by eliminating those subsets above a certain confidence threshold. Lowering the confidence threshold improves the analysis time by eliminating some lower dimensional indexes from consideration but can result in recommending indexes that cover a strict superset of the queried attributes.
Our technique differs from existing tools in the method that we use to determine the potential set of indexes to evaluate and in the quantization-based technique that we use to estimate query costs. All of the commercial index wizards work in design time. The DBA has to decide when to run this wizard and over which workload. The assumption is that the workload is going to remain static over time, and in case it changes, the DBA would collect the new workload and run the wizard again. The Flexibility afforded by the abstract representation that we use allows it to be used for infrequent index selection considering a broader analysis space or frequent online index selection.
Proposed Solution for Index Selection
The goal of the index selection is to minimize the cost of the queries in the workload, given certain constraints. We identify three major components in the index selection framework: the initialization of the abstract representations, the query cost computation, and the index selection loop.
Initialize Abstract Representations
The initialization step uses a query workload and the data set to produce a set of Potential Indexes P, a Query Set Q, and a Multidimensional Histogram H according to the support, confidence, and histogram size specified by the user.
The description of the outputs and how they are generated are given as follows: Potential index set P. This is a collection of attribute sets that could be beneficial as an index for the queries in the input query workload. This set is computed using traditional data mining techniques.
Considering the attributes involved in each query from the input query workload to be a single transaction, P consists of the sets of attributes that occur together in a query at a ratio greater than the input support. As the input support is decreased, the number of potential indexes increases. Note that our particular system is built independently of a query optimizer, but the sets of attributes appearing in the predicates from a query optimizer log could just as easily be substituted for the query workload in this step.
In order to enhance analysis speed with limited effect on accuracy, the input confidence is used to prune the analysis space. Confidence is the ratio of a set's occurrence to the occurrence of a subset.
Multidimensional histogram H. An abstract representation of the data set is created in order to estimate the query cost associated with using each query's possible indexes to answer that query. This representation is in the form of a multidimensional histogram H. A single bucket represents a unique bit representation across all the attributes represented in the histogram.
Index Selection Notation List
Query Cost Calculation
Once generated, the abstract representations of the query set Q and the multidimensional histogram H are used to estimate the cost of answering each query by using all possible indexes for the query.
we apply a cost estimate that is based on the actual matches that occur over the multidimensional histogram over the attributes that form a potential index. The cost model for R-trees that we use in this work is given by
Where d is the dimensionality of the index, and m is the number of matches returned for query matching attributes in the multidimensional histogram. The cost estimate provided is conservative in that it will provide a result that is at least as great as the actual number of matches in the database.
Index Selection Loop
The potential index i that yields the highest improvement over the query set Q is considered to be the best index. Index i is removed from the potential index set P and is added to the suggested index set S. For the queries that benefit from i, the current query cost is replaced by the improved cost. After each it is selected, a check is made to determine if the index selection loop should continue. The input indexing constraints provides one of the loop stop criteria. The indexing constraint could be any constraint such as the number of indexes, total index size, or the total number of dimensions indexed. If no potential index yields further improvement or the indexing constraints have been met, then the loop exits. The set of suggested indexes S contains the results of the index selection algorithm.
Proposed Solution for Online Index Selection
The online index selection is motivated by the fact that query patterns can change over timeline our approach, we use control feedback to monitor the performance of the current set of indexes for incoming queries and determine when adjustments should be made to the index set. In a typical control feedback system, the output of a system is monitored and based on some functions involving the input and output, the input to the system is readjusted through a control feedback loop.
Our implementation of dynamic index selection. Our system input is a set of indexes and a set of incoming queries. Our system simulates and estimates costs for the execution of incoming queries. System output is the ratio of the potential system performance to the actual system performance in terms of database page accesses to answer the most recent queries. We implement two control feedback loops. One is for fine-grained control and is used to recommend minor inexpensive changes to the index set. The other loop is for coarse control and is used to avoid very poor system performance by recommending major index set changes. Each control feedback loop has decision logic associated with it.
Fine-Grained Control Loop
The fine-grained control loop is used to recommend low-cost minor changes in the index set. This loop is entered in case 2, as described above, when the ratio of the hypothetical performance to the actual performance is below some input minor-change threshold. Then, the indexes are changed to I knew, and appropriate changes are made to update the system data structures. Increasing the input minor change threshold causes the frequency of minor changes to also increase.
Coarse Control Loop
The coarse control loop is used to recommend changes that are more costly but with greater impact on the future performance of the index set. This loop is entered in case 4, as described above, when the ratio of the hypothetical performance to the actual performance is below some input major-change threshold. Then, the static index selection is performed over the last queries, abstract representations are recomputed, and a new set of suggested indexes knew is generated. Appropriate changes are made to update the system data structures to the new situation. Increasing the input major-change threshold increases the frequency of major changes.
A flexible technique for index selection is introduced, which can be tuned to achieve different levels of constraints and analysis complexity. A low-constraint more complex analysis can lead to more accurate index selection over stable query patterns. A more constrained less complex analysis is more appropriate to adapt index selection to account for evolving query patterns. The technique uses a generated multidimensional histogram to estimate cost. These experiments have shown great opportunity for improved performance using adaptive indexing over real query patterns. A control feedback technique is introduced for measuring the performance and indicating when the database system could benefit from an index change. By changing the threshold parameters in the control feedback loop, the system can be tuned to favor analysis time or pattern change recognition. The foundation provided here will be used to explore this trade-off and to develop an improved utility for real-world applications. The proposed technique affords the opportunity to adjust indexes to new query patterns.
 S. Ponce, P.M. Vila, and R. Hersch, “Indexing and Selection of Data Items in Huge Data Sets by Constructing and Accessing Tag Collections,” Proc. 19th IEEE Symp. Mass Storage Systems and 10th Goddard Conf. Mass Storage Systems and Technologies, 2002.
 C.-W. Chung and G.-H. Cha, “The GC-Tree: A High-Dimensional Index Structure for Similarity Search in Image Databases,” IEEE Trans. Multimedia, vol. 4, no. 2, pp. 235-247, June 2002.
 K. Whang, “Index Selection in Relational Databases,” Proc. Second Int'l Conf. Foundations on Data Organization (FODO '85), 1985.
 E. Barucci, R. Pinzani, and R. Sprugnoli, “Optimal Selection of Secondary Indexes,” IEEE Trans. Software Eng., 1990.
 S. Choenni, H. Blanken, and T. Chang, “On the Selection of Secondary Indexes in Relational Databases,” Data and Knowledge Eng., 1993.
 A. Capara, M. Fischetti, and D. Maio, “Exact and Approximate Algorithms for the Index Selection Problem in Physical Database Design,” IEEE Trans. Knowledge and Data Eng., 1995.
 A. Dogac, A.Y. Erisik, and A. Ikinci, “An Automated Index Selection Tool for Oracle7: Maestro 7,” Technical Report LBNL/ PUB-3161, Software Research and Development Center, Scientific and Technical Research Council of Turkey (TUBITAK), 1994.
 J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Item sets,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery '00, pp. 21- 30, 2000.
 C. Bohm, “A Cost Model for Query Processing in High Dimensional Data Spaces,” ACM Trans. Database Systems, vol. 25, no. 2, pp. 129-178, 2000.