# Selectivity estimation for spatial joins

A spatial join is a requirement where two or more datasets are joined based on their spatial relations (Debnath B, undated). Usually a point is defined by its minimum bounding rectangles. It was in 1986 that Orenstein first came up with the technique for spatial joins (Orenstein, 1986). This is the idea that using a grid as a space it then becomes divided up into pixels. These pixels are ordered at a location in space and are then joined by an intersect with a rectangle.

This study focuses on an algorithm that intends to save time when exact answers or precise accuracy is not needed when creating a spatial join. Spatial queries are carried out on a spatially indexed database with the intention of obtaining answers that depend on spatial relationships between data items (Larson, R. R, 1996). This is referring to 'exact' and potentially time consuming queries. When a dataset is very large and exact answers may not be needed an estimation algorithm could be implemented. With regards to business, time can be seen as money and effort and a reduction in the time to gain an adequate answer may be beneficial. Bae et al present 'an incremental refining spatial join (IRSJ) algorithm' which is used to provide results from the query estimates and at the same time providing incrementally refined confidence intervals for these estimates. The idea is to spend less computation time and more time analysing the data to its full potential whilst maintaining a predefined level of accuracy through an incremental refining process.

There are several types of spatial relationships including intersection as shown in the example above, containment, and adjacency; however the focus of Bae's study and the algorithm I am concerned with focuses on finding an estimate of the number of intersections.

Hass, P. J, leads the way regarding query processing algorithms and showed in Hass, 1997 how 'new and existing central limit theorems, simple bounding arguments, and the delta method can be used to derive formulas for both large sample and deterministic confidence intervals' Essentially Hass et al were the pioneers of this algorithm and came up with an interface for online relational aggregation to give the user the ability to control the query execution process (Hass, 1997).

A different and possibly even faster way of estimating join selectivity is Ning An, Zhen-Yu Yang and Anand Sivasubramaniam use of sampling and histogram based techniques which they believe to be less restrictive. They present Geometric Histograms that gain an accurate estimate of spatial joins by showing a distribution of rectangles in space using the 'Orenstein' grid based data sequences. The algorithm for estimating spatial join selectivity considering sets of rectangles as datasets has been furthered by Bellussi et al (2004). They have devised a new algorithm that is independent of the distribution of the rectangles in space that in turn produces an auxiliary structure with an order of magnitude smaller than the corresponding histogram.

As mentioned above spatial relations are found using rectangles. Spatial relations are usually indexed using R-trees (Guttman, A, 1984). An R-tree is a structure of rectangles and when the algorithm is executed it finds the relevant rectangles that are intersecting the query that has been carried out. The algorithm IRSJ intends to give a period of time or a magnitude quicker than the exact rectangle join. It attains the goal of shortened time by using query estimates gathered from a subset of the full join by random sampling. The sampling used is tuple-level or single record level sampling and page-level which sample's the whole page. The data has to be randomly sampled or the algorithm will not have correct query and confidence levels.

The algorithm itself consists of three main steps: random sampling, spatial joining, and refining the estimation of the query result.

The code for the algorithm above as I understand it shows a For...Do loop that could be carried out on programming software such as Visual Basic. The loop runs through the data randomly sampling either by tupile level or page level sampling from the R-tree (shown as L on the algorithm diagram) and the rectangle from a tuple, or record (shown as M) until it hits upon the desired confidence interval level (shown as Cf).

The algorithm has been adapted for use on real datasets. A query was required to join mineral resources with geochemical unconsolidated sediments in the US (Bae et al). The query then provides the estimated number of intersections of mineral resources and geochemical unconsolidated sediments with 95% accuracy. 'The time taken to obtain an exact answer is more than an order of magnitude greater than that needed to get accuracy within 5% using IRSJ' (Bae et al) They have proved the benefits to users that a lot of time can be saved if a 95% accuracy is all one wants.

Future developments on this algorithm for estimation require detailed sampling tests. Finding out which of the sampling tests are best will in turn lead to the best estimate output of the data.

Spatial join research has mainly been centred around topological relations like intersections, and distance relations. More recently 'direction joins' have become increasingly studied for geospatial database applications in GIS (Xiao YQ, Li ZH, Jing N, 2003). Most join algorithms have focused on two-way joins and have been region intersection joins however the use of 'mutli-way' spatial joins has come into contention (Zhu HJ, Su JW, Ibarra OH, 2001).

The second algorithm that I am concerned with is an Algorithm for Interpolating Irregularly-Spaced Data for the application of terrain modelling. An understanding of spatial relationships plays a large part within a comprehensive GIS (Frank, 1991). This is because new meaningful information can be gained from the raw data (Egenhofer, 1994). The main purpose for terrain modelling is to provide estimates of height at random points on a surface. It seems there are many methods for an estimation of point heights or 'surface interpolation' as it is known, some of which are natural neighbour interpolation (Sibson, R, 1981), polynominal interpolation (Barnhill R.E et al, 1975) and bicubic spline interpolation (Holroyd, M.T et al, 1970), however I am concerned with Delauney Triangulation, invented by Boris Delaunay in 1934 (Delaunay B, 1934). Delaunay triangulation and Voronoi diagrams (Voronoi, 1908) spatially capture proximity relationships for a set of points (Bandyopadhyay D and Snoeyink J, 2004). The two methods are closely related to each other and Voronoi diagrams show a correspondence with Delauney triangulation for the same set of points (Aurenhammer F, 1991). Voronoi diagrams create shapes from their generating point meaning any other point inside the shape is closer to this generating point then to any other. Delauney triangulation provides the same outcome by connecting all generating points sharing a common triangular edge. Interpolating at a location involves finding which triangle the point lies within and then an estimation of height by linear interpolation can be carried out defined by the three vertices forming the triangle (Mucke et al, 1999).

There are several effective algorithms for creating a Delaunay triangle or a Voronoi diagram (Fortune, 1987, Watson, 1981). The Wotson algorithm is incremental and works by adding new points to a Voronoi diagram of a subset. The fortune algorithm is a plane sweep algorithm for generating a Voronoi diagram from a set of points in time and space. This involves a sweep line and a beach line that moves across a plane incorporating the points the line moves over into a Voronoi diagram. When the line has passed a set of points can be made with equal distance between the centre point and the sweep line, with the beach line being the outer edge of these points (Fortune, 1987)

Odgaard A, Nielsen, K, 2000 give a clear example of what a Voronoi diagram could be used for. When a city has a certain amount of telephone masts and the nearest mast is the one needing to be connected, the Voronoi diagram can split the areas up so there is one mast within each area and at any point within this area it will be closest to the central point, or the 'generating point'.

H. Ledoux and C. Gold attempt to show how these methods can be used for modelling of three dimensional datasets. They found problems with triangulation and a heightened degree of error if there was already some error with the 2D triangulation.

Gopi et al, 2001 show a local Delaunay triangulation-based method, where points are individually triangulated and the final mesh is obtained by merging all individual fans. Buchart C et al aimed to further this algorithm and present a GPU Local triangulation shown as a flow diagram below.

Essentially what has been presented is a GPU implementation of a surface reconstruction method. They showed that efficient GPU implementation is possible for an interpolating reconstruction through the use of local Delaunay triangulations (C. Buchart et al). They go on to suggest most research has been regarding General Purpose GPU to take advantage of the increasing computational technologies and their specific set of instructions for computer graphics.

With increasingly complex surface reconstruction models, more triangles are needed. At certain points high/low resolutions can be used to help ease memory. A new interpolation wavelet filter has been used in two steps, splitting and elevation for this process (Pradhan B et al, 2007). The aim is to provide better terrain models when dealing with large amounts of data.

### References

- An N., Yang, Z., Sivasubramaniam, A. Selectivity estimation for spatial joins. In ICDE. (2001) 368-375
- Aurenhammer F. Voronoi diagrams: A survey of a fundamental geometric data structure. ACM Computing Surveys, Vol. 23 (1991), pp. 345-405
- Bandyopadhyay D, Snoeyink J. Symposium on Discrete Algorithms. Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms. SESSION: Session 5A, 410 - 419. Society for Industrial and Applied Mathematics.Philadelphia, PA, USA 2004
- Barnhill, RE, Gregory J A Polynomial Interpolation to Boundary Data on Triangles, Mathematics of Computation, Number 29. 1975
- Belussi. A, Bertino. E, Nucita A. Query processing and optimization. In Proceedings of the 12th annual ACM international workshop on Geographic information systems. 2004. 92 - 100
- Biplob Kumar Debnath, Department of Electrical and Computer Engineering, University of Minnesota. Spatial Join. Undated
- Delaunay B: Sur la sphre vide, Izvestia Akademii Nauk SSSR, Otdelenie Matematicheskikh i Estestvennykh Nauk, 7:793-800, 1934
- Egenhofer, Max J., Clementini, E. and di Felice, P. 1994, Topological relations between regions with holes, International Journal of Geographical Information Systems, 129- 142.
- Frank, A., 1991. Qualitative spatial reasoning about cardinal directions. Auto-Carto 10, 148- 169.
- Fortune, S. 1987. A sweepline algorithm for Voronoi diagrams. Algorithmica, pp. 153-174.
- Guttman, A. R-trees: a Dynamic Index Structure for Spatial Searching. In Proc. ACM SIGMOD. (1984) 45-57
- Gopi M., Krishnan S., Silva C. T. Surface reconstruction based on lower dimensional localized Delaunay triangulation. Computer Graphics Forum 19, 3 2000.
- Hass, P. J. Large-sample and deterministic confidence intervals for online aggregation. SSDM. (1997) 51-63
- Holroyd, M T, Bhattacharya, B K., Automatic Contouring of Geophysical Data Using Bicubic Spline Interpolation. Dept. of Energy, Mines, and Resources Publications. Geological Survey of Canada, 1970.
- Larson, R. R. Geographic Information Retrieval and Spatial Browsing. GIS and Libraries. University of Illinois. (1996) 81-127
- Mu¨cke, E. P, Saias, I., and Zhu, B, 1999. Fast randomized point location without preprocessing in two- and three-dimensional Delaunay triangulations. Computational GeometryTheory and Applications pp. 63-83
- Orenstein J. (1986). Spatial Query Processing in an Object-Oriented Database System. Proceedings of the ACM SIGMOD Conference, pages 326-336 Odgaard A, Nielsen, K, (2000). A visual implementation of Fortune's Voronoi algorithm. http://www.diku.dk/hjemmesider/studerende/duff/Fortune/. Accessed on 23/10/09
- B. Pradhan, K. Sandeep, Shattri Mansor, Abdul Rahman Ramli, Abdul Rashid B. Mohamed Sharif. Second generation wavelets based GIS terrain data compression using Delaunay triangulation. Engineering Computations. Bradford: 2007. Vol. 24, Iss. 2; pg. 200
- Sibson, R A. Brief History of Natural Neighbour Interpolation. In Barnett, V. Interpreting Multivariate Data, John Wiley & Sons, New York. 1981
- Voronoi, M. G., 1908, Nouvelles applications des parameters continus ~ la theorie des formes quadratiques: Jour. Reine Angew. Math., v. 134, p. 198-287.
- Watson, D. F., 1981. Computing the n-dimensional Delaunay tessellation with application to Voronoi polytopes. Computer Journal, 24:pp. 167-172.
- Xiao YQ, Li ZH, Jing N proceedings of the 2003 IEEE international conference on information reuse and integrationpp104-11. 2003
- Zhu HJ, Su JW, Ibarra OH. Advances in spatial and temporal databases, proceedings. Book series: lecture notes in computer science. Volume: 2121, pp: 217-235, 2001