This paper presents an measure for image similarity based on local feature descriptions and geometric constraints. We show that on the basis of this similarity an appearance graph representation of the environment of a mobile robot can be made. This graph can be used for representing semantic information about the space, and can be used for visual navigation. The image similarity measure is robust for occlusions by people in the neighbourhood of the robot. (Kröse 2007)
An internal representation of the environment is needed for optimal mobile robot navigation. Traditionally such a model is represented as a geometric model indicating admissible and non-admissible areas. The robot has to know its location within such a model and in most of the times has to estimate the parameters of the models simultaneously (SLAM). (Kröse 2007)
Now cameras and processing power are becoming cheaper, visual information is used more often in environment modeling. For example, visual features are used to solve the loop closing problem in geometric SLAM. A step further are approaches which model the environment only in appearances, in contrast to explicit geometric representations of space.In this paper we present our recent work on appearance modeling of the environment. On the basis of a set of omnidirectional camera images an 'appearance graph' is constructed.
This graph can be used for navigation and for a categorization. A prerequisite for making the graph is a good similarity measure between images. (Kröse 2007)
Lynch's studies were based on 'salient' objects, or buildings which are (visually) perceived by humans. Also in the field of machine intelligence 'cognitive maps' have been introduced. Kuipers  defines a cognitive map as a layered model consisting of the identification and recognition of landmarks and places from local sensory information, control knowledge of routes, a topological model of connectivity and metrical descriptions of shape, distance, direction, orientation, and local and global coordinate systems. In the literature on spatial representations in humans, there is clearly evidence that visual information might be a basis of spatial models. (Kröse 2007)
III. APPEARANCE BASED REPRESENTATIONS FOR ROBOTICS
Traditionally robots use a two-dimensional, geometrically accurate representation of a three-dimensional space; a 'mapping' from world coordinates to an indicator which tells whether the position is occupied or not. Sensors to make such maps are typically range sensors such as sonar or laser range scanners. Scanning range sensors make it possible to make 3D geometric representations, sometimes augmented with appearance information from a camera. More recently computer vision techniques are presented. Sets of images (structure from motion) are used to make 3D representations.
Dense methods have been presented , as well as 3D reconstruction from local salient features ('landmarks') ,. On-line simultaneous localization and reconstruction of visual landmark positions was presented in  but currently only for small scale environments.
In addition to the metric mapping it is common to represent the environment in terms of a topological map: distinct places are coded as nodes in a graph structure with edges which encode how to navigate from one node to the other . The nodes and the edges are usually enriched with some local metric information. Mostly such topological maps are derived from geometric maps. However, recently also visual information has been used to characterize nodes , . All approaches described above use vision to reconstruct a 3D (or 2D) representation of the environment of the robot.
The question is, whether we can also use models that do not try to recover a 2D or 3D representation of space? In machine vision, appearance modeling of objects was introduced about 10 years ago . Nayar showed that an object could be modeled as the set of views of all different poses w.r.t. the camera. In feature space these views form a low dimensional manifold. An unknown object is classified by finding the nearest manifold. Class label and pose are recovered simultaneously (see 2). For environment modeling appearance models of space were presented . In these approaches, the environment is modeled as an 'appearance map' that consists of a collection of camera (or other sensor) readings obtained at known poses (positions and orientations).
These methods have shown to be able to localize a mobile robot but have the problem that supervised data, consisting of images and corresponding poses, are needed. (Kröse 2007)
IV. THE APPEARANCE GRAPH REPRESENTATION
In our current approach on appearance modeling we avoid the problem of a supervised training set. We collect a set of camera images and use this image set to construct an 'appearance graph'.
In an appearance graph each vertex or 'node' represents a pose (which we do not know) and is characterized by the camera image taken at that location. An edge between two nodes is defined if the two images are sufficiently similar. As we will see in the next section, the similarity checks whether it is possible to perform 3D reconstruction of the local space from the two corresponding images. The idea behind this is that we want to have a similarity measure which states that similar images are taken at adjacent positions. The appearance graph contains in a natural way the information about how the space in an indoor environment is separated by the walls and other barriers. Images from a convex space, for example a room, will have many connection between them and just a few connections to some images that are from another space, for example a corridor, that is connected with the room via a narrow passage, for example a door. As the result from n images we obtain a graph that is described with a set of n nodes V and a symmetric matrix S called the 'similarity matrix'. For each pair of nodes i,j_[1, ..., n] the value of the element Sij from S defines similarity of the nodes.
Note that in constructing the graph no information about the positions of the nodes is used. In the we used these only for visualization. (Kröse 2007)
V. OUR IMAGE SIMILARITY MEASURE
Extensive work on image similarity has been presented in the field of image database retrieval. Methods developed in this field are generally based on local features, or visual landmarks.
Popular methods try to find a vocabulary of 'codebook' vectors which are used for image matching. Also in the field of robotics these approaches are introduced . Note that these methods remove all geometric information from the similarity measure. Our image similarity measure is based on knowledge that images are taken from a moving camera in an environment. This is exactly what robotics makes different from most image retrieval methods.As mentioned earlier, our image similarity measure denotes that a 3D reconstruction is possible between the images. The method for 3D reconstruction is based on local salient features as landmarks. Currently we use the SIFT feature detector .
The SIFT feature detector extracts the scale of the feature point and describes the local neighborhood of the point by a 128-element rotation and scale invariant vector. This vector descriptor is robust to some light changes, which makes it appropriate for our application. The method for computing the similarity between two images is split in two parts:
1) Are there matching landmarks in the two images, and
2) Do these landmarks fulfill the epipolar constraint? (Kröse 2007)A. Matching Landmarks
Visual landmarks are used often in robotics for navigation ,,. It has been shown that it is possible to reconstruct both the camera poses and the 3D positions of the landmarks by matching (or tracking) landmarks through images. On-line simultaneous localization and reconstruction of landmark positions was presented in  but currently only for small scale environments.
In this paper we consider the general case when we start with a set of unordered images of the environment. This is similar to the case described in , . First we check if there are many similar (repetitive) landmarks within each image separately. Such landmarks could potentially lead to false matches. We discard those landmarks that have 6 or more similar instances in the same image.
Then, for a landmark from one image we find the best and the second best matching landmark from the second image.
The goodness of the match is defined by the Euclidian distance between the landmark descriptors. If the goodness of the second best match is less then 0.8 of the best one it means that the match is very distinctive. According to the experiments in  this typically discards 95% of the false matches and less then 5% of the good ones. This is repeated for each pair of images and it is very computationally expensive. Fast approximate methods were discussed in . algorithm  which requires at least 8 matching points. With such small number of false matches it is possible to use the robust M-estimator directly. The whole procedure goes as follows:
• extract SIFT landmarks from all images
• discard self similar landmarks within each image
• find distinctive matches between pairs of images
• if there are more than 8 matches:
• estimate the fundamental matrix using M estimator and RANSAC
• if there are still more than 8 matches then there is an edge in the graph
VI. CLUSTERING IN THE GRAPH
Using the above presented similarity measure, we are able to make a graph-like representation of the environment. The graph can be considered as a low-level topological map, with the vertices (nodes) indicating omnidirectional images and the links indicating a similarity between the images. By clustering in this representation we are able to come to a higher level topological map, with clusters indicating regions in which the nodes are very similar.
The graph (V,S) gives this information about the structure of the environment. Convex spaces contain nodes which are highly interconnected, and doorways will have nodes with fewer connections. The graph V can be divided into subsets by cutting a number of edges. There exist different graph cut mechanisms. In  we present our approach which is a fast approximate solution to the normalized graph cut method from . In 5 it can be seen that the method results in meaningful clusters. The nodes inside the rooms get the same label, and each room gets its own label. The hallway is divided into four regions, which indicates that the appearance is not uniform in the hallway.
We use the clustered representation to obtain a semantic description of space by Human Robot Interaction. In  we describe a situation where the robot is guided around by a user, while the user occasionally gives a label to a location (for example: ‘corridor', 'living room'). The image taken at that location is labeled with that label. By using our clustering method, all images (nodes) in a cluster obtain the same label. (Achar S. 2008)
Image based methods are a new approach for solving problems in mobile robotics. Instead of building a metric (3D) model of the environment, these methods work directly in the sensor (image) space. The environment is represented as a topological graph in which each node contains an image taken at some pose in the workspace, and edges connect poses between which a simple path exists. This type of representation is highly scalable and is also well suited to handle the data association problems that effect metric model based methods. In this paper, we present an efficient, adaptive method for qualitative localization using content based image retrieval techniques. In addition, we demonstrate an algorithm which can convert this topological graph into a metric model of the environment by incorporating information about loop closures.
With recent advancements in computer vision, many robotic vision systems have been shown to be practical in real world. Traditional robotic vision systems employ stereo or structure from motion (a complete 3D reconstruction) for navigation in a 3D world . However, model-free or image based methods  have recently emerged as interesting alternatives which enable a robot to operate without an explicit metric reconstruction of the environment. A large amount of research in robot navigation is devoted to the problem of automatically building a metric map of the environment while concurrently using the partially constructed map to localize the robot. This problem is called SLAM (Simultaneous Localization and Mapping). Visual SLAM systems using both monocular and stereo imaging to reconstruct a 3D model of the environment have been studied in depth . In  and  systems are presented that use monocular vision and structure from motion techniques to provide a realtime estimate of the trajectory being followed by a robot.
On the other hand, image based methods for robot navigation store very little or no metric information in the environment representation. The ‘map' takes the form of a topological graph in which each node contains sensor readings (in this case images) taken at some position in the workspace . Nodes are linked with an edge if there is a simple, collision free path between the poses corresponding to the two nodes.
In the context of image based navigation, the localization process is formulated as an image retrieval problem. The graph contains a large collection of images taken from all over the environment. The image acquired at the robot's current position is used as the query. Localization is performed by finding images stored in the graph which are similar to the robot's current view. The more similar a database image is to the robot's current view and the greater the overlap between the two images, the more likely it is that the robot is close to the corresponding node in the graph. tent based image retrieval can now be performed accurately and efficiently even over very large image collections containing millions of images . Hence image retrieval techniques can be employed for fast and effective robot localization.
The focus of this paper is on how learning can improve the performance of an image based robot navigation system. We show that qualitative localization of a robot can be performed effectively using an adaptive vocabulary based approach to image retrieval. The visual vocabulary used by the system is not fixed, it adapts dynamically to better describe the type of visual features that occur in the environment.
This makes it possible for the robot to work better in new environments which are dissimilar in appearance to those it has worked in previously. In addition, we present a method that allows the robot to gradually learn the metric structure of the environment over time from the topological graph that it builds.
The use of local features for qualitative localization provides a significant degree of robustness to occlusions and changes in viewpoint. Extraction of local features from an image is typically computationally more expensive than global features, but the successful use of local features in image retrieval demands the investigation of their applicability in qualitative robot localization. One approach would be to directly match local features (like SIFT descriptors) between the current view and the images stored in the graph as is done in . This method is provides good results.
But because the current view needs to be matched against each database image it does not scale well as the number of images is increased. The Bag-of-Words (BoW) approach  of modelling images as collections of ‘visual words' built from local feature descriptors has made it possible to perform efficient and accurate image retrieval using local features over very large image databases. In , the current view is matched to images in the database on the basis of visual words by using a simple voting mechanism in which the number of words each database image has in common with the current view is counted. The matches from the highest scoring images in the first step are geometrically validated by fitting them using a homography. If an image in the database has a sufficiently large number of geometrically validated matches with the current view, the robot is localized to that image. In  a Bayesian approach to Bag-of-Words localization is presented. A generative model for the probability of a set of visual words occurring in an image is learnt from a training dataset. This is used to estimate the probability of two given images coming from nearby poses. Similar looking, highly distinctive views are given high probabilities of being from the same pose, while views that appear frequently in the workspace are given lower probability scores.
The Bag-of-Words based robot localization schemes described above use fixed vocabularies. These vocabularies are built during a separate training process over a set of images that are considered to be representative of what the robot is expected to see while it navigates through an environment environment.
It is assumed that a sufficiently large vocabulary will allow the CBIR system to function effectively over any image collection. A better alternative would be a dynamic set of visual words for describing an image that adapts to best represent the images of the robot's environment.
This would improve the robot's ability to operate in new environments which are visually dissimilar to those it had seen previously. Adaptive Vocabulary Forests  provide a method for doing this. A forest is grown incrementally as new images are added to the collection. Using a set of vocabulary trees helps to overcome problems of quantization near cell boundaries that occur when using a single tree. As new images are added to the collection, nodes are added. Nodes that have not been accessed over a long period of time are considered obsolete and gradually pruned out of the trees.
We extract scale and affine invariant interest points  from each image. A typical 640 × 480 image generates around 200 to 300 such interest points. For each interest point we determine the 128 dimensional SIFT  feature descriptor. Each tree has a set of inverted files associated with it, one file for each visual word. The files contain the indices of all images in the collection containing that particular visual word. This inverted file structure makes it possible to quickly process queries. When a query image is given to the system, visual words are extracted from it. We score database images according to the number of words they contain that appear in the query. Each tree has its own set of visual words and generates a score for the database images. These scores are totalled and the database image with the highest score is returned as the closest match. 1 shows some example queries and the results returned by the localization system. The result images clearly match well with their respective queries.
In this paper we start from a set of images obtained by the robot that is moving around in an environment. We present a method to automatically group the images into groups that correspond to convex subspaces in the environment which are related to the human concept of rooms. Pair wise similarities between the images are computed using local features extracted from the images and geometric constraints. The images with the proposed similarity measure can be seen as a graph or in a way as a base level dense topological map. From this low level representation the images are grouped using a graph-clustering technique which effectively finds convex spaces in the environment. The method is tested and evaluated on challenging data sets acquired in real home environments. The resulting higher level maps are compared with the maps humans made based on the same data.
(Zivkovic, Booij et al. 2007)
Achar S., J. C. V. A. a. l. f. i. b. n. (2008). Adaptation and learning for image based navigation. Proceedings - 6th Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP 2008.
Bellotto, N., K. Burn, et al. (2008). "Appearance-based localization for mobile robots using digital zoom and visual compass." Robotics and Autonomous Systems 56(2): 143-156.
Chella, A., I. Macaluso, et al. (2007). Automatic place detection and localization in autonomous robotics. Intelligent Robots and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on.
Cheng, C., L. Kin-Man, et al. (2008). "An Efficient Scene-Break Detection Method Based on Linear Prediction With Bayesian Cost Functions." Circuits and Systems for Video Technology, IEEE Transactions on 18(9): 1318-1323.
Cummins, M. and P. Newman (2008). "FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance." The International Journal of Robotics Research 27(6): 647-665.
Guoxuan, Z. and S. Hong (2009). Integration of a prediction mechanism with a sensor model: An anticipatory Bayes filter. Robotics and Automation, 2009. ICRA '09. IEEE International Conference on.
Heritier, M., L. Gagnon, et al. (2009). "Places Clustering of Full-Length Film Key-Frames Using Latent Aspect Modeling Over SIFT Matches." Circuits and Systems for Video Technology, IEEE Transactions on 19(6): 832-841.
Kröse, B. J. A., Booij, O. & Zivkovic, Z. (2007). A geometrically constrained image similarity measure for visual mapping, localization and navigation. Proceedings of the 3rd European Conference on Mobile Robots Freiburg, Germany.
Pronobis, A., B. Caputo, et al. (2009). "A realistic benchmark for visual indoor place recognition." Robotics and Autonomous Systems 58(1): 81-96.
Tapus, A. and R. Siegwart (2006). A cognitive modeling of space using fingerprints of places for mobile robot navigation. Robotics and Automation, 2006. ICRA 2006. Proceedings 2006 IEEE International Conference on.
Vasudevan, S., S. Gächter, et al. (2007). "Cognitive maps for mobile robots--an object based approach." Robotics and Autonomous Systems 55(5): 359-371.
Werner, F., J. Sitte, et al. (2007). On the Induction of Topological Maps from Sequences of Colour Histograms. Digital Image Computing Techniques and Applications, 9th Biennial Conference of the Australian Pattern Recognition Society on.
Zivkovic, Z., B. Bakker, et al. (2005). Hierarchical map building using visual landmarks and geometric constraints. Intelligent Robots and Systems, 2005. (IROS 2005). 2005 IEEE/RSJ International Conference on.
Zivkovic, Z., O. Booij, et al. (2007). "From images to rooms." Robotics and Autonomous Systems 55(5): 411-418.