An autonomous exploration rover


The field of computer vision and its applications in robotics have been paving the way toward developing sophisticated vision-based systems that can actually mimic the human's sight or vision sense to perform tasks such as; object detection and recognition, and understanding the coordinates and locations of these objects in the real world.

On June 10th and July 7th, 2003, NASA launched its twin Mars Exploration Rovers; Spirit and Opportunity toward Mars on a quest to search for possible traces of water on the red planet. The mission is part of NASA's Mars Exploration Program which is a long-term commitment to discover the secrets of Mars using robots. The mission's objectives include; looking for and studying a vast range of soils and rocks which could answer the question about the history of water on the planet and therefore; the rovers were targeted toward opposite sides of Mars. These two sides were characterized as potential areas that contained liquid water in the past. The rovers' first task after landing on the surface of Mars; was to take panoramic images of the surrounding area to help scientist on Earth to select the best possible paths for the twin rovers, leading to promising areas holding potential clues of water history. [1]

This shows the significant importance of vision in this mission, where the rovers utilize the power of the stereo cameras mounted on them to understand the area and the terrain surrounding them, detecting objects to study them and obstacles to avoid them, and determining their current location using visual odometry which falls under what is known as; Simultaneous Localization and Mapping (SLAM) which is the problem of addressing the robot's current estimated location and understanding the environment surrounding it to move to its destination using the correct path. [2]

Using an autonomous vision-based navigation system allows for longer daily drives which is very hard to achieve through the obsolete step-by-step navigation commands sent from the command and control centre. This system works by taking images using a stereo-vision camera. These images are processed to generate useful information such as depth and 3D points' locations of different objects and obstacles. The rovers use the generated information to determine the shortest and safest path toward the target set by scientist on Earth. [3]

This introduction ends by explaining the goals and millstones that are vital to the overall success of the dissertation's project and it also mentions the challenges faced during the analysis and development phase.

Chapter 2 covers a thorough investigation of computer vision and its related topics including; the available sensing systems technologies, autonomous mobile robotics navigation, breakthroughs in the area of computer vision, a comparison between vision-based sensing techniques, a comparison between the major obstacle detection algorithms, stereo vision, 3D modelling and re-construction, and finally the latest state of the art applications of computer vision including; the embedded small computer vision systems.

Chapter 3 covers the preliminary system design including hardware and software specifications and components of the overall vision-based sensing system. It ends by explaining the integration and development alternatives that the project went thorough to come up with the final approach and provides a comparison between them.

Chapter 4 explains thoroughly the chosen approach of developing and integrating the different components of the system and ends by explaining the testing phase and the techniques followed to test the depth accuracy and the object detection methods and techniques which were implemented.

Chapter 5 explains in depth the detailed software design. It starts by explaining how the different integrated SDKs and developed applications communicate with each other to achieve the following: error handling, camera handling, image acquisition and handling, image rectification, image pre-processing and disparity generation. Chapter 5 introduces the newly developed algorithm and how it functions to create regions of interest, calculates depth, detects obstacles and craters and finally performs artificial intelligent decision making to enable the rover to move safely. This chapter ends by explaining the process of 3D point cloud generation which can be used for model or scene re-construction.

The dissertation ends by concluding the achieved progress with the respect to the project's goals and milestones. The different results which were generated from the system are discussed and evaluated to understand their potential contribution to some related future work.

Goals, Milestones & CHALLENGES

The main goal of my project was to develop and integrate a vision-based system with an autonomous exploration rover to perform activities including; object detection and avoidance, distance (depth) calculation, and artificial inelegant decision making based on the acquired data. Milestones covered; researching the topic of computer vision in depth, understanding its applications in the field of autonomous exploration rovers, researching the software and hardware requirements for developing such system and acquiring them, understanding and applying the best integration approach possible and finally developing a system capable of utilizing the available software and hardware to successfully; capture images, process them, generate useful depth data, create a 3D points cloud representing the coordinates of available points taken from the image in the real world and finally use the available data generated from the system to perform artificial intelligent decision making to allow the robot to understand the environment ahead and get the answers to vial questions such as; is the path ahead clear? are there any obstacles or even steep crates ahead? and if there are obstacles, where are they exactly with respect to the rover's position? how far are they? and how to avoid them and achieve a safe rover traverse?

There are a number of already available algorithms that are being used to achieve the previously mentioned challenges but there are a number of problems including; the large number of processing steps which are lengthy and require a lot of time apply, the requirement of great amounts of processing power, the need for extremely sophisticated stereo cameras which are properly calibrated and can operate under almost all conditions such as; dusty environments and different lighting conditions and textures, the very limited time that I had to develop such comprehensive system and the problem with the 3d stereo camera which I was handed to use as the source of visual input. The camera had a calibration problem affecting the accuracy of depth data which is the most important type of data needed in such systems. This problem can be fixed only by the manufacturer in Canada.

Due to the challenges mentioned above, it was very clear to me that I had to develop a brand new algorithm capable of tackling these challenges. The limited time issue meant that the algorithm had to be simple and rapidly developed yet robust to provide safe robot traverse. The algorithm had to take into consideration the camera's calibration problem and to be able to efficiently work around the issue and provide final results that are reliable.

I was actually able to successfully develop a new algorithm which possesses the features of; simplicity, robustness and the ability to generate reliable results by effectively working around the previously mentioned challenges. The algorithm performs a full cycle of utilizing the available hardware and software to capture images, process them, generate depth data, generate 3D points cloud and finally perform AI decision making in less than 2.5 seconds with minimal processing power.

LITERATURE Review & State of the Art

There are many sensing system technologies available such as; Sonar, Laser, Infrared, and so on but due to the nature of the surface of Mars and the type of missions on its terrain, it has been proven that equipping the rovers with a vision-based system is the best method to give the autonomous robot the ability to perform self-localization, obstacle detection and avoidance, and 3D models and maps creation rather than relying on other sensors that provide non-visual data.

Autonomous Robotics

The major challenges which an autonomous mobile robot sent to explore an unknown terrain are determining its current location, determining its destination and, finding out the path to that next destination. These challenges are referred to as; where am I, where am I going, and how should I get there [1]. These questions can be answered by utilizing the output of the available sensors on the robot [2]. As mentioned earlier, many sensing techniques are available but having a vision-based sensing system has been proven to lead in this field as scientists and researchers have realized that some sensors can have some limitations or would not be of use at all in some cases unlike vision-based sensors that can perform efficiently in most cases. Vision-based sensing is basically based on what is known as Computer Vision which has been a subject of interest and numerous researches. With the rapid development in technology, the area of computer vision has been widely explored and developed to utilize it in various fields and applications from simple monitoring and surveillance to space exploration.

The key ability of autonomous exploration rovers is to efficiently detect obstacles without human intervention. This area of research has been attracting a lot of attention to it in the recent years. In 2002, Guilherme et al [x] prepared a particular survey about the field of computer vision technologies. The survey studied the different environments where computer vision applications are applied. The process of navigating safely by detecting and avoiding obstacles autonomously is classified into two major environments which are; indoor and outdoor environments. These environments can be classified furthermore into to sub-types according to the topographical features which are; structured and unstructured environments.

The National Institute of Standards and Technology (NIST) in Gaithersburg, Maryland in the United States of America developed an obstacle detection algorithm which is intended to function in environments that have rough terrain conditions. The system relies on a Ladar sensor as its main source input device. The algorithm is basically a hybrid of sensor-based and grid-based obstacle detection and mapping methods. The perception and obstacle detection/mapping module is actually a component of the integrated 4D Real-time control System (RCS) [1, 2]. The system has two main components which are; a mapping part and an obstacle detection part. The algorithm functions by first of all, processing the range data provided from the Ladar sensor. This process is carried out by the obstacle detection section of the system. After that, the algorithm converts the retrieved data into Cartesian coordinates in the frame of the sensor coordinate which are further analyzed and processed to detect obstacles. After performing the first task of detecting obstacles, the algorithm's second section which is the mapping module, comes to action by projecting obstacle points onto a grid-based generated map. This map is used by the 4D Real-time control System's planner module [11] to generate a final path that assures the rover's safe traverse. The algorithm allowed the rover to travel at a speed of 24 km/h while being able to perform obstacle detection and avoidance efficiently

Due to the simplicity of the structured indoor environments when it comes to the terrains and scenes, the technology of mobile robotic within this area has achieved a great deal of success and progress. On the other hand, the outdoor unstructured environment has been quite challenging when it comes to autonomous mobility and navigation due to the many external factors such as the complexity of the terrains but there has been a certain level of achieved success during the recent years such as; the Navlab Project [x] where an autonomous ground rover was able to travel across the United States of America from the west coast to the east coast using neural network-based vision system. The DARPA Urban Challenge was another success added to the area of outdoor autonomous navigation. DARPA Urban Challenge is a competition sponsored by U.S. Defence Advanced Research Projects Agency, which is the major research organization of the U.S. Department of Defence. The challenge was held on November 3, 2007 at former air force base in the State of California. The participant teams were challenged to build an autonomous vehicle which acquires the capabilities of autonomous driving in traffic, complex manoeuvres performance such as; parking and negotiating instructions. [x] But the greatest success was actually NASA's Autonomous Exploration Rovers (Spirit and Opportunity) which were sent to Mars.

Vision-Based Sensing Systems

Two main sensor technologies; laser range finder and camera have been intensively researched and used in rough outdoor unstructured terrains for autonomous navigation and obstacle detection and avoidance. Laser sensors are able to provide a straight forward and refined data and information about the scene ahead but they have a number of limitations such as; the problem of one-degree freedom, the relatively big energy consumption and the slightly large size of the device itself. On the other hand, vision-based sensing systems tackle the previously mentioned limitations through their flexible, effective and efficient features such as; the ability to scan the whole sense ahead of the device (camera), high resolution, the capability of handling larger amounts of data and information and the relatively smaller size of the sensing devices (cameras). Although computer vision algorithms are very complex and much harder to implement compared to other types of sensing devices algorithms, it has been proved that using vision-based sensing devices is capable of delivering far more better results and the area of computer vision is being heavily researched and improved which indicates the inevitability of vast breakthroughs in the very near future.

Vision-based sensing systems can be classified into three types according to the number of cameras integrate with the system. The first type is called; Monocular in which one camera uses the optical flow for object detection [x]. This method proved to have limitations when it comes to detecting objects because it relied on appearance and colour based techniques which were not capable of accurately determining the localization of the objects or obstacles but a recent research on 3D reconstruction generated from still images taken from single cameras were able to yield better results of object detection [x, x, x]. The monocular method focused on calculating the absolute depth of the object in front of the camera through the techniques of machine learning, extraction and the prior knowledge and understanding of the environment which is not possible to achieve in an unexplored environment which means; they were not able to yield the global 3D localization of the scene ahead.

With the introduction of Binocular and Multi- cameras (two cameras or more), the previously mentioned challenges localization and 3D reconstruction were tackled. This success can be clearly scene in the results and achievements reached by the DARPA completion and the Mars Exploration Missions. Powerful algorithms were developed to work with the Binocular and Multi- cameras methods such as the V-Disparity Image Algorithm [x, x, x] which is able to detect objects or obstacles through the automatic estimation of the ground plane. Another algorithm which focused on the same technique is the Bernard's Ground Plane Segment [x, x]. Due to the complexity of utilizing computer vision for object detection, these two algorithms were challenged by the limitation of the flat ground assumption which is not always the case with unstructured environments. Another challenge is the time and energy (processing power) consumption of these methods which could represent heavy burdens that cannot be tolerated with the respect of mission success and efficiency. [x]

Stereo Vision

Stereo vision is one of the major related topics to computer vision. It introduces the use of stereo cameras as the main data and information input devices in vision-based sensing systems. The stereo vision problem has been researched intensively during the recent years but unfortunately, it remains unsolved in some of its parts. It is true that a various number of methods and techniques are able to generate excellent results but usually these results are achieved under a set of well chosen constraints and similar measures of the environments they operate in. The problem is present even within the state of the art stereo vision algorithms which still struggle with many issues including; occlusions, low textured areas, illumination, computation complexity and distortions. Furthermore, the majority of the state of the art stereo vision algorithms are only accurate in the case of the nearest pixel rather than the more desirable sub-pixel accuracy. This problem can be found when two or more pixels of the same disparity are thought to be of the same depth which is not necessarily true.

Stereo vision is introduced with Binocular or Multi-cameras methods, which indicates the integration of two or more cameras to achieve stereo 3D image acquisition. The base-line or the distance between the cameras is a major role player when it comes to scene re-construction. If the base-line is long or in other words; if the separation between the lenses is wide, the scene re-construction is more accurate, but, the amount of distortion - curvy lines within the scene - actually increases as the matching process between the two pairs of images (taken from two cameras simultaneously) becomes more difficult. Many methods have been developed and implemented to deal with such challenges including; the Gabor Filter which is able to eliminate minor stereo image distortions. Another primary element of stereo vision is; depth generation which depends heavily on the actual depth resolution which depends on the resolution of the raw original input image. Despite the limitations of other sensing technologies such as structured light and laser, these technologies are actually capable of providing re-construction results that are accurate to less than a millimetre. This does not mean that stereo vision-based sensing technologies are not able to produce such accurate results, but, the need for much complex algorithms is vital to achieve accurate results. New methods have been investigated to allow the high level accuracy output such as super resolution and 3D interpolation. Super resolution covers a number of methods that are capable of enhancing the resolution of images through the use of single-frame and multiple-frame variants of super resolution. Multiple-frame super resolution apply the method of sub-pixel shifting between multiple low resolution images capturing the same scene, while single frame super resolution applies the method of magnifying the image without causing blur.

Obtaining data and information of the surrounding environment in a way that simulates or mimics the actual real world is very crucial for computer vision systems. Having such data enables computer vision systems to effectively analyse, segment and process the obtained visual input to make it possible for various numbers of applications that depend on vision capabilities to perform many tasks that were impossible to achieve before. This kind of visual input is usually a 3D representation of the real world obtained through a number of methods such as; motion parallax data, structured light, object shading and laser range finders but the nowadays; the most commonly method being investigated, developed and used is; stereo vision which have been proven as the best method for such applications. The 3D representation of the real world enables such systems to obtain and generate vital visual data and information such as; depth and localization through the ability to mimic the human eyes. Stereo vision is being used in various fields including; space exploration, security, face recognition [x] and so on.

The utilization of what is known as; motion cues for the purpose of direct 3D data reconstruction is being used in some computer vision systems through tracking signal motion between image frames to generate relative depth ordering of the points in the same scene as Smith et la.[7] described. This technique is could be useful in some processes such as feature extraction, segmentation and layer recovery but is not suitable for capturing detailed features of the environment captured in the real world scene and this is where stereo vision proves to be the leading technology for such tasks.

Stereo vision can be utilized to effectively reconstruct 3D scenes and objects by using techniques that relies on a mass of stereo correspondence based reconstruction methods where the points of the captured scene are matched across the available stereo image pairs and reconstructed to 3D. There are a many algorithms that have been developed to work with stereo vision. The most common stereo vision algorithms are knows as; pixel based algorithms which are used for correspondence measure [8, 9]. Pixel based algorithms look for similarities between pixels in the image pairs to compare and match these pixels or points.

There are a number of problems that are introduced with stereo vision such as; the stereo correlation problem which have been tackled by the use of the frequency domain approach. Frequency domain techniques are efficient due to their processing speed the accuracy of sub-pixels [12] and have been researched intensively. One of the most promising researches in this topic is explained in the work of Ahlvers and Zoler in which they used a number of classical Gabor filters to acquire frequency phase data about the points or features in images. In their work, they ensure the first match through the integration of magnitude Gabor response information with the matching measure when the current initial phase does not provide enough data and information about the candidate matches. They have found that there is a significant and noticeable improvement achieved in the matching accuracy when both phases; information phase and magnitude phases are included. The Gabor filter technique has proved to be one of the most successful techniques in terms of frequency domain analysis due to being the most efficient time / frequency analysis primitive which has made it the mostly used filtering technique amongst computer vision researches.

The stereo vision correspondence matching method can be enhanced through introducing projected light patterns to achieve better matching process. This technique is carried out by projecting a light pattern onto the scene or the object to be reconstructed during the image acquisition step to provide easily match-able points which the correspondence algorithm detects. There a number of light projection techniques that variation in the achieved results. One of these techniques is what is known as; the random light projection [15] which provides robust and outstanding feature points of the scene for the correspondence algorithm to match. Another technique is known as; the strip light projection which works by estimating the surface upon the distortions which are caused to the light strip when it eliminates the scene or the surface of the object. A coded light pattern is another technique which utilizes the structured light sequence to generate a unique code for each pixel where the correspondence matching process is done by identifying matching pixels with the same unique code from the images pair [16]. Although the light projection techniques provide robust results, they introduce a number of issues such as; setup cost and complexity of the light projectors.

Another reconstruction technique involves the use of laser depth scanners which are far more complex and expensive when compared to light projection. Laser scanners introduce other drawbacks including; lengthy scan times, and the impossibility of capturing scene textures and colours without the help of other devices such as; regular camera. This technique follow the same classical method of stereo vision depth triangulation, however, the depth measurement of the scene points is carried out through the use of laser. Despite their drawbacks they have gained a lot of attention for being the most accurate technology for depth calculation.

Stereo Correlation Algorithms

Correlating a number of points in the corresponding images pair captured from each camera is the first step of generating a 3D reconstruction of scenes or objects. Some of the mostly used correlation algorithms nowadays include; the feature based technique, the local window based method [31, 34] and the Laganiere and Vincent method [36]. When it comes to maximizing the resolution and generating 3D reconstructions that are rich in details, it is needed to generate dense correlations between the capture images which means each pixel in one image should be exactly matched to one pixel in the other corresponding image. In this case, the use of other correlation algorithms is needed such as; the Gabor wavelet [37] based technique which yields relatively high accuracy results and is not highly affected by lighting conditions variations and pose which can be commonly faced as accuracy problems in stereo images.

Denis Gabor introduced his algorithm in 1946 which was intended for the representation of signals as a combination of elementary functions. The Gabor wavelet has been proved to yield optimal analytical resolution in both the frequency and the special domains. A number of researches have added to the Gabor wavelet work such as; the introduction of 2D counterpart of the initial wavelet by Granlund [38], the research which proved evidence that the 2D Gabor wavelet methods mimicked the human's receptive fields of the visual cortex to certain level which was conducted by Daugman [39] and the recently conducted research work by Okajima which focused on the information theory prospective of the Gabor wavelet which showed that the wavelets' receptive fields can actually extract a large bulk of data and information to a maximal level from a local image region [40]. The formula of the 2D form of the Gabor wavelet is as follows:

A mother wavelet is used to generate a set of Gabor wavelets which makes it possible to perform analysis of a certain region of a particular image. These generated sets of Gabor wavelets or filters are coiled up with the image where each filter is combined into a vector which represents all the filters generated. The vector of Gabor filter responses is known as; a Gabor Jet. When analyzing and comparing a number of different Gabor jets, it is possible to measure the similarities between the regions in the images to be calculated. Formula 2.2 defines the jet similarity functions for a set of two images that are; (J and J'):

In this type of stereo vision systems, matching the primary seed points in the original image to pixels in the corresponding image is performed by obtaining the Gabor jet for the filters that are centred on the original seed pixel. After that, a comparison between the obtained jet and the corresponding jet of each pixel on the corresponding epipolar line is performed where the pixel with the most similarity is then chosen as a match. Osadchy [50] analyzed the Gabor correspondence method further more and has proved its robustness against common factors which pose as stereo vision issues and challenges such as; the perspective distortion and the illumination problems.

The SSD algorithm functions through the assumption of having the correlating image points surrounded by a window points taken from other images. When subtracting these window points from their respective pixels that are found in the matching correlation window, they can be squared and result in the measurement of the similarity of a given two points which are found in the centre of each window. Determining which window size to use is very crucial when it comes to adopting the SSD algorithm. A window must have a large enough size to contain a satisfying amount of intensity variation for the purpose of reliable matching, but also have a small enough size to avoid including any depth discontinuities [34]. Another problem is the algorithm's low tolerance to noise, lighting conditions and the perspective variations between the images which are being matched. This is due to the algorithm's approach of directly comparing pixel intensity levels at a local level. On the other hand, the simplicity of the SSD algorithm makes it on the fastest stereo correlation algorithms available.

3D Models

When it comes to 3D models, the topic has been investigated and developed in many researches for the last two decades. The main technologies of this area are 3D acquisition, view registration, model construction, and 3D modelling systems. 3D acquisition works by using structured light, laser scanning, or stereo to basically scan objects to determine its shape and distance. The first two approaches have some drawbacks such as; the limited ability to illuminate a whole surrounding environment using structured light rather than just a single object and when it comes to laser, the scanning task requires full stationary state of the laser scanner which can also be heavy in weight and expensive hence, using stereo vision is the most efficient way of capturing images of the surrounding terrain and converting them to 3D models because of its ability to capture complete images in microseconds with no need to remain stationary. The second main technology in the area of 3D models is view registration which basically registers multiple scans together to create a 3D model using a special dedicated device. The Iterative Closest Point (ICP) algorithm [3] has been the most commonly used algorithm for the registration task. Model construction deals with the process of eliminating unwanted factors from the registered 3D models such as extra unwanted details and noise which can be done by constructing geometrical models such as 3D surfaces. The fourth technology area; modelling systems, has been used in large-scale scanning such as city scanning [4].

Latest Applications of Stereo Vision

Nowadays, computer vision is being applied in the area of industry by equipping industrial robots with vision sensing capabilities, allowing the robots to sense the surrounding environment and perform tasks that were impossible to achieve before. Vision guided industrial robots are able to detect and recognize different objects and their position variations and manipulate them such as; different work pieces that must be installed precisely in the right location of the product. The integration of computer vision with industrial robots has been proven to dramatically enhance the production lines by becoming more flexible and the mechanism for work piece transport and positioning can be simplified as well. Many computer vision approaches are applied in the field of industrial robotics including; static scene capture and move for analyzing static environment scenes, semi-dynamic scene capture and move for analyzing moving but predictable environment scenes, path correction for measuring the offset of the robot's current path and the predefined path. [X]

Fully integrated, miniaturized embedded stereo vision systems have also been developed to further study the methods of applying them in the field of miniature robotics. These systems utilize very low power, tiny CMOS cameras and power efficient embedded media processors to perform standard image processing steps from; image acquisition to pre-processing, processing, rectifying and depth generation. [X]

Preliminary System Design

In this section, the hardware and software requirements and specifications are discussed thoroughly. After that, the different integration and development alternatives that were researched, analyzed and considered are explained and illustrated.

Hardware Specifications

Developing a vision-based system that would be integrated with a robot requires two main hardware components; a camera or a set of cameras to provide visual input or in other words; act as the "eyes" of the robot and a processing unit or a computer which is used to host the essential software and perform processing activities such as; running the computer vision's software components.

Stereo Camera

The camera's model is Point Grey Research Bumblebee2 which is a stereo vision camera -two lenses/cameras- that provides a balance between 3D data quality, processing speed, size, and price. The stereo vision capability of this camera works in a similar way to 3D sensing in human vision. It starts by identifying the pixels on the captured image which corresponds to the actual points in the real world. The 3D position of a certain point can then be identified by triangulation through a ray from each camera. The more corresponding pixels identified by the camera, the more 3D points which can be identified with a single set of images. The whole process can result in tens of thousands of 3D values created with every stereo image. Bumblebee comes with a number of advantages as well such as; being provided as full hardware and software packages including a free license of the FlyCapture SDK used for image acquisition and camera control, and the Triclops SDK used for image rectification and stereo processing. Bumblebee offers full field-of-view depth measurements from a single image set, real time transformation of images to 3D data, and the ease of integration with other machines and systems, passive 3D sensing which eliminates the need for other sensors such as laser or structured light, pre-calibration, high quality CCD sensors, and high speed [4]. Figure 2 shows the stereo camera (Bumblebee2).


The server which is used as part of the developed system is basically a Hewlett-Packard Pavilion dv6000 running Microsoft Windows Vista. Figure 3 shows the server used in the project.

Server specifications:

  • Processor: Intel Core 2 CPU T5500 @1.66GHz
  • Memory (RAM): 2.00 GB
  • System type: 32-bit Operating System
  • IEEE 1394 port
  • C++ Compiler (Visual Studio 2008)

The server hosts the stereo camera's SDKs, libraries, and header files. It also hosts the core software which controls and performs the full task of image capturing and processing, and the generation of navigation information. The server can be used to send commands to the controller software over TCP/IP to instruct the rover to move according to the navigation information. The stereo camera is connected to the server through a very high speed FireWire (IEEE 1394) connection in order to rapidly send the captured images to the core software to perform the previously mentioned tasks.

It can be clearly understood that the server's specifications are of hardware packages that are widely available at very convenient costs which means that the processing power is not very high. This proves that the algorithm can run effectively and efficiently on such machines which are available to virtually anyone.

Software Specifications

A computer vision system relies on a bundle of software components integrated together that perform certain tasks such as; communicating with camera to capture and transfer images to the server, process the grabbed images, generate useful depth data, detect objects, compute distances and finally, use the generated data and information to achieve the required results.


Since the 3d stereo camera which is used as the visual input component in the project is Point Grey Research Bumblebee2 , the default bundle of software components that are needed by the camera to operate properly must include; PGR SDKs (FlyCapture and Triclops) which can be downloaded by PGR products' owners from Point Grey Research company's website. The SDKs contain functions that deal with images stages in the stereo processing pipeline; capturing the raw images, rectifying the images through correcting and aligning them, performing epipolar validation on the features of the image to identify their edges and 3D locations, and finally producing depth images using the Sum of Absolute Differences criteria. [5] Figure 4 shows the different stereo image processing pipeline.


PGR FlyCapture is the result of an extensive development and testing process that has yielded Point Grey's stable and feature-rich software package. FlyCapture is capable of cross-platform support for both 32- and 64-bit Microsoft Windows and Linux Ubuntu operating systems, using a common application programming interface (API). [X] FlyCapture includes header files that are used for forward declaration of its classes, identifiers and variables. It also includes a various number of APIs that can be used to call the pre-defined functions included with FlyCapture. These functions are used for camera communication, image capturing and so on.


PGR Triclops is used for providing real-time depth range images using stereo vision technology. It allows for accurate measurements of distances to every valid pixel in an image. Triclops depends on the Sum of Absolute Differences algorithm (SAD) which is a fairly robust algorithm that has a number of validation steps to reduce the level of noise. The algorithm works properly with texture and contrast but issues such as; occlusions, repetitive features and specularities can generate weak results. [X]


OpenCV is an open source computer vision library originally developed by Intel. The code material of the library is written in the programming languages; C and C++ and is capable of running on the following platforms; Windows, Mac OS X, and Linux. One of the main objectives of OpenCV is to give developers the ability to rapidly and easily build -to a certain extent- sophisticated computer vision applications and programs. OpenCV library has more than five hundred functions covering a vast area of applications such as; industrial, medical, security, and exploration applications based on computer vision. OpenCV has a full, general-purpose Machine Learning Library (MLL) which provides efficient ways of performing statistical pattern recognition and clustering that can be used for various machine learning problems such as sophisticated vision tasks. The first release of OpenCV was its alpha version in January 1999 which opened the doors for many opportunities to utilize the power of computer vision by stitching captured images together in many fields such as; satellite and web maps, medicine, security, military, unmanned vehicles, and exploration [6]. OpenCV was considered as a candidate software component of the developed vision-based sensing system but I decided to discard this option due to reasons that are explained under part C of section 2; "Integration & Development Alternatives".

core software

The core or main software in this project is the hub of all the software components that are integrated together including the SDKs or the applications developed by me which as a whole, work together as a bundle of software components that perform the main steps of the proposed computer vision solution. The software would be linked or integrated with PGR Triclops and FlyCapture to use their application programming interface (APIs) to call their pre-defined functions that deal with capturing the images, processing them, generating depth maps and creating 3D points cloud. Next, the core software runs the newly developed algorithm to generate navigation information such as object detection, object locating, distance computation and AI decision making. The proposed core software is explained in details in section 4; "Detailed Software Design".

integration & development alternatives

I have tried three possible design approaches of the proposed solution on the testing server (laptop).

The first approach relied exclusively on using OpenCV's APIs to call its pre-defined functions to capture images, process them, and generate depth information. Diagram 1 illustrates the first approach.

After testing the three different approaches to tackle the problem and develop the final solution, three different results were achieved as follows:

The first approach which relied exclusively on OpenCV to capture images, process them, and generate depth information resulted in some problems related to the difficulty of configuring OpenCV to work with stereo cameras. Initial trials proved the difficulty to configure OpenCV to grab images from a pair of stereo cameras rather than just a single mono camera. In this case, it is mandatory to include an algorithm which can calibrate the pair of stereo cameras and this would increase the demand for more unnecessary processing power. Since the need to develop a system which requires minimal processing power was one of my main goals and since Bumblebee2 cameras are already calibrated before being shipped to the owners, it was very clear to me that this approach was definitely not the best and I eventually decided to discard it.

The second approach which relied exclusively on FlyCapture and Triclops SDKs to capture images, process them, and generate depth information proved to be the most suitable way to accomplish these steps. It was vividly clear that FlyCapture, Triclops SDKs and Bumblebee2 camera are basically a bundle of both; hardware and software components manufactured and developed by the same company which is Point Grey Research (PGR). This means that these components yield the best results possible if integrated together as one entity and so, it did not make sense to discard FlyCapture and Triclops SDKs from the final design of the system. These SDKs include pre-defined functions that work efficiently and effectively with PGR stereo cameras to grab images, process them and generate depth information.

Although the second approach proved to be effectively and efficiently plausible, I decided to further more investigate a third approach which included OpenCV as one of the software components of the system. This approach suggested using an integration of PGR SDKs and OpenCV.

OpenCV has created a new hype in the field of computer vision mainly because it is an open source computer vision library which gives developers the ability to rapidly and easily build -to a certain extent- sophisticated computer vision applications and programs. OpenCV library has more than five hundred functions covering a vast area of applications such as; industrial, medical and security vision applications. OpenCV has a full, general-purpose Machine Learning Library (MLL) which provides efficient ways of performing statistical pattern recognition and clustering that can be used for various machine learning problems such as sophisticated vision tasks. OpenCV contains powerful algorithms such as the Watershed Algorithm [Meyer92]. This algorithm works by transforming lines in a certain image into "mountains" and what is known as; uniform regions into "valleys" which are utilized to segment objects. It first takes the gradient of the intensity image which contains the effect of generating basins or low points that lack texture and ranges or high points that contain textures and lines. After that, the algorithm fills up those basins until some points are met. The regions with points that met are considered to be part of one feature or in other words; they belong to each other. The image is then segmented into the corresponding marked regions. The algorithm simply isolates or extracts certain objects by detecting all the points that belong to that object and group them together. Finally it removes everything else that is not of use. OpenCV also provides techniques and methods such as; the Lucas-Kanade* [Lucas81] and Horn-Schunck [Horn81] techniques to track certain objects of interest. [10]

Due to the many advantages mentioned above, I had the need to investigate the possibilities of utilizing the power of OpenCV in the project but again, OpenCV works with already developed powerful algorithms, hence more demand for unnecessary processing power. These algorithms were developed to work with a various number of different colours and textures and involved many stages and steps of analyzing and processing them. Some of these algorithms were designed to work with moving objects to not only track them, but recognize them as well such as; Shi and Tomasi [Shi94] algorithm. Since my goal emphasized the need to develop a low processing power vision system designed in theory for a space exploration rover, I decided to discard OpenCV from the final design despite its advantages because simply, there are no multiple textures, colours or moving objects to recognize or track on Mars.

As mentioned above, the second approach was chosen to design the vision system. The architecture of this approach is further explained in details in section 3; "Implementation Strategy".

Implementation Strategy

In this section, the final design approach of the proposed vision-based sensing system is explained thoroughly. The integration method of all the hardware and software components of the system as well as; how all elements communicate with each other are explained and illustrated. Finally, the testing techniques including depth accuracy testing and object detection testing are discussed to further explain and illustrate the methods of making sure that the system is capable of providing the desired results and goals which are mentioned in part B of section 1; "Goals, Millstones & Challenges".

Proposed Integration APPROACH

After testing three different approaches to come up with the final system architecture and design as mentioned in part C of section 2; "Integration & Development Alternatives", the second approach which relied exclusively on FlyCapture and Triclops SDKs to capture images, process them, and generate depth information proved to be the most suitable way to accomplish these tasks.

As mentioned earlier in part B of section 2; "Software Specifications", PGR SDKs (FlyCapture and Triclops) are a bundle of software components which works with PGR stereo cameras. The SDKs contain functions that deal with images stages in the stereo processing pipeline; capturing the raw images, rectifying the images through correcting and aligning them, performing epipolar validation on the features of the image to identify their edges and 3D locations, and finally producing depth images using the Sum of Absolute Differences criteria.

The first step was downloading and installing PGR FlyCapture on the server. After that, Triclops Stereo Vision SDK was downloaded and installed on the server as well. Next, the camera was plugged to the server through the very high speed FireWire (IEEE 1394) connection port and it was started to check the compatibility of the software components (including the server's operating system) and the camera. Finally, Microsoft Visual Studio 2008 which was the integrated development environment of choice, was configured to compile and run PGR SDKs files by first of all, adding the following header folders to the (Include files) of Visual C++ directories:

  • Triclops Stereo Vision SDK Include folder
  • PGR FlyCapture FlyCap folder
  • PGR FlyCapture Include folder

Second of all, adding the following library folders to the (Library files) of Visual C++ directories:

  • Triclops Stereo Vision SDK Lib folder
  • PGR FlyCapture Lib folder

Next, adding the following source folders to the (Source files) of Visual C++ directories:

  • Triclops Stereo Vision SDK Src folder
  • PGR FlyCapture Src folder

Finally, adding the following SDK library files to the Linker directory of the project:

  • PGRFlyCapture.lib
  • pgrflycapturegui.lib
  • pnmutils.lib
  • pnmutilsd.lib
  • triclops.lib

The reason behind performing the previously mentioned steps is to prepare the integrated development environment (Microsoft Visual Studio 2008) to be able to compile the code segments of the core software which call the pre-defined functions available in PGR SDKS and run them successfully. Diagram 4 illustrates the final system design and architecture.

When running the vision system, the core software uses the application programming interfaces (APIs) to call a set PGR FlyCapture image acquisition functions which have already been defined. The functions check for errors, acquire the current stereo camera's configuration file, check for camera type and compatibility and finally start transferring images to the server. Then, it calls another set of predefined functions to apply a certain number of image processing stages to the grabbed image such as; image rectification and image processing to perform certain tasks such as; noise removal. Next, a disparity map is generated from the rectified and processed image. After that, the new algorithm which I have developed gets the depth data from the disparity image to perform two main tasks; object detection and distance calculation. The algorithm analysis the available data to perform an artificial intelligent decision making process by answering questions such as; is the path ahead clear? if there are obstacles, what is their type? how far are they? and how to avoid them?

Based on the decision made, the system sends the appropriate command to the rover to ensure its safe traverse. Finally, the core software calls another set of PGR Triclops pre-defined functions to create a 3D point cloud file. The final outputs of the different tasks mentioned earlier are saved on the server's hard disk. These steps are explained in depth in section 4: "Detailed Software Design".

Testing Techniques

Thorough testing approaches and techniques were applied to ensure the accuracy of depth calculation which is performed after generating the disparity map. After that, the new algorithm was tested in depth to ensure its ability and reliability to detect obstacles or craters ahead of the rover perform the right AI decision making and send the correct commands to the rover. The testing approaches and techniques are explained and illustrated below.

Depth Accuracy

Depth accuracy is basically the accuracy of the results which are generated after calculating the distance between the rover and the valid depth point (obstacle) in meters. This step is very important as it ensures the safety of the rover's traverse by identifying the distance to collision if a valid obstacle is detected. The testing technique in this case was performed by putting an obstacle right in front of the camera and running a small application which first of all, generated a depth map from the acquired image. Next, a region of interest (ROI) was created in the centre of the acquired image. Finally, the application calculated the distance between the camera and the centre point of the created region of interest and converted it to meters. Figure 5 shows a box acting as an obstacle right in front of the camera.

After that, the physical distance between the object and the camera was measured manually to ensure the accuracy of the depth calculation. Figure 6 shows the manual measurement approach to match the application's calculation.

Due to the stereo camera's calibration problem which is mentioned in part B of section 1; "Goals, Milestones & challenges", the result of using the camera to measure the depth and using the manual method, only matched perfectly under certain circumstances that were; very good lighting conditions, light coloured ground and colourful and textured objects. I discussed this issue with Point Grey Research Company's technical support team and they confirmed the need to send the camera to Canada for calibration especially after I sent them a number of the stereo images captured by the camera and the disparity maps generated from them.

Of course, this was not a reason to stop the development of the system and send the camera to Canada for a period of time that could have taken four to six weeks especially with the limited time that I had. The solution was to further test and operate the camera under the perfect conditions learnt from the initial testing attempts which were suitable for the calibration issue. The further testing attempts under the previously mentioned conditions yielded much better results which were perfect most of the time.

Generating disparity maps and using them to calculate the distance between a certain object and the camera (depth information) is explained in details in section 4; "Detailed Software Design".

Object Detection

Object detection is the process of analyzing the generated disparity maps to detect relatively close objects (obstacles) or far away surfaces which could be a threat to the rover's traverse (craters). Disparity maps are basically grey images with deferent levels of grey levels representing the depth of each and every pixel (valid depth) in the image. In disparity images, very close objects are represented with a darker shade of grey while far objects are represented with a lighter shade of grey.

The testing approach which was followed in this case started with taking a various number of stereo images of deferent scenes, rectifying them and generating disparity maps out of them. The disparity maps or images were compared against the original images to make sure that the different levels or shades of grey are actually representing the depth or distances of all the objects in the scene correctly. In other words; I made sure that very close objects were represented in a very dark shade of grey, a bit further away objects were represented with a relatively lighter shade of grey, far away objects were represented with a very light level of gray and so on. This method ensured that the stereo camera was able to be used properly under the perfect conditions mentioned earlier to generate disparity maps which represented each valid depth with the correct level or shade of grey. This means that all the objects in the original scene were represented with the correct level of grey which represents its distance from the camera. Figure 7 shows a rectified stereo image and the generated disparity map with different levels of grey representing the depth of the objects in the original image.

After making sure that the objects were detected properly and represented with the correct levels of grey, another set of stereo images were taken, rectified and depth maps were generated out of them. These images were taken of scenes which had no detectable close objects (valid depth) to confirm that the corresponding generated disparity maps had no levels of dark shade. This step was taken to test the ability of both; the stereo camera and the disparity algorithm to generate correct disparity maps with correct levels of grey and to investigate the depth or distance calculations that are produced for empty scenes by running the depth calculation application mentioned above. The generated depth or distance results of empty scenes were used to understand surface distances, horizon distances and the camera's accurate range which was approximately three (3) meters and setup thresholds which will be used by the new developed algorithm to understand the scene in front of the rover mathematically and logically in terms of answering the question of the availability of obstacles or not and their distances as well. The algorithm is explained in depth in part B of section 4; "The New Algorithm".

After that, regions of interest (ROI) were created and processed individually to serve as different sectors of the image. A certain number of regions of interests or sectors were created around different parts of the image to be tested and after that, forty-eight (48) regions of interest were created to cover the whole area of the 320 x 240 image. The average depth of all the pixels within each region of interest or sector was calculated. If a certain region of interest were represented in a dark shade of grey (a close obstacle or a valid depth sector), the calculated depth or distance result had to match the distance of the part of that object in the real world and vice versa. The forty-two sectors divided the 320 x 240 image into six (6) rows. The first three rows (1, 2 and 3) represented the horizon ahead of the rover and the last two rows (5 and 6) represented the ground or the surface which the rover is on and were obstacles or craters are to be detected. Row 4 represented the transition area between the ground and the horizon. The method of using regions of interest to detect obstacles or craters is explained and illustrated in part B of section 4; "The New Algorithm". As mentioned above, the application calculated the average depth of each and every region of interest or sector of the image and the results were displayed in the output showed in figure 8 below.

The results were used to fill up data tables to be investigated and analyzed to understand the nature and fluctuations or readings from the system. These readings were used to setup thresholds which are used by the new algorithm to detects obstacles and craters, calculate their distances from the camera or the rover and find their locations with respect to the rover's location. The tables below represent the readings generated from the system under two different situations testing empty scenes and scenes with available obstacles as well. Each cell represents the reading of the corresponding region of interest in the corresponding row.

Detailed Software Design

As mentioned earlier, the core software uses the application programming interfaces (APIs) to call a set PGR FlyCapture image acquisition functions which have already been defined. The functions check for errors, acquire the current stereo camera's configuration file, check for camera type and compatibility and finally start transferring images to the server. Then, it calls another set of predefined functions to apply a certain number of image processing stages to the grabbed image such as; image rectification and image processing to perform certain tasks such as; noise removal. Next, a disparity map is generated from the rectified and processed image. After that, the new algorithm which I have developed gets the depth data from the disparity image to perform two main tasks; object detection and distance calculation. The algorithm analysis the available data to perform an artificial intelligent decision making process by answering questions such as; is the path ahead clear? if there are obstacles, what is their type? how far are they? and how to avoid them?

Based on the decision made, the system sends the appropriate command to the rover to ensure its safe traverse. Finally, the core software calls another set of PGR Triclops pre-defined functions to create a 3D point cloud file. The final outputs of the different tasks mentioned earlier are saved on the server's hard disk.

All the system calls to the pre-defined functions, the method by which the algorithm is able to detect objects and calculate their depths, the AI decision making process and the 3D points cloud generation are explained in this section thoroughly with code segments as well. Both PGR SDKs functions and the new developed algorithm's functions are covered in depth. Flowchart 1 illustrated the different steps taken by the system to achieve the final results and outputs.

The core software's source code is written in C++ and was developed using Microsoft's integrated development environment (IDE); Microsoft Visual Studio 2008.

Flowchart 5.1: The steps taken by the system to generate the final results and outputs


In this part, the role of the pre-defined functions of PGR SDKs is discussed to explain their functionalities and how they contribute as parts of the whole vision-based sensing system to achieve the final results of detecting obstacles and craters, calculating their distances from the rover, perform artificial intelligent decision making and finally create a 3D points cloud of the scenes in front of the rover or the camera.

System Components

Header files are commonly used with professional source codes to provide the essential re-usability feature of Object Oriented Programming languages (OOPL). Header files include declarations of pre-defined functions, classes, variables and so on. As explained earlier, the system uses application programming interfaces (APIs) to call the pre-defined functions provided by PGR SDKs. Both; PGR Triclops and PGR FlyCapture SDKs come with their own header files. The integration requires including the header files which declare the SDKs functions at the beginning of the source code of the core software along with the system header files which are used by the integrated development environment (IDE); Microsoft Visual Studio. The PGR SDKs header files are; triclops.h, pgrflycapture.h and pgrflycapturestereo.h:

Error Handling

The development of any robust software requires taking care of errors that could take place by adding error handling techniques which perform vital tasks such as validation. Since the vision system is completely automatic or in other words; autonomous, there is no human-computer interaction. This means that validation was substituted by error handling techniques that catch system generated errors which are unlikely to happen. System errors could include hardware problems such as faulty or incompatible components or even integration problems due to missing necessary files which could get corrupted or even unintentionally deleted. The first step was creating macros to check, report on and handle both; FlyCapture and Triclops API errors. The following code segment creates macros for FlyCapture API errors:

The second step was to bundle each and every API call to a PGR pre-defined function (FlyCapture and Triclops) with an error handling variable to make sure that not even a single API call could yield an error without being caught. The following code segment shows how an API call to a certain function is preceded by the error handling variable to throw an exception in case of errors:

Camera Handling

There is certain sequence of API calls to the PGR pre-defined functions that must followed to handle the communication between the hardware components (stereo camera) and the software. The sequence or protocol starts by creating contexts (for FlyCapture and Triclops) to serve as ground environments for handling the communication between the camera and the software which is used to transport images. The following code segment shows how one of the contexts is created:

The third step is to acquire important data and information from the camera which are used to create the Triclops context and understand the type of the camera which used as part of the hardware components of the whole system. These data and information are stored in the camera's configuration file:

After that, elements including pixel formats and the maximum number of rows and columns within the image are defined according to the camera's type (colour or mono) and the camera's model (Bumblebee 2 or Bumblebee 3 XB3). If an incompatible camera is being used as part of the hardware components of the system, the software throws and exception and provides a message explaining the nature of the problem:

It is worth mentioning that the vision-based sensing system is capable of interacting with three types and models of stereo cameras; low resolution Bumblebee 2, high resolution Bumblebee 2 and Bumblebee 3 XB3 which has three integrated cameras instead of just two to capture three images simultaneously for the purpose of extra accurate depth information. The system is developed in an efficient way that allows it to interact with these three different cameras without having to apply any changes to the system's integration approach or even alter any segments of the software's source code.

Image Acquisition and Handling

After setting up a custom image format based on the information acquired from the camera and the parameters set such as the maximum number or rows and columns, grabbing an image and transferring it from the camera to the software starts by calling the pre-defined function which handles this task:

The next step after capturing the image is to extract the necessary data and information from it including the number or rows and columns in the image and the time stamp in seconds and micro seconds. Buffers are created to hold both; mono and colour images and temporary images are created as well to prepare the stereo image. The following code segment shows how these steps are performed:

Finally, the pixel interleaved row data is converted to row interleaved format which basically takes care of preparing the software to join the two images taken from the right camera and the left camera to create a stereo image. The production of the colours array of the image through the red, green and blue (RGB) model is handled by creating pointers to the positions in the mono buffer which correspond to the beginning of the red, green and blue sections. The row interleaved images described above which are taken from the two left and right integrated cameras (Bumblebee2) or the three integrated cameras (Bumblebee 3 XB3) are used to build the RGB Triclops input or in other words; the final stereo image as follows:

Image Rectification

In stereo vision, one of the most common problems is what is known as; the correspondence problem. It is the issue of being able to find a certain corresponding point of an image captured by one camera, in the second image of the same scene captured by the second camera. This issue is associated with the camera's configuration and it needs a search to be conducted in two dimensions. The need to search in two dimensions is eliminated in the case of having two perfectly aligned cameras which is the case of the stereo camera used as part of this system. In this situation, the search task is conducted in only one dimension represented by the line which is parallel to the line between the two integrated cameras. This parallel line is referred to as; the baseline.

Image rectification is the process of transforming the images to make the epipolar lines - epipolar geometry - align horizontally to eliminate what is known as; geometric distortion [Oram 2001]. Figure 9 illustrates the geometry behind image rectification.

The calculation is done by a linear transformation where the image is put on the same plane by the X and Y rotation. To directly line up the image pixel rows, a number of steps are taken which are, scaling the image frames to be of the same size and Z skew adjustments and rotations. The rigid alignment of the cameras has to be known and can be achieved through calibration and the calibration coefficients are used by the transform [Fusiello 2000].

Image pre-Processing

In computer vision, image pre-processing is the manipulation or enhancement of input images through the use of computer algorithms to produce output images from where certain data and information can be extracted or used by other applications to achieve certain results. Examples of image processing include; RGB manipulation, noise and signal distortion removal. Noise removal is very common in computer vision applications and it is usually applied to remove grounds or surfaces that are sometimes considered as noise from the captured images. Image pre-processing is basically that prerequisite step which prepares the image for post-processing such as disparity maps generation.

The pre-defined function which is included with PGR Triclops SDK performs image pre-processing through a number of steps that starting with; data unpacking by stripping individual channels from 32-bit packed data and puts them into three images with raw channels. After that, a low-pass filter is applied to the raw channels which corrects common stereo images issues such as; lens distortion and camera misalignment. At the end, a second derivative Gaussian edge processing method is applied to these images to generate what is known as; edge images.

The following figure shows the difference after applying a pre-processing method to an image to remove noise:

Disparity Image

The power of stereo vision relies in its ability to generate range measurements through triangulation between slightly offset cameras. The system generates gray-scale or colour images that are processed to generate useful information. The system analyzes the images and generates correspondence between each and every pixel in each image. The camera's geometry and the correspondences between pixels as well allow the possibility to determine the distance to points or in other words, depth.

To put this in an illustrating example consider the simultaneously captured two images of the scene below. Consider the following example: The image pair is captured from a horizontally displaced camera pair. We can identify two points A and B in both images. The point A (left) corresponds to the point A (right). Similarly, point B (left) corresponds to the point B (right). If a ruler is used to measure out the horizontal distance between the left edge of the images and a certain point, it is found that the distance in the left image is greater than the distance to the corresponding point in the right image. This distance which is also called; disparity is used to determine the distance to the point from the camera or the rover. To summarize; the disparity is defined as the difference between the coordinates of the same features in the left and right image.

If disparity for the point A is defined as D(A) = x(A left) - x(Aright) and the disparity of point B is derived as D(B) = x(B left) - x(Bright), where x(A left) is the x coordinate of the point A (left), calculating D(A) and D(B) will show that D(B) > D(A) which means that point B in the scene is closer than point A. [8]

Correspondence between images is generated through calculating the Sum of Absolute Differences correlation method. For every pixel in the image the system gathers a neighbourhood of a given square size from the raw or original image to compare this neighbourhood to a group of neighbourhoods in the other one. This is done using the following formula:

Finally, the distances (depth) from the camera or in my case; the exploration rover is calculated through the displacement between images and the geometry of the camera. The position of the matched feature is a function of the displacement, the focal length of the lenses, resolution of the CCD and the displacement between cameras. The approach is illustrated in figure 5. [9]

The disparity map generation algorithm described above is included with the PGR Triclops SDK. The core software of the vision-based sensing system uses the API to call the pre-defined function that generates the disparity map or image to retrieve this image. The function call is done as follows:

The new Algorithm

After examining the different obstacle detection algorithms mentioned in section 2; "Literature Review & State of the Art", it was decided to develop a new algorithm capable of performing the main task of obstacle detection but in a more rapid approach that requires less processing power which is suitable for environments that have very limited variations in natural textures such as; the surface of Mars. The examined algorithms are intended to work with the environment available on Earth. The nature of the environment on Earth is very rich in components - moving objects, trees, land, water and so on- and in textures as well while the surface of Mars has a certain distinct nature of rocky surfaces that are very similar in colour and texture. These algorithms go through a lot of image processing steps to achieve the final result of obstacle detection. The steps include intensive image processing to eliminate different levels of noise as well as ground removal and also to look for and isolate blobs which could pose as potential obstacles or even object recognition rather than just detection and also in some cases; tracking moving objects to properly understand the environment ahead of the robot. It is true that the previously mentioned algorithms would work on the surfaces of other planets as well but there is no need to go through a lot of steps that require much more processing power and multiple levels and approaches of image processing especially by taking into consideration the initial difficulty of sending a space exploration rover that is equipped with a multiple number of applications dedicated to serving many mission critical tasks and jobs and having to balance the processing requirement for each application to ensure the integrity of the mission and its goals.

Therefore, I have decided to develop a new algorithm that is simple yet capable of performing the main task of obstacle detection with minimal processing power requirements. The algorithm is explained in details in the next parts of this section.

Regions of Interest (ROI) & Depth Calculation

The new algorithm's technique starts by segmenting the full captured image into what is known as; regions of interest (ROI). Regions of interest are sectors of the image that are processed and analyzed individually to extract useful data and information from them. For efficiency purposes, it was decided to set the resolution of the captured images to 320 X 240 pixels. This particular resolution has been proved during the testing phase to have a balance of both; processing efficiency and results reliability. The 320 X 240 pixels resolution can be segmented to different numbers of regions of interest to cover the whole image but after testing three different segmentation approaches which are; a 16 x 12 (192) regions of interest, an 8 x 6 (48) regions of interest and finally a 4 x 3 (12) regions of interest. The first segmentation method yields very accurate results but requires more processing power and the third method yielded acceptable results and required very minimal processing power. Since the goal is to have a balance between accuracy and processing power requirements, it was decided to apply the second approach which provides a balance between efficiency and processing power requirements. Figure 13 illustrates the chosen image segmentation approach.

This can be repeated 48 times to create the 48 regions of interest.

To lower the required processing power furthermore, it was decided to ignore 33 regions of interest which are located in the first 3 rows (sections 1 to 24), the first column and the last two columns. These ignored regions of interest represent either the horizon ahead of the rover or the areas to the right and left of the rover which are not part of the rover's moving area. Only the remaining regions of interest which are; sections 26 to 30 (the fourth row), sections 34 to 38 (the fifth row) and sections 42 to 46 (the sixth row) were analyzed furthermore as they represent the ground or the surface ahead of the rover (rows 5 and 6) where obstacles or craters are found and the transition area between the ground and the horizon ahead of the rover (row 4). This decision was made after analyzing the vast number of the images captured of different scenes during the testing phase to understand which sections could be ignored and which sections are important to work on as they are part of the area which the rover would use to move forward. Ignoring the right and left columns allowed for a clearance area of 16 cm to the right and left of the rover to ensure a safe traverse with no side collision threats. Figure 14 highlights the important regions of interest.

After determining the important sectors or regions of interest, the algorithm calculates the average depth of each and every sector. Each sector is basically a grid of 16 pixels (4 x 4) that represents 1 region of interest. The average depth or distance between the stereo camera and the physical scene caught in each region of interest is converted to meters. The following code example shows how the average depth calculation is done:

Object Detection & AI Decision Making

To achieve efficient object detection results, the algorithm sends the resulting average depth data of all the important regions of interest mentioned above to a function which analyzes the resulting average depth of each region of interest. This function compares the average depth against the pre-defined thresholds that were gathered and set during the object detection testing phase mentioned in part B of section 3; "Testing Techniques". The algorithm compares the resulting readings against these thresholds to determine whether the particular region of interest being analyzed suggest the presence of an obstacle (the average depth is less than the normal threshold), crater (the average depth is far more than the normal threshold) or nothing hazardous (the average depth is within the normal range of the threshold) as follows:

After determining the closest point or object to the rover, the algorithm matches that point back with the region of interest from where it was generated. Since each region of interest has a unique number that indicates its position in the image and by knowing which region of interest represents the closest object to the rover, it is possible to know exactly the position of that object with respect to the rover's location.

The previously mentioned steps allow the algorithm to perform decision making tasks to send the correct command to the rover. As explained earlier, for example; if the average depth of the 26th region of interest or section is lower than the normal threshold, then it is determined that this particular section indicates the presence of an object or obstacle which is located in the left side of the rover's area of movement. The following code segment explains the decision making process:


As mentioned earlier, the final decision made by the algorithm about the situation ahead of the camera is supposed to be sent to the rover as a command to its motors such as; stop, turn right or left and so on. The actual physical integration of the vision-based sensing system with a rover (robot) does not fall under the scope of my project but it can be done by mounting the stereo camera on top of the rover and also mounting a small server (computer) suitable for the rover's size. Regarding software, all the included applications, files and SDKs would be installed on the server and would communicate with the high-level application responsible for communicating with the low-level application that controls the rover's motors through TCP/IP protocol. The main part of the vision-based sensing software could be executed in a loop which executes once every 2 seconds for example through the use of system pauses that control the period of time which separates each loop execution.

The final decision made by the algorithm is displayed in a console instead of being sent. Figures 15 to 18 show different decisions made by the algorithm after analyzing the corresponding different scenes and scenarios.

3D Points Cloud

A point cloud file contains scanned points which represent the visible surface of a certain object or a whole scene. Initially, special devices called 3D scanners were used to scan objects or scenes to capture these points into a cloud file, but nowadays, with the developments achieved in stereo vision, it is possible to use stereo cameras to capture these points. Point clouds are used in various computer applications such as; the creation of 3D models especially in the fields of industry and animation. These files are usually converted to certain types of models first in order for them to be useable in the previously mentioned applications and to be used for other purposes as well. The process of converting point clouds to usable models is known as; point clouds reverse engineering.

An algorithm containing a set of pre-defined PGR Triclops functions handles the generation of the 3D point cloud through the use of sub-pixel interpolation which yields more accurate extraction of depth. The method of using surface validation is also introduced in this function. This method is used to remove noise from depth reconstruction where each and every valid depth pixel is converted into a 3D XYZ position. The generated 3D point cloud is written to 3D point file (PTS). If a rectified coloured image is used as the source or input image, then it is also possible to write the colours of the points as well to the file.

The algorithm starts the process by determining the number of pixels spacing by row.

Please be aware that the free essay that you were just reading was not written by us. This essay, and all of the others available to view on the website, were provided to us by students in exchange for services that we offer. This relationship helps our students to get an even better deal while also contributing to the biggest free essay resource in the UK!