Eye tracking system for HCI


1. Introduction

Eye Tracking System for HCI is real time Human-Computer Interaction system that aims to achieve the following tasks

  • eye tracking through real time web camera input
  • determine the best algorithm for eye tracking from existing algorithms

The goal of the project is basically to design software for people with severe physical disabilities like paralysis ALS (Lou Gehrig's disease) and multiple sclerosis to create an interface between man and the machine. The application of eye movements as a control medium using real time input helps people with severe disability to interact with simple computer applications in a meaningful way. (Website address)


OpenCV is a computer vision library originally developed by Intel. It is a huge real time image processing library It is free for commercial and research use under the open source BSD license. The library is cross-platform, and runs on Windows, Mac OS X, Linux, PSP, VCRT (Real-Time OS on Smart camera) and other embedded devices. OpenCV was designed for the high computational efficiency of the real time application.

Officially launched in 1999, the OpenCV project was initially an Intel Research initiative to advance CPU-intensive applications, part of a series of projects including real-time ray tracing and 3D display walls. One of OpenCV's goals is to provide a simple-to-use computer vision infrastructure that helps people build fairly sophisticated vision applications quickly [1].

System block diagram

1.1 Source Stage

This stages refers to the real time visual input to the system. The data is fetched from different types of visual data sources.

1.2 Face Extraction Stage

This stage performs the task of Face detection from the visual input, extracting the largest face and passing it to the next stage for further processing.

1.2.1 Face Detection

Face detection is a computer technology that determines the locations and sizes of human faces in digital images. It detects facial features and ignores anything else, such as buildings, trees and bodies. Face detection is concerned with finding whether or not there are any faces in a given image usually in gray scale and, if present, return the image location and content of each face.

1.2.2 Haar cascade classifier for Face Detection

A rapid face detection algorithm based on the papers published by Paul Viola and Michael Jones in 2001 has been implemented as face detection module. The algorithm is among the fastest existing face detection algorithms. It uses a technique of performing detections in several stages (cascade) and removing much of the non face containing regions at the early stage. Each stage consists of Haar like features which are used for comparing the sub window as a face or a non face. The process of calculation of the feature value is rapid by the use of the technique called integral image and the selection of the features is done by the AdaBoost algorithm.

OpenCV already have a trained classifier for frontal face detection

1.3 Eye Extraction Stage

In this stage the rough estimation of eyes location is determined. Eyes are usually located at the upper half of the face. Thus, set the right as the Region of Interest and extract the eye.

1.3.1 Edge Detection using the Canny Edge Detector

The ability to measure the gray level transition in an image is called the Edge detection. Since, the iris is always darker than the sclera no matter what color it is. It is easier to detect the edge points of the iris. TheCannyedge detectionoperator was developed byJohn F. Cannyin 1986 and uses a multi-stagealgorithmto detect a wide range of edges in images. The Canny edge detector uses a filter based on the first derivative of a Gaussian to reduce noise present on raw unprocessed image data. The raw image isconvolvedwith a Gaussian filter, results a slightlyblurredversion of the original which is not affected by a single noisy pixel to any significant degree. The Canny algorithm uses four filters to detect horizontal, vertical and diagonal edges in the blurred image.

1.3.2 Iris Detection using Circular Hough Transform

Circular Hough transform is applied after performing some pre-processing steps. First, the iris image extracted from the previous stage is converted into grayscale to remove the effect of illumination. Then, Circular Hough transform internally applies the Canny edge detector algorithm and searches of the curves around the binary image.

1.4 Eye and Iris Tracking

1.4.1CAMShift Algorithm

CAMShift was developed by Gary R. Bradski of Intel. It tracks the mode of a color probability distribution across the frames of a video sequence. The color image data is represented as a probability distribution through the backprojection. The backprojection is created for each frame of tracking and is generated from a color histogram model of iris color. The center and size of the iris region is found and used to set the search window on the next frame of the video sequence.

1.5 Pupil Tracking

1.5.1 Lucas Kanade Algorithm

Lucas-Kanade methodis a two-frame differential method foroptical flowestimation developed byBruce D. LucasandTakeo Kanade.

Optical flow is a method used for estimating motion of objects across a series of frames. The method is based on an assumption which states that points on the same object location (therefore the corresponding pixel values) have constant brightness over time.

1.6 Mapping Tracked Movement of Eye with Mouse


  • Few real world application
  • Few implemented by us

2. Literature Survey/Review

2.1 Face detection using Haar cascade classifier

The face detection is a challenging task as it needs to account for all possible appearance variation caused by change in illumination, facial features, occlusions, etc. It has to detect faces that appear at different scale, pose, with in plane rotations. In order to detect faces the algorithm should employ a higher level feature extraction rather than examining some properties of an individual pixel.

This paper describes the implementation of Haar cascade classifier for the face detection as the one of the most efficient algorithm in processing real time videos.

2.1.1 Haar Wavelet

In Haar wavelet transform, images are mapped from the space of pixels to that of an over complete dictionary of Haar wavelet features that provides a rich description of the pattern. This representation is able to capture the structure of the class of objects to be detected while ignoring the noise inherent in the images. Haar wavelets, identifies local, oriented intensity difference features at different scales and is efficiently computable.

The Haar wavelet is perhaps the simplest such feature with finite support. The images are transformed from pixel space to the space of wavelet coefficients, resulting in an over-complete dictionary of features that are then used as training for a classifier. Haar wavelet is one of the simplest wavelet, and a step function also is a type of Haar wavelet.

2.1.2 Haar-like features

Haar-like features encodes the existence of oriented contrasts between regions in the image. A set of these features can be used to encode the contrasts exhibited by a human face and their spacial relationships. Haar-like features are so called because they are computed similar to the coefficients in Haar wavelet transforms.

2.1.3 Haar cascade classifier

A image is only a collection of color and/or light intensity values. Analyzing these pixels for face detection is time consuming and difficult to accomplish because of the wide variations of shape and pigmentation within a human face. Viola and Jones devised an algorithm, called Haar Classifiers, to rapidly detect any object, including human faces, using AdaBoost classifier cascades classifies images based on Haar-like features and not pixels.

The core basis for Haar classifier object detection is the Haar-like features. These features, rather than using the intensity values of a pixel, use the change in contrast values between adjacent rectangular groups of pixels. The contrast variances between the pixel groups are used to determine relative light and dark areas. Two or three adjacent groups with a relative contrast variance form a Haar-like feature. A simple rectangular Haar-like feature can be defined is the difference between the sum of pixels of the white region to the dark region of the features as shown in the Figure 1

(a) being two vertically stacked rectangular feature. Similarly, (c) and (d) is three rectangle feature and (e) is the four rectangle feature

The simple rectangular features of an image are calculated using an intermediate representation of an image, called the integral image. The integral image is an array containing the sums of the pixels' intensity values located directly to the left of a pixel and directly above the pixel at location (x,y) inclusive. Using the appropriate integral image and taking the difference between six to eight array elements forming two or three connected rectangles, a feature of any scale can be computed.

The cascading of the classifiers allows only the sub-images with the highest probability to be analyzed for all Haar-features that distinguish an object. It also allows one to vary the accuracy of a classifier. The cascaded classifiers are then trained amongst two sets of data- one containing the object to be identified and another without the object. Hence, Haar cascade classifier is made.

2.1.4 Haar Training

1. Positive (Face) Images

We need to collect positive images that contain only objects of interest, e.g., faces.

3. Negative (Background) Images

We need to collect negative image that does not contain objects of interest, e.g., faces to train Haar cascade classifier.

4. Natural Test (Face in Background) Images

We can synthesize testing image sets using the create samples utility, but having a natural testing image dataset is still good.


* Snapshots of face detection

2.2Eye Detection using Circular Hough Transform

2.2.1Hough transform

The Hough transform is a feature extraction technique that can be used to determine the parameters of simple geometric objects, such as line and circles, present in an image. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes by a voting procedure. This voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly constructed by the algorithm for computing the Hough transform.

2.2.2Circular Hough Transform

CHT is to find circular patterns within an image. Then, CHT is used to transforma set of feature points in the image space into a set of accumulated votes in a parameter space. Then, for each feature point, votes are accumulated in an accumulator array for all parameter combinations. The array elements that contain the highest number of votes indicate the presence of the shape.

The process of ?nding circles in an image using CHT( Circular Hough Transform) is:

  1. First we ?nd all edges in the image. This step has nothing to do with Hough Transform and any edge detection technique of your desire can be used. It could be Canny, Sobel or Morphological operations.
  2. At each edge point we draw a circle with center in the point with the desired radius. In this way we sweep over every edge point in the input image drawing circles with the desired radii and incrementing the values in our accumulator. When every edge point and every desired radius is used, we can turn our attention to the accumulator. The accumulator will now contain numbers corresponding to the number of circles passing through the individual coordinates. Thus the highest numbers (selected in an intelligent way, in relation to the radius) correspond to the center of the circles in the image.
  3. One or several maxima in accumulator is found.
  4. The found parameters (r,a,b) corresponding to the maxima are mapped back to the original image.

Finding circles:

One approach is to ?nd the highest peaks for each a, b plane corresponding to a particular radius, in the accumulator data. If the height of the peak(s) is equal compared to the number of edge pixels for a circle with the particular radius, the coordinates of the peak(s) does probably correspond to the center of such a circle.

* Snapshots of circle drawn around the iris

2.3Edge detection using canny edge detector

2.3.1 Edge detection

In an image the edges characterize object boundaries and are useful for segmentation, registration, and identification of objects in a scene. An edge is a jump in intensity. The cross section of an edge has the shape of a ramp. An ideal edge is a discontinuity (i.e., a ramp with an infinite slope). The first derivative assumes a local maximum at an edge. For a continuous image[Graphics:Images/index_gr_1.gif], wherexandyare the row and column coordinates respectively, we typically consider the two directional derivatives[Graphics:Images/index_gr_2.gif]and[Graphics:Images/index_gr_3.gif]. Of particular interest in edge detection are two functions that can be expressed in terms of these directional derivatives: the gradient magnitude and the gradient orientation. The gradient magnitude is defined as

Local maxima of the gradient magnitude identify edges in[Graphics:Images/index_gr_6.gif]. When the first derivative achieves a maximum, the second derivative is zero. For this reason, an alternative edge-detection strategy is to locate zeros of the second derivatives of[Graphics:Images/index_gr_7.gif]. The differential operator used in these so-called zero-crossing edge detectors is the Laplacian.

2.3.2Canny edge detector an edge is normally defined as an abrupt change in colour intensity.

2.4 CAMShift algorithm for tracking

CAMShift algorithm is the color based object tracking technique using a one-dimensional histogram consisting of quantized channels from the HSV color space.

It is based on an adaptation of Mean Shift that, given a probability density image, finds the mean (mode) of the distribution by iterating in the direction of maximum increase in probability density.

The primary difference between CAMShift and the Mean Shift algorithm is that CAMShift uses continuously adaptive probability distributions (that is, distributions that may be recomputed for each frame) while Mean Shift is based on static distributions, which are not updated unless the target experiences significant changes in shape, size or color.

The CamShift Algorithm

The CamShift algorithm can be summarized in the following steps (Intel Corporation, 2001);

  1. Set the region of interest (ROI) of the probability distribution image to the entire image.
  2. Select an initial location of the Mean Shift search window. The selected location is the target distribution to be tracked.
  3. Calculate a color probability distribution of the region centred at the Mean Shift search window.
  4. Iterate Mean Shift algorithm to find the centroid of the probability image. Store the zeroth moment (distribution area) and centroid location.
  5. For the following frame, center the search window at the mean location found in Step 4 and set the window size to a function of the zeroth moment. Go to Step 3.

Continuously Adaptive Distributions

The probability distribution image (PDF) is determined using any method that associates a pixel value with a probability that the given pixel belongs to the target. A common method is known as Histogram Back-Projection. In order to generate the PDF, an initial histogram is computed at Step 1 of the CamShift algorithm from the initial ROI of the filtered image.

The histogram used in Bradski (1998) consists of the hue channel in HSV color space, however multidimensional histograms from any color space may be used.

The histogram is quantized into bins, which reduces the computational and space complexity and allows similar color values to be clustered together. The histogram bins are then scaled between the minimum and maximum probability image intensities using Equation 2.

Histogram Back-Projection

Histogram back-projection is a primitive operation that associates the pixel values in the image with the value of the corresponding histogram bin.

The back-projection of the target histogram with any consecutive frame generates a probability image where the value of each pixel characterizes probability that the input pixel belongs to the histogram that was used. [CRPITV36Allen.pdf]

The technique performs better in case of occlusions and images muddled by artifacts such as shadows and noise.


2.4.1 Eye tracking using CAMShift Algorithm

  • How this algorithm is implemented for tracking the eye
  • Snapshot of ellipse drawn around the eye

2.4.2 Iris Tracking using CAMShift Algorithm

  • How this algorithm is implemented for tracking the iris
  • Snapshot of ellipse drawn around the iris and its change size with the movement of eye
  • Figure of back projection histogram

2.5 Lucas kanade for tracking

Optical flow means tracking specific features (points) in an image across multiple frames. It basically finds objects from one frame in other frames; determine the speed, and direction of movement of objects. It tracks the features basically the corners and edges.

2.5.1 Pupil tracking using Lucas kanade

How this algorithm is implemented for tracking the pupil

3. Performance and Analysis

3.1 Performance measurements of algorithms implemented

  • Efficiency of the eye tracking using lucas kanade
  • Efficiency of the eye tracking using CAMShift
  • The better one among the two tracking algorithm
  • 3.2 Performance analysis of EYE TRACKING SYSTEM FOR HCI under a specific environment/platform


    [1] GaryBradski & Adrain Kaehler, Learning OpenCV Computer Vision with the OpenCV Library, September 2008.

    [2] Websites: viewed date


    * Distance from camera:

    Iris detection: best detection from the distance of 50cm from the camera

    Also possible till 100cm away from the camera, accuracy decreases hence onwards

    Face detection: less than 25 cm from the camera, face detection does not occur. However, face detection occurs up to a distance of 4.5 m, along with detection of lot of non face objects but iris detection does not occur accurately from that distance, hence, this becomes of less significance.

    Height: face should be placed at the height of the web camera for the entire process to take place accurately

    Restriction in movement of the head/body after the process of calibration

    Performance analysis:


    Face detection: From a normal distance i.e of 30 cm: 0.2518 seconds for single face, 0.314145 secs for double face, 0.337079 secs at farther distance

    Conclusion : distance from camera has not much significance in the speed of detection of face in anyways.

    Iris detection: From a normal distance : 0.205625 seconds

    CAMshift tracking: 0.00392829 seconds, very fast, takes less CPU cycles




    Requirement of high resolution camera for better accuracy

    Takes significant amount of CPU time because of the reason that the entire process is real-time.

    The algorithm based on a robust nonparametric technique for climbing density gradients to find the mode (peak) of probability distributions is called the mean shift algorithm. In our case, we want to find the mode of a color distribution within a video scene.

    Therefore, the mean shift algorithm is modified to deal with dynamically changing color probability distributions derived from video frame sequences. The modified algorithm is called the Continuously Adaptive Mean Shift (CAMSHIFT) algorithm.

    2.3 HSI, HSV, HSL - Hue Saturation Intensity

    (Value, Lightness)

    Hue-saturation based colorspaces were introduced when there was a need for the user to specify color properties numerically. They describe color with intuitive values, based on the artist's idea of tint, saturation and tone. Hue defines the dominant color (such as red, green, purple and yellow) of an area, saturation measures the colorfulness of an area in proportion to its brightness [Poynton 1995].

    The "intensity", "lightness" or "value" is related to the color luminance.

    The intuitiveness of the colorspace components and explicit discrimination between luminance and chrominance properties made these colorspaces popular in the works on skin color segmentation [Zarit et al. 1999], [McKenna et al. 1998], [Sigal et al. 2000], [Birchfield 1998], [Jordao et al. 1999]. Several interesting properties of Hue were noted in [Skarbek and Koschan 1994]: it is invariant to highlights at white light sources, and also, for matte surfaces, to ambient light and surface orientation relative to the light source. However, [Poynton 1995], points out several undesirable features of these colorspaces, including hue discontinuities and the computation of "brightness" (lightness, value), which conflicts badly with the properties of color vision.

    An alternative way of hue and saturation computation using log opponent values was introduced in [Fleck et al. 1996], where additional logarithmic transformation of RGB values aimed to reduce the dependance of chrominance on the illumination level.

    The polar coordinate system of Hue-Saturation spaces, resulting in cyclic nature of the colorspace makes it inconvenient for parametric skin color models that need tight cluster of skin colors for best performance. A different representation of Hue-Saturation using Cartesian coordinates can be used [Brown et al. 2001]: X = ScosH, Y = SsinH (5)

Please be aware that the free essay that you were just reading was not written by us. This essay, and all of the others available to view on the website, were provided to us by students in exchange for services that we offer. This relationship helps our students to get an even better deal while also contributing to the biggest free essay resource in the UK!