handheld augmented reality

Augmented Reality Anywhere and Anytime   


   Social Augmented Reality

   Information Presentation

   Real-Time Self Localization

   Structural Modelling

   AR Graphics

   AR Navigation

   Augmented Reality Games

   Past Projects



   Hybrid SLAM

   Panorama SLAM

   Planar SLAM

   Model-Based Tracking

   Marker-Based Tracking


Software Libraries

   Studierstube ES

   Studierstube Tracker










   Student projects


   Site map


The Handheld Augmented Reality
Project is supported by the
following institutions:



Christian Doppler Forschungsgesellschaft


Graz University of Technology




Come to ISMAR 2011



Follow us on twitter

Natural Feature Tracking

Phones are embedded systems with severe limitations in both the computational facilities (low throughput, no floating point support) and memory bandwidth (limited storage, slow memory, tiny caches). Therefore, natural feature tracking on phones has been up to now considered prohibitive.

We developed the first fully self-contained natural feature tracking system capable of tracking full six degrees of freedom (6DOF) at real-time frame rates (20Hz) from natural features using solely the built-in camera and CPU of the phone. To exploit the nature of typical AR applications, our tracking technique uses only textured planar targets, which are known beforehand and can be used to create a training data set.

We have achieved this by examining a leading approach in feature descriptors, namely SIFT. In the original published form, the SIFT approach is unsuitable for low-end embedded platforms such as phones. We noticed that, while some aspects are computationally infeasible on current generation phones and must be replaced by different approaches, other aspects can be simplified to run at the desired level of speed, quality and resource consumption.

Tracking pipeline

The tracking pipeline is shown in the image below, including also a coarse representation of the processing time required on a phone for each of the stages.

Feature detection

The original SIFT algorithm uses Difference-of-Gaussians (DoG) to perform a scale-space search that not only detects features but also estimates their scale. Although several faster implementations of Lowe's approach have been proposed, the approach is inherently resource intensive and therefore not suitable for real-time execution on mobile phones. We therefore replaced the DoG with the FAST corner detector with non-maximum suppression that is known to be one of the fastest corner detectors, but still provides a high repeatability.

Feature tracking

After new features have been detected, they can optionally be tracked by cross correlation. While feature tracking is not mandatory since the match step is independent of frame-to-frame coherence, it provides two benefits: Most obviously, tracking of features gives a speed up, since features that were tracked reliably don't have to be described and matched against the SIFT database. At the same time feature tracking also improves the overall robustness since features that passed all outlier tests are forwarded with highest confidence values into the next frame, which improves outlier removal.

Our feature tracker follows both "good" and "bad" features from frame to frame. Good features passed all tests and finally contributed to the pose estimation in the previous frame. Hence, they provide a good basis for the next camera frame. Bad features could either not be matched or were filtered out by the outlier removal step. Since a bad feature is likely to be re-detected in the next frame, much processing time can be saved by forwarding this information to the next frame. Forwarding both good and bad features removes the need to describe and match them, resulting in a considerable speedup.

Descriptor creation

Most people associate SIFT only with its most complex variant, which is built from 4x4 sub-regions with 8 gradient bins each, resulting in a 128 dimensional vector. For performance and memory reasons we decided to use a variant using only 3x3 sub-regions with 4 bins each (resulting in a 36 dimensional vector). As Lowe outlines, this variant performs only ~10 percent worse than the best variant.

Our SIFT kernel is always 15 pixels wide (3 sub-regions with a size of 5 pixels each). To gain more robustness, we again blur the patch with a 3x3 Gaussian kernel. Like in the original implementation we first estimate the main orientations from the patchÕs gradients: for all pixels of the kernel, we calculate the gradient direction and magnitude. The gradient magnitude is weighted using a distance measure and is added to the respective bin. The resulting histogram is then searched for peaks. If more than 3 peaks are found, the feature is discarded.

For each detected orientation, the patch is rotated using sub-pixel accuracy to compensate that orientation. Based on the rotated patches, descriptor vectors are created: Gradient magnitudes and orientations are estimated again and weighted by their distance to the patch center as well as to the sub-region center. The weighted magnitudes are then written into the 4 bins corresponding to the sub-region.

Descriptor matching

After the descriptors for all features detected in the new camera image (except for those tracked from the previous frame) have been created, they are matched against the descriptors in the database. Brute force matching is not an option, since for each frame ~50-100 features have to be matched against ~5000 features in the database. determined) per pixel to treat a feature as correctly matched. The original SIFT implementation uses a k-d Tree together with a Best-Bin-First strategy. Yet, our tests showed that even with this approach far too many vectors have to be compared for real-time performance on mobile phones. Looking in more detail, we discovered that some (usually 1-3) entries of the descriptors vary strongly from the respective descriptors in the database. These errors increase the required tolerance for searching in the tree tremendously.

A Spill Tree is a variant of a k-d Tree that uses an overlapping splitting area. Values that are within a certain threshold are dropped into both branches. With an increasing threshold, a Spill Tree is capable of tolerating more and more error at the cost of growing larger. We discovered that multiple Spill Trees with randomized dimensions for pivoting allow for a highly robust voting process, similar to the idea of randomized trees. Instead of using a single tree, we combine a number of Spill Trees into a Spill Forest. Each Spill Tree is built up to such a size, that it holds ~50-80 entries in each leaf. While trees with more levels reduce search time, more levels also increase the chance of testing a faulty dimension. Since only a few values of a vector are expected to be wrong, a vector has a high probability of showing up in the "best" leaf of each tree. We therefore only visit a single leaf in each tree and merge the resulting candidates. Descriptors that show up in only one leaf are discarded. All others are matched using Sum of Squared Difference (SSD).

Outlier removal

The outlier removal operates in three steps.

  1. The first step uses the orientations that have already been estimated for the descriptor creation. Since the tracker is limited to planar targets, all features should have a similar orientation. Entering all orientations into a histogram and searching for a peak, a main orientation is quickly estimated and used to filter out those features which do not support this hypothesis. Since the feature orientations are already available, this step is very fast; at the same time it is very efficient in removing most of the outliers.
  2. The second outlier removal step uses simple geometric tests. Starting with the most confident candidates, lines are estimated from two features. All other features are then tested to lie on the same side of the line in camera as well as object space. If too many features (>50%) fail the test for a single line, the test is canceled, since one of the two reference features is probably faulty. Features which fail line tests are removed.
  3. The third outlier removal step creates homographies to remove final outliers. Features that passed the previous two tests obviously have a correct orientation and coarsely lie at the right spot in the camera image. Hence, only few outliers remain. Since a homography must be computed for the initial pose anyway, it can be used for strong final test without introducing overhead. 8 features are combined into sets of 4 and used to calculate 16 candidate homographies from object into camera space. All homographies are then used to test the projection of the features. The homography with the smallest number of outliers and all respective inliers are finally selected for pose refinement.


A video showing the natural feature tracker in action on a mobile phone found can be found in the media section.


The dataset used can be found here: ismar08_data.zip.


copyright (c) 2014 Graz University of Technology