Augmented Reality Anywhere and Anytime
The Handheld Augmented Reality
Come to ISMAR 2011
Natural Feature Tracking
Phones are embedded systems with severe limitations in both the computational facilities (low throughput, no floating point support) and memory bandwidth (limited storage, slow memory, tiny caches). Therefore, natural feature tracking on phones has been up to now considered prohibitive.
We developed the first fully self-contained natural feature tracking system capable of tracking full six degrees of freedom (6DOF) at real-time frame rates (20Hz) from natural features using solely the built-in camera and CPU of the phone. To exploit the nature of typical AR applications, our tracking technique uses only textured planar targets, which are known beforehand and can be used to create a training data set.
We have achieved this by examining a leading approach in feature descriptors, namely SIFT. In the original published form, the SIFT approach is unsuitable for low-end embedded platforms such as phones. We noticed that, while some aspects are computationally infeasible on current generation phones and must be replaced by different approaches, other aspects can be simplified to run at the desired level of speed, quality and resource consumption.
The tracking pipeline is shown in the image below, including also a coarse representation of the processing time required on a phone for each of the stages.
The original SIFT algorithm uses Difference-of-Gaussians (DoG) to perform a scale-space search that not only detects features but also estimates their scale. Although several faster implementations of Lowe's approach have been proposed, the approach is inherently resource intensive and therefore not suitable for real-time execution on mobile phones. We therefore replaced the DoG with the FAST corner detector with non-maximum suppression that is known to be one of the fastest corner detectors, but still provides a high repeatability.
After new features have been detected, they can optionally be tracked by cross correlation. While feature tracking is not mandatory since the match step is independent of frame-to-frame coherence, it provides two benefits: Most obviously, tracking of features gives a speed up, since features that were tracked reliably don't have to be described and matched against the SIFT database. At the same time feature tracking also improves the overall robustness since features that passed all outlier tests are forwarded with highest confidence values into the next frame, which improves outlier removal.
Our feature tracker follows both "good" and "bad" features from frame to frame. Good features passed all tests and finally contributed to the pose estimation in the previous frame. Hence, they provide a good basis for the next camera frame. Bad features could either not be matched or were filtered out by the outlier removal step. Since a bad feature is likely to be re-detected in the next frame, much processing time can be saved by forwarding this information to the next frame. Forwarding both good and bad features removes the need to describe and match them, resulting in a considerable speedup.
Most people associate SIFT only with its most complex variant, which is built from 4x4 sub-regions with 8 gradient bins each, resulting in a 128 dimensional vector. For performance and memory reasons we decided to use a variant using only 3x3 sub-regions with 4 bins each (resulting in a 36 dimensional vector). As Lowe outlines, this variant performs only ~10 percent worse than the best variant.
After the descriptors for all features detected in the new camera image (except for those tracked from the previous frame) have been created, they are matched against the descriptors in the database. Brute force matching is not an option, since for each frame ~50-100 features have to be matched against ~5000 features in the database. determined) per pixel to treat a feature as correctly matched. The original SIFT implementation uses a k-d Tree together with a Best-Bin-First strategy. Yet, our tests showed that even with this approach far too many vectors have to be compared for real-time performance on mobile phones. Looking in more detail, we discovered that some (usually 1-3) entries of the descriptors vary strongly from the respective descriptors in the database. These errors increase the required tolerance for searching in the tree tremendously.
A Spill Tree is a variant of a k-d Tree that uses an overlapping splitting area. Values that are within a certain threshold are dropped into both branches. With an increasing threshold, a Spill Tree is capable of tolerating more and more error at the cost of growing larger. We discovered that multiple Spill Trees with randomized dimensions for pivoting allow for a highly robust voting process, similar to the idea of randomized trees. Instead of using a single tree, we combine a number of Spill Trees into a Spill Forest. Each Spill Tree is built up to such a size, that it holds ~50-80 entries in each leaf. While trees with more levels reduce search time, more levels also increase the chance of testing a faulty dimension. Since only a few values of a vector are expected to be wrong, a vector has a high probability of showing up in the "best" leaf of each tree. We therefore only visit a single leaf in each tree and merge the resulting candidates. Descriptors that show up in only one leaf are discarded. All others are matched using Sum of Squared Difference (SSD).
The outlier removal operates in three steps.
A video showing the natural feature tracker in action on a mobile phone found can be found in the media section.
The dataset used can be found here: ismar08_data.zip.
copyright (c) 2014 Graz University of Technology