next up previous contents
Next: Matching using affinely invariant Up: Feature extraction and matching Previous: Comparing image regions   Contents

Feature point extraction

One of the most important requirements for a feature point is that it can be differentiated from its neighboring image points. If this were not the case, it wouldn't be possible to match it uniquely with a corresponding point in another image. Therefore, the neighborhood of a feature should be sufficiently different from the neighborhoods obtained after a small displacement.

A second order approximation of the dissimilarity, as defined in Eq. (4.1), between a image window $W$ and a slightly translated image window is given by

\begin{displaymath}
D(\Delta x,\Delta y)=\left[ \begin{array}{c} \Delta x \\ \De...
...& \frac{\partial I}{\partial y}\end{array}\right] w(x,y) dx dy
\end{displaymath} (D3)

To ensure that no displacement exists for which $D$ is small, the eigenvalues of ${\bf M}$ should both be large. This can be achieved by enforcing a minimal value for the smallest eigenvalue [136] or alternatively for the following corner response function $R= \det {\bf M} - k (\mbox{trace } {\bf M})^2$ [45] where $k$ is a parameter set to 0.04 (a suggestion of Harris). In the case of tracking this is sufficient to ensure that features can be tracked from one video frame to the next. In this case it is natural to use the tracking neighborhood to evaluate the quality of a feature (e.g. a $7 \times 7$ window with $w(x,y)=1$). Tracking itself is done by minimizing Eq. 4.1 over the parameters of ${\bf T}$. For small steps a translation is sufficient for ${\bf T}$. To evaluate the accumulated difference from the start of the track it is advised to use an affine motion model.

In the case of separate frames as obtained with a still camera, there is the additional requirement that as much image points originating from the same 3D points as possible should be extracted. Therefore, only local maxima of the corner response function are considered as features. Sub-pixel precision can be achieved through quadratic approximation of the neighborhood of the local maxima. A typical choice for $w(x)$ in this case is a Gaussian with $\sigma=0.7$. Matching is typically done by comparing small, e.g. $7 \times 7$, windows centered around the feature through SSD or NCC. This measure is only invariant to image translations and can therefore not cope with too large variations in camera pose.

To match images that are more widely separated, it is required to cope with a larger set of image variations. Exhaustive search over all possible variations is computationally untractable. A more interesting approach consists of extracting a more complex feature that not only determines the position, but also the other unknowns of a local affine transformation [164] (see Section 4.1.3).

In practice often far too much corners are extracted. In this case it is often interesting to first restrict the numbers of corners before trying to match them. One possibility consists of only selecting the corners with a value $R$ above a certain threshold. This threshold can be tuned to yield the desired number of features. Since for some scenes most of the strongest corners are located in the same area, it can be interesting to refine this scheme further to ensure that in every part of the image a sufficient number of corners are found.

In figure 4.1 two images are shown with the extracted corners. Note that it is not possible to find the corresponding corner for each corner, but that for many of them it is.

Figure 4.1: Two images with extracted corners
\begin{figure}\centerline{\psfig{figure=feature/corners1.ps, width=7cm}
\psfig{figure=feature/corners2.ps, width=7cm}}\end{figure}

In figure 4.2 corresponding parts of two images are shown. In each the position of 5 corners is indicated. In figure 4.3 the neighborhood of each of these corners is shown. The intensity cross-correlation was computed for every possible combination. This is shown in Table 4.1. It can be seen that in this case the correct pair matches all yield the highest cross-correlation values (i.e. highest values on diagonal). However, the combination 2-5, for example, comes very close to 2-2. In practice, one can certainly not rely on the fact that all matches will be correct and automatic matching procedures should therefore be able to deal with important fraction of outliers. Therefore, further on robust matching procedures will be introduced.

Figure 4.2: Detail of the images of figure 4.1 with 5 corresponding corners.
\begin{figure}\centerline{\psfig{figure=feature/kast1.ps, height=7cm}
\psfig{figure=feature/kast2.ps, height=7cm}}\end{figure}
Figure 4.3: Local neighborhoods of the 5 corners of figure 4.2.
\begin{figure}\centerline{
\psfig{figure=feature/c11.ps, width=2cm}
\psfig{figur...
...feature/c24.ps, width=2cm}
\psfig{figure=feature/c25.ps, width=2cm}}\end{figure}


Table 4.1: Intensity cross-correlation values for all possible combinations of the 5 corners indicated figure 4.2.
0.9639 -0.3994 -0.1627 -0.3868 0.1914
-0.0533 0.7503 -0.4677 0.5115 0.7193
-0.1826 -0.3905 0.7730 0.1475 -0.7457
-0.2724 0.4878 0.1640 0.7862 0.2077
0.0835 0.5044 -0.4541 0.2802 0.9876


If one can assume that the motion between two images is small (which is needed anyway for the intensity cross-correlation measure to yield good results), the location of the feature can not change widely between two consecutive views. This can therefore be used to reduce the combinatorial complexity of the matching. Only features with similar coordinates in both images will be compared. For a corner located at $(x,y)$, only the corners of the other image with coordinates located in the interval $[x-w_i,x+w_i]\times[y-h_i,y+h_i]$. $w_i$ and $h_i$ are typically $10\%$ or $20\%$ of the image.


next up previous contents
Next: Matching using affinely invariant Up: Feature extraction and matching Previous: Comparing image regions   Contents
Marc Pollefeys 2002-11-22