Urban 3D Modelling from Video

University of North Carolina, Chapel Hill and University of Kentucky



Our research aims at developing a fully automated system for the accurate and rapid 3D reconstruction of urban environments from video streams. It is based on two factors:
  • the abundance of video data that came about with the recent progress in camcorder technology combined with decreased prices
  • the need for compact, 3D descriptions of what has been filmed.
For many applications, 3D models are more descriptive than the frames of the original video. In a model of a city users can see a very large area at once, realize the spatial arrangement of the buildings at a single glance and navigate freely to the parts that most interest them. These tasks are considerably more difficult and time-consuming using the original video.

To achieve accurate 3D reconstructions one is faced with many difficulties. The core problems that need to be addressed are estimating the motion of the camera and the structure of the scene. These problems are generally ill-posed since they are attempts to recover 3D information from 2D image data. Given structure and motion estimates, dense pixel correspondences can be established and polygonal models of the scene can be generated using frames of the videos for texture-mapping.

While such reconstructions have been possible for years, as for instance in the previous work of members of our team (see M. Pollefeys et al., Visual modeling with a hand-held camera", International Journal of Computer Vision 59(3), 2004, and D. Nistér, "Automatic dense reconstruction from uncalibrated video sequences", PhD Thesis, Royal Institute of Technology KTH, Stockholm, Sweden, 2001), the computational effort to obtain them was a limiting factor for the development of a practical system that is able to process massive amounts of video such as the ones needed to model an entire city. In our current work, the speed of the system is a major consideration. Our algorithms are fast by nature and amenable to GPU implementations. We have achieved real-time performance leveraging both the CPU and the GPU.


A base with four cameras. The same base on the roof of our van in Chapel Hill.

Data Collection

We use multiple synchronized cameras on a base, such as the one seen above, to collect video streams. Below is a composite video captured by four cameras in Chapel Hill. Three of the four cameras are horizontal, pointing forward, backward and to the side, while the fourth one is tilted upward to capture the upper parts of building facades. Additional sensors such as GPS and inertia sensors can also be integrated to give accurate position measurements.

Composite video captured with four cameras.

The inputs to our processing pipeline are the trajectory of the vehicle and the video sequences. Since the cameras do not overlap, reconstruction is performed on the frames of a single camera as it moves through the scene. The main steps of the processing pipeline are the following:

  • 2D feature tracking: Our GPU implementation of the KLT tracker is used to track features in the image. The features are used in the Structure from Motion computation and in the sparse scene analysis step. The sources for the KLT tracker can be downloaded from here.
  • Pose refinement: Information from the vehicle trajectory is fused with visual measurements via an Extended Kalman Filter to obtain the pose of each camera at each frame.
  • Sparse scene analysis: The tracked features can be reconstructed in 3D, given the camera poses, to provide valuable information about the scene surfaces and their orientation.
  • Multi-way plane-sweeping stereo: We use the plane-sweeping algorithm for stereo reconstruction. Planes are swept in multiple directions to account for slanted surfaces and a prior probability estimated from the sparse data is used to disambiguate textureless surfaces. Our stereo algorithm is also run on the GPU to take advantage of its efficiency in rendering operations, which are the most costly computations in plane-sweeping stereo.
  • Depth map fusion: Stereo depth maps are computed for each frame in the previous step in real time. There are large overlaps between adjacent depth maps that should be removed to produce a more economical representation of the scene. The depth map fusion stage combines multiple depth estimates for each pixel of a reference view, enforces visibility constraints and thus improves the accuracy of the reconstruction. The result is one fused depth map that replaces several raw depth maps that cover the same part of the scene.
  • Model generation: A multi-resolution mesh is generated for each fused depth map and video frames are used for texture-mapping. The camera gain is adjusted within and across video streams so that transitions in appearance are smoother. In addition, partial reconstructions are merged and holes in the model are filled in, if possible.


Below are some screenshots and videos of textured models of Chapel Hill reconstructed using our approach. Both sequences consist of a few thousand frames, while our largest reconstruction comprises several such sequences and totals 500,000 frames.

Top: Overview and details of a reconstruction using two cameras. Bottom: video of the same model.

Top: View from above and details of a reconstruction of a challenging scene using four cameras. Bottom: video of the same model.


University of North Carolina

Marc Pollefeys (Principal Investigator at UNC)

Jan-Michael Frahm

Philippos Mordohai

Herman Towles

Greg Welch

Brian Clipp

David Gallup

Paul Merrell

Christina Salmi

Sudipta Sinha

Brad Talton

University of Kentucky

David Nistér (Principal Investigator at UK)

Ruigang Yang

Amir Akbarzadeh

Henrik Stewénius

Christopher Engels

Liang Wang

Qing-Xiong Yang



We would like to thank DARPA for funding our research under the UrbanScape program.
Approved for Public Release, Distribution Unlimited.