Urban 3D Modelling from Video



Our research aims at developing a fully automated system for the accurate and rapid 3D reconstruction of large scale urban environments from video streams. It is based on two factors:

For many applications, 3D models are more descriptive than the frames of the original video. In a model of a city users can see a very large area at once, realize the spatial arrangement of the buildings at a single glance and navigate freely to the parts that most interest them. These tasks are considerably more difficult and time-consuming using the original video.


Sample screenshots of our ground and aerial reconstructions

To achieve accurate 3D reconstructions one is faced with many difficulties. The core problems that need to be addressed are estimating the motion of the camera and the structure of the scene. These problems are generally ill-posed since they are attempts to recover 3D information from 2D image data. Given structure and motion estimates, dense pixel correspondences can be established and polygonal models of the scene can be generated using frames of the videos for texture-mapping.

While such reconstructions have been possible for years, as for instance in the previous work of members of our team (see M. Pollefeys et al., "Visual modeling with a hand-held camera", International Journal of Computer Vision 59(3), 2004), the computational effort to obtain them was a limiting factor for the development of a practical system that is able to process massive amounts of video such as the ones needed to model an entire city. In our current work, the speed of the system is a major consideration. Our algorithms are fast by nature and amenable to GPU implementations. We have achieved 30 Hz real-time performance on a single consumer PC with a standard commodity graphics card (GPU) by leveraging both the CPU and the GPU.

Data Collection


Portable recording system mounted on a backpack



Left: A base with four cameras. Right: The same base on the roof of our van in Chapel Hill.

For data collection at ground-level, we have constructed two recording systems. The image on the left shows a low-cost, man-portable camera capture system, designed for flexibility and mobility. The setup consists of a Point Grey Research Ladybug2 omnidirectional camera, a Garmin consumer-grade GPS receiver with Wide Area Augmentation System (WAAS) capability, and a Microstrain 3DM-G inertial sensor. The Ladybug is a multi-camera system consisting of six cameras, of which five form a ring and the sixth camera points upwards. Together, these provide video coverage of most of the upper hemisphere about the camera unit, except for the area directly below the camera. The GPS unit is accurate to approximately five meters under optimal conditions meaning many visible satellites, no or low multipath error etc. In practice this is highly unusual in urban environments and errors on the scale of 10 or more meters are more typical due to the urban canyon effect where very little of the sky is visible to the GPS receiver because of surrounding buildings. Finally the 3DM-G provides an absolute orientation measurement accurate to +/-5 degrees.

The image on the right shows the system we use for large-scale data collection, constructed using multiple synchronized cameras on a base and mounted on a car. Three of the four cameras are horizontal, pointing forward, backward and to the side, while the fourth one is tilted upward to capture the upper parts of building facades. While additional sensors such as GPS and inertia sensors can be integrated to give accurate position measurements, our system can function in the absence of this additional information as well.

Below is a composite video captured by the four camera recording system in Chapel Hill

Composite video captured with the four camera system


We have also obtained 3D reconstructions from aerial video with a helicopter, using a nose-mounted, gyro-stabilized HDTV camera system. Using both ground and aerial video allows us to obtain increased coverage of the facades and rooftops in urban scenes. By aligning the ground and aerial reconstructions, we are thus able produce more complete 3D models. Below is a photo of the aerial recording system, along with a sample video.

Setup for recording aerial video



Aerial video clip recorded over Chapel Hill


The input to our processing pipeline consists of the video sequence, which may be optionally augmented by the trajectory of the vehicle, in the form of GPS/INS measurements . Since the cameras do not overlap, reconstruction is performed on the frames of a single camera as it moves through the scene. The main steps of the processing pipeline are the following:

  • 2D feature tracking: Our highly optimized GPU implementation of the KLT tracker is used to track features in the image, and is able to achieve processing rates of more than 200 frames per second on on state-of-the-art consumer GPUs for PAL (720 × 576) resolution data. The features are used in the structure-from-motion computation and in the sparse scene analysis step. Our implementation is also capable of estimating a global gain ratio between successive frames in order to compensate for changes in the camera exposure. The sources for the KLT tracker can be downloaded from here.
  • Pose estimation and refinement: Using the tracked 2D features, we compute camera poses using various structure-from motion techniques. The computation is done using ARRSAC, a high-performance, real-time robust estimation technique. In the event where additional trajectory information is available, this may be fused with visual measurements via an Extended Kalman Filter to refine the pose of each camera at each frame.
  • Sparse scene analysis: The tracked features can be reconstructed in 3D, given the camera poses, to provide valuable information about the scene surfaces and their orientation.
  • Multi-way plane-sweeping stereo: We use the plane-sweeping algorithm for stereo reconstruction. Planes are swept in multiple directions to account for slanted surfaces and a prior probability estimated from the sparse data is used to disambiguate textureless surfaces. Our stereo algorithm is also run on the GPU to take advantage of its efficiency in rendering operations, which are the most costly computations in plane-sweeping stereo.
  • Depth map fusion: Stereo depth maps are computed for each frame in the previous step in real time. There are large overlaps between adjacent depth maps that should be removed to produce a more economical representation of the scene. The depth map fusion stage combines multiple depth estimates for each pixel of a reference view, enforces visibility constraints and thus improves the accuracy of the reconstruction. The result is one fused depth map that replaces several raw depth maps that cover the same part of the scene.
  • Model generation: A multi-resolution mesh is generated for each fused depth map and video frames are used for texture-mapping. The camera gain is adjusted within and across video streams so that transitions in appearance are smoother. In addition, partial reconstructions are merged and holes in the model are filled in, if possible.
  • Model matching: We have also developed a technique that may be used to efficiently perform 3D scene alignment. By leveraging local shape information, an invariant feature descriptor is extracted which is then used in a hierarchical matching scheme to perform efficient matching and alignment of 3D scenes. This allows us to align multiple 3D models; such as those obtained from ground and aerial video, and also to perform loop completion.



Below are some screenshots and videos of textured models of Chapel Hill reconstructed using our approach. Our largest reconstruction comprises several such sequences and totals 1.3 million frames.

Screenshot: Overview and details of a ground-based reconstruction using two cameras


Screenshot: View from above and details of a ground-based reconstruction of a challenging scene using four cameras


Videos of the two ground-based reconstructions shown above


Below are screenshots and videos for reconstructions obtained from aerial video captured over UNC Charlotte and UNC Chapel Hill.

Screenshot: Aerial reconstruction over UNC Charlotte


Screenshot: Aerial reconstruction over UNC Chapel Hill


Videos of the two aerial-based reconstructions shown above


Using our technique for model matching, ground and aerial reconstructions can be aligned, as described in the video below.



Our 3D models can also be loaded into Google Earth and displayed in a geo-registered coordinate frame. Click here to download the model files (unzip the folder and drop the .kml file into Google Earth). Below is a video clip that shows one of our models being navigated within Google Earth.

Video of the model within Google Earth




Marc Pollefeys

Jan-Michael Frahm

Greg Welch

Postdocs and Staff

Christopher Zach

Seon-Joo Kim

Herman Towles


Brian Clipp

David Gallup

Rahul Raguram

Changchang Wu

Former Members

University of Kentucky: David Nistér, Ruigang Yang, Amir Akbarzadeh, Henrik Stewénius, Christopher Engels, Liang Wang, Qing-Xiong Yang

UNC: Philippos Mordohai, Paul Merrell, Sudipta Sinha, Brad Talton, Christina Salmi



  • Rahul Raguram, Jan-Michael Frahm, Marc Pollefeys, "A Comparative Analysis of RANSAC Techniques Leading to Adaptive Real-Time Random Sample Consensus", ECCV 2008
  • Changchang Wu, Brian Clipp, Xiaowei Li, Jan-Michael Frahm, Marc Pollefeys, "3D Model Matching with Viewpoint Invariant Patches (VIPs)", CVPR 2008
  • David Gallup, Jan-Michael Frahm, Philippos Mordohai, Marc Pollefeys, "Variable Baseline/Resolution Stereo", CVPR 2008
  • M. Pollefeys, D. Nister, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, S.-J. Kim, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stew´enius, R. Yang, G. Welch, H. Towles, "Detailed Real-Time Urban 3D Reconstruction From Video", IJCV special issue on Modeling Large-Scale 3D Scenes
  • Christopher Zach, David Gallup and Jan-Michael Frahm, "Fast Gain-Adaptive KLT Tracking on the GPU", CV GPU' 08 workshop in conjunction with CVPR'08
  • Paul Merrell, Amir Akbarzadeh, Liang Wang, Philippos Mordohai, Jan-Michael Frahm, Ruigang Yang, David Nist´er, and Marc Pollefeys, "Real-Time Visibility-Based Fusion of Depth Maps", ICCV 2007.
  • Seon Joo Kim, Jan-Michael Frahm, and Marc Pollefeys, "Joint Feature Tracking and Radiometric Calibration from Auto-Exposure Video", ICCV 2007.
  • S. Sinha, J.-M. Frahm, M. Pollefeys and Y. Genc, "Feature Tracking and Matching in Video Using Programmable Graphics Hardware", Machine Vision and Applications, to appear, 2007.
  • P. Mordohai, J.-M. Frahm, A. Akbarzadeh, B. Clipp, C. Engels, D. Gallup, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewénius, H. Towles, G. Welch, R. Yang, M. Pollefeys and D. Nistér, "Real-Time Video-Based Reconstruction of Urban Environments", 3D-ARCH'2007: 3D Virtual Reconstruction and Visualization of Complex Architectures, Zurich, Switzerland, July, 2007
  • D. Gallup, J.-M. Frahm, P. Mordohai, Q. Yang and M. Pollefeys, "Real-time Plane-sweeping Stereo with Multiple Sweeping Directions", International Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, Minnesota, USA, June 2007
  • S.J. Kim, D. Gallup, J.-M. Frahm, A. Akbarzadeh, Q. Yang, R. Yang, D. Nistér and M. Pollefeys,"Gain Adaptive Real-Time Stereo Streaming", International Conference on Computer Vision Systems, 2007
  • A. Akbarzadeh, J.-M. Frahm, P. Mordohai, B. Clipp, C. Engels, D. Gallup, P. Merrell, M. Phelps, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewénius, R. Yang, G. Welch, H. Towles, D. Nistér and M. Pollefeys, "Towards Urban 3D Reconstruction From Video", Third International Symposium on 3-D Data Processing, Visualization and Transmission, Chapel Hill, North Carolina, USA, June 2006

Other Scene Modelling Work at UNC

Visit our page describing the Modeling and Recognition of Landmark Image Collections Using Iconic Scene Graphs


We would like to thank DARPA for sponsoring part of this research under the UrbanScape program.
Approved for Public Release, Distribution Unlimited.

joomla visitor