NSF award IIS-0313047– 2003-2007 Converting 2D Video to 3D with Applications to 3D TV, Video Analysis and Compression.

PI: Marc Pollefeys – Univ. of North Carolina at Chapel Hill


The aim of this project is to augment traditional 2D video content with 3D depth using computer vision approaches. To achieve this goal the camera motion and settings will also have to be recovered from the video data. The main application that is envisioned is to convert existing 2D video content to 3D video to provide sufficient content for stereoscopic displays and enable applications such as 3D TV to emerge and flourish. Besides this, the intended research potentially also has important applications in the context of video analysis and compression. We intend to develop a reliable fully automatic approach. Since it will not always be possible to compute the depth from the available image content (e.g. fixed camera), we intend to correctly deal with ambiguities and provide perceptually acceptable results (e.g. fade depth out when it can’t be computed anymore).

Given a 2D video stream, we intend to

(1)   compute the relative motion between the scene and the camera for each shot,

(2)   detect independent moving objects and compute their motion and deformation,

(3)   compute a detailed depth representation for each video frame.


Research and educational activities (full project)


Articulated and non-rigid motion estimation:  One of the main challenges for 2D-to-3D conversion of video consists of dealing with articulated and non-rigid deformation of objects.  Graduate research assistant Jingyu Yan has been working on new techniques for articulated and non-rigid shape and motion recovery from a video stream.  Some initial results (based on manually tracked features) can be seen below.  

final_3 final_4  final_8  final_9

Our most recent work in this area addresses the subspace properties and the recovery of articulated motion. We have derived the fact that the nature of the motion subspace of an articulated object is a combination of a number of intersecting rigid motion subspaces. The rank of that motion subspace is less than that of each articulated part combined, depending on the connection between every two linked parts, either a rotation axis or a rotation joint. The reduced dimension(s) results from an intersection between two motion subspaces of the linked parts which is exactly the motion subspace of the axis or joint that connects them. From these observations, we have described the rank constraints of articulated motion, have developed an algorithm to recover the image motion of the axis or joint and proposed a novel but simple approach to recover articulated shape and motion from a single-view image sequence, which is based on subspace clustering.  In the figure below two frames of monocular videos from articulated objects are seen that were automatically segmented (white and black feature points) and for which the articulation was also computed (red rotation axis and black joint resp.). This work was presented at the IEEE Conf. on Computer Vision and Pattern Recognition in 2005 and was very well received. 

A remaining challenge that we had to solve was the problem of motion segmentation.  Most existing motion segmentation techniques are not able to segment articulated motions as they can not deal with intersecting subspaces.  An exception to this is the recent work on Generalized Principal Component Analysis (GPCA) which has received a lot of attention in the computer vision community.  We initially implemented this technique, but quickly realized that the method had important limitations and was limited in practice to two or three motions only because at its core it requires the estimation of the coefficients of a degenerate polynomial in multiple variables with an order equal to the number of motions.  The linear estimation of these coefficients without enforcing the degeneracy is very sensitive to noise and quickly becomes underconstraint for real cases. 

We proposed several approaches to solve this problem.  Our first approach was based on RANSAC (Random Sampling Consensus). The key idea consisted of not sampling candidate feature tracks used to form a motion hypothesis uniformly, as is customary with RANSAC, but base the probability of sampling subsequent tracks on their affinity with the previously selected tracks (as this is an indicator that they might be in the same subspace).  While a standard RANSAC algorithm would quickly get lost when only a small fraction of the total number of tracks is located in any single motion subspace, his approach was able to efficiently segment multiple dependent motions and deal with outliers in the data.  This approach was presented last year at the Workshop on Dynamical Vision collocated with the International Conference on Computer Vision. 

There was one remaining issue with this technique which is that the dimensionality of the motion subspaces has to be known.  This works well for articulated rigid bodies where each subspace is four dimensional, but was not suited to segment a general scene consisting of a mix of rigid, articulated and deformable shapes, as well as degenerate shapes such as flat objects.  For this purpose, we proposed an approach based on spectral clustering that performs significantly better than the state of the art.  This work has appeared at the European Conference on Computer Vision in 2006.  Some of segmentation results can be seen in the Figure below.


Once the motion segmentation problem was solved to our satisfaction we developed an algorithm to reconstruct the kinematic chains of articulated objects from feature tracks.  To the best of our knowledge this is the first approach that is able to automatically build up a kinematic chain from a video (and not just adjust the parameters of a kinematic chain with a known structure).  We have proposed a simple and elegant algorithm based on estimating minimum spanning trees on an affinity graph between the computed motion subspaces.  This work has appeared at the IEEE Conf. on Computer Vision and Pattern Recognition in 2006. And is illustrated below for a puppet and for Jingyu Yan himself (in each case 6 different motions were automatically identified, those are indicated with different colors).   


Our complete approach now starts from raw feature tracks, performs motion segmentation, computes the structure of the kinematic chains and the location of the joints, and reconstructs the shape and motion of the observed objects.  This work has been accepted for publication in the IEEE Transactions on Pattern Analysis and Machine Intelligence and can already be consulted online on their website.  A short version of this overview has also appeared at the AMDO 2006 - IV Conference on Articulated Motion and Deformable Objects.  Jingyu Yan has successfully defended his PhD in December 2006 and went on to work for Microsoft.


Key-frame selection for 2D-to-3D conversion:  Together with graduate research assistant Jason Repko we have worked at improving the robustness of our 2D-to-3D processing pipeline.  One of the key elements of this consists of having a robust approach to select key-frames.  We are investigating a three-view statistical approach which should work a lot better than earlier two-view approaches, as further processing assumes sufficient common tracked features between three consecutive views.  Below we show results obtained on a video sequence.  One of the frames is shown on the left while the recovered 3D model and camera path is shown on the right.    We have also worked on an approach to deal with projective drift in uncalibrated structure from motion.  If this issue is not taken into account this often leads to failure of self-calibration algorithms.  These results will be reported at the 3DIM’05 conference.  Jason Repko has obtained a Masters degree and went on to work for the US patent office.



Fast depth estimation on commodity graphics hardware:  An important aspect of recovering depth from a video stream is the possibility to compute correspondences between multiple views.  We have improved our initial approach for real-time stereo.  These results have been presented at the IEEE CVPR 2004 Workshop on Realtime 3D Sensors and Their Use (in conjunction with CVPR’04) (see publications).  This is collaborative work with Ruigang Yang (faculty at University of Kentucky) and his student.  Our new algorithm allows to check for bi-directional consistency, allows variable correlation windows and is twice as fast as our original implementation on the same hardware.  Different aspects of this research have also been reported in the International Journal of Image and Graphics (IJIG) and the Journal of Real-Time Imaging (RTI).  We now achieve up to 580 million disparity evaluations per second (=image width x image height x disparity search space x frame rate) on ATI X800 GPUs, which is several times faster than the fastest available CPU-based implementations. An example of some results on a testdataset is shown below for two configurations of the algorithm.  


Course development:  In Fall 2003 we had organized the Computer Vision course.  Although it had been taught before, the course was completely redeveloped as the previous teacher’s specialty area was not computer vision.  The course was followed by 16 graduate students.  Last fall we have organized an advanced graduate course on “3D photography” that is very much related to this project.  10 graduate students have taken the course and several other students and faculty have also regularly audited the course.


Undergraduate research opportunities: We have hired undergraduate student Matt Goldberg to work with us on this project during last summer on an REU supplement.  This student has worked on a 3D viewer for depth-augmented video on auto-stereoscopic displays.  This has spurred some further research ideas that might lead to an invention disclosure. 


Broader dissemination: To obtain a broader dissemination of our past and current research results, we have organized courses at conferences in computer vision and also in related areas such as computer graphics.  More specifically, we have organized or participated in the following courses and presentations:

Besides this we also participate in the strong tradition of broader dissemination that the computer science department of UNC-CH has.  Most Friday afternoons demonstrations are organized for the general public, specifically targeting groups of interest such as girls in middle school at an age where they tend to loose interest for science.