![]() |
Philippos Mordohai Postdoctoral Research Associate Department of Computer Science University of North Carolina at Chapel Hill Office: Sitterson Hall 260
|
Home Research Publications Teaching Service, Awards and Other Activities CV (pdf) |
NEWS I have started as a postdoc at UPenn. My new webpage is at www.seas.upenn.edu/~mordohai. CURRENT RESEARCH PROJECTS
CURRENT RESEARCH PROJECTS Video-based Reconstruction of Urban Environments Since September 2005 I have worked on the DARPA UrbanScape project, which aims at the real-time 3-D reconstruction of urban scenes. The video-based part of UrbanScape that is carried out by the UNC computer vision group led by Marc Pollefeys and the computer vision group from the Center for Visualization and Virtual Environments of the University of Kentucky led by David Nistér. We work on both the video collection system that can record to disk eight high-resolution video streams at 30 frames per second as well as the 3D reconstruction system that generates 3D models from these videos. More information on UrbanScape can be found at the Urban 3D Modelling from Video webpage. My efforts are mainly in the second part of the processing pipeline that addresses dense 3D reconstruction. I have also made smaller contributions in the first part of the pipeline that addresses feature tracking, camera pose estimation and georegistration. Modules that I have worked on include:
I have also started working on a new project entitled "3D Content Extraction from Video Streams". It is part of the DTO Video Analysis and Content Extraction program. Our work is on the development of algorithms that can automatically extract 3D information from videos captured by unknown cameras under unknown conditions. Our research will be on robust Structure from Motion algorithms and auto-calibration that will allow us to recover the camera parameters and poses from the videos. This will be followed by model selection to determine whether we can achieve a 3D reconstruction, a panoramic mosaic (if the camera only rotates) or no reconstruction at all from the video sequence. The emphasis is on making our methods applicable to video sequences on which we have very little control. If 3D reconstruction is feasible, it will be performed by a stereo module with adjustable trade-off between speed and quality. A very fast recognition module will be used to detect previously observed landmarks to stitch partial models potentially reconstructed from different videos. We will also attempt to measure the background and objects of the scene and make higher level inferences such as whether the scene is natural or man-made. Multiple-View Reconstruction using Graph Cuts on an Adaptive Tetrahedral Mesh In this work, Sudipta Sinha, Marc Pollefeys and I formulate multi-view 3D shape reconstruction as the computation of a minimum cut on the dual graph of a semi-regular, multi-resolution, tetrahedral mesh. Our method uses photo-consistency to guide the adaptive subdivision of a coarse mesh. This generates a multi-resolution volumetric mesh that is densely tessellated in the parts likely to contain the unknown surface and coarse in parts that are empty. The graph-cut on the dual graph of this tetrahedral mesh produces a minimum cut corresponding to a triangulated surface that minimizes a global surface cost functional. We make no assumptions about topology and can recover deep concavities when enough cameras observe them. Our formulation also allows silhouette constraints to be enforced during the graph-cut step to counter its inherent bias for producing minimal surfaces. Local shape refinement via surface deformation is used to recover details in the reconstructed surface. Reconstructions of the Multi-View Stereo Evaluation benchmark datasets and several other real datasets show the effectiveness of our method.
In ICCV 2007, Scott Larsen, Marc Pollefeys, Henry Fuchs and I presented an approach for 3D reconstruction from multiple video streams taken by static, synchronized and calibrated cameras that is capable of enforcing temporal consistency on the reconstruction of successive frames. We attempted to improve the quality of the reconstruction by finding corresponding pixels in subsequent frames of the same camera using optical flow, but also to at least maintain the quality of the single time-frame reconstruction when these correspondences are wrong or cannot be found. This allows us to process scenes with fast motion, occlusions and self-occlusions where optical flow fails for large numbers of pixels. To this end, we modify the belief propagation algorithm to operate on a 3D graph that includes both spatial and temporal neighbors and to be able to discard messages from outlying neighbors. We also propose methods for introducing a bias and for suppressing noise typically observed in uniform regions. The bias term encapsulates information about the background and aids in achieving a temporally consistent reconstruction and in the mitigation of errors caused by occlusion.
Publications
In the fall of 2005, Scott Larsen Marc Pollefeys, Henry Fuchs and I worked on the development of a belief propagation framework applicable to multiple-view reconstruction. The beliefs for the depth of each pixel are initialized using the plane sweep algorithm, which is repeated after each belief propagation iteration taking into account the update visibility information. Pdf's for the depth of each pixel along the ray emanating from the camera center are maintained for all pixels of all images. The main novelty of our work is a scheme for performing belief propagation in adaptive neighborhoods that include 3D neighbors besides the classic 4-neighbors in the image that contains each pixel (see figure below). Essentially each pixel has four constant neighbors in its own image and a number of other neighbors that are determined based on its projection in the other images. Messages are passed among all the neighbors modulated by a compatibility function that takes into account similarity in color, to mitigate the effects of occlusion, and distance in 3D, to suppress the influence of points on different surfaces. For these computations to be feasible, we had to simplify the belief propagation algorithm, thus the title of the project.
Binocular Stereo using Tensor Voting One of the many projects I worked on at USC, and arguably the one I spent more time on, was binocular stereo. This work was based on the preliminary approach of Mi-Suen Lee and Gérard Medioni that addressed stereo based on the premise that correct pixel correspondences reconstructed in 3D form the scene surfaces, while wrong correspondences do not form salient surfaces. Under this approach stereo can be posed a perceptual organization problem and tensor voting (see below) can be used to infer the surfaces. I worked to develop an algorithm that would use the same philosophy, but would be more effective in challenging real examples and benchmark data. After experimenting with a number of options for establishing pixel correspondences and for integrating monocular information in a way that mitigates the effects of occlusion without committing to premature decisions, we presented an algorithm that offers certain advantages. These include:
![]() ![]() ![]() ![]() Left images, ground truth depth maps, the depth maps we generated and error maps for the Middlebury Stereo Evaluation webpage datasets. White in the error maps indicates errors below 0.5 disparity levels, gray errors between 0.5 and 1 disparity level and black errors greater than 1 disparity level. The error metric is the percentage of pixels above a certain error in disparity. First row: Tsukuba. Second row: Venus. Third row: Teddy. Fourth row: Cones. We have submitted our results to the Middlebury Stereo Evaluation webpage and rank 14th among 25 algorithms (as of 11/11/2006) when the error threshold is set to 1 disparity level and 9th when the threshold is set to 0.5 disparity levels.
At USC I also worked on multiple view stereo, where the input is a set of more than two images with known calibration. Our goal was to develop an approach with minimum reliance on binocular processing that addresses the problem in 3D and not 2 1/2-D. We also did not want to be restricted by constraints such as having to place all the cameras on the same part of the scene, perform background segmentation or merge partial results. When data from all images are processed simultaneously the difficulties caused by occlusion and uniform surfaces are reduced. Merging partial noisy depth maps is not guaranteed to have the same effect. Of course, this is only feasible for relatively small sets of images such as the ones we processed here that do not exceed 36 images. The only binocular step is the detection of potential pixel correspondences which are then reconstructed in 3D and are used as input for tensor voting. Correct correspondences receive a lot of support as parts of salient surfaces from their neighboring while wrong correspondences do not. Tensor voting on a rather large number of potential correspondences (1.1 mil) takes around 45 minutes. Since this work was done for the most part before the integration of monocular information in binocular stereo, there is still room for improvement. Some day I may find the time to improve these results and their visualization... Six of the input images captured at the CMU dome, a view of the inputs from above (note that the cameras are inside the set of points), a view of the most salient points and a zoomed in view at the center of the dome where the person is. Publications
Besides working on specific computer vision and machine learning problems using tensor voting, I put a lot of effort in understanding, evaluating and extending the framework. Tensor voting is a perceptual organization approach based on the Gestalt principles of proximity and good continuation. It has mainly been applied for organizing generic tokens into coherent groups in core perceptual organization scenarios as well as for computer vision problems formulated as perceptual organization. The two fundamental aspects of tensor voting are the representation of the data by second-order, symmetric, non-negative definite tensors and the information propagation mechanism among the inputs that cast and receive votes to and from their neighbors. Following the standards set by Gérard Medioni and a number of his students that also worked on this, I tried to ensure that all modifications and additional functionalities adhere to our philosophy and result in an approach that is:
My other large contribution to the framework was a fully general N-D implementation that allows us to tackle problems in high-dimensional spaces. See below for our work on dimensionality estimation, manifold learning and function approximation. What made this implementation feasible is a geometric observation that allowed us to simplify the vote generation process and made the pre-computation of huge high-dimensional voting fields unnecessary. Publications
I also did research on figure completion which is a perceptual organization process which is triggered by the presence of certain configurations of keypoints. For example, for contour completion to occur behind an occluder two T-junctions at appropriate positions and orientations have to exist. The integration of first order information allows us to detect keypoints such as endpoints of curves, T-junctions and L-junctions. These are indicators of potential completions and are used to generate hypotheses. An important aspect of our approach is that the decision between modal completion, which occurs along the boundary of the occluder, and amodal completion, which occurs along the direction of the occluded contour, can be made completely automatically. If a hypothesis is supported by at least two keypoints, we can infer the completion in a second pass of tensor voting. What should be noted here is that while we do not address real images since integrated edge and junction detection in them is far from being solved, our algorithm can explain a few illusions such as the Koffka crosses shown below, the Ehrenstein stimulus and the Poggendorf illusion. Top row: two examples of the Koffka cross. Notice that the perceived completion by the human visual system changes from a circle to a square depending on the width of the cross' arms. Bottom row: zoomed in views of the illusory contours produced by our algorithm that detects that modal completion is feasible and completes the low contrast occluder. Due to pixel quantization the circle appears slightly squared. Note that the junctions in the case of the square completion have been explicitly detected. Publications
Using the N-D implementation of tensor voting, we were able to tackle problems in instance-based learning. One such problem is the estimation of the intrinsic dimensionality of the data given a set of observations in a high-dimensional space. We can perform this estimation after a round of tensor voting, since the eigenstructure of the resulting tensors provides an estimate of the dimensionality of the structure going through the point. This point-wise estimation makes our method applicable to challenging datasets with varying dimensionality and datasets that are not manifolds, as is the case when they contain intersections. Moreover, the absence of global computations allows us to process very large datasets at reasonable computational costs. We show results of accurate dimensionality estimation at the point level in spaces of up to 150-D. ![]() ![]() Top row: data of varying dimensionality in a 4D. (The fourth dimension has been dropped for visualization purposes.) The input consists of an empty 3D sphere in 4D (which appears as a full 3D sphere when the fourth dimension is dropped), a 2D cone and a curve. Bottom row: points classified according to their dimensionality as 1D, 2D and 3D. Notice that the intersection between the cone and the curve is correctly classified as 3D. Publications
Besides the dimensionality, tensor voting provides estimates of local orientation at each point. This allows us to learn the structure of the manifold locally and perform tasks such as geodesic distance estimation and generation of new samples on the manifold. This can be extended to address function approximation in a setting where the function is learned from observations in a joint input-output space. The queries are in the form of points in the lower dimensional input space. The answer is found by finding a starting point on the manifold and marching on it until the coordinates of the query in the input space are reached. More details will be added once this work is published in a peer-reviewed forum. 3D Face Modeling and Recognition I spent a few years demonstrating and evaluating the 3D face reconstruction and recognition technology developed by Geometrix, Inc. ![]() For a 3-D model of my face using these two pictures click on the pictures or here. This model was created with the Facevision 200 Series system. While I never wrote any code directly used in this project I have thoroughly tested numerous versions of the Geometrix software and hardware systems over a five year period. The reconstruction system matured to the point that recognition using 3D information only could be reliably performed. The fact that appearance is not used at all makes the system invariant to illumination and viewpoint variations. ![]() This is a screenshot of a verification test on my face using two models made two months apart. There are large variations in lighting, pose and my appearance, which do not throw the system off. Seismology In 2003, Gérard Medioni and I collaborated with Ory Dor and Charles G. Sammis from the Department of Earth Sciences at USC towards developing a technique that uses computer vision to assist in the characterization of the orientation distribution of slip surfaces in fault breccia. My contribution was to use the Facevision stereo rig to reconstruct rock samples from the fault. I also wrote software that detected markers corresponding to slip planes and slip lines, computed their normal or tangent respectively and collected statistics analyzed by our collaborators. They were able to draw useful conclusions on the mechanical origin of the set of surfaces. The use of stereo vision made the process considerably faster and more accurate compared to manual measurements of each slip surface. Publications
During the Spring semester of 1999, I worked as a Research Assistant at the Signal and Image Processing Institute in the Electrical Engineering Department of USC doing research on Magnetic Resonance Imaging with Richard M. Leahy. My task was to develop processing able to segment MR images of the brain into gray and white matter and cerebrospinal fluid using morphological processing in 3D. Lossless Image Compression and Watermarking For my undergraduate diploma thesis in the Electrical and Computer Engineering Department of the Aristotle University of Thessaloniki, Greece, I developed a plug-in for the Windows version of Netscape Navigator that decodes pyramid encoded images and extracts an embedded watermark from them. The thesis was supervised by Michael G. Strintzis. Publications
|