philippos
Philippos Mordohai
Postdoctoral Research Associate
Department of Computer Science
University of North Carolina at Chapel Hill

Office: Sitterson Hall 260
Phone Number: (919) 962 1965
E-mail: mordohai@cs.unc.edu




Home

Research

Publications

Teaching

Service, Awards and
Other Activities


CV (pdf)

NEWS
I have started as a postdoc at UPenn. My new webpage is at www.seas.upenn.edu/~mordohai.

CURRENT RESEARCH PROJECTS
PAST RESEARCH PROJECTS
Most of my research on tensor voting can be found in my thesis, the book I wrote with Gérard Medioni and the short course I gave at CVPR 2007. Note: Some publications are made available here to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders.

CURRENT RESEARCH PROJECTS

Video-based Reconstruction of Urban Environments
Since September 2005 I have worked on the DARPA UrbanScape project, which aims at the real-time 3-D reconstruction of urban scenes. The video-based part of UrbanScape that is carried out by the UNC computer vision group led by Marc Pollefeys and the computer vision group from the Center for Visualization and Virtual Environments of the University of Kentucky led by David Nistér. We work on both the video collection system that can record to disk eight high-resolution video streams at 30 frames per second as well as the 3D reconstruction system that generates 3D models from these videos. More information on UrbanScape can be found at the Urban 3D Modelling from Video webpage.

My efforts are mainly in the second part of the processing pipeline that addresses dense 3D reconstruction. I have also made smaller contributions in the first part of the pipeline that addresses feature tracking, camera pose estimation and georegistration. Modules that I have worked on include:
  • the stereo engine that runs at several frames per second
  • fusion of depth maps from a single video stream in order to obtain reliable depth estimates
  • fusion across video streams to achieve maximum coverage and to select the best model for surfaces visible in multiple cameras
  • 3D model generation where the aim is to remove redundancy and obtain a compact representation that still preserves details.
Screenshots of models can be seen below. These reconstructions were made using videos from two-cameras. The model at the bottom comprises more than 2,600 frames. Close-ups show that features such as the ground plane which is viewed at an angle from the camera and thin objects such as the light pole on the right are reconstructed well. More pictures are available at the Urban 3D Modelling from Video webpage.






Also see the following videos.




Publications

  • M. Pollefeys, D. Nistér, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, S.-J. Kim, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewénius, R. Yang, G. Welch, H. Towles, "Detailed Real-Time Urban 3D Reconstruction From Video", International Journal of Computer Vision, published online, 2007 (impact factor for 2006: 6.085)
  • P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J.-M. Frahm, R. Yang, D. Nistér and M. Pollefeys, "Real-Time Visibility-Based Fusion of Depth Maps", International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil, October 2007 (acceptance rate for oral presentations: 3.9%)
  • P. Merrell, P. Mordohai, J.-M. Frahm and M. Pollefeys. "Evaluation of Large Scale Scene Reconstruction", Virtual Representations and Modeling of Large-scale environments (VRML), Rio de Janeiro, Brazil, October 2007
  • P. Mordohai, J.-M. Frahm, A. Akbarzadeh, B. Clipp, C. Engels, D. Gallup, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewénius, H. Towles, G. Welch, R. Yang, M. Pollefeys and D. Nistér, "Real-Time Video-Based Reconstruction of Urban Environments", 3D-ARCH'2007: 3D Virtual Reconstruction and Visualization of Complex Architectures, Zurich, Switzerland, July, 2007
  • D. Gallup, J.-M. Frahm, P. Mordohai, Q. Yang and M. Pollefeys, "Real-time Plane-sweeping Stereo with Multiple Sweeping Directions", International Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, Minnesota, USA, June 2007 (acceptance rate: 27.5%)
    Videos of an uncalibrated reconstruction (3.8 MB) and a reconstruction using only planes with high prior probability (8.2 MB). Note that the depth maps have not been fused for these results, which were generated at 33Hz.
  • A. Akbarzadeh, J.-M. Frahm, P. Mordohai, B. Clipp, C. Engels, D. Gallup, P. Merrell, M. Phelps, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewénius, R. Yang, G. Welch, H. Towles, D. Nistér and M. Pollefeys, "Towards Urban 3D Reconstruction From Video", Third International Symposium on 3-D Data Processing, Visualization and Transmission, Chapel Hill, North Carolina, USA, June 2006
Demo
  • J.-M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewénius, H. Towles, G. Welch, R. Yang, D. Nistér and M. Pollefeys, "Real-Time Video-Based Reconstruction of Urban Environments" International Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, Minnesota, USA, June 2007. Won best demo award
Video Analysis and 3D Content Extraction
I have also started working on a new project entitled "3D Content Extraction from Video Streams". It is part of the DTO Video Analysis and Content Extraction program. Our work is on the development of algorithms that can automatically extract 3D information from videos captured by unknown cameras under unknown conditions.
Our research will be on robust Structure from Motion algorithms and auto-calibration that will allow us to recover the camera parameters and poses from the videos. This will be followed by model selection to determine whether we can achieve a 3D reconstruction, a panoramic mosaic (if the camera only rotates) or no reconstruction at all from the video sequence. The emphasis is on making our methods applicable to video sequences on which we have very little control. If 3D reconstruction is feasible, it will be performed by a stereo module with adjustable trade-off between speed and quality. A very fast recognition module will be used to detect previously observed landmarks to stitch partial models potentially reconstructed from different videos. We will also attempt to measure the background and objects of the scene and make higher level inferences such as whether the scene is natural or man-made.

Multiple-View Reconstruction using Graph Cuts on an Adaptive Tetrahedral Mesh
In this work, Sudipta Sinha, Marc Pollefeys and I formulate multi-view 3D shape reconstruction as the computation of a minimum cut on the dual graph of a semi-regular, multi-resolution, tetrahedral mesh. Our method uses photo-consistency to guide the adaptive subdivision of a coarse mesh. This generates a multi-resolution volumetric mesh that is densely tessellated in the parts likely to contain the unknown surface and coarse in parts that are empty. The graph-cut on the dual graph of this tetrahedral mesh produces a minimum cut corresponding to a triangulated surface that minimizes a global surface cost functional. We make no assumptions about topology and can recover deep concavities when enough cameras observe them. Our formulation also allows silhouette constraints to be enforced during the graph-cut step to counter its inherent bias for producing minimal surfaces. Local shape refinement via surface deformation is used to recover details in the reconstructed surface. Reconstructions of the Multi-View Stereo Evaluation benchmark datasets and several other real datasets show the effectiveness of our method.


Three of the input images for the Hygeia dataset. Top row: reconstruction after graph cut. Bottom row: model after local refinement. Notice the details on the clothes.

Publications

Temporally Consistent Multiple-View Reconstruction
In ICCV 2007, Scott Larsen, Marc Pollefeys, Henry Fuchs and I presented an approach for 3D reconstruction from multiple video streams taken by static, synchronized and calibrated cameras that is capable of enforcing temporal consistency on the reconstruction of successive frames. We attempted to improve the quality of the reconstruction by finding corresponding pixels in subsequent frames of the same camera using optical flow, but also to at least maintain the quality of the single time-frame reconstruction when these correspondences are wrong or cannot be found. This allows us to process scenes with fast motion, occlusions and self-occlusions where optical flow fails for large numbers of pixels. To this end, we modify the belief propagation algorithm to operate on a 3D graph that includes both spatial and temporal neighbors and to be able to discard messages from outlying neighbors. We also propose methods for introducing a bias and for suppressing noise typically observed in uniform regions. The bias term encapsulates information about the background and aids in achieving a temporally consistent reconstruction and in the mitigation of errors caused by occlusion.


Left: reconstruction using standard belief propagation stereo. Right: reconstruction using our method. The static parts of the background do not fluctuate and the boundaries of the dancer are sharper.

Publications

Simplified Belief Propagation for Multiple-View Reconstruction
In the fall of 2005, Scott Larsen Marc Pollefeys, Henry Fuchs and I worked on the development of a belief propagation framework applicable to multiple-view reconstruction. The beliefs for the depth of each pixel are initialized using the plane sweep algorithm, which is repeated after each belief propagation iteration taking into account the update visibility information. Pdf's for the depth of each pixel along the ray emanating from the camera center are maintained for all pixels of all images. The main novelty of our work is a scheme for performing belief propagation in adaptive neighborhoods that include 3D neighbors besides the classic 4-neighbors in the image that contains each pixel (see figure below). Essentially each pixel has four constant neighbors in its own image and a number of other neighbors that are determined based on its projection in the other images. Messages are passed among all the neighbors modulated by a compatibility function that takes into account similarity in color, to mitigate the effects of occlusion, and distance in 3D, to suppress the influence of points on different surfaces. For these computations to be feasible, we had to simplify the belief propagation algorithm, thus the title of the project.


Illustration of the neighborhood definition for a candidate depth along a single ray. For the ray that goes trough pixel p in C2, Pd is a candidate 3-D point. Its neighborhood includes the four neighbors of p in the reference image, as well as its projections q1 and q3 in the other images, rounded to the nearest pixel, along with their four-neighborhoods.

One of the eight input images (courtesy of Microsoft Research) and the corresponding depth map we produced.

Publications

PAST RESEARCH PROJECTS

Binocular Stereo using Tensor Voting
One of the many projects I worked on at USC, and arguably the one I spent more time on, was binocular stereo. This work was based on the preliminary approach of Mi-Suen Lee and Gérard Medioni that addressed stereo based on the premise that correct pixel correspondences reconstructed in 3D form the scene surfaces, while wrong correspondences do not form salient surfaces. Under this approach stereo can be posed a perceptual organization problem and tensor voting (see below) can be used to infer the surfaces. I worked to develop an algorithm that would use the same philosophy, but would be more effective in challenging real examples and benchmark data. After experimenting with a number of options for establishing pixel correspondences and for integrating monocular information in a way that mitigates the effects of occlusion without committing to premature decisions, we presented an algorithm that offers certain advantages. These include:
  • The ability to integrate multiple pixel matching techniques and utilize their different strengths.
  • Information propagation via tensor voting in 3D instead of 2D, so that results do not suffer from interference among different surfaces.
  • No bias towards frontoparallel surfaces.
  • A segmentation scheme that groups points into surfaces using only geometric information. After this, the appearance of each surface can be estimated and outliers can be removed. This scheme is effective for both uniform and textured surfaces.
  • Unreliable depth estimates are corrected by generating depth hypotheses from nearby surfaces with similar appearance. The one which is the best continuation of the surface is determined by tensor voting and selected as the new depth estimate.
  • Depth can be estimated even for occluded pixels also by detecting their similarity with nearby surfaces.









Left images, ground truth depth maps, the depth maps we generated and error maps for the Middlebury Stereo Evaluation webpage datasets. White in the error maps indicates errors below 0.5 disparity levels, gray errors between 0.5 and 1 disparity level and black errors greater than 1 disparity level. The error metric is the percentage of pixels above a certain error in disparity. First row: Tsukuba. Second row: Venus. Third row: Teddy. Fourth row: Cones.

We have submitted our results to the Middlebury Stereo Evaluation webpage and rank 14th among 25 algorithms (as of 11/11/2006) when the error threshold is set to 1 disparity level and 9th when the threshold is set to 0.5 disparity levels.


[Click on the image for a larger version or here for the actual table.]

Publications

Multiple View Stereo using Tensor Voting
At USC I also worked on multiple view stereo, where the input is a set of more than two images with known calibration. Our goal was to develop an approach with minimum reliance on binocular processing that addresses the problem in 3D and not 2 1/2-D. We also did not want to be restricted by constraints such as having to place all the cameras on the same part of the scene, perform background segmentation or merge partial results. When data from all images are processed simultaneously the difficulties caused by occlusion and uniform surfaces are reduced. Merging partial noisy depth maps is not guaranteed to have the same effect. Of course, this is only feasible for relatively small sets of images such as the ones we processed here that do not exceed 36 images. The only binocular step is the detection of potential pixel correspondences which are then reconstructed in 3D and are used as input for tensor voting. Correct correspondences receive a lot of support as parts of salient surfaces from their neighboring while wrong correspondences do not. Tensor voting on a rather large number of potential correspondences (1.1 mil) takes around 45 minutes. Since this work was done for the most part before the integration of monocular information in binocular stereo, there is still room for improvement. Some day I may find the time to improve these results and their visualization...


Six of the input images captured at the CMU dome, a view of the inputs from above (note that the cameras are inside the set of points), a view of the most salient points and a zoomed in view at the center of the dome where the person is.

Publications Tensor Voting: First Order Augmentation and N-D Implementation
Besides working on specific computer vision and machine learning problems using tensor voting, I put a lot of effort in understanding, evaluating and extending the framework. Tensor voting is a perceptual organization approach based on the Gestalt principles of proximity and good continuation. It has mainly been applied for organizing generic tokens into coherent groups in core perceptual organization scenarios as well as for computer vision problems formulated as perceptual organization. The two fundamental aspects of tensor voting are the representation of the data by second-order, symmetric, non-negative definite tensors and the information propagation mechanism among the inputs that cast and receive votes to and from their neighbors. Following the standards set by Gérard Medioni and a number of his students that also worked on this, I tried to ensure that all modifications and additional functionalities adhere to our philosophy and result in an approach that is:
  • local
  • data-driven
  • able to represent all structure types and their intersections and boundaries
  • robust to noise
  • able to process large amounts of data
  • amenable to a least-commitment strategy postponing hard decisions as far as possible.
One of my contributions to the tensor voting framework was the integration of first order information that allows the inference of boundaries and terminations of perceptual structures. This addresses a limitation of the strictly second-order formulation of tensor voting that could not reliably infer the endpoints of a curve or the bounding curves of a surface. Second order voting can be interpreted as an excitatory process and first order voting can be viewed as an inhibitory process that complements the former. Similar research was carried out by Chi-Keung Tang and Wai-Shun Tong at the Hong Kong university of Science and Technology. Our collaboration resulted in the PAMI paper listed below.

My other large contribution to the framework was a fully general N-D implementation that allows us to tackle problems in high-dimensional spaces. See below for our work on dimensionality estimation, manifold learning and function approximation. What made this implementation feasible is a geometric observation that allowed us to simplify the vote generation process and made the pre-computation of huge high-dimensional voting fields unnecessary.

Publications
  • G. Medioni and P. Mordohai, "Saliency in Computer Vision", in Neurobiology of Attention, L. Itti, G. Rees, and J. Tsotsos (editors), Elsevier Science, 2005
  • G. Medioni, P. Mordohai, and M. Nicolescu, "The Tensor Voting Framework", in Handbook of Geometric Computing : Applications in Pattern Recognition, Computer Vision, Neuralcomputing, and Robotics, E. Bayro-Corrochano (editor), Springer-Verlag, 2005
  • G. Medioni and P. Mordohai, "The Tensor Voting Framework", in Emerging Topics in Computer Vision, S.B. Kang and G. Medioni (editors), Prentice Hall, 2004
  • W.S. Tong, C.K. Tang, P. Mordohai and G. Medioni,"First Order Augmentations to Tensor Voting for Boundary Inference and Multiscale Analysis in 3-D", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, No. 5, pp. 594-611, 2004 (impact factor in 2004: 4.352)
Tutorials
  • P. Mordohai, "Tensor Voting: A Perceptual Organization Approach to Computer Vision and Machine Learning", during CVPR 2007. The slides are available here in ppt and pdf form. References to related work can also be downloaded as ppt and pdf.
  • Contributed to the brief introduction to Tensor Voting by Gérard Medioni for CV Online.
  • Contributed in the preparation of the Tensor Voting short course by Gérard Medioni and Chi-Keung Tang for CVPR 2003 in Madison, Wisconsin.
Software Figure Completion
I also did research on figure completion which is a perceptual organization process which is triggered by the presence of certain configurations of keypoints. For example, for contour completion to occur behind an occluder two T-junctions at appropriate positions and orientations have to exist. The integration of first order information allows us to detect keypoints such as endpoints of curves, T-junctions and L-junctions. These are indicators of potential completions and are used to generate hypotheses. An important aspect of our approach is that the decision between modal completion, which occurs along the boundary of the occluder, and amodal completion, which occurs along the direction of the occluded contour, can be made completely automatically. If a hypothesis is supported by at least two keypoints, we can infer the completion in a second pass of tensor voting. What should be noted here is that while we do not address real images since integrated edge and junction detection in them is far from being solved, our algorithm can explain a few illusions such as the Koffka crosses shown below, the Ehrenstein stimulus and the Poggendorf illusion.



               
Top row: two examples of the Koffka cross. Notice that the perceived completion by the human visual system changes from a circle to a square depending on the width of the cross' arms.
Bottom row: zoomed in views of the illusory contours produced by our algorithm that detects that modal completion is feasible and completes the low contrast occluder. Due to pixel quantization the circle appears slightly squared. Note that the junctions in the case of the square completion have been explicitly detected.


Publications Dimensionality Estimation
Using the N-D implementation of tensor voting, we were able to tackle problems in instance-based learning. One such problem is the estimation of the intrinsic dimensionality of the data given a set of observations in a high-dimensional space. We can perform this estimation after a round of tensor voting, since the eigenstructure of the resulting tensors provides an estimate of the dimensionality of the structure going through the point. This point-wise estimation makes our method applicable to challenging datasets with varying dimensionality and datasets that are not manifolds, as is the case when they contain intersections. Moreover, the absence of global computations allows us to process very large datasets at reasonable computational costs. We show results of accurate dimensionality estimation at the point level in spaces of up to 150-D.





Top row: data of varying dimensionality in a 4D. (The fourth dimension has been dropped for visualization purposes.) The input consists of an empty 3D sphere in 4D (which appears as a full 3D sphere when the fourth dimension is dropped), a 2D cone and a curve.
Bottom row: points classified according to their dimensionality as 1D, 2D and 3D. Notice that the intersection between the cone and the curve is correctly classified as 3D.


Publications Manifold Learning and Function Approximation
Besides the dimensionality, tensor voting provides estimates of local orientation at each point. This allows us to learn the structure of the manifold locally and perform tasks such as geodesic distance estimation and generation of new samples on the manifold. This can be extended to address function approximation in a setting where the function is learned from observations in a joint input-output space. The queries are in the form of points in the lower dimensional input space. The answer is found by finding a starting point on the manifold and marching on it until the coordinates of the query in the input space are reached. More details will be added once this work is published in a peer-reviewed forum.

3D Face Modeling and Recognition
I spent a few years demonstrating and evaluating the 3D face reconstruction and recognition technology developed by Geometrix, Inc.


For a 3-D model of my face using these two pictures click on the pictures or here. This model was created with the Facevision 200 Series system.

While I never wrote any code directly used in this project I have thoroughly tested numerous versions of the Geometrix software and hardware systems over a five year period. The reconstruction system matured to the point that recognition using 3D information only could be reliably performed. The fact that appearance is not used at all makes the system invariant to illumination and viewpoint variations.
This is a screenshot of a verification test on my face using two models made two months apart. There are large variations in lighting, pose and my appearance, which do not throw the system off.

Seismology
In 2003, Gérard Medioni and I collaborated with Ory Dor and Charles G. Sammis from the Department of Earth Sciences at USC towards developing a technique that uses computer vision to assist in the characterization of the orientation distribution of slip surfaces in fault breccia. My contribution was to use the Facevision stereo rig to reconstruct rock samples from the fault. I also wrote software that detected markers corresponding to slip planes and slip lines, computed their normal or tangent respectively and collected statistics analyzed by our collaborators. They were able to draw useful conclusions on the mechanical origin of the set of surfaces. The use of stereo vision made the process considerably faster and more accurate compared to manual measurements of each slip surface.

Publications
  • P. Mordohai, O. Dor, J. Zechar, C.G. Sammis, and Y. Ben-Zion, "Slip Surfaces in Fault Breccia from the Sierra Madre Fault Zone: Geometry and Mechanical Implications", American Geophysical Union, EOS, 2003
  • O. Dor, P. Mordohai, C.G. Sammis, and Y. Ben-Zion, "Slip Surfaces in Fault Breccia from the Sierra Madre Fault Zone: Geometry and Mechanical Implications", SECE, Proceedings and Abstracts, 2003
Medical Image Segmentation
During the Spring semester of 1999, I worked as a Research Assistant at the Signal and Image Processing Institute in the Electrical Engineering Department of USC doing research on Magnetic Resonance Imaging with Richard M. Leahy. My task was to develop processing able to segment MR images of the brain into gray and white matter and cerebrospinal fluid using morphological processing in 3D.

Lossless Image Compression and Watermarking
For my undergraduate diploma thesis in the Electrical and Computer Engineering Department of the Aristotle University of Thessaloniki, Greece, I developed a plug-in for the Windows version of Netscape Navigator that decodes pyramid encoded images and extracts an embedded watermark from them. The thesis was supervised by Michael G. Strintzis.

Publications
  • P. Mordohai, “Netscape Navigator plug-in for decoding pyramid-encoded medical images with watermarks” (in Greek), Diploma Thesis, Electrical and Computer Engineering Department Aristotle University of Thessaloniki, Greece, June, 1998