Reconstruction Algorithm

Objective:
If we can generate a model of the user (and other real objects) in real time, we can then render that model from the user's viewpoint and present a visually faithful avatar.  Further, we could insert the objects into simulations and rendering to truly incorporate the objects into the virtual environment.

To generate a real-time model of real objects, we try to answer the question:

Given n images of an object (or objects), can we quickly generate a 3D model of the objects?

We can make a tradeoff between speed and accuracy, and depending on the application, this can be an appropriate trade off to make.

We solve for the visual hull, first proposed by Laurentini in 1994, using graphics hardware to accelerate this process.

The visual hull is the tightest volume given a set of silhouettes of an object.

In this 2D example, we have a turquoise box.  Given the two yellow cameras, the box would project (the dotted purple lines) onto the image planes.  We denote the pixels on the camera image plane that have the object projected onto them as object pixels.

Now, if we reverse this and ask given just the camera image planes, what can we say about the object?  The light blue area is the visual hull, the volume that the object could exist within.  The original cube is overlayed for comparison.  Similar to a convex hull, the visual hull is always larger than the original object, and no part of the object is not within the visual hull.  As you can see, certain object topologies can not be determined with just object silhouettes, such as a concave object like a cup.

To answer the question: "Is a point within the visual hull?"  we note that a point inside the visual hull (on the left) will project onto an object pixel in every camera.  Secondly, an object outside the visual hull (on the right) will not project onto an object pixel in every camera.

So we can now determine if a point is within the visual hull, let's extend this to a line.  In the following diagram, we project each cameras' object pixels onto the line in question (orange).  The left camera projects as a bright green line, and the right camera is projected as a purple line.  If we take the intersection along the line where the bright green and purple lines overlap, we can the intersection of the line with the visual hull (denoted as the turquoise line, the same color as the original square).  In affect, we are querrying, "what points along this line are within the visual hull?"

To generate a view-dependent rendering of the visual hull, we can solve for intersections with the visual hull in a series of planes perpendicular to the lookat vector, gradually receeding away from the viewer.  Here we sweep a series of orange planes, starting from the closest to the eye.  What results is sliced representation of the visual hull.  For 3D, we sweep planes instead of lines, and all the intersection approaches do scale up to a third dimension.

Well that's all fine and good you say, but how can we possibly test for every point in a view volume in real-time?  Well you can't test every point (for another approach to this, and one that avoids a Z sampling problem, check out the Image Based Visual Hulls approach via the links page), so you can only sweep the planes.  In practice we find this works well, by determining a reasonable spacing across the volume in question.  For example, we have a 4' x 3' x 3' volume and sweeping 50 planes is more than adequate.

Okay even with planes, how can we possibly calculate all the different points that could be shown onto a screen in real-time?  For a 320 x 200 (screen resolution) x 50 slices x n cameras = 3.2 million* projections and tests per frame.

Well, we note that this projection approach is already accelerated through projected texture mapping.  If we convert each cameras's image into a texture and apply the appropriate texture matrix that corresponds to the camera's extrensic parameters, we can project the object pixels onto any 3D primative that is rendered.

Back to see the forest: remember we are asking whether any points in the view volume are within the visual hull, our approximation to the original object.

Each camera's texture onto a plane.  The intersection of all the textures is the parts of the plane that is within the visual hull. We then sweep a series of planes to generate a first visible surface from the viewpoint.

For example, say we start with the following synthetic data as the input to our algorithm:

Our first step is to determine which pixels in each image correspond to object pixels and which are background pixels (non - object pixels).  We use image subtraction (discussed in the paper, really a brute force technique) that subtracts the current image from a reference image for each camera.  For each pixel, we subtract and compare against a threshold.  If the difference is greater than the threshold, then label the pixel as an object pixel and assign an alpha value of 255.  Background pixels have a alpha value of 0.  Here is alpha map of input image #4 (white = alpha of 255, black = alpha of 0).

This is an overhead diagram of projecting all the cameras (green (N), blue (E), light green (S), light blue (W)) about the user.  The rendering viewpoint is located in the lower left of the image.

On the left, projecting all the camera images as textures onto a plane (orange in the above diagram).  The right image is the intersection of the textures.  Note in this view, the plane is not perpendicular to the view vector for "illustrative purposes".

Thus we have an answer for one plane at the rendering cost of 2 triangles (one quad).  The expensive part is the fill rate since each plane (without optimizations) takes up the entire screen.

Computing the intersection:

Set things up.  We need the blending because we test the alpha.  We don't need to test for depth since we are going to render from front to back, but we want the Z value to still be affected.

glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);
glDepthFunc(GL_ALWAYS);
glEnable(GL_BLEND);

First we want to only draw onto a pixel if it is an object pixel

glAlphaFunc(GL_GREATER, 0.0);
glEnable(GL_ALPHA_TEST);

Next we will use the stencil buffer to compute the intersections.
glEnable(GL_STENCIL_TEST);

For a plane, when a pixel is projected onto by a camera's texture (thus an object pixel), we increase the stencil buffer by 1.  After projecting all textures, the pixels with a stencil value of n (where n is the number of cameras) are the points on the plane within the visual hull.

for (i=0;i<state.m_iNumCameras;i++)
{
    if (state.m_pCam[i].m_isCameraActive->m_iVal)
    {
        glStencilFunc(GL_EQUAL,iDrawn,~0);
        ApplyProjTexture(i);
        DrawPlane();
        iDrawn++;
    }
}

Now to clear all pixels that don't have a stencil value = n.

glStencilFunc(GL_GREATER,iDrawn,~0);
glStencilOp(GL_KEEP, GL_KEEP, GL_ZERO);
DrawPlane();

Repeat the above for:

for (fPlane = m_fsNearestPoint->m_fVal; fPlane<m_fsFarthestPoint->m_fVal; fPlane=fPlane + fStep)

Here is the resulting depth buffer (on the left) and another view on the right.  To color, you can weight the alphas depending on camera location.  Or you can generate a depth mesh (we plan on using Dave McCallister's code to go from a depth buffer to a poly mesh quickly.

Collision Detection/Response:

Using the graphics hardware to determine if a primitive (or part of a primitive) is within a visual hull, we can do collision detection (and response) between real and virtual objects with the same projected textures technique.

To determine a collision between a real and virtual object, render each triangle with the projected textures/stencil buffer.  After all the triangles have been rendered, check the frame buffer for any pixel that is "on" from having passed the stencil value!=n clear.  If there are, there was a collision.