Encumbrance-free Telepresence System with Real-time 3D Capture and Display using Commodity Depth Cameras

Andrew Maimone	Henry Fuchs

Department of Computer Science
University of North Carolina at Chapel Hill

Figure 1: Three views of a live capture session.

Abstract

We introduce a proof-of-concept telepresence system that offers fully dynamic, real-time 3D scene capture and continuous-viewpoint, head-tracked stereo 3D display without requiring the user to wear any tracking or viewing apparatus. We present a complete software and hardware framework for implementing the system, which is based on an array of commodity Microsoft Kinect^TM color-plus-depth cameras. Novel contributions include an algorithm for merging data between multiple depth cameras and techniques for automatic color calibration and preserving stereo quality even with low rendering rates. Also presented is a solution to the problem of interference that occurs between Kinect cameras with overlapping views. Emphasis is placed on a fully GPU-accelerated data processing and rendering pipeline that can apply hole filling, smoothing, data merger, surface generation, and color correction at rates of up 100 million triangles/sec on a single PC and graphics board. Also presented is a Kinect-based markerless tracking system that combines 2D eye recognition with depth information to allow head-tracked stereo views to be rendered for a parallax barrier autostereoscopic display. Our system is affordable and reproducible, offering the opportunity to easily deliver 3D telepresence beyond the researcher's lab.

Overview

Figure 2. Applications. Left: Multi-user telepresence session. Right: Mixed reality collaboration -- A 3D virtual object (circuit board) is incorporated into a live 3D capture session and appropriately occludes real objects.

Figure 3: Physical layout. Top: Demonstrated proof-of-concept system configuration. Bottom: Proposed ideal configuration.

Figure 4: Camera Coverage. Left: System Camera coverage (5 cameras). Right: Color-coded camera contributions.

Figure 5: Data Processing and Rendering Pipeline. All real-time graphics functionality is implemented on the GPU.

Figure 6: Kinect interference problem. First column: IR images showing combined projected dot pattern from camera 1 (red dots) and camera 2 (blue dots). Second column: depth images with no interference. Third column: depth images with interference from other camera. Fourth column: Difference of second and third columns.

Figure 7. Fast Meshing for the Kinect. Left to right: 1) Triangle mesh template stored in GPU memory. 2) Vertex shader extrudes template using camera intrinsics and depth texture. 3) Geometry shader rejects triangles corresponding to missing or discontinuous surfaces (shown in red). 4) Resultant textured mesh.

Figure 8. Data Merger Algorithm. Depth, color, and quality estimate values are determined at each pixel for each camera. The front surface is determined using the depth information, and the associated color values are weighted by the quality estimates.

Figure 9. Data processing results. 1: No enhancements applied. 2: All enhancements except hole filling. 3: All enhancements except color matching. 4: All enhancements except quality weighted data merger. 5: All enhancements applied. ("All enhancements" includes to hole filling, data merger, and color matching)

Figure 10. Head-tracked stereo in motion. Left: Tracking prediction, intra-frame rendering disabled. Center: Prediction, intra-frame rendering enabled. Right: Head cutout used to test eye tracking, with camera behind eye.

Contact

Andrew Maimone, Graduate Student
maimone (at) cs.unc.edu

Paper

Coming Soon.

Frequently Asked Questions

Q: Is this really 3D, or does it just provide a view from the user's head position?

A: It is both. The system displays the scene in stereoscopic 3D (each eye sees a different image), providing a similar sense of depth as current 3D TV and 3D cinema, but does not require glasses. The system also allows the user to "look around" the scene.

Q: How is this different than earlier Kinect 3D video capture?

A: Earlier Kinect video capture utilized one or two Kinect units, which do not provide enough coverage to allow a user to look around a small room without large missing areas. Additionally, when using two Kinect units the data was not smoothly merged, presenting quality problems. Utilizing more Kinects presents a challenge since the units interfere with each other, causing holes to appear in the output. Our current system utilizes five Kinect units to provide more comprehensive scene coverage and new algorithms to overcome the interference problem and merge data with improved quality.

Q: Why not just use the 3D cameras used for 3D cinema?

A: The cameras used for 3D cinema are essentially two 2D cameras placed side-by-side. They allow a scene to be captured from a single general viewpoint at two slightly different positions, one for each eye. However, by themselves these cameras do not provide the depth information necessary to allow the user to "look around" the scene.

Q: Why does a teleconferencing system need to be in 3D and allow the user to look around the remote scene? Is this just a gimmick?

A: The value of these features is two-fold. First, they increase the sense of "presence" or "being there", the feeling that one is actually co-located with the remote user and his or her environment. This sense helps the user forget he or she is looking at a display screen and communicate naturally as if talking to someone on the other side of a table. Second, the ability to "look around" the scene helps preserve information that is lost during normal 2D video conferencing. For example, imagine that you are seated in a meeting room and someone's head is blocking your view of a whiteboard. In our system, as in real life, you would naturally move your head around for an unobstructed view. In traditional video conferencing, you must interrupt the meeting and ask that the remote user move his or her head. As another example, imagine an engineer is holding up a new prototype part to show to a remote participant. With our system, the remote participant could simply move his or her head around to inspect the part. With traditional 2D video conferencing, the remote participant must communicate back and forth with the engineer regarding the different angles the part should be held until it is fully inspected.