Group Tele-Immersion

Henry Fuchs
Herman Towles
Graduate RAs: Andrew Nashel
Collaborators: Ruigang Yang, University of Kentucky

Funding: Department of Energy ASC VIEWS program
Program manager: Christine Yang, Sandia National Laboratories


Introduction

The goal of group tele-immersion (GTI) is to enable high fidelity, immersive teleconferencing between groups at distant locations. Most group teleconferencing systems in use today are simply versions of the one-to-one system used by a group of people at each site. Such single camera/single display systems usually suffer from low resolution, small fields-of-view, and smaller than life-size displays. Attempts to overcome these limitations usually involve replicating the one-to-one system for each set of participants, but this creates hard boundaries between video images that may be disconcerting to viewers. The group tele-immersion project will transmit video data to be reconstructed as a single high-resolution continuous image, for viewing on a life-size projector display for multiple participants between two remote locations.


Our conceptual vision for group tele-immersion


Initial Design and Implementation

Based on our experience with previous tele-immersion and multi- projector display wall systems, we developed the following design for a group- to-group teleconferencing system:

  • High resolution, life-size interactive display with a multi-projector PC-driven display wall
  • Scene capture with tens of digital cameras
  • Eye contact via 2D view interpolation or 3D scene reconstruction
  • User tracking for rendering and geometry extraction
  • Spatialized immersive audio with a 1-1 array of microphones and speakers

Initial design concept
Initial design concept: 3 projector display with 8 capture cameras mounted above the display

The initial system implementation included:

  • 3 projector abutted display
  • 8 capture cameras mounted above the display
  • Firewire interface cameras for digital video
  • Commodity PCs and networking
  • Custom camera server and rendering software


Line Light Field Rendering

We present a system and techniques for synthesizing views for many-to-many video teleconferencing. Instead of replicating one-to-one systems for each pair of users, or performing complex 3D scene acquisition, we rely upon user tolerance for soft discontinuities for our rendering techniques. Furthermore, we observed that the participants' eyes usually remain at a constant height (sitting height) during video teleconferencing, thus we only need to be able to synthesize new views along a horizontal plane. To accomplish this, we have developed a real- time system that uses a linear array of cameras to perform Light Field style rendering. Vertical image strips are chosen from each camera and blended to create a continuous single image for display. The simplicity and robustness of Light Field rendering, combined with the natural restrictions of limited view volume in video teleconferencing, allow us to synthesize photo-realistic views for a group of participants at interactive rate.


Orthogonal view Line Light Field rendering with the lower color band showing the blended contribution from the eleven cameras, each represented by a unique color

Acquisition and Rendering Implementation

The cameras of the linear array are connected in pairs to commodity PCs, which are interconnected with 100Mbps Ethernet. The cameras are synchronized by a wire controlled from a PC and fully calibrated. The PCs act as video servers and JPEG-encode the video data at full VGA resolution or a smaller region-of-interest specified by the algorithm. The video data is sent to a rendering PC which drives a two projector display using an NVIDIA GeForce4 video card. The rendering algorithm blends these camera images using a Light Field- based technique to create a continuous, high-resolution, wide field-of-view image from a perspective further behind the screen, a good compromise for a group.


Line Light Field design: Abutted projector display with a dense horizontal camera array at table level and actual camera array implementation


Remote participants and local participants with display at UNC


Internet Conferencing Architecture

We have implemented a 3D teleconferencing prototype between the University of Kentucky (UKy) and the UNC-CH. Each site has a total of eight Sony digital Firewire cameras arranged in a linear array. These cameras are regularly placed at 65 millimeter apart, very close to the minimum distance allowed by the form factor of the camera body. All cameras are synchronized by a wire controlled from a PC and fully calibrated. The video acquisition system (one in each site) includes four PCs interconnected through 100Mbit Ethernet. Each of them is connected to two Sony cameras and is used to capture and JPEG-encode the raw image data at full VGA resolution. The JPEG streams can be sent through the network.

We use light field rendering to synthesize new images from a user-driven virtual camera. There are two options to run the view synthesis program. One is to send all the video streams over the internet and synthesize novel views at the remote site (remote rendering). Alternatively we can synthesize views locally and only send the nal result to the remote site (local rendering). The first approach has a lower latency for changing viewpoint while the second is easier to manage from a network standpoint. In terms of scalability, the bandwidth requirement for remote rendering is O(kn) where k is the number of cameras and n is the number of sites (assuming multicast is used). The bandwidth require- ment for local rendering, on the other hand, has a quadratic growth rate at O(n2). Given that our current prototype includes only two sites and the viewpoint is not changing rapidly (i.e., no head tracking), we chose to implement the local rendering approach.

Locally, we can achieve an update rate of 4-7 frames per second (fps) for VGA input images. The bottleneck is in image capture. We can only capture synchronized VGA resolution images at 7-8 FPS with two cameras on the same 1394 bus. This is caused by the 1394 bus bandwidth limita- tion and the transfer characteristics of the digital cameras under external trigging. The synthesized view, typically at 1024x512, is read back from the rendering program's frame- buffer and sent to the remote site through TCP/IP with JPEG encoding. The frame rate between UKy and UNC- CH varies from 5 fps to 10 fps, depending on the network traffic. Optimization in the network code or the use of a more sophisticated compression scheme is expected to substantially increase the frame rate.


Live LLF-based conferencing between UNC-CH and UKy


Updated: 01 May 2005