Ti_viewer : 3PC Renderer

 
Sudipta N. Sinha , Feb 16, 2002


 
 
 
Architecture Overview :

Introduction

 
         The Renderer is the backend component in the 3D Tele-immersion pipeline. It is designed to receive dense depth maps over multiple channels and render them as a giant point cloud and re-create a life-size, view-dependent stereo display of the acquired scene. Multiple depth maps are time synchronized to create a composite display frame. The stereo display system uses two projectors in a passive stereo setup while the user's current position is estimated using a HiBall wide-area tracker.
          The old version of Ti_viewer was designed to run on a single PC with both the left and right eye image rendered on the same machine with a dual output graphics card, running in twin view mode . The renderer had two threads : the first thread read depth maps off the network and  re-created a point cloud display list while the second thread received updates from the wide-area tracker and accomplished view dependent rendering. With 3D reconstruction at a rate of 1-2 frames/sec with background segmentation, and depth maps of resolution 320 x 240 pixels, this version of ti_viewer succesfully rendered approximately 50,000 points for 3 streams of depth maps. However the architecture was simply not scalable to support dense depth maps from full scene reconstruction at 640x480 resolution and possibly higher reconstruction frame rate. This was the motivation for creating a 3 machine architecture for a distributed renderer which could handle upto 2 million points and still render it at an interactive rate.


          The 3 PC Renderer has three modules / processes :


Typical Data Sizes

 
        A typical tele-immersion demo is described alongwith the b/w and rendering data sets involved.

        Let us assume 5 streams of 640x480 resolution depth maps each with a reconstruction rate of 1 frame/sec.
The depth maps contain a stream definition header followed by the depth map itself. Every pixel in the depth map
is represented as 5 bytes ( 3 bytes for color - RGB and 2 bytes for the depth ). Holes in the depth map ( these are points for which a correspondence was not found with a high confidence ) are not rendered and are represented using a special value defined as part of the stream definition. Similarly background pixels are represented by a special value and are not rendered. Assuming  75% reconstruction efficiency on an average, the size of 1 depth map is :

                       0.75 * 640 * 480 * 5 = 1.098 MBytes / depth map.

       When a depth map pixel is converted into a 3D world point, it requires 15 bytes for storage :

                       12 bytes for 3 coordinates and 3 bytes for RGB color.

       For the above example :  1 depth map would produce 0.75 * 640 * 480 points = 230,400 points.
       Each point takes 15 bytes for storage , thus the total number of bytes used to store this is 3.1 MBytes.

       With 4 streams, the renderer has to render almost 1 million points at an interactive rate.


Detailed Architecture

                The various components in the renderer architecture are depicted in the figure below. Each module is described in detail in the sub-sections that follow.
 
 

3 PC Renderer Architecture

Fig 1 : Overview of the 3 PC Renderer Architecture.









Modes of Operations:

There are 2 main modes of operations. An alternative approach would have been to design two different applications which would have considerable overlap in terms of functional modules like stream parsing, networking, timestamp based synchronisation etc. Instead both the rendering functionalities were built into the same renderer and the decision to run in a particular mode can be decided during a tele-immersion session by the user. The 2 modes of operations shall be called the following from now on.


VAR Mode



 

Vertex Shaders / Cg Mode

          The main motivation behind this alternative design of the renderer was the quest for higher rendering speeds. Since conversion of the depth maps into points is a standard matrix multiplication, it was decided that a vertex shader program could probably do the computation faster. Thus the idea was to move the CPU intensive operation in the recon_server to the rendering processes. An extra benefit would be lower bandwidth over UDP since 1 depth map pixel ( 5 bytes ) requires more storage space as a 3D Point ( 15 bytes ). We still had to implement this design to be sure of the trade-offs and gains of this approach.
          The recon_server becomes a simpler module which simply synchronises the streams based on timestamps and chooses the particular depth maps to display in the next update. Each of these depth maps are forwarded or relayed to the left and right eye renderer over different UDP connections. Each of the depth maps are received in the renderers over a different UDP port which is automatically selected during initialization.
          In this mode the complete depth map needs to be transported from the recon_server to the left and right eye renderer running in Vertex Shader Cg Mode. For a lower reconstruction efficiency, the size of all the depth maps put together would be higher than that of the composite 3D point cloud payload in the VAR mode of operation. However a simple analysis will show that for a reconstruction efficiency of higher than 33 %, the bandwidth over UDP for a 3D point cloud would be higher than that of all the corresponding depth maps.
  • Renderer in Cg Mode

  •  

     
     
     
     
     

             This time the renderer has to parse the depth maps from the UDP stream, in the same way that the recon server parses the streams in the VAR Mode, the depth map buffer is passed to the rendering thread directly to the vertex shader program. The conversion of depth maps to 3D Points is done through floating point instructions in a Cg program. Depth map pixels that are holes or background pixels are projected to the plane at infinity and are not rendered visible in this way.

    A top-level view of the architecture of the renderer in the Cg Mode is shown below :


    Fig 3: 3 PC Renderer in Vertex Shaders Cg Mode.







     

    Recon Server ( n/w aggregate node )


    Frame 4 bytes unsigned int The frame number / timestamp field
    Height  4 bytes integer Height of the Frame , typically 480, 240
    Width 4 bytes integer Width of the Frame , typically  640, 320
    offset 4 bytes float Offset of the camera from the image plane
    stepsize 4 bytes float depth quantization stepsize used in reconstruction
    red length 4 bytes unsigned int Length of the buffer of red values for all pixels
    red plane  byte char stream of the above length Red Buffer 
    green length 4 bytes unsigned int Length of the buffer of green values for all pixels
    green plane byte char stream of the above length Green Buffer
    blue length 4 bytes unsigned int Length of the buffer of blue values for all pixels
    blue plane byte char stream of the above length Blue Buffer
    depth rows 4 bytes int Depth frame height same as texture frame typical.
    depth columns 4 bytes int Depth frame width same as texture frame typical.
    depth length 4 bytes unsigned int Length of depth pixel buffer
    depth image 2 byte unsigned short stream of the above length Depth pixel buffer , each pixel ( 1/z value ).

    Left and Right Eye Renderer


    Network Interface in 3PC Renderer.

                  The basic motivation behind using UDP is that it is fast. Morever on a dedicated direct link, it will probably not drop packets unless something unusual happens. UDP allows us to use multicasting which is practical in our case since both left and right eye renderers need to get the same data payload from the recon_server.
                  A basic UDP sender and receiver functions completely asynchronously. However the various modules of the renderer need to sychronize with each other to function properly. This synchronization is done in our own protocol which is now described.

    The idea behind the UDP based protocol in the 3 PC Renderer is to have a signalling connection and a separate data connection  for every abstract stream of data. The signal connection is used to synchronize the sender and the receiver at the beginning of data transfer. Once both are synchronized data is transfered over the data connection at the fastest rate possible. The sender informs the receiver how much data is going to be sent in the very first packet that it sends. Next it tries to send as much data as the OS send buffer can hold. If it tries to send more there is a high chance of buffer overflow and would lead to packet loss. The time interval between two such consecutive sends is controlled by the user through a command line prompt.

    The client knows when it has received all packets and will signal back to the sender saying "I am done".
    However both receiver and sender are only allowed a small time-interval in which they have to finish data transfer. This restriction is imposed by the application. Right now the timeout value is a constant but this could be easily be changed into an adaptive timeout value based on what the update frame rates and the percentage packet losses are.
     


    Rendering Performance:

    The columns in the table mean the following:
     

     
         Dataset Name        Resolution  Bytes delivered by UDP / update. ( MB )  Num. of Points Rendered
    UDP   Transfer Time       in seconds 
     Refresh   rate 
    ( frames / sec )
     Update rate 
    per second
       Foregrnd only 
         3 Streams
           320 x 240               0.800                 55,000            0.06                   380            8.9 
        Foregrnd only 
         5 Streams 
           320 x 240              1.3 MB                 100,000            0.09                   200            6.3
      Full Scene 
         1 stream 
           640 x 480              3.2 MB                 232,000            0.11                   110            5.3
         Full Scene 
         2 streams
           640 x 480              6.3 MB                 455,000            0.16                    66            3.7
         Full Scene 
         3 streams 
           640 x 480              9.9 MB                  720,000            0.23                     42            2.7
         Full Scene 
         4 streams
           640 x 480              13.3 MB                 900,000            0.29                   27            2.3
         Full Scene 
         5 streams
           640 x 480              17.5 MB                1,160,000            0.32                   24            1.8
         Full Scene 
         7 streams 
           640 x 480              24.5 MB                   1,650,000             0.46                  18            1.4
         Full Scene 
         9 streams
           640 x 480              31.2 MB                   2,100,000             0.56                  16            1.0

         Dataset Name        Resolution  Bytes delivered by UDP / update. ( MB )  Num. of Points Rendered
    UDP   Transfer Time       in seconds 
     Refresh   rate 
    ( frames / sec )
     Update rate 
    per second
    Enabled   1 stream        640 x 480              3.36 MB                 234,000            0.11                   95            1.7
         Enabled   2 streams        640 x 480              7.00 MB                 475,000            0.16                    47            1.5
         Enabled   3 streams        640 x 480              10.5  MB                  710,000            0.23                     35            1.3
         Enabled   5 streams        640 x 480              17.3 MB                 1,170,000            0.37                   20            1.2
         Enabled   7 streams        640 x 480              24.5 MB                1,620,000            0.46                   18            1.1
        Enabled   9  streams        640 x 480              31.5 MB                 2,120,000             0.56                  16            1.0

    Conclusion : We are spending way too much time in parsing the streams. That is why when we have 9 streams running and we are displaying only 1 - we are able to run at only 1.7 frame/sec whereas in the case where we are running with only 1 stream, we can go upto 5.3 frames/sec/
     

    Question  about bandwidth utilization :

    Inter PC bandwidth :

    During these tests we were transferring data from recon_server to the left and right eye renderers at   ( 31.5 / 0. 56 ) * 8 Mbits / sec =  450 Mbits/sec.


    Incoming bandwidth :

    During these tests we were running streamplayer on a local machine directly connected to the same Gbit ethernet switch.
    Each of 9 depth maps - 1.5 Mbytes each.
    We recorded the time taken to read the stream off the network and parse them to be 0. 05 seconds each  on an average.
    Thus time taken to parse the 9 streams are taking up 0.45 seconds at the recon_server.
    Since the framerate is 0.70 for 9 parallel streams : total bandwidth is only 1.5 * 9 * 8 / 0.45 = 240 Mbits / sec.


    But where is the rest of the bandwidth going ? shouldn't this have roughly added to 1000 Mbits/sec ?

    In other words can we do better than 450 Mbits/sec over UDP between recon_server and left and right eye renderers assuming that we are trying to send as fast as we can.
     

    The following table shows TCP vs RUDP throughput comparison
     
         Dataset Name        Resolution  Bytes delivered by UDP / update. ( MB )  Num. of Points Rendered
    UDP   Transfer Time       in seconds 
     Refresh   rate 
    ( frames / sec )
     Update rate 
    per second
       Foregrnd only 
         3 Streams
           320 x 240               0.800                 55,000      
        Foregrnd only 
         5 Streams 
           320 x 240              1.3 MB                 100,000      
         Full Scene 
         1 stream 
           640 x 480              3.2 MB                 232,000      
         Full Scene 
         2 streams
           640 x 480              6.3 MB                 455,000      
         Full Scene 
         3 streams 
           640 x 480              9.9 MB                  720,000      
         Full Scene 
         4 streams
           640 x 480              13.3 MB                 900,000      
         Full Scene 
         5 streams
           640 x 480              17.5 MB                1,160,000      
         Full Scene 
         7 streams 
           640 x 480
         Full Scene 
         9 streams
           640 x 480

     
     


    Dataset Name        Resolution  Bytes delivered by UDP / update.
               UDP 
       Transfer Time
         Refresh   rate  Update rate 
        Foregrnd 3 Streams            320 x 240 
        Forgrnd 5 Stream             640 x 480
        Full Scene 1 Stream            640 x 480
        Full Scene 2 Streams             640 x 480
        Full Scene 3 Streams            640 x 480
        Full Scene 5 Streams             640 x 480
        Full Scene 7 Streams            640 x 480
        Full Scene 8 Streams             640 x 480

    Usage:

  • VAR Mode:

  • Demo Setup :

                Renderer is ready ...
                       In either case we shall start seeing the rendered scene on both hires41-tn and hires42-tn ( full screen rendering ).

                To check projector alignment.