|
|
|
` Fast and Simple Occlusion Culling Based on Hardware Depth Queries Karl Hillesland, Brian Salomon, Anselmo Lastra, Dinesh Manocha Submitted to Journal of Graphics Tools This research effort was directed at utilizing recent developments in graphics hardware to support occlusion culling. We take advantage of a "depth query" provided by the hardware. A depth query is a report from the graphics adapter indicating whether a given set of polygons (triangles) passed the depth-test (z-test). NVIDIA's 2X line of graphics cards expose this capability through the NV_OCCLUSION_QUERY mechanism. The program specifies the beginning of a query, some geometry to render, and then the end of the query. At a later time the program asks for the result of the query and the API returns the number of pixels from the geometry that were written. Assuming the depth-test is enabled, if this number is greater than zero, then the geometry was not fully occluded. We utilize this query in a straight-forward manner using a uniform grid. The model bounding box is divided into cubes and associated geometry is stored with each cube. The cubes are traversed in either axis-aligned slabs or slabs that are roughly perpendicular to the viewer. In the axis-aligned case, the two dimensions (from x, y, and z) that define planes most orthogonal to the view direction are used. These dimensions are called j and k and and the third dimension is called i. In the second case, "stepping numbers" (signed integers) a and b are chosen for the two dimensions j and k are computed. Within a slab for each abs(a) moves in the positive j direction one move is made in the i direction. The sign of the move in the i direction the sign of a. The same relationship holds for direction k and stepping number b. The axis-aligned traversal can be viewed as being a special case of the non-axis-aligned traversal with a=b=infinity. Here we are looking down the view frustum (blue walls) towards the eye at a view-dependent slab: Cubes are traversed in successive slabs from the eye point. We utilize the fact that the NV_occlusion_query extension allows multiple queries at once by querying up to a whole slab at once. Once the result of queries for a slab are known, the geometry of visible cubes is rendered and we progress to the next slab. View frustum culling is added by only traversing the portions of each slab that intersect the view frustum. Triangles that fall fully within a cube are stored in per-cube lists. Triangles that span two or more cubes are kept in a global list and cubes contain indices for the "global" triangles that intersect them. Each global triangle also has a an integer that is set to the last frame for which the triangle was rendered. This is used to prevent re-rendering of large (high fill) triangles multiple times during a single frame (for multiple cubes). Surprisingly even with an additional conditional statement per global triangle testing this frame number rendering was accelerated. This is because the bottleneck of our system is fill-rate. For each slab of the traversal the entire screen is filled by bounding boxes. Our algorithm has also been implemented with nested grids. Rather than containing geometry, a cell (cube) would contain a pointer to a sub-grid within its space. This sub-grid would then be traversed in the same manner as the top-level grid. Cells with triangle counts above a threshold are chosen to contain sub-grids with a cap on the depth of sub-gridding. Results: In the above picture the 3 panes depict from left to right: the user's view, a side view showing the geometry rendered by our system, and the geometry rendered using only view frustum culling. This example comes from the power plant model we used in testing our algorithm's performance. Below is a sample viewpoint within the model illustrating the high depth complexity and aggregate occlusion caused by many small occluders. Click on the image to see the full-sized version. Our algorithm dramatically increases the rendering rate for this environment. The following data has been taken from a path that starts on the outside of the power plant, enters a window and moves through a complex pipe structure. The actual improvement in rendering time varies as the amount of occlusion varies through the path. Presented below is occlusion data from the same path: Although the algorithm dramatically increases performance even in this difficult occlusion model, the actual visible set of triangles (as found with an item buffer) is still far smaller than what is actually rendered. Occlusion detection can be increased by using smaller cells or a smaller splitting threshold and greater maximum depth on with the nested grid implementation. The parameters used in the presented data were from the parameter configuration that we found to be near optimal for this path. Further Acceleration Techniques: Prior to discovering that our implementation efficiency was fill-rate dependent we implemented a vertex program to generate cube vertices on the card to save bandwidth. Only the cube id (its grid position) and a set of static indices (into a vertex array already on the graphics card) would need to be sent to the graphics card. If a future API extension allowed the storing of indices into vertex arrays on the card, then only the id would need to be transmitted per cube. We attempted two methods of accelerating our algorithm by reducing fill-rate. One method was to subdivide the view frustum into sub-frusta in an attempt to find sub-frusta whose pixels had been completely filled. The traversal of such sub-frusta could be terminated for the purpose of saving occlusion tests on bounding boxes for the cubes in this frustum. This would save fill for these cubes. To test the visibility of these sub-frusta, the screen was split recursively into quarters up to some limit. A rectangle whose corners were guaranteed to be behind all previously rendered geometry was rendered for each screen area. If this rectangle failed the depth test then no further occlusion testing need be done for cells whose screen space projection falls fully in this screen region. Unfortunately, though traversal of sub-frusta was prevented by this addition, the overall rendering time was increased. The rectangles must have added more fill than they saved. There are a few possible explanations for this. The obvious explanation is that there are too many small holes of visibility preventing enough sub-frutsa from being completely filled. Another is that the z-culling capabilities of the NVIDIA cards are able to prevent full rasterization of the cells that this technique was able to prevent traversing. This "z-cull" operation is based upon storing several layers of Ned Greene's z-pyramid in hardware and detecting when a primitive is fully occluded. Greene has proposed the idea of reading back the tip of the z-pyramid from the card to the CPU during a frame. This capability could be used by our algorithm in place of rendering rectangles and would then reduce the overhead of this sub-frusta traversal termination approach. Another approach to reducing the number of cells rendered was to take advantage of frame-to-frame coherence in occlusion. The algorithm is divided in to two passes. In the first pass all cells visible from the previous frame are rendered without first rendering their bounding boxes. Each rendering is contained within an occlusion query but the results of these queries are not gathered till the end of the frame (to prevent pipeline stalling). The queries are used to remove cells from this list of cells that are assumed to be visible. The second pass proceeds exactly as in the original algorithm and simply skips over cells that have been rendered in the first pass. Cells deemed visible in the second pass are added to the list of cells for the first pass of the next frame. This technique showed modest improvements in frame-rate. |