Occluder Segmentation with Motion


 Final Project Package (password required)


Li Guan

lguan@cs.unc.edu

University of North Carolina at Chapel Hill

Department of Computer Science



 


Abstract

We present a system for segmenting occluders in a scene from a fixed viewpoint, given video stream involving random active motions from that view. We first detect moving object silhouettes using a pre-learned background and shadow model, so that the system is robust against lighting changes, which is frequent in such a setting. Then we analyze motions of the objects by looking at spatial-temporal Motion History Images (MHI) of the silhouettes. Based on the motion direction, we propose a concept of Effective Edge of the moving silhouette, which guarantees to enclose the occluder boundary. We show its power distinguishing between the interface where moving objects have never reached and the real occluder boundary. After a final refinement, the actual occluder is segmented as a binary mask image. A real-time system has been implemented to validate the theory.

 

1.     Introduction

For many computer vision applications, knowing the shape of occluders from a static camera view drastically improve performance. Examples are surveillance tracking, image based reconstruction, etc. It also plays an important role in stereo-pair depth estimation, indicating where pixels in one view do not have correspondences in the other view.

In 2005, N. Apostoloff et. al. [1] proposed a method to analyze occlusion problem in spatial-temporal cube. They re-introduce the concept of “T-junction” and propose a learning framework to improve the correctness of detecting real occlusion.

The T-junction analysis in spatial-temporal video cube reveals a fundamental property: when occlusion happens, the general motion of foreground object is suddenly prohibited by the occluder.

However, we find out that exclusive analysis for T-junctions in a spatial-temporal cube is neither sufficient nor necessary to determine real occlusion. There are mainly two problems, both of which are related with pixel colors. The first one is that when the occluder and background have similar color, no T-junctions at the occlusion boundary could be detected, as shown in Figure 1. Another problem is more subtle. When the moving object has stripe-like textures, in the spatial-temporal cube, we are likely to observe T-junction-like patches, where actually, it is only due to the texture illusion, as shown in Figure 2.

People have tried other methods to avoid too much dependence on color information as in T-junction or optical flow analysis. G. Brostow et. al. [2] use moving foreground object silhouettes extracted with background subtraction as active brushes delineating the underlying layers of the scene. Original as this approach is, the authors admit that due to sensitivity to intensity variations, e.g. shadows cast by moving object, this background-subtraction-based method has limited application areas. Improvements based on probability theory to eliminate the effect of shadows as well as reflections are available now [3, 4], but the main problem to use moving foreground object silhouettes for occluder extraction lies in regions where no moving object has ever explored. In Figure 3, we accumulate foreground silhouettes from all image frames in a single binary image, which from now on will be called Cumulated Silhouette Image (CSI). As image frames accumulate with time, zero-mean pixel noises are mostly eliminated. Therefore edges in CSI provide strong indications for occluder boundaries, as long as we can detect silhouettes correctly. The only exception happens at the interface between reachable and unreachable regions by any foreground object. We call this Phantom Boundary (PB)

Figure 1. T-junction detection failure—problem with similar color. Top: One image from a video sequence, where workers are walking towards right. Bottom: The blue horizontal line in the top image changing with time. Note that the blue circle indicates the region where occlusion happens, but because of color similarity, no T-junction can be detected.

Figure 2. T-junction detection failure—problem with moving object texture. Left: One image from a video sequence, where workers are turning his body around. Right: The red vertical line in the left image changing with time. Note that the red circles indicate false T-junctions with the yellow ribbon, caused by stripe-like T-shirt texture intersecting with the horizontal ribbon, where there is actually no up-and-down motion.

Figure 3. Cumulated Silhouette Image (CSI) of video captured at the construction site. The binary boundaries are also occluder boundaries, except for the horizontal curve at the top, the Phantom Boundary (PB), above which foreground moving objects (the wandering workers) have ever reached.

Silhouette and CSI approach successfully bypasses foreground texture problem as in T-junction analysis. The PB problem warns us that some parts of a silhouette boundary should not contribute to the actual occluder boundary.

In the following, we propose a new approach that uses silhouette boundary to directly detect occluder boundary, i.e., not first find CSI and calculate image edges. Section 2 takes a close look at the formation of PB and introduces the idea of Effective Edge (EE). Section 3 explains additional concerns to make the real-time system work. The main algorithm is explained in Section 4. Examples and conclusion are made in Section 5.

2.     Effective Edge (EE)

In this paper, we assume static scene and only foreground objects can move and they should be occluded by some occluder in the scene at some time instant.

We find that using moving object silhouette F is an over-kill to determine occluder boundary. Using ∂F, the silhouette boundary is enough. To explain the idea, we introduce Cumulated Silhouette Boundary Image (CSBI), which is similar to CSI, but instead of complete silhouette blob, only silhouette boundaries are accumulate. Figure 4 shows CSBI of the construction site video as well as the ground truth occluder boundary. As one can see, the latter is a subset of the former.

 

Figure 4. Left: Cumulated Silhouette Boundary Image (CSBI). Right: Ground truth occluder boundary. The boundaries in the right image are a subset of those in the left image. The ultimate goal of this paper is to achieve the image on the right.

But the PB problem still exists in CSBI. Now the task is to determine which part of ∂F in each frame contributes to the real occluder boundary.

If we take a close look at CSBI, parts of ∂F constituting PB are always parallel to the moving direction of the object, if it is moving at all, but never to be the frontal of ∂F. Therefore PB is different from real occluder edges, in the sense that an occluder boundary should suddenly prohibit foreground motion, as we have observed in T-junction analysis.

This analysis, again, explains why it is important to not look merely static images, but temporally meaningful video sequence for occluder boundary—motion direction does provide essential information to occlude boundary detection.

We now give a vague definition of Effective Edges (EE)—the frontals of foreground object silhouettes in both moving direction and reverse direction, as depicted in Figure 5. The reason that we consider the reverse direction also is that you can “play back” the video, and also detect the occluder boundary via motion.

Figure 5. Effective Edges (EE) of a synthetic move object silhouette. Note that the silhouette could be concave.

We can then simply accumulate EE as valid candidates for occluder boundaries. The idea is that, even some EE are not actual occluder boundaries for some unpredictable reasons, e.g. failure of foreground silhouette segmentation, they can be easily detected and removed if those pixels have ever been or will ever be occupied by some foreground silhouette in the video sequence.

3.     Preparations

3.1 Background subtraction

As mentioned in Section 1, traditional background subtraction algorithm, which requires a perfect background image, is very sensitive to intensity changes. However, it can be easily improved by learning a probability model for every pixel.

In this paper, we tend to model random camera noise for every pixel with a simple 3D Gaussian distribution in RGB color space, and treat shadows separately with another 1D Gaussian.

The construction of background model is straightforward. We collect a series of background frames, and for each pixel, we calculate its mean and covariance.

Once we get the background model, for shadows, we compare chromaticity difference between the observation and calculated background mean. Independence between chromaticity and intensity is assumed. Suppose the mean background value at a certain pixel location is Cb, and the color observation at that pixel now is Ct. We first normalize both vectors, such that Cb becomes (1, 1, 1). To measure chromaticity, we calculate the distance d from Ct to the vector. Figure 6 shows the setup. For all pixel position in all times, we train a general zero-mean 1D Gaussian probability model.

The reason that the chromaticity difference is a Gaussian is given as follows. Intuitively, the nearer chromaticity of a pixel to, the higheris. We assume that a sample point in the RGB cube lies on a 2D plane having as normal vector. Then a sample can be thought as a random variable (x, y) in the plane, if we set the origin at the intersection of and the plane. Then,, by definition, obeys a Rician Distribution [5]:

And if we assume, we have.

Sinceis distributed along the circle with radius d0, it turns out:

is a still zero-mean Gaussian.

Figure 6. Normalized RGB cube for learning shadows. Every red dot represents a pixel observation of shadow. It has been normalized with respect to the background mean color at that pixel location before plotted in the graph. Chromaticity difference, measured as d, is trained to indicate shadow probability.

After background and shadow model are constructed, we can label pixels having both low background and shadow probability as a foreground pixel using simple thresholding. Although smoothness can be further imposed to regularize the shape of the silhouette, we do not discuss it further in this paper.

3.2 Determine motion direction

We use Motion History Image (MHI) of silhouettes to infer the motion direction. The MHI is a static image template where pixel intensity is a function of the recency of motion in a sequence as shown in Figure 7.

Figure 7. MHI at a certain time instant in construction site video. The blue blobs indicate moving silhouettes. The newest frame is recorded with highest intensity, and the older ones are faded out sequentially.

Green bars in Figure 7 indicate intensity gradient direction at that location. Red rectangles cluster silhouette blobs, and the red line segments at the center indicate the mean gradient direction of all pixels in the rectangle, which represent the general motion direction of the silhouette.

In fact, MHI can be taken as a compressed form of video cube. T-junctions necessarily happen at locations where the ratio of the Eigen values of gradient matrices suddenly decreases from infinity to 1.

3.3 Calculate Effective Edge

In order to simplify the implementation, we do not calculate EE as depicted in Figure 5. Instead, we redefine it to be the silhouette boundary within degree to the motion direction and its reverse direction, as shown in Figure 8. is one of the tuning parameter later in our system. But generally, a large indicates high confidence in the extracted silhouette as the real object shape.

Figure 8. Modified Effective Edges (EE). Notice the EE on the right is broken into two pieces,

Because of concavity, this EE is by no means the same as the original definition in Section 2, but thanks to the huge side of video frames, the missing EE is very likely to be recovered by other moving objects at other time instant.

Now we can use accumulate EE with time, but eliminate those EEs happen to locate in places where moving blobs have ever traveled or will be traveling. We call this final image as Cumulated Occluder Boundary Image (COBI), as shown in Figure 9.

Figure 9. From left to right, top to bottom, COBIs at frame No.5, No.50, No.100 and No.274 respectively. This shows how COBI evolves with time. As candidates of occluder boundaries, EE appears random at first, but getting regularized as time goes on. Notice that no PB exists in the final COBI.

3.4 Refine the final boundary

Right now, we manually break the loop updating COBI, if the boundary is “visually” good enough for further use. To find a good metric to evaluate the occluder boundary is one of the further research directions.

However, we can enhance the smoothness of occluder boundary by combining COBI with CSI again, since edges in binary CSI always preserve smoothness and connectivity.

The idea is to find edges CSI, and let them voted by non-zero boundary pixels in COBI. If ratio of the voting number against the edge length is below a threshold R, that edge is taken as a PB, thus deleted. Otherwise, the edge is announced to be an occluder boundary, as shown in Figure 10.

Figure 10. On the top row from left to right are final COBI, CSI colored with different color to indicate connectivity of edges. On the bottom row, the left image is after thresholding with R=10% of the ratio between votes and edge length in unit of pixel. Try to compare this image with the ground truth in Figure 4. The bottom right image is after flood-fill as a mask image for further use.

4.     Main algorithm

Now we give the main algorithm.

FindOccluderBoundary (V, MB, MS)

Input. Video sequence V, background model MB, shadow model MS.

Output. The Cumulative Occluder Boundary Image (COBI).

1.     Background segmentation to get foreground object silhouette S;

2.     Update the Motion History Image (MHI) with S;

3.     Determine motion directions of S

a)      Calculate gradient of MHI;

b)      Cluster S into meaningful blobs bounded by rectangle R;

c)      Represent motion direction of R with the mean gradient of all pixels in R

4.     Calculate Effective Edge (EE) of current S, given the motion direction of S.

5.     Add EE into COBI;

6.     Add S into Cumulated Silhouette Image (CSI);

7.     In COBI, reset pixel to zero if that the same pixel is non-zero in CSI.

There are a few parameters to tune the quality of the recovery.

1.   —the standard deviation in the shadow model;

2.   —the threshold for background segmentation;

3.   —the fading ratio of silhouette in MHI;

4.   —the kernel size for MHI gradient calculation;

5.   —the confidence of EE along motion direction;

6.   —the threshold for finally smooth the edges.

5.     Conclusion

In this paper, we introduce an original method to robustly detect and segment static occluders using Effective Edge of a moving foreground object in the scene. Based on MSVC++6.0 and OpenCV, we have real-time implementation to segment occluders from a webcam as well as background and shadow training. Figure 11 shows another example from webcam. The main problem currently, as mentioned before, is no automatic stop of COBI online updating, and depends on the complexity of the setting, the speed of convergence of the algorithm vary a lot.

Reference

[1]       Nicholas Apostoloff and Andrew Fitzgibbon. Learning spatiotemporal T-junctions for occlusion detection. CVPR 2005, June 2005.

[2]       Gabriel J. Brostow, Irfan A. Essa. Motion Based Decompositing of Video. ICCV, 1999.

[3]       Ahmed M. Elgammal, David Harwood, Larry S. Davis: Non-parametric Model for Background Subtraction. ECCV, 2000.

[4]       Nicolas Martel-Brisson and Andr´e Zaccarin. Moving Cast Shadow Detection from a Gaussian Mixture Shadow Model. CVPR 2005, June 2005.

[5]        http://www.mathworks.com/access/helpdesk/help/toolbox/commblks/ref/riciannoisegenerator.html.

Figure 11. Full-resolution real-time example. (a) camera view; (b) MHI; (c) CSI; (d) silhouette image; (e) silhouette boundary; (f) EE; (g) COBI; (h) final COBI; (i) CSI boundary; (j) after thresholding with R; (k) after flood filling.