A roughly calibrated camera and projector are aimed at the wall (screen). A line is swept across the screen, and as the line falls on objects in a user-defined volume of interest, a feedback mechanism makes the line settle on and follow one end of that object. In the example setup, a horizontal line is drawn, as that orientation is determined to have one of the largest possible projected areas on the camera screen to observe the maximum visible change. And the end chosen is the upper end of the object - the line will move back and forth on the upper end and follow the object's upper edge around within the bounding box.
Structured Light: This program uses an active vision technique called structured light (known light from a projector introduced into the scene to aid in 3D point reconstruction from a single camera view). The light in question is a single straight line of a predefined thickness (to increase error resilience) and the parameter value (and hence the position where the line lies) is known at each instance of the projection loop.
Rough Calibration: Instead of an exact calibration that yields intrinsic and extrinsic camera and projector matrices, all I determine is the centers of projection of camera and projector (by measurement/calculation/verification), four 2D points on the projector screen and the corresponding 3D points on the wall, and an affine warp for the camera (calculated from the screen space and world space coordinates of 4 marked points in the camera image by another program).
The formula for the affine warp is: x'=(Ax + By + C)/(Gx + Hy + 1) and y'=(Dx + Ey + F)/(Gx + Hy + 1), where A..H are the affine warp coefficients.
The advantages of rough calibration are that it is simpler to perform & less time consuming (if one already has center of projection values and measurement of distances in the room can be done quickly); however it is less general and less accurate than the full calibration method. In that method I would have used a calibration cube to determine the matrix from projector to camera space. Here I transform them both to wall space coordinates for objects (x left to right and y bottom to top along the wall, z out of it). This space is the most intuitive to measure & assign 3D coordinates to objects in.
Calculating 3D: From the rough calibration we can work out the ray in image space corresponding to a point in the camera image, and the line's endpoint 3D coordinates are an interpolation of the given 3D coordinates for the projected box in 3D space. Given the triangle formed by the projector COP and the line in 3D space, and the ray from the camera COP to a bright pixel seen by the camera, their intersection gives a unique 3D point. This point should then be tested against the bounding box of the volume of interest.
Lookup Table Generation: We preprocess each possible value for the line and rasterize the triangle formed by the projector COP and the two line endpoints, projecting 3D points on it into the camera image to get corresponding image points that would be lit if an object intersected the plane of this triangle at that 3D point. The image point is then warped according to the radial distortion map for the camera, so that one can do a direct lookup in the distorted image instead of doing undistortion in the rendering loop (buys us speed).
Video Capture and Rendering: I capture frames off a Matrox Meteor-II PCI frame grabber card fed by a Panasonic video camera (future plans are to use a 1394/Firewire based digital video camera and corresponding capture card for repeatability and remote control of camera configuration. To achieve real-time performance (about 25fps currently), capture is made non-blocking and the capture and processing stages are pipelined so that in the main loop, we start the capture of a future frame, draw the line corresponding to the future frame and process the image captured in the previous iteration of the loop, using the parameter value used to draw the previous line.
Feedback from Camera: Without feedback, the line continuously sweeps back and forth, covering the entire range of its motion. Feedback refers to modifying the movement of the line when a pixel is detected whose value is above the threshold and whose table looked-up 3D coordinates lie in the volume of interest (i.e. we hit an interesting object). Then, the algorithm is to move up while we detect such a pixel and back down when we do not, with the effect that the system hovers constantly on the edge between object and no object, thus getting localized on one edge of the object as it moves around.
The system works pretty well (is fairly accurate) for one where the inherent sources of error are so many, and works in real time (20-25 fps) even though it has to do thousands of table lookups per frame. Complex geometric calculations are involved in exactly determining the Centers of Projection of the projector and the camera - what I have has to be an approximation with some significant error. Similarly each of the measurements of 3D points has error. Also if the camera is perturbed even slightly from the position at which the affine transform was found, the result will be incorrect and both camera COP and affine transform will haveto be measured again.
There is an inherent frame of latency in the system due to the pipelining. This shows up in feedback that acts one frame too late, so that the line hovers loosely around the edge instead of settling firmly on it. The line has to continue its previous motion until its "shadow" (the previously drawn line) is analysed and an interesting object is found or not found.
The notion of a bounding box is arbitrary and just introduced to bound the complexity of the lookup per frame to those points that are within the bounding box. It can be frustrating when you lose tracking as the object accidentally leaves the bounding box or the envelope of the sweeping line. I have tried to make this box large enough to facilitate a good enough range of tracking for the demo. There are geometric constraints on how big the bounding box can be based on where the line envelope falls in space.
With the current localization strategy one can find the 3D shape of the object by listing out all the 3D points discovered for a given position of the sweeping line. I am not doing this at the moment as it is not required for my localization algorithm. If it is done, one use this system for modeling or pose estimation of a known object, or tracking (position/orientation). One could reliably project a marker (say a line or cross) on the "tip" of the object or on its "corner" if we ran the process in parallel for a vertical and horizontal line.
The structured light is currently very visible (the camera counts on it being so) and using a special threshold view, one can vary the lighting conditions in the room and the value of the intensity threshold so that only the projected light shows up, not any other objects. It is posssible to make the structured light imperceptible using techniques developed here, for seamless tracking using projector and camera only.
The camera image being shown (and analysed) is in greyscale instead of color due to some problem with getting color image capture to work. Conceivably color would help threshold better and seek out color-coded lines of structured light (eg. for two or multiple directions).