Summary:
Our goal here is to calculate the 3D surface locations of an object. There are several ways of accomplishing this but in this article we concentrate on stereoscopic reconstruction. Stereoscopic reconstruction estimates the depth of an object using images generated from two cameras that view the object. As a project in my Computer Vision course (ECGR6090), I implemented such a system.
Below is the setup of a simple stereoscopic reconstruction system, with two cameras imaging the same object. Using the images, known relative geometric positions and orientations for the cameras, and how each of the cameras form images, i.e., details regarding their lenses and image sensors, we can reconstruct the 3D positions of points which lie upon the viewed object surface.
The theory of stereoscopic reconstruction is straightforward. Using the know locations of two cameras, relating the same points in two different images(one from each camera) the 3D depth of that point can be calculated. While the theory is straightforward, there are many tasks which must be completed before accurate readings may be found.
Step 1: Camera Calibration
In order to use images captured from a camera, certain intrinsic and extrinsic properties of the system must be known. To calculate these properties, a calibration pattern was is imaged by system. The calibration pattern is a box with a checkerboard pattern of known geometry superimposed on the box outer surfaces known as a calibration pattern. The calibration objects is then imaged from both cameras and matching image-point pairs, i.e., image pixels describing the same 3D position in each of the two images, were manually selected from the resulting images. These point pairs matched the 2D image locations of the same 3D world points. Using these pairs, intrinsic and extrinsic parameters were calculated. The intrinsic properties of the camera of interest were the focal length, image center and image plate size, i.e., the size of the image sensor. The extrinsic parameters of the system were calculated for each camera and specify the relative positions and orientations of the two cameras.
Step 2: Back-Projection
Back-projection takes the indicated point pairs and projects rays back through the cameras into the 3D scene (see blue lines in the diagram below). For each image-point pair, rays from the left and right cameras should become very close to intersecting at the actual 3D position of the surface point corresponding to the indicated point pair (see green circles in the diagram below). These lines never actually intersect due to noise in the image measurement process which is a product of quantizing the scene into pixels and noise inherent to any measurement situation, e.g., thermal noise, illumination variation, etc. Since these lines were manually selected, this step is an error check of the results of camera calibration. For each image-point pair, the resulting pair of 3D rays extending into 3-space will have a unique location where they are closest. This occurs where the two rays nearly intersect and the reconstructed 3D position (shown as green points) is taken as the midpoint of the line segment connecting the two skew 3D rays at their closest point. As the geometry of the box and calibration pattern is known one my now subtract the known 3D positions from corresponding reconstructions of these positions. The remaining values is a sample function of the noise present in the stereoscopic reconstruction system.
Given that the corresponding image-point pairs are selected carefuly, the results of back-projection typically give accurate positions for the calibration pattern. The calibration pattern may also be used to compute the extrinsic and intrinsic camera parameters.
Step 3: Stereo Calibration
Next, the stereo system itself needed to be calibrated by finding the rotation matrix and translation vector that mapped one camera onto the other so they could be taken in a common coordinate system. Finding this transformation information allowed the extraction of an equation which related image points in one image to points in the other:
Pr = R(Pl - T)
Where Pr is an image point from the right camera, Pl is a point in the left camera and R and T are the rotation matrix and translation vector which map one camera to the other.
Step 4: Hough Transform
The Hough Transform is a technique which is used to identify the parameters of prominent lines in an image. The process takes as input an image and generates a second image which plots the prominence of various lines within the image. Since image structures such as lines may be small or large and have gaps, detecting these lines in the image may be difficult. The Hough transform transforms this difficult problem of detecting structure within an image and turns it into a simple peak detection problem within the generated Hough transform image (shown below). The Hough Transform was used in this project to detect the lines of the edges of the black boxes in the calibration pattern. An examples of the Hough Transform for our calibration box is plotted below:
Step 5: Correlation Matching
The goal of correlation matching is to match specific pixel locations in one image to pixel locations in the second image. The Hough Transform helps with correlation matching by extracting these specific pixel locations in both images so they can be matched. Using an edge detector to isolate a specified number of high intensity points was also used to determine important points for correlation matching. The result of correlation matching is a set of point correspondences between the two stereo images. These correspondences are needed to make the system automated rather than relying on user-input to match points in the two stereo images.
Step 6: 3D Reconstruction
Finally, once the camera system is well understood and points correspondences have been established, 3D reconstruction can take place. Triangulation-based reconstruction assumes the system has known intrinsic and extrinsic properties, which were calculated during the system calibration as well as a number of point correspondences which can be from user-input or correlation matching. Using all the known parameters, a 3D location is triangulated for each known point correspondence in the two images, the resulting 3D point is taken as the midpoint of these two points. The result of 3D reconstruction of a soda bottle using correspondence matching is show below: