CS 180: Computer Vision, Fall 2024
Project 4: Mosaicing & Auto-Stitching
Rohan Gulati
SID: 3037864000
Computing Homographies
In this project, we blend pictures of scenes taken from a rotating camera's axis. In order to this, we use a select few correspondences to warp one scene onto another, which in this case requires 8 degrees of freedom per transformation, as Affine transformations of 6 degrees of freedom
can't capture the perspective differences between the images. In short, we want to compute
where T is our transformation matrix and A & B are matrices of homogeneous coordinates. The matrix scales our B values by a certain factor (w) per point due to their varying distances from the camera, but we'll always be computing
this w value simultaneously allowing us to get the true x' and y' for any given input.
Breaking this down, we get:
Breaking down one pair of correspondences, we get:
Rewriting these equations in matrix form, we get:
Doing this for all n coordinates, we get a 2n x 9 matrix M and we want to compute the vector v (a flattened T) such that Mv ~ 0
While zero might not always be achievable, we can still minimize this as much as possible using the matrix's singular value decomposition. The V matrix contains a series of singular vectors
that each correspond to a singular value within the diagonal Sigma matrix, which are sorted in decreasing order of magnitude. Since the V matrix is orthonormal, multiplying it again by any of its vectors will
zero out the other vectors, isolating the chosen singular vector and thus also the associated singular value.
By choosing the last singular vector to be our vector v, we can effectively choose our error to be this last singular value, which is the minimum error achievable and in the examples below, kinda negligible. Intuitively, this last singular vector is also the direction M scales the most slowly in, so
a vector along this direction would be a good minimizer. To compute this, I used Numpy's np.linalg.svd()
on M, which returns the U,Σ, and V (transposed) matrices. Lastly, I'd extract the last row of V and reshape this into the 3x3 matrix, T.
Since the magnitude of T would only change
the w values in the B matrix, we can scale it by any factor and maintain the same ability to compute x' and y'. For example, we can divide the matrix T by the last element "i", anchoring it to 1, resulting in 8 degrees of freedom for perspective warping.
Now we can go back and forth between A & B, using T and its inverse.
Rectification
We can see if the warping worked by warping them so they are normal to us. I selected corners on the rectangles of interest (A) and then corresponding rectangular points (B) like [0, 0], [1500, 0], [0, 1500], [1500, 1500] for a square
or with a longer dimension for a rectangle. After computing a transformation T from A to B, I iterated through the pixels within the parallel B rectangle and applied the inverse of T on each coordinate to find its pixel in the warped A rectangle.
Since this usually landed between pixels, I would use bilinear interpolation using its position relative to surrounding pixels for its final value in the parallel B rectangle.
Memorial Stadium
|
Gameday @ the Glade
|
|
Memorial Stadium Correspondences
|
Glade Correspondences
|
|
Memorial Stadium Field
|
Home Depot
|
|
Mosaicing
For mosaicing, I would select at least 8 correspondences per image and then compute the transformation T between them.
Then, I would warp the left image into the "space" of the right image, which would remain flat.
Since I didn't know how big my canvas would have to be to contain both images, I warped the corners of the left image to create a silhouette in the right space.
My canvas here would just be bounded by the max and min of the corners of both the right image and the left's silhouette.
Then, same as before with rectification, I would go through the pixels within this silhouette and apply the inverse transform on each pixel to find where in the left image they
came from.
Once they are layered on top of each other correctly, I would use a gradient mask and a Laplacian pyramid of the mask and the two images to blend them together.
The mask would increase from 0 to 1 along the x-axis and and also be centered at the overlap of the two images, creating a smooth transition from one to another.
Mosaic 1: Rooftop
Left
|
Right
|
|
Left Correspondences
|
Right Correspondences
|
|
Left in "Right Space"
|
Right in "Right Space"
|
|
Mosaic (No Blending)
|
Mosaic (Feathered Blending)
|
Mosaic 2: Near Giannini Hall
Left
|
Right
|
|
Left Correspondences
|
Right Correspondences
|
|
Left in "Right Space"
|
Right in "Right Space"
|
|
Mosaic (No Blending)
|
Mosaic (Feathered Blending)
|
Mosaic 3: Sunset State Beach
Left
|
Right
|
|
Left Correspondences
|
Right Correspondences
|
|
Left in "Right Space"
|
Right in "Right Space"
|
|
Mosaic (No Blending)
|
Mosaic (Feathered Blending)
|
Auto Stitching
Here, we recreate methods from "Multi-Image Matching using Multi-Scale Oriented Patches” by Brown et al., in order to detect corresponding points between images and compute their best homography automatically, so
images can be stitched automatically as well. Below are the examples this method is tested on. As a note, the rooftop, VLSB, and dino images were downsampled using 3x3 or 4x4 blocks
to keep the runtime of the Harris detection algorithm low. Also, the VLSB photo in particular was taken with a long exposure setting to capture details in the
setting at night.
Rooftop Left
|
Rooftop Right
|
|
Beach Left
|
Beach Right
|
|
VLSB Left
|
VLSB Right
|
|
Dino Top
|
Dino Bottom
|
|
Harris Corner Detection
To locate good candidates for correspondences, we first use the Harris point detector, which finds points on the images with a high Harris corner strength. The initial set of generated
corners are shown below.
Rooftop - Harris Corner Detection Output
|
|
Beach - Harris Corner Detection Output
|
|
VLSB - Harris Corner Detection Output
|
|
Dino - Harris Corner Detection Output
|
|
Adaptive Non-Maximal Suppression
Since there are many potential points generated above, we want to reduce our set of candidate points while also keeping our points distributed evenly over the image.
This is because more spread out points better capture the perspective differences between the images, leading to better warping. In order to compute this, we
want to find the points that are local maximums using their Harris Corner strength.
To do this, for each point i, we compute ri, which indicates the distance to the closest point that
has a significantly larger corner strength than i, using a threshold c_robust, set to 0.9. In code, we first filter all points j such that h[j] * c_robust > h[i]. Then, among these,
we compute the L1-norm or Manhattan distance between the remaining points j and i, and set ri to the minimum result. After sorting all of the points by their r values,
we settle on the 500 points with the highest r values per image, representing the 500 points that had a relatively higher corner strength than the other corners in the area. The top 500
for each image are shown below.
Rooftop - ANMS Output
|
|
Beach - ANMS Output
|
|
VLSB - ANMS Output
|
|
Dino - ANMS Output
|
|
Feature Description Extraction & Matching (Steps 2 & 3)
To extract the features per image, I used numpy slicing to capture the 80 x 80 pixel window about a pixel and then downsampled every 2 pixels per row and column (instead of
the 40x40 window & every 5 pixel sampling as mentioned in the paper, since the runtime difference was negligible and this provided more information). This looked like
im[y-40:y:+40:2,x-40:x+40:2]
Next to match the points in image A with points in image B, I would use Lowe's ratio, which for a point in A, compares the differences in the closest match in B with the second closest match in B.
Algorithmically, for each point & patch pair i in A, I would iterate through all of the points & respective patches j in B, keeping track of the points j1 and j2
whose patches had the least and second-least l2-norms (e1-NN and e2-NN) respectively when subtracted from the i'th patch. The idea is that perfect correspondences would have a close match with j1 indicated by a low
e1-NN value while a relatively poor match with j_2, indicated by a higher e2-NN value. Thus, points with the lowest (ha) Lowe Threshold, e1-NN / e2-NN, would
be the best correspondences.
The distribution in the Lowe Threshold is shown per pair of images (points in the left image were matched with the closest in the right). I used this graph to choose the number of points to include in RANSAC.
Additionally, the 20 correspondences with the lowest Lowe ratios are shown with lines drawn in between them, and the patches of the lowest 10 are displayed as well.
Rooftop - Lowe Distribution
|
|
Rooftop - Feature Matched
|
|
Rooftop - Closest 10 patches output
|
|
Beach - Lowe Distribution
|
|
Beach - Feature Matched
|
|
Beach - Closest 10 patches output
|
|
VLSB - Lowe Distribution
|
|
VLSB - Feature Matched
|
|
VLSB - Closest 10 patches output
|
|
Dino - Lowe Distribution
|
|
Dino - Feature Matched
|
|
Dino - Closest 10 patches output
|
|
RANSAC-based robust homography
Now that we have our correspondences, we look for the best 4 correspondences (8 equations) between A and B that can create our homographic transformations, using RANSAC. This algorithm filters our set of correspondences S by randomly selecting 4 correspondences every iteration, computing the homography, and applying it to all of the correspondences.
If applying the homography on a point in A leads to an output that is far from the respective point in B (given a threshold), we discard this outlier point and then reduce our set of correspondences.
At the end, once our set of remaining "inliers" is small, we compute our homographic transformation using these remaining correspondences.
Since the RANSAC algorithm chooses the initial set of correspondences randomly and is narrowed down from there, the resulting warp may not be representative of the true transformation at times.
To handle such random errors, I included a couple changes. First, I used the above Lowe Distribution graphs to find a sufficient amount of the best correspondences to include in the initial set while keeping their max Lowe ratio low.
This is because if the RANSAC algorithm was applied on the entire set, including the points with high Lowe thresholds, this would treat poorly matched points in A and B as correspondences, which would
lead to inaccurate results.
Next, I experimented with a threshold decay, so as the RANSAC algorithm iterated and reduced the set of inliers that had similar homographies, I made the threshold stricter
over time to have RANSAC choose the best inliers over time. This seemed to increase the frequency of arriving at a successful warp. As a note, setting the decay to 1 just let the initial RANSAC algorithm proceed,
without any decay. Below are the homographies of the left image layered with the right after RANSAC.
Rooftop - Auto Warping (No Blending)
|
|
Beach - Auto Warping (No Blending)
|
|
VLSB - Auto Warping (No Blending)
|
|
Dino - Auto Warping (No Blending)
|
|
Mosaicing
I used the same feathering approach to blend the images together as I did with the manual stitching method.
Rooftop - Auto Stitched Mosaic
|
Rooftop - Manually Stitched Mosaic
|
|
Beach - Auto Stitched Mosaic
|
Beach - Manually Stitched Mosaic
|
|
The different exposures in the left and right images seem to have caused a visible distinction between the left and right sides, leading to more of a flawed yet interesting stitch.
VLSB - Auto Stitched Mosaic
|
|
Dino - Auto Stitched Mosaic
|
|
|
Reflection
It was cool to see how useful linear algebra was for graphics and vision, and deriving a formula and implementing it for a case like stitching my own
custom photos together was pretty nice. For auto-stitching, reading through and recreating the methods of a research paper felt rewarding as well,
especially when I could condense the entire stitching process into an automatic recipe where you could feed sets of pictures on one end
for them to come out stitched on the other (without having to manually click on correspondences and check if they were placed and paired correctly).