CS 180: Computer Vision, Fall 2024

Project 4: Mosaicing & Auto-Stitching

Rohan Gulati

SID: 3037864000



Computing Homographies

In this project, we blend pictures of scenes taken from a rotating camera's axis. In order to this, we use a select few correspondences to warp one scene onto another, which in this case requires 8 degrees of freedom per transformation, as Affine transformations of 6 degrees of freedom can't capture the perspective differences between the images. In short, we want to compute


where T is our transformation matrix and A & B are matrices of homogeneous coordinates. The matrix scales our B values by a certain factor (w) per point due to their varying distances from the camera, but we'll always be computing this w value simultaneously allowing us to get the true x' and y' for any given input. Breaking this down, we get:





Breaking down one pair of correspondences, we get:







Rewriting these equations in matrix form, we get:





Doing this for all n coordinates, we get a 2n x 9 matrix M and we want to compute the vector v (a flattened T) such that Mv ~ 0









While zero might not always be achievable, we can still minimize this as much as possible using the matrix's singular value decomposition. The V matrix contains a series of singular vectors that each correspond to a singular value within the diagonal Sigma matrix, which are sorted in decreasing order of magnitude. Since the V matrix is orthonormal, multiplying it again by any of its vectors will zero out the other vectors, isolating the chosen singular vector and thus also the associated singular value.

By choosing the last singular vector to be our vector v, we can effectively choose our error to be this last singular value, which is the minimum error achievable and in the examples below, kinda negligible. Intuitively, this last singular vector is also the direction M scales the most slowly in, so a vector along this direction would be a good minimizer. To compute this, I used Numpy's np.linalg.svd() on M, which returns the U,Σ, and V (transposed) matrices. Lastly, I'd extract the last row of V and reshape this into the 3x3 matrix, T.

Since the magnitude of T would only change the w values in the B matrix, we can scale it by any factor and maintain the same ability to compute x' and y'. For example, we can divide the matrix T by the last element "i", anchoring it to 1, resulting in 8 degrees of freedom for perspective warping.

Now we can go back and forth between A & B, using T and its inverse.






Rectification

We can see if the warping worked by warping them so they are normal to us. I selected corners on the rectangles of interest (A) and then corresponding rectangular points (B) like [0, 0], [1500, 0], [0, 1500], [1500, 1500] for a square or with a longer dimension for a rectangle. After computing a transformation T from A to B, I iterated through the pixels within the parallel B rectangle and applied the inverse of T on each coordinate to find its pixel in the warped A rectangle. Since this usually landed between pixels, I would use bilinear interpolation using its position relative to surrounding pixels for its final value in the parallel B rectangle.

Memorial Stadium
Gameday @ the Glade
Memorial Stadium Correspondences
Glade Correspondences
Memorial Stadium Field
Home Depot

Mosaicing

For mosaicing, I would select at least 8 correspondences per image and then compute the transformation T between them. Then, I would warp the left image into the "space" of the right image, which would remain flat. Since I didn't know how big my canvas would have to be to contain both images, I warped the corners of the left image to create a silhouette in the right space. My canvas here would just be bounded by the max and min of the corners of both the right image and the left's silhouette.

Then, same as before with rectification, I would go through the pixels within this silhouette and apply the inverse transform on each pixel to find where in the left image they came from.

Once they are layered on top of each other correctly, I would use a gradient mask and a Laplacian pyramid of the mask and the two images to blend them together. The mask would increase from 0 to 1 along the x-axis and and also be centered at the overlap of the two images, creating a smooth transition from one to another.

Mosaic 1: Rooftop

Left
Right
Left Correspondences
Right Correspondences
Left in "Right Space"
Right in "Right Space"
Mosaic (No Blending)
Mosaic (Feathered Blending)

Mosaic 2: Near Giannini Hall

Left
Right
Left Correspondences
Right Correspondences
Left in "Right Space"
Right in "Right Space"
Mosaic (No Blending)
Mosaic (Feathered Blending)

Mosaic 3: Sunset State Beach

Left
Right
Left Correspondences
Right Correspondences
Left in "Right Space"
Right in "Right Space"
Mosaic (No Blending)
Mosaic (Feathered Blending)

Auto Stitching

Here, we recreate methods from "Multi-Image Matching using Multi-Scale Oriented Patches” by Brown et al., in order to detect corresponding points between images and compute their best homography automatically, so images can be stitched automatically as well. Below are the examples this method is tested on. As a note, the rooftop, VLSB, and dino images were downsampled using 3x3 or 4x4 blocks to keep the runtime of the Harris detection algorithm low. Also, the VLSB photo in particular was taken with a long exposure setting to capture details in the setting at night.

Rooftop Left
Rooftop Right
Beach Left
Beach Right
VLSB Left
VLSB Right
Dino Top
Dino Bottom

Harris Corner Detection

To locate good candidates for correspondences, we first use the Harris point detector, which finds points on the images with a high Harris corner strength. The initial set of generated corners are shown below.

Rooftop - Harris Corner Detection Output
Beach - Harris Corner Detection Output
VLSB - Harris Corner Detection Output
Dino - Harris Corner Detection Output

Adaptive Non-Maximal Suppression

Since there are many potential points generated above, we want to reduce our set of candidate points while also keeping our points distributed evenly over the image. This is because more spread out points better capture the perspective differences between the images, leading to better warping. In order to compute this, we want to find the points that are local maximums using their Harris Corner strength.

To do this, for each point i, we compute ri, which indicates the distance to the closest point that has a significantly larger corner strength than i, using a threshold c_robust, set to 0.9. In code, we first filter all points j such that h[j] * c_robust > h[i]. Then, among these, we compute the L1-norm or Manhattan distance between the remaining points j and i, and set ri to the minimum result. After sorting all of the points by their r values, we settle on the 500 points with the highest r values per image, representing the 500 points that had a relatively higher corner strength than the other corners in the area. The top 500 for each image are shown below.


Rooftop - ANMS Output
Beach - ANMS Output
VLSB - ANMS Output
Dino - ANMS Output

Feature Description Extraction & Matching (Steps 2 & 3)

To extract the features per image, I used numpy slicing to capture the 80 x 80 pixel window about a pixel and then downsampled every 2 pixels per row and column (instead of the 40x40 window & every 5 pixel sampling as mentioned in the paper, since the runtime difference was negligible and this provided more information). This looked like im[y-40:y:+40:2,x-40:x+40:2]

Next to match the points in image A with points in image B, I would use Lowe's ratio, which for a point in A, compares the differences in the closest match in B with the second closest match in B. Algorithmically, for each point & patch pair i in A, I would iterate through all of the points & respective patches j in B, keeping track of the points j1 and j2 whose patches had the least and second-least l2-norms (e1-NN and e2-NN) respectively when subtracted from the i'th patch. The idea is that perfect correspondences would have a close match with j1 indicated by a low e1-NN value while a relatively poor match with j_2, indicated by a higher e2-NN value. Thus, points with the lowest (ha) Lowe Threshold, e1-NN / e2-NN, would be the best correspondences.

The distribution in the Lowe Threshold is shown per pair of images (points in the left image were matched with the closest in the right). I used this graph to choose the number of points to include in RANSAC. Additionally, the 20 correspondences with the lowest Lowe ratios are shown with lines drawn in between them, and the patches of the lowest 10 are displayed as well.

Rooftop - Lowe Distribution
Rooftop - Feature Matched
Rooftop - Closest 10 patches output
Beach - Lowe Distribution
Beach - Feature Matched
Beach - Closest 10 patches output
VLSB - Lowe Distribution
VLSB - Feature Matched
VLSB - Closest 10 patches output
Dino - Lowe Distribution
Dino - Feature Matched
Dino - Closest 10 patches output

RANSAC-based robust homography

Now that we have our correspondences, we look for the best 4 correspondences (8 equations) between A and B that can create our homographic transformations, using RANSAC. This algorithm filters our set of correspondences S by randomly selecting 4 correspondences every iteration, computing the homography, and applying it to all of the correspondences. If applying the homography on a point in A leads to an output that is far from the respective point in B (given a threshold), we discard this outlier point and then reduce our set of correspondences. At the end, once our set of remaining "inliers" is small, we compute our homographic transformation using these remaining correspondences.

Since the RANSAC algorithm chooses the initial set of correspondences randomly and is narrowed down from there, the resulting warp may not be representative of the true transformation at times. To handle such random errors, I included a couple changes. First, I used the above Lowe Distribution graphs to find a sufficient amount of the best correspondences to include in the initial set while keeping their max Lowe ratio low. This is because if the RANSAC algorithm was applied on the entire set, including the points with high Lowe thresholds, this would treat poorly matched points in A and B as correspondences, which would lead to inaccurate results.

Next, I experimented with a threshold decay, so as the RANSAC algorithm iterated and reduced the set of inliers that had similar homographies, I made the threshold stricter over time to have RANSAC choose the best inliers over time. This seemed to increase the frequency of arriving at a successful warp. As a note, setting the decay to 1 just let the initial RANSAC algorithm proceed, without any decay. Below are the homographies of the left image layered with the right after RANSAC.

Rooftop - Auto Warping (No Blending)
Beach - Auto Warping (No Blending)
VLSB - Auto Warping (No Blending)
Dino - Auto Warping (No Blending)

Mosaicing

I used the same feathering approach to blend the images together as I did with the manual stitching method.

Rooftop - Auto Stitched Mosaic
Rooftop - Manually Stitched Mosaic
Beach - Auto Stitched Mosaic
Beach - Manually Stitched Mosaic

The different exposures in the left and right images seem to have caused a visible distinction between the left and right sides, leading to more of a flawed yet interesting stitch.

VLSB - Auto Stitched Mosaic
Dino - Auto Stitched Mosaic

Reflection

It was cool to see how useful linear algebra was for graphics and vision, and deriving a formula and implementing it for a case like stitching my own custom photos together was pretty nice. For auto-stitching, reading through and recreating the methods of a research paper felt rewarding as well, especially when I could condense the entire stitching process into an automatic recipe where you could feed sets of pictures on one end for them to come out stitched on the other (without having to manually click on correspondences and check if they were placed and paired correctly).