2017 was an exciting year as we saw deep learning become the dominant paradigm for estimating geometry in computer vision.
Learning geometry has emerged as one of the most influential topics in computer vision over the last few years.
“Geometry is … concerned with questions of shape, size, relative position of figures and the properties of space” (wikipedia).
We’ve first seen end-to-end deep learning models for these tasks using supervised learning, for example depth estimation (Eigen et al. 2014), relocalisation (PoseNet 2015), stereo vision (GC-Net 2017) and visual odometry (DeepVO 2017) are examples. Deep learning excels at these applications for a few reasons. Firstly, it is able to learn higher order features which reason over shapes and objects with larger context than point-based classical methods. Secondly, it is very efficient for inference to simply run a forward pass of a convolutional neural network which approximates an exact geometric function.
Over the last year, I’ve noticed epipolar geometry and reprojection losses improving these models, allowing them to learn with unsupervised learning. This means they can train without expensive labelled data by just observing the world. Reprojection losses have contributed to a number of significant breakthroughs which now allow deep learning to outperform many traditional approaches to estimating geometry. Specifically, photometric reprojection loss has emerged as the dominant technique for learning geometry with unsupervised (or self-supervised) learning. We’ve seen this across a number of computer vision problems:
- Monocular Depth: Reprojection loss for deep learning was first presented for monocular depth estimation by Garg et al. in 2016. In 2017, Godard et al. show how to formulate left-right consistency checks to improve results.
- Optical Flow: this requires training reprojection disparities over 2D and has been demonstrated by Yu et al. 2016, Ren et al. 2017 and Meister et al. 2018.
- Stereo Depth: in my PhD thesis I show how to extend our stereo architecture, GC-Net, to learn stereo depth with epipolar geometry & unsupervised learning.
- Localisation: I presented a paper at CVPR 2017 showing how to train relocalisation systems by learning to project 3D geometry from structure from motion models Kendall & Cipolla 2017.
- Ego-motion: learning depth and ego motion with reprojection loss now out performs traditional methods like ORB-SLAM over short sequences under constrained settings (Zhou et al. 2017) and (Li et al. 2017).
- Multi-View Stereo: projection losses can also be used in a supervised setting to learn structure from motion, for example DeMoN and SfM-Net.
- 3D Shape Estimation: projection geometry also aids learning 3D shape from images in this work from Jitendra Malik’s group.
In this blog post I’d like to highlight the importance of epipolar geometry and how we can use it to learn representations of geometry with deep learning.
What is reprojection loss?
The core idea behind reprojection losses is using epipolar geometry to relate corresponding points in multi-view stereo imagery. To dissect this jargon-filled sentence; epipolar geometry relates the projection of 3D points in space to 2D images. This can be thought of as triangulation (see the figure below). The relation between two 2D images is defined by the Fundamental matrix. If we choose a point on one image and know the fundamental matrix, then this geometry tells us that the same point must lie on a line in the second image, called the epipolar line (the red line in the figure below). The exact point of the correspondence on the epipolar line is defined by the 3D point’s depth in the scene.
If these two images are from a rectified stereo camera then this is a special type of multi-view geometry, and the epipolar line is horizontal. We then refer to the corresponding point’s position on the epipolar line as disparity. Disparity is inversely proportional to metric depth.
The standard reference for this topic is the textbook, “Multiple View Geometry in Computer Vision” Hartley and Zisserman, 2004.
One way of exploiting this is learning to match correspondences between stereo images along this epipolar line. This allows us to estimate pixel-wise metric depth. We can do this using photometric reprojection loss (Garg et al. in 2016). The intuition behind reprojection loss is that pixels representing the same object in two different camera views look the same. Therefore, if we relate pixels, or determine correspondences between two views, the pixels should have identical RGB pixel intensity values. The better the estimate of geometry, the closer the photometric (RGB) pixel values will match. We can optimise for values which provide matching pixel intensities between each image, known as minimising the photometric error.
An important property of these losses is that they are unsupervised. This means that we can learn these geometric quantities by observing the world, without expensive human-labelled training data. This is also known as self-supervised learning.
The list of papers at the start of this post further extend this idea to optical flow, depth, ego-motion, localisation etc. — all containing forms of epipolar geometry.
Does this mean learning geometry with deep learning is solved?
I think there are some short-comings to reprojection losses.
Firstly, photometric reprojection loss makes a photometric consistency assumption. This means it assumes that the same surface has the same RGB pixel value between views. This assumption is usually valid for stereo vision, because both images are taken at the same time. However, this is not always the case for learning optical flow or multi-view stereo, because appearance and lighting changes over time. This is because of occlusion, shadows and the dynamic nature of scenes.
Secondly, reprojection suffers from the aperture problem. The aperture problem is unavoidable ambiguity of structure due to a limited field of view. For example, if we try to learn depth by photometric reprojection, our model cannot learn from areas with no texture, such as sky or featureless walls. This is because the reprojection loss is equal across areas of homogeneous texture. To resolve the correct reprojection we need context! This problem is usually resolved by a smoothing prior, which encourages the output to be smooth where there is no training signal, but this also blurs correct structure.
Thirdly, we don’t need to reconstruct everything. Learning to reproject pixels is similar to an auto encoder — we learn to encode all parts of the world equally. However, for many practical applications, attention based reasoning has been shown to be most effective. For example, in autonomous driving we don’t need to learn the geometry of building facades and the sky, we only care about the immediate scene in front of us. However, reprojection losses will treat all aspects of the scene equally.
How can we improve performance?
It is difficult to learn geometry alone, I think we need to incorporate semantics. There is some evidence that deep learning models learn semantic representations implicitly from patterns in the data. Perhaps our models could more explicitly exploit this?
I think we need to reproject into a better space than RGB photometric space. We would like this latent space to solve the problems above. It should have enough context to address the aperture problem, be invariant to small photometric changes and emphasise task-dependant importance. Training on the projection error in this space should result in a better performing model.
After the flurry of exciting papers in 2017, I’m looking forward to further advances in 2018 in one of the hottest topics in computer vision right now.
I first presented the ideas in this blog post at the Geometry in Deep Learning Workshop at the International Conference on Computer Vision 2017. Thank you to the organisers for a great discussion.