Overview

Before we even consider how to assemble 3D scenes, we should first consider how we observe our 3D scenes via 2D renders. Much like how a person sees the world through their eyes (and often a pair of glasses), we observe 3D data through 2D renders. The practice of rendering 3D scenes is a complex subject. First, we need to consider the physics of how real-world images are formed by light.

Images

Images need no introduction. Most of us carry cameras in our pockets these days and capture images with the same ease with which we read and write. However, for pedagogical purposes, let’s define in strict terms what an image is.

For our purposes, an image is simply a regularly-spaced 2D grid composed of many rectangular bins that each contain a record of light. When a digital camera takes a photo, rays of light pass through the camera lens and strike a sensor. This sensor has a regular spacing of photosensors. These photosensors, called pixels, gather charge when receiving light, allowing the camera to estimate how much light struck a particular sensor while the photo was taken. Think of the raw pixel values at each pixel of the 2D image as an estimate of how much light was recorded at that part of the sensor. In the case of RGB images (i.e., color images), pixels record light across three separate wavelengths: red (R), green (G), and blue (B). Images with only a single value for intensity are called grayscale images.

In the numpy or torch libraries, an image is most often represented as a batched 3D tensor with the shape $(N, H, W, C)$ where:

$H$ is the number of rows (height)
$W$ is the number of columns (width)
$C$ is the number of channels (e.g., $3$ for RGB and $1$ for grayscale)
$N$ is the (optional) batch size

For instance, when we say “one RGB image with dimensions $640 \times 480$ ,” that could equate to a tensor of shape $(1, 480, 640, 3)$ or $(480, 640, 3)$ . Multiple images of the same dimensions and number of channels can be stacked into batches.

Pinhole camera model

Our theory of image formation will be our bridge between the 2D space of an image and the 3D space of the observed world. As such, having a model of image formation is crucial to 3D machine learning. The simplest model for understanding image formation is the so-called pinhole camera model. Though it does not account for real-world camera effects like distortion, it provides a remarkably intuitive geometric model of how light moves through space to produce the images we capture.

Many of us have made a pinhole camera at home before. It usually consists of some dark chamber, such as a box with a flat surface on one end and a tiny hole, which we call the aperture, through which light can pass on the opposite end. A so-called camera obscura is one type of pinhole camera. As light passes through the aperture, it strikes the flat surface on the other end. The diameter of the aperture forces light to strike points on the flat surface only from specific angles. The effect is that we see a reflection of the outside world cast upon the image plane, and thus, we refer to this surface as the image plane.

Press + to interact

In the figure above, $O$ designates the location of the aperture. This marks the crossing point where light from the outside world (for instance, light emanating from the point $P$ ) passes through into the camera and strikes the image plane (e.g., at the point $Q$ ). Notice how the light follows a straight path in this model.

As mentioned previously, the great thing about the pinhole camera model is that it gives us a geometric understanding of our scene. From any point in our 2D plane, we can draw a ray that passes through the aperture and into our 3D scene. Anything along that ray is a potential source of reflected or emitted light for our render. This is in fact how rendering works, but also gives us, in the computer vision space, some tools we can use to learn about a 3D scene from 2D images.

Note how we can relate distances on one side of the aperture to the other (e.g., outside to inside of the camera or vice versa) via similar triangles.

In other words, the ratio between $b$ and $a$ is the same as that between $c$ and $f$ .

The thin lens equation

In many cases, we can’t ignore the effects of the lens entirely. For instance, we’ll often encounter situations where focus and exposure are important to consider. In cases like these, we can apply the most basic of lens models: the thin lens equation. A lens is simply just a piece of glass, often curved on both sides, which is inserted into the aperture of a camera. A so-called thin lens is simply one where the thickness of the glass is negligibly small compared to the radii of curvature of either side of the glass.

Press + to interact

This lens focuses the light passing through it onto the image plane. Because of the geometry of the light passing through the lens, only light from objects at a certain distance from the aperture will reach the image plane in tight focus. In the figure above, each beam of light passes through the lens, changes direction via refractionThe change in direction of a light wave as it passes from one medium into another (e.g. from air into glass). on the other side, and crosses over the optical axis. The distance between the lens and this crossing point is called the focal length. The focal length, due to geometry, constrains the light that enters from the outside world such that only light within a narrow band of distance beyond the lens will be tightly in-focus.

The resulting effect is that light outside of this region appears blurry and out-of-focus. Photographers refer to this region of focus as the depth of field. Light that arrives from outside of this region will be distributed in a ring known as the circle of confusion, a name that hints at the ambiguous appearance that this unfocused light causes. This is precisely why out of focus portions of an image appear as if their details are blurred in a radius around their true position.

Press + to interact

The thin lens equation provides the relationship between the focal length and the depth of field, telling us which parts of the world will be in focus and which parts out of focus. This phenomenon is why photographers have to adjust the focus of their cameras to keep objects looking clear. Photographers often utilize this effect to create visually stunning portraits and macro photography.

The thin lens equation relates the following properties:

focal length $f$ : The distance between the aperture and focal point.
image distance $d_I$ : The distance between the aperture and the image plane.
object distance $d_O$ : The distance between the aperture and the observed object.

Press + to interact

Getting Started

Cameras and Projection

Rendering

Data Representations

Operations and Techniques

Key Models

Final Assessment

Conclusion

Appendix

Link-Based Classification Using Graph Neural Networks

Image Formation and the Thin Lens Equation

Overview

Images

Pinhole camera model

The thin lens equation