Mesh R-CNN

Overview

Mesh R-CNN is a landmark model in the world of 3D deep learning. It is one of the first computer vision models for 3D shape prediction that works on real-world images. Based on the modular R-CNN design, it relies on much of the same architecture but introduces a new mesh prediction branch with several key innovations. We’ll begin by introducing Mesh R-CNN, then delve into details on the mesh prediction branch, and follow up with some code examples. We lack the time, data, and compute needed to present the entire Mesh R-CNN project, but we’ll explore several code examples that implement the key components.

Introduction to Mesh R-CNN

First introduced in the paper “Mesh R-CNN” in 2020, the Mesh R-CNN architecture builds upon the prior research into the R-CNN models like Faster R-CNN and Mask R-CNN. As a result, pretrained Mask R-CNN models can be used as the backbone to incorporate strong priors into the Mesh R-CNN’s predictions. Mesh R-CNN introduces a mesh prediction branch, which processes image features through a series of branches with the intent of predicting a 3D model for the detected object.

Mesh R-CNN has a number of features that make it both innovative for its time and usable today. Some of these features include:

  • Predicts arbitrary (untextured) 3D meshes from a single image

  • Works on real-world images

  • Can use pretrained backends that have been trained on general computer vision tasks (such as ResNet)

  • Doesn’t require a template mesh

Review of Mask R-CNN

Since Mesh R-CNN relies heavily on Mask R-CNN, we first take a quick review of the Mask R-CNN architecture. Mask R-CNN is built upon the Faster R-CNN architecture for object detection, which consists of two sequential stages:

  1. A region proposal network (RPN) uses a convolutional neural network to propose candidate bounding boxes.

  2. The RoIPool stage aggregates features from the bounding boxes for classification and regression.

Mask R-CNN makes a key contribution called RoIAlign, a technique that enables segmentation mask prediction. It is a variation of another technique in Faster R-CNN, called RoIPool, that is used to pool features from bounding boxes. RoI, or region of interest, refers to gathering relevant information from local regions in an image. Unlike RoIPool, which simply pools this information, RoIAlign applies bilinear interpolation to the underlying feature map to interpolate features for each output point. This enables it to gather local information without losing resolution, which is essential for segmentation tasks.

Get hands-on with 1200+ tech skills courses.