15 February 2022 / PAPERS

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick, ICCV 2017
https://github.com/facebookresearch/Detectron

What’s different?

Models so far
R-CNN: 2-stage model for Object detection
Fast R-CNN: RoI on feature map
Faster R-CNN: RPN network
Instance Segmentation
Combining to tasks:
- Object detection(Fast/Faster R-CNN): classify individual objects and localize each using a bounding box.
- Semantic segmentation(FCN; Fully Convolutional Network): classify each pixel into a fixed set of categories without differentiating object instances.
Mask R-CNN:
1) Model for instance segmentation: Mask prediction branch
2) FPN(feature pyramid network) before RPN
3) RoI align

Mask prediction

Mask loss
In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask(ones to the object and zeros elsewhere) for each RoI. Defined multi-task loss on each sampled RoI: $L = L_{cls} + L_{box} + L_{mask}$
The mask branch has a $Km^2$-dimensional output for each RoI, which encodes K binary masks of resolution $m\times m$, one for each of the K classes. To this we apply a per-pixel sigmoid, and define $L_{mask}$ as the average binary cross-entropy loss. For an RoI associated with ground-truth class k, $L_{mask}$ is only defined on the k-th mask(other mask outputs do not contribute to the loss).
Decouples mask and class prediction
This definition of mask loss allows the network to generate masks for each class without competition among classes; we rely on the dedicated classification branch to predict the class label used to select the output mask.
With a per-pixel sigmoid and a binary loss, masks do not compete across classes; in contrast to FCNs for semantic segmentation using a per-pixel softmax and a multinomial cross-entropy loss.
Mask Representation
Unlike class labels or box offsets, extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions.
Predicting an $m\times m$ mask from each RoI using an FCN, allows each layer in the mask branch to have $m\times m$ object spatial layout without collapsing it into a vector representation that lacks spatial dimensions.
This pixel-to-pixel behavior requires RoI features, small cropped feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence; RoIAlign.

RoIAlign

RoIPool(or RoI Pooling)
Quantizes a floating-number RoI to the discrete granularity(integerize by rounding) of the feature map, its result is then subdivided into spatial bins, and finally feature values covered by each bin are aggregated(usually by max pooling).
- Problem: Quantizations introduce misalignments between the RoI and the extracted features. This may not impact classification, which is robust to small translations, but it has a large negative effect on predicting pixel-accurate masks.
RoIAlign layer
Instead of any quantization of the RoI boundaries or bins, use bilinear interpolation(Spatial transformer networks) to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result(using max or average).
e.g.

(from CS231n lecture)
Feature $f_{xy}$ for point $(x, y)$ is a linear combination of features at its four neighboring grid cells:
$f_{xy} = \sum_{i,j=1}^2 f_{i,j} \text{max}(0, 1 - \left\vert x - x_i \right\vert) \text{max}(0, 1 - \left\vert y - y_i \right\vert)$
RoIAlign improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics.

Network Architecture

backbone: Faster R-CNN with an FPN(ResNet-FPN)
- FPN, Feature pyramid network(Lin et al.):
  Uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input.
- RPN:
  RoI align on each FPN feature maps
head:

Add a fully convolutional mask prediction branch, extending the Faster R-CNN box heads from the ResNet and FPN papers. Train with additional mask loss.

Experiments

Comparison to the sota methods in instance segmentation

png
png

Ablations
- Architecture
- Multinomial vs. Independent Masks
- Class-Specific vs. Class-Agnostic Masks
  Interestingly, Mask R-CNN with classagnostic masks(predicting a single m×m output regardless of class)) is nearly as effective as class-specific masks(default; m×m mask per class).
- RoIAlign
  ResNet50-C4 backbone of stride 16
  
  ResNet-50-C5 backbone of stride 32
  
  Note that with RoIAlign, using stride-32 C5 features is more accurate than using stride-16 C4 features. Used with FPN, which has finer multi-level strides, RoIAlign shows better result.
- Mask branch
Bounding Box Detection Results
Our approach largely closes the gap between object detection and the more challenging instance segmentation task.

png

Mask R-CNN for Human Pose Estimation

By modeling a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types, this framework can easily be extended to human pose estimation.
Main Results and Ablations:

$\therefore$ We have a unified model that can simultaneously predict boxes, segments, and keypoints while running at 5 fps.

Mask R-CNN

Mask R-CNN

What’s different?

Mask prediction

RoIAlign

Network Architecture

Experiments

Mask R-CNN for Human Pose Estimation

DevEnv Setup

Starfish detection w/ TF Object Detection API

Mask R-CNN

What’s different?

Mask prediction

RoIAlign

Network Architecture

Experiments

Mask R-CNN for Human Pose Estimation

Search Darron's Devlog