7 February 2022 / CS231N

cs231n - Lecture 15. Detection and Segmentation

Computer Vision Tasks

Image Classification: No spatial extent
Semantic Segmentation: No objects, just pixels
Object Detection/ Instance Segmentation: Multiple objects

Semantic Segmentation

Paired training data:
For each training image, each pixel is labeled with a semantic category.
At test time, classify each pixel of a new image.
Problem:
Classifying with only single pixel does not include context information.
Idea:
- Sliding Window
  Extract patch from full image, classify center pixel with CNN.
  $\color{red}{(-)}$ Very inefficient, not reusing shared features between overlapping patches.
- Convolution
  Encode the entire image with conv net, and do semantic segmentation on top.
  $\color{red}{(-)}$ CNN architectures often change the spatial sizes, but semantic segmentation requires the output size to be same as input size.
- Fully Convolutional
  Design a network with only convolutional layers without downsampling operators
  
  $\color{red}{(-)}$ convolutions at original image resolution is very expensive
  $\rightarrow$ Design convolutional network with downsampling and upsampling
Downsampling: Pooling, strided convolution
In-Network Upsampling: Unpooling, strided transpose convolution
Unpooling:
Nearest Neighbor: copy-paste to extended region
“Bed of Nails”: no positional argument, pad with zeros
Max Unpooling: use positions from poolying layer ahead, pad with zeros
Learnable Downsampling: Strided convolution
Output is a dot product between filter and input
Stride gives ratio between movement in input and output
Learnable Upsampling: Transposed convolution
Input gives weight for filter
Output contains copies of the filter weighted by the input, summing at where at overlaps in the output
Summary
Label each pixel in the image with a category label
Don’t differentiate instances, only care about pixels

Object Detection

png

Multiple Objects:
Each image needs a different number of outputs;
$\rightarrow$ Apply a CNN to many different crops of the image, CNN classifies each crop as object or background.
$\color{red}{(-)}$ Need to apply CNN to huge number of locations, scales, and aspect ratios, very computationally expensive.

R-CNN

png

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014
2-stage Detector: Region Proposal + Region Classification
1. Image as input
2. Crop bounding boxes with Selective Search
  Warp into same size pixels for CNN model
3. Input Warped images into CNN
4. Run classification on each
Algorithm:
1. Region Proposals: Selective Search
  Find “blobby” image regions that are likely to contain objects.
  Relatively fast to run; e.g. Selective Search gives 2000 region proposals in a few seconds on CPU.
2. CNN:
  For a pre-trained CNN architecture, change the number of classes on the last classification layer(detection classes N + background 1), fine-tune with dataset for Object Detection. From the region proposal input, outputs a fixed-length feature vector.
3. SVM: Category-Specific Linear SVMs
  Positive: ground-truth boxes
  Negative: IoU under 0.3
  Scores each feature vector for classes, classifies whether each one is positive/negative(is_object).
4. Non-Maximum Suppression: with concept of IoU
  Intersection over Union; area of intersection divided by area of union
  If there are two boxes with IoU over 0.5, consider them proposed on the same object, leave one with the highest score.
5. Bounding Box Regression: adjust boxes from Selective Search
  - Algorithm:
    Assume a bounding box $P^i = (P_x^i, P_y^i, P_w^i, P_h^i)$,
    Ground-truth box $G = (G_x, G_y, G_w, G_h)$.
    Define a function $d$, mapping $P$ close to $G$;
    $\hat{G}_x = P_w d_x(P) + P_x$
    $\hat{G}_y = P_h d_y(P) + P_y$
    $\hat{G}_w = P_w \mbox{exp}(d_w(P))$
    $\hat{G}_h = P_h \mbox{exp}(d_h(P))$
    where $d_{\star}(P) = w_{\star}^T \phi_5(P)$, is modeled as a linear function(learnable weight vector w) of the POOL5 features of proposal P($\phi_5(P)$). We learn $w_{\star}$ by optimizing the regularized least squares objective(Ridge regression)
    Learnable parameters on: 2, 3, 5
Summary:
Score: 53.7% on Pascal VOC 2010
Problem:
1. Low Performance; Warping images into 224x224 size for AlexNet
2. Slow; Using all candidates from Selective Search
3. Not GPU-optimized; Using Selective Search and SVM
4. No Back Propagation; Not sharing computations

Fast R-CNN

png

Girshick, “Fast R-CNN”, ICCV 2015
idea:
Pass the image through convnet before cropping. Crop the conv feature instead.
Algorithm:
1. Pass the full image through pre-trained CNN and extract feature maps.
2. Get RoIs from a proposal method(Selective Search) and crop by RoI Pooling, get fixed size feature vectors.
3. With RoI feature vectors, pass some fully connected layers and split into two branches.
4. 1) pass softmax and classify the class of RoI. no SVM used. 2) Run bounding box regression.
Cropping Features: RoI Pool
1. Project RoI proposals(on input image) onto CNN image features.
2. Divide into subregions.
3. Run pooling(Max-pool) within each subregion.
  $\rightarrow$ Region features always be the same size regardless of input region size

Faster R-CNN

png

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
idea:
Fast R-CNN is not GPU-optimized; runtime dominated by region proposals.
By inserting Region Proposal Network(RPN), implemented end-to-end architecture.
Algorithm:
1. Pass the full image through pre-trained CNN and extract feature maps.
2. RPN:
  For K different anchor boxes of different size and scale at each point in the feature map, predict whether it contains an object(binary classification), and also predict a corrections from the anchor to the ground-truth box(regress 4 numbers per pixel).
3. Jointly train with 4 losses:
  1) RPN classify object / not object
  2) RPN regress box coordinates
  3) Final classification score (object classes)
  4) Final box coordinates
Glossing over many details:
- Ignore overlapping proposals with non-max suppression
- How are anchors determined?
- How do we sample positive / negative samples for training the RPN?
- How to parameterize bounding box regression?
Two-stage object detector:
- First stage: Run once per image
  - Backbone network
  - Region proposal network(RPN)
- Second stage: Run once per region
  - Crop features: RoI pool/ align
  - Predict object class
  - Prediction bbox offset

Single-Stage Object Detectors: YOLO / SSD / RetinaNet

Algorithm:
1. Divide input imgae into grid
2. Image a set of base boxes centered at each grid cell
3. Within each grid cell:
  - Regress from each of the B base boxes to a final box with 5 numbers(dx, dy, dh, dw, confidence)
  - Predict scores for each of C classes(including background as a class)
  - Looks a lot like RPN, but category-specific

png

Instance Segmentation: Mask R-CNN

He et al, “Mask R-CNN”, ICCV 2017

Open Source Frameworks

TensorFlow Detection API
Detectron2(Pytorch)

Beyond 2D Object Detection

Object Detection + Captioning: Dense Captioning

Dense Video Captioning: timestep “T”

png

Objects + Relationships: Scene Graphs

png
png

3D Object Detection

png

2D bounding box: (x, y, w, h)
$\rightarrow$ 3D oriented bounding box: (x, y, z, w, h, l, r, p, y)
$\rightarrow$ Simplified bbox: no roll & pitch
Simple Camera Model:

A point on the image plane corresponds to a ray in the 3D space
A 2D bounding box on an image is a frustrum in the 3D space
Localize an object in 3D: The object can be anywhere in the camera viewing frustrum
Monocular Camera:
- Same idea as Faster RCNN, but proposals are in 3D
- 3D bounding box proposal, regress 3D box parameters + class score
3D Shape Prediction: Mesh R-CNN

cs231n - Lecture 15. Detection and Segmentation

Computer Vision Tasks

Semantic Segmentation

Object Detection

R-CNN

Fast R-CNN

Faster R-CNN

Single-Stage Object Detectors: YOLO / SSD / RetinaNet

Instance Segmentation: Mask R-CNN

Open Source Frameworks

Beyond 2D Object Detection

Object Detection + Captioning: Dense Captioning

Dense Video Captioning: timestep “T”

Objects + Relationships: Scene Graphs

3D Object Detection

Starfish detection w/ TF Object Detection API

cs231n - Lecture 14. Visualizing and Understanding

Computer Vision Tasks

Semantic Segmentation

Object Detection

R-CNN

Fast R-CNN

Faster R-CNN

Single-Stage Object Detectors: YOLO / SSD / RetinaNet

Instance Segmentation: Mask R-CNN

Open Source Frameworks

Beyond 2D Object Detection

Object Detection + Captioning: Dense Captioning

Dense Video Captioning: timestep “T”

Objects + Relationships: Scene Graphs

3D Object Detection

Search Darron's Devlog