/ CS231N

cs231n - Lecture 15. Detection and Segmentation

Computer Vision Tasks

  • Image Classification: No spatial extent
  • Semantic Segmentation: No objects, just pixels
  • Object Detection/ Instance Segmentation: Multiple objects

Semantic Segmentation

  • Paired training data:
    For each training image, each pixel is labeled with a semantic category.
  • At test time, classify each pixel of a new image.
  • Problem:
    Classifying with only single pixel does not include context information.
  • Idea:
    • Sliding Window
      Extract patch from full image, classify center pixel with CNN.
      $\color{red}{(-)}$ Very inefficient, not reusing shared features between overlapping patches.
    • Convolution
      Encode the entire image with conv net, and do semantic segmentation on top.
      $\color{red}{(-)}$ CNN architectures often change the spatial sizes, but semantic segmentation requires the output size to be same as input size.
    • Fully Convolutional
      Design a network with only convolutional layers without downsampling operators
      png
      $\color{red}{(-)}$ convolutions at original image resolution is very expensive
      $\rightarrow$ Design convolutional network with downsampling and upsampling
      png
  • Downsampling: Pooling, strided convolution
  • In-Network Upsampling: Unpooling, strided transpose convolution

  • Unpooling:
    Nearest Neighbor: copy-paste to extended region
    “Bed of Nails”: no positional argument, pad with zeros

  • Max Unpooling: use positions from poolying layer ahead, pad with zeros

  • Learnable Downsampling: Strided convolution
    Output is a dot product between filter and input
    Stride gives ratio between movement in input and output

  • Learnable Upsampling: Transposed convolution
    Input gives weight for filter
    Output contains copies of the filter weighted by the input, summing at where at overlaps in the output
    png

  • Summary
    Label each pixel in the image with a category label
    Don’t differentiate instances, only care about pixels

Object Detection

png

  • Multiple Objects:
    Each image needs a different number of outputs;
    $\rightarrow$ Apply a CNN to many different crops of the image, CNN classifies each crop as object or background.
    $\color{red}{(-)}$ Need to apply CNN to huge number of locations, scales, and aspect ratios, very computationally expensive.

R-CNN

png

  • Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014

  • 2-stage Detector: Region Proposal + Region Classification
    1. Image as input
    2. Crop bounding boxes with Selective Search
      Warp into same size pixels for CNN model
    3. Input Warped images into CNN
    4. Run classification on each
  • Algorithm:
    1. Region Proposals: Selective Search
      Find “blobby” image regions that are likely to contain objects.
      Relatively fast to run; e.g. Selective Search gives 2000 region proposals in a few seconds on CPU.
    2. CNN:
      For a pre-trained CNN architecture, change the number of classes on the last classification layer(detection classes N + background 1), fine-tune with dataset for Object Detection. From the region proposal input, outputs a fixed-length feature vector.
    3. SVM: Category-Specific Linear SVMs
      Positive: ground-truth boxes
      Negative: IoU under 0.3
      Scores each feature vector for classes, classifies whether each one is positive/negative(is_object).
    4. Non-Maximum Suppression: with concept of IoU
      Intersection over Union; area of intersection divided by area of union
      If there are two boxes with IoU over 0.5, consider them proposed on the same object, leave one with the highest score.
    5. Bounding Box Regression: adjust boxes from Selective Search
      • Algorithm:
        Assume a bounding box $P^i = (P_x^i, P_y^i, P_w^i, P_h^i)$,
        Ground-truth box $G = (G_x, G_y, G_w, G_h)$.
        Define a function $d$, mapping $P$ close to $G$;
        \(\hat{G}_x = P_w d_x(P) + P_x\)
        \(\hat{G}_y = P_h d_y(P) + P_y\)
        \(\hat{G}_w = P_w \mbox{exp}(d_w(P))\)
        \(\hat{G}_h = P_h \mbox{exp}(d_h(P))\)
        where $d_{\star}(P) = w_{\star}^T \phi_5(P)$, is modeled as a linear function(learnable weight vector w) of the POOL5 features of proposal P($\phi_5(P)$). We learn $w_{\star}$ by optimizing the regularized least squares objective(Ridge regression)
        Learnable parameters on: 2, 3, 5
  • Summary:
    Score: 53.7% on Pascal VOC 2010
    Problem:
    1. Low Performance; Warping images into 224x224 size for AlexNet
    2. Slow; Using all candidates from Selective Search
    3. Not GPU-optimized; Using Selective Search and SVM
    4. No Back Propagation; Not sharing computations

Fast R-CNN

png

  • Girshick, “Fast R-CNN”, ICCV 2015

  • idea:
    Pass the image through convnet before cropping. Crop the conv feature instead.

  • Algorithm:
    1. Pass the full image through pre-trained CNN and extract feature maps.
    2. Get RoIs from a proposal method(Selective Search) and crop by RoI Pooling, get fixed size feature vectors.
    3. With RoI feature vectors, pass some fully connected layers and split into two branches.
    4. 1) pass softmax and classify the class of RoI. no SVM used. 2) Run bounding box regression.
  • Cropping Features: RoI Pool
    1. Project RoI proposals(on input image) onto CNN image features.
    2. Divide into subregions.
    3. Run pooling(Max-pool) within each subregion.
      $\rightarrow$ Region features always be the same size regardless of input region size

Faster R-CNN

png

  • Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015

  • idea:
    Fast R-CNN is not GPU-optimized; runtime dominated by region proposals.
    By inserting Region Proposal Network(RPN), implemented end-to-end architecture.

  • Algorithm:
    1. Pass the full image through pre-trained CNN and extract feature maps.
    2. RPN:
      For K different anchor boxes of different size and scale at each point in the feature map, predict whether it contains an object(binary classification), and also predict a corrections from the anchor to the ground-truth box(regress 4 numbers per pixel).
      png
    3. Jointly train with 4 losses:
      1) RPN classify object / not object
      2) RPN regress box coordinates
      3) Final classification score (object classes)
      4) Final box coordinates
  • Glossing over many details:
    • Ignore overlapping proposals with non-max suppression
    • How are anchors determined?
    • How do we sample positive / negative samples for training the RPN?
    • How to parameterize bounding box regression?
  • Two-stage object detector:
    • First stage: Run once per image
      • Backbone network
      • Region proposal network(RPN)
    • Second stage: Run once per region
      • Crop features: RoI pool/ align
      • Predict object class
      • Prediction bbox offset

Single-Stage Object Detectors: YOLO / SSD / RetinaNet

  • Algorithm:
    1. Divide input imgae into grid
    2. Image a set of base boxes centered at each grid cell
    3. Within each grid cell:
      • Regress from each of the B base boxes to a final box with 5 numbers(dx, dy, dh, dw, confidence)
      • Predict scores for each of C classes(including background as a class)
      • Looks a lot like RPN, but category-specific

png

Instance Segmentation: Mask R-CNN

  • He et al, “Mask R-CNN”, ICCV 2017
    png
    png
    png
    png
    png

Open Source Frameworks

TensorFlow Detection API
Detectron2(Pytorch)

Beyond 2D Object Detection

Object Detection + Captioning: Dense Captioning

Dense Video Captioning: timestep “T”

png

Objects + Relationships: Scene Graphs

png
png

3D Object Detection

png

  • 2D bounding box: (x, y, w, h)
    $\rightarrow$ 3D oriented bounding box: (x, y, z, w, h, l, r, p, y)
    $\rightarrow$ Simplified bbox: no roll & pitch

  • Simple Camera Model:
    png
    A point on the image plane corresponds to a ray in the 3D space
    A 2D bounding box on an image is a frustrum in the 3D space
    Localize an object in 3D: The object can be anywhere in the camera viewing frustrum

  • Monocular Camera:
    png
    • Same idea as Faster RCNN, but proposals are in 3D
    • 3D bounding box proposal, regress 3D box parameters + class score
  • 3D Shape Prediction: Mesh R-CNN png