/ CS231N

cs231n - Lecture 13. Self-Supervised Learning

Self-Supervised Learning

Generative vs. Self-supervised Learning

  • Both aim to learn from data without manual label annotation
  • Generative learning aims to model data distribution $p_{data}(x)$,
    e.g., generating realistic images.
  • Self-supervised learning methods solve “pretext” tasks that produce good features for downstream tasks.
    • Learn with supervised learning objectives, e.g., classification, regression.
    • Labels of these pretext tasks are generated automatically.

Self-supervised pretext tasks

  • Example: learn to predict image transformations / complete corrupted images;
    e.g. image completion, rotation prediction, “jigsaw puzzle”, coloriztaion.
  1. Solving the pretext tasks allow the model to learn good features.
  2. We can automatically generate labels for the pretext tasks.
  • Learning to generate pixel-level details is often unnecessary; learn high-level semantic features with pretext tasks instead(only encode high-level features sufficient enough to distinguish different objects, Contrastive Methods): Epstein, 2016

How to evaluate a self-supervised learning method?

  1. Self-supervised learning:
    With lots of unlabeled data, learn good feature extractors from self-supervised pretext tasks, e.g., predicting image rotations.
  2. Supervised Learning:
    With small amount of labeled data on the target task, attach a shallow network on the feature extractor; train the shallow network and evaluate on the target task, e.g., classification, detection.

Pretext tasks from image transformations

  • Predict Rotations
    Gidaris et al., 2018 - (Paper Review) png
  • Predict Relative Patch Locations
    Doersch et al., 2015
    png
  • Solving “jigsaw puzzles”; shuffled patches
    Noroozi & Favaro, 2016
    png
  • Predict Missing Pixels(Inpainting); encoder-decoder
    Pathak et al., 2016
    png
  • Image Coloring; Split-brain Autoencoder
    Richard Zhang/ Phillip Isola
    png
  • Video Coloring; from t=0 reference frame to the later frames
    png

Summary: Pretext tasks

  • Pretext tasks focus on “visual common sense”; by image transformations, can learn without supervision(big labeled data).
  • The models are forced learn good features about natural images, e.g., semantic representation of an object category, in order to solve the pretext tasks.
  • We don’t care about the performance of these pretext tasks, but rather how useful the learned features are for downstream tasks.
  • $\color{red}{Problems}$: 1) coming up with individual pretext tasks is tedious, and 2) the learned representations may not be general; tied to a specific pretext task.

Contrastive Representation Learning

For a more general pretext task,
png

A formulation of contrastive learning

  • What we want:
    $\mbox{score}(f(x), f(x^+)) » score(f(x), f(x^-))$
    x: reference sample, x+: positive sample, x-: negative sample
    Given a chosen score function, we aim to learn an encoder function f that yields high score for positive pairs and low scores for negative pairs.

  • Loss function given 1 positive sample and N-1 negative samples:
    png
    seems familiar with Cross entropy loss for a N-way softmax classifier!
    i.e., learn to find the positive sample from the N samples

    • Commonly known as the InfoNCE loss(van den Oord et al., 2018)
      A lower bound on the mutual information between $f(x)$ and $f(x^+)$
      \(\rightarrow MI[f(x), f(x^+)] - \log(N) \ge -L\)
      The larger the negative sample size(N), the tighter the bound

SimCLR: A Simple Framework for Contrastive Learning

  • Chen et al., 2020
    png
  • Cosine similarity as the score function:
    \(s(u, v) = \frac{u^T v}{\lVert u \rVert \lVert v \rVert}\)
  • Use a projection network h(.) to project features to a space where contrastive learning is applied.
  • Generate positive samples through data augmentation:
    random cropping, random color distortion, and random blur.
    png
    png

  • Evaluate: Freeze feature encoder, train(finetune) on a supervised downstream task

SimCLR design choices: Projection head($z=g(.)$)

Linear / non-linear projection heads improve representation learning.

  • A possible explanation:
    • contrastive learning objective may discard useful information for downstream tasks.
    • representation space z is trained to be invariant to data transformation.
    • by leveraging the projection head g(.), more information can be preserved in the h representation space

SimCLR design choices: Large batch size

png
Large training batch size is crucial for SimCLR, but it causes large memory footprint during backpropagation; requires distributed training on TPUs.

Momentum Contrastive Learning (MoCo)

  • He et al., 2020 png
  • Key differences to SimCLR:
    • Keep a running queue of keys (negative samples).
    • Compute gradients and update the encoder only through the queries.
    • Decouple min-batch size with the number of keys: can support a large number of negative samples.

png

MoCo V2

  • Chen et al., 2020
  • A hybrid of ideas from SimCLR and MoCo:
    From SimCLR: non-linear projection head and strong data augmentation.
    From MoCo: momentum-updated queues that allow training on a large number of negative samples (no TPU required).
  • Key takeaways(vs. SimCLR, MoCo V1):
    • Non-linear projection head and strong data augmentation are crucial for contrastive learning.
    • Decoupling mini-batch size with negative sample size allows MoCo-V2 to outperform SimCLR with smaller batch size (256 vs. 8192).
    • Achieved with much smaller memory footprint.

Instance vs. Sequence Contrastive Learning

  • Instance-level contrastive learning:
    Based on positive & negative instances.
    E.g., SimCLR, MoCo
  • Sequence-level contrastive learning:
    Based on sequential / temporal orders.
    E.g., Contrastive Predictive Coding (CPC)

Contrastive Predictive Coding (CPC)

  • van den Oord et al., 2018
  • Contrastive: contrast between “right” and “wrong” sequences using contrastive learning.
  • Predictive: the model has to predict future patterns given the current context.
  • Coding: the model learns useful feature vectors, or “code”, for downstream tasks, similar to other context self-supervised methods.

png

  1. Encode all samples in a sequence into vectors $z_t = g_{\mbox{enc}}(x_t)$
  2. Summarize context (e.g., half of a sequence) into a context code $c_t$ using an auto-regressive model ($g_{\mbox{ar}}$). The original paper uses GRU-RNN here.
  3. Compute InfoNCE loss between the context $c_t$ and future code $z_{t+k}$ using the following time-dependent score funtion: $s_k(z_{t+k}, c_t) = z_{t+k}^T W_k c_t$, where $W_k$ is a trainable matrix.
  • Summary(CPC):
    Contrast “right” sequence with “wrong” sequence.
    InfoNCE loss with a time-dependent score function.
    Can be applied to a variety of learning problems, but not as effective in learning image representations compared to instance-level methods.

Other examples

png
png
png