19 January 2022 / CS231N

cs231n - Lecture 13. Self-Supervised Learning

Self-Supervised Learning

Generative vs. Self-supervised Learning

Both aim to learn from data without manual label annotation
Generative learning aims to model data distribution $p_{data}(x)$,
e.g., generating realistic images.
Self-supervised learning methods solve “pretext” tasks that produce good features for downstream tasks.
- Learn with supervised learning objectives, e.g., classification, regression.
- Labels of these pretext tasks are generated automatically.

Self-supervised pretext tasks

Example: learn to predict image transformations / complete corrupted images;
e.g. image completion, rotation prediction, “jigsaw puzzle”, coloriztaion.

Solving the pretext tasks allow the model to learn good features.
We can automatically generate labels for the pretext tasks.

Learning to generate pixel-level details is often unnecessary; learn high-level semantic features with pretext tasks instead(only encode high-level features sufficient enough to distinguish different objects, Contrastive Methods): Epstein, 2016

How to evaluate a self-supervised learning method?

Self-supervised learning:
With lots of unlabeled data, learn good feature extractors from self-supervised pretext tasks, e.g., predicting image rotations.
Supervised Learning:
With small amount of labeled data on the target task, attach a shallow network on the feature extractor; train the shallow network and evaluate on the target task, e.g., classification, detection.

Pretext tasks from image transformations

Predict Rotations
Gidaris et al., 2018 - (Paper Review)
Predict Relative Patch Locations
Doersch et al., 2015
Solving “jigsaw puzzles”; shuffled patches
Noroozi & Favaro, 2016
Predict Missing Pixels(Inpainting); encoder-decoder
Pathak et al., 2016
Image Coloring; Split-brain Autoencoder
Richard Zhang/ Phillip Isola
Video Coloring; from t=0 reference frame to the later frames

Summary: Pretext tasks

Pretext tasks focus on “visual common sense”; by image transformations, can learn without supervision(big labeled data).
The models are forced learn good features about natural images, e.g., semantic representation of an object category, in order to solve the pretext tasks.
We don’t care about the performance of these pretext tasks, but rather how useful the learned features are for downstream tasks.
$\color{red}{Problems}$: 1) coming up with individual pretext tasks is tedious, and 2) the learned representations may not be general; tied to a specific pretext task.

Contrastive Representation Learning

For a more general pretext task,
png

A formulation of contrastive learning

What we want:
$\mbox{score}(f(x), f(x^+)) » score(f(x), f(x^-))$
x: reference sample, x+: positive sample, x-: negative sample
Given a chosen score function, we aim to learn an encoder function f that yields high score for positive pairs and low scores for negative pairs.
Loss function given 1 positive sample and N-1 negative samples:

seems familiar with Cross entropy loss for a N-way softmax classifier!
i.e., learn to find the positive sample from the N samples
- Commonly known as the InfoNCE loss(van den Oord et al., 2018)
  A lower bound on the mutual information between $f(x)$ and $f(x^+)$
  $\rightarrow MI[f(x), f(x^+)] - \log(N) \ge -L$
  The larger the negative sample size(N), the tighter the bound

SimCLR: A Simple Framework for Contrastive Learning

Chen et al., 2020
Cosine similarity as the score function:
$s(u, v) = \frac{u^T v}{\lVert u \rVert \lVert v \rVert}$
Use a projection network h(.) to project features to a space where contrastive learning is applied.
Generate positive samples through data augmentation:
random cropping, random color distortion, and random blur.
Evaluate: Freeze feature encoder, train(finetune) on a supervised downstream task

SimCLR design choices: Projection head($z=g(.)$)

Linear / non-linear projection heads improve representation learning.

A possible explanation:
- contrastive learning objective may discard useful information for downstream tasks.
- representation space z is trained to be invariant to data transformation.
- by leveraging the projection head g(.), more information can be preserved in the h representation space

SimCLR design choices: Large batch size

png
Large training batch size is crucial for SimCLR, but it causes large memory footprint during backpropagation; requires distributed training on TPUs.

Momentum Contrastive Learning (MoCo)

He et al., 2020
Key differences to SimCLR:
- Keep a running queue of keys (negative samples).
- Compute gradients and update the encoder only through the queries.
- Decouple min-batch size with the number of keys: can support a large number of negative samples.

png

MoCo V2

Chen et al., 2020
A hybrid of ideas from SimCLR and MoCo:
From SimCLR: non-linear projection head and strong data augmentation.
From MoCo: momentum-updated queues that allow training on a large number of negative samples (no TPU required).
Key takeaways(vs. SimCLR, MoCo V1):
- Non-linear projection head and strong data augmentation are crucial for contrastive learning.
- Decoupling mini-batch size with negative sample size allows MoCo-V2 to outperform SimCLR with smaller batch size (256 vs. 8192).
- Achieved with much smaller memory footprint.

Instance vs. Sequence Contrastive Learning

Instance-level contrastive learning:
Based on positive & negative instances.
E.g., SimCLR, MoCo
Sequence-level contrastive learning:
Based on sequential / temporal orders.
E.g., Contrastive Predictive Coding (CPC)

Contrastive Predictive Coding (CPC)

van den Oord et al., 2018
Contrastive: contrast between “right” and “wrong” sequences using contrastive learning.
Predictive: the model has to predict future patterns given the current context.
Coding: the model learns useful feature vectors, or “code”, for downstream tasks, similar to other context self-supervised methods.

png

Encode all samples in a sequence into vectors $z_t = g_{\mbox{enc}}(x_t)$
Summarize context (e.g., half of a sequence) into a context code $c_t$ using an auto-regressive model ($g_{\mbox{ar}}$). The original paper uses GRU-RNN here.
Compute InfoNCE loss between the context $c_t$ and future code $z_{t+k}$ using the following time-dependent score funtion: $s_k(z_{t+k}, c_t) = z_{t+k}^T W_k c_t$, where $W_k$ is a trainable matrix.

Summary(CPC):
Contrast “right” sequence with “wrong” sequence.
InfoNCE loss with a time-dependent score function.
Can be applied to a variety of learning problems, but not as effective in learning image representations compared to instance-level methods.

Other examples

png
png
png

cs231n - Lecture 13. Self-Supervised Learning

Self-Supervised Learning

Generative vs. Self-supervised Learning

Self-supervised pretext tasks

How to evaluate a self-supervised learning method?

Pretext tasks from image transformations

Summary: Pretext tasks

Contrastive Representation Learning

A formulation of contrastive learning

SimCLR: A Simple Framework for Contrastive Learning

SimCLR design choices: Projection head($z=g(.)$)

SimCLR design choices: Large batch size

Momentum Contrastive Learning (MoCo)

MoCo V2

Instance vs. Sequence Contrastive Learning

Contrastive Predictive Coding (CPC)

Other examples

Unsupervised Representation Learning by Predicting Image Rotations

cs231n - Lecture 12. Generative Models

Self-Supervised Learning

Generative vs. Self-supervised Learning

Self-supervised pretext tasks

How to evaluate a self-supervised learning method?

Pretext tasks from image transformations

Summary: Pretext tasks

Contrastive Representation Learning

A formulation of contrastive learning

SimCLR: A Simple Framework for Contrastive Learning

SimCLR design choices: Projection head($z=g(.)$)

SimCLR design choices: Large batch size

Momentum Contrastive Learning (MoCo)

MoCo V2

Instance vs. Sequence Contrastive Learning

Contrastive Predictive Coding (CPC)

Other examples

Search Darron's Devlog