27 December 2021 / CS231N

cs231n - Lecture 8. Training Neural Networks II

Optimization

Problems with SGD

What if loss changes quickly in one direction and slowly in another? What does gradient descent do? Very slow progress along shallow dimension, jitter along steep direction
What if the loss function has a local minima or saddle point?
Zero gradient, gradient descent gets stuck(more common in high dimension)
Gradients come from minibatches can be noisy

SGD + Momentum

To avoid local minima, combine gradient at current point with velocity to get step used to update weights; continue moving in the general direction as the previous iterations
$v_{t+1}=\rho v_t + \nabla f(x_t)$
$x_{t+1}=x_t - \alpha v_{t+1}$
with rho giving “friction”; typically 0.9 or 0.99

vx = 0
while True:
	dx = compute_gradient(x)
	vx = rho * vx + dx
	x -= learning_rate * vx

Nesterov Momentum

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction
$v_{t+1}=\rho v_t - \alpha\nabla f(x_t + \rho v_t)$
$x_{t+1}=x_t + v_{t+1}$
rearrange with $\tilde{x}_t = x_t + \rho v_t$,
$v_{t+1}=\rho v_t - \alpha\nabla f(\tilde{x}_t)$
$\begin{align*} \tilde{x}_{t+1} &= \tilde{x}_t - \rho v_t + (1+\rho)v_{t+1} &= \tilde{x}_t + v_{t+1} + \rho(v_{t+1}-v_t) \end{align*}$

AdaGrad

grad_squared = 0
while True:
	dx = compute_gradient(x)
	grad_squared += dx * dx
	x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension
“Per-parameter learning rates” or “adaptive learning rates”
Progress along “steep” directions is damped and “flat” directions is accelerated
Step size decays to zero over time

RMSProp: “Leaky AdaGrad”

grad_squared = 0
while True:
	dx = compute_gradient(x)
	grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx
	x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

Adam

first_moment = 0
second_moment = 0
for t in range(1, num_iterations):
	dx = compute_gradient(x)
	first_moment = beta1 * first_moment + (1 - beta1) * dx		# Momentum
	second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
	first_unbias = first_moment / (1 - beta1 ** t)			# Bias correction
	second_unbias = second_moment / (1 - beta2 ** t)
	x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)	# AdaGrad/ RMSProp

Sort of like RMSProp with momentum Bias correction for the fact that first and second moment estimates start at zero
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point

Learning rate schedules

Learning rate decays over time

Reduce learning rate by a certain value at a few fixed points(after some epochs)

Learning Rate Decay

Reduce learning rate gradually, e.g.
Cosine: $\alpha_t = \frac{1}{2}\alpha_0(1+\mbox{cos}(t\pi / T))$
Linear: $\alpha_t = \alpha_0(1-t/T)$
Inverse sqrt: $\alpha_t = \alpha_0 / \sqrt{t}$
while $\alpha_0$ is the initial learning rate, $\alpha_t$ is one at epoch t, and T is the total number of epochs
Linear Warmup
High initial learning rates can make loss explode; linearly increasing learning rate from 0 over the first ~5000 iterations can prevent this
Empirical rule of thumb: If you increase the batch size by N, also scale the initial learning rate by N

First-Order Optimization

Use gradient from linear approximation
Step to minimize the approximation

Second-Order Optimization

Use gradient and Hessian to form quadratic approximation
Step to the minima of the (quadratic) approximation

But Hessian has O(N^2) elements and inverting takes O(N^3), N is extremely large
- Quasi-Newton methods (BGFS most popular):
  instead of inverting the Hessian, approximate inverse Hessian with rank 1 updates over time
- L-BFGS (Limited memory BFGS):
  Does not form/store the full inverse Hessian. Usually works very well in full batch, deterministic mode, but does not transfer very well to mini-batch setting. Large-scale, stochastic setting is an active area of research.

Summary

Adam is a good default choice in many cases; even with constant learning rate
SGD+Momentum can outperform Adam but may equire more tuning of LR and schedule. Cosine schedule preferred, since it has very few hyperparameters.
L-BFGS is good if you can afford to do full batch updates.

Improve test error

Better optimization algorithms help reduce training loss, but what we really care is about error on new data - how to reduce the gap?

Early Stopping: Always do this

Stop training the model when accuracy on the validation set decreases. Or train for a long time, but always keep track of the model snapshot that worked best on val.

Model Ensembles

Train multiple independent models
At test time average their results
(Take average of predicted probability distributions, then choose argmax)

Regularization

To improve single-model performance, add terms to loss
e.g. L1, L2, Elastic net.
or, use Dropout:
In each forward pass, randomly set some neurons to zero. Probability of dropping is a hyperparameter; 0.5 is common.
It forces the network to have a redundant representation; Prevents co-adaptation of features. Dropout can be interpreted as training a large ensemble of models (that share parameters).

p = 0.5 # dropout rate

def train_step(X):		# drop in train time
	H1 = np.maximum(0, np.dot(W1, X) + b1)
	U1 = np.random.rand(*H1.shape) < p
	H1 *= U1
	H2 = np.maximum(0, np.dot(W2, H1) + b2)
	U2 = np.random.rand(*H2.shape) < p
	H2 *= U2
	out = np.dot(W3, H2) + b3

	# backward pass: compute gradients...
	# perform parameter update...

def predict(X):			# scale at test time
	H1 = np.maximum(0, np.dot(W1, X) + b1) * p
	H2 = np.maximum(0, np.dot(W2, H1) + b2) * p
	out = np.dot(W3, H2) + b3

more common: “Inverted dropout”
U1 = (np.random.rand(*H1.shape) < p) / p in train time and no scaling in test time
A common pattern of regularization
Training: Add some kind of randomness
$y = fw(x,z)$
Testing: Average out randomness (sometimes approximate)
$y = f(x) = E_z[f(x,z)] = \int p(z)f(x,z)\, dz$
Data Augmentation:
Addes transformed data to train model
e.g. translation, rotation, stretching, shearing, lens distortions.
DropConnect:
Training: Drop connections between neurons (set weights to 0)
Testing: Use all the connections
Fractional Pooling:
Training: Use randomized pooling regions
Testing: Average predictions from several regions
Stochastic Depth:
Training: Skip some layers in the network
Testing: Use all the layer
Cutout:
Training: Set random image regions to zero
Testing: Use full image
Mixup:
Training: Train on random blends of images
Testing: Use original images
e.g. Randomly blend the pixels of pairs of training images, say 40% cat and 60% dog, and set the target label as cat:0.4 and dog:0.6.
Summary Consider dropout for large fully-connected layers
Batch normalization and data augmentation almost always a good idea
Try cutout and mixup especially for small classification datasets

Choosing Hyperparameters

Step 1: Check initial loss
Turn off weight decay, sanity check loss at initialization
e.g. log(C) for softmax with C classes
Step 2: Overfit a small sample
Try to train to 100% training accuracy on a small sample of training data (~5-10 minibatches); fiddle with architecture, learning rate, weight initialization
If loss is not going down, LR too low or bad initialization. If loss explodes, then LR is too high or bad initialization.
Step 3: Find LR that makes loss go down
Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within ~100 iterations.
Step 4: Coarse grid, train for ~1-5 epochs
Choose a few values of learning rate and weight decay around what worked from Step 3, train a few models for ~1-5 epochs.
Step 5: Refine grid, train longer
Pick best models from Step 4, train them for longer (~10-20 epochs) without learning rate decay
Step 6: Look at loss and accuracy curves
If accuracy still going up, you need to train longer. If it goes down, huge train / val gap means overfitting. You need to increase regularization or get more data. If there’s no gap between train / val, it means underfitting. Train longer or use a bigger model.
Look at learning curves
Losses may be noisy, use a scatter plot and also plot moving average to see trends better. Cross-validation is useful too.
Step 7: GO TO Step 5
Hyperparameters to play with:
network architecture,
learning rate, its decay schedule, update type,
regularization (L2/Dropout strength)
for Hyper-Parameter Optimization, consider both Random Search and Grid Search

Summary

Improve your training error:
- Optimizers
- Learning rate schedules
Improve your test error:
- Regularization
- Choosing Hyperparameters

cs231n - Lecture 8. Training Neural Networks II

Optimization

Problems with SGD

SGD + Momentum

Nesterov Momentum

AdaGrad

RMSProp: “Leaky AdaGrad”

Adam

Learning rate schedules

Learning rate decays over time

Learning Rate Decay

First-Order Optimization

Second-Order Optimization

Summary

Improve test error

Early Stopping: Always do this

Model Ensembles

Regularization

Choosing Hyperparameters

Summary

cs231n - Lecture 9. CNN Architectures

cs231n - Lecture 7. Training Neural Networks I

Optimization

Problems with SGD

SGD + Momentum

Nesterov Momentum

AdaGrad

RMSProp: “Leaky AdaGrad”

Adam

Learning rate schedules

Learning rate decays over time

Learning Rate Decay

First-Order Optimization

Second-Order Optimization

Summary

Improve test error

Early Stopping: Always do this

Model Ensembles

Regularization

Choosing Hyperparameters

Summary

Search Darron's Devlog