/ CS231N

cs231n - Lecture 8. Training Neural Networks II

Optimization

Problems with SGD

  1. What if loss changes quickly in one direction and slowly in another? What does gradient descent do? Very slow progress along shallow dimension, jitter along steep direction

  2. What if the loss function has a local minima or saddle point?
    Zero gradient, gradient descent gets stuck(more common in high dimension)

  3. Gradients come from minibatches can be noisy

SGD + Momentum

  • To avoid local minima, combine gradient at current point with velocity to get step used to update weights; continue moving in the general direction as the previous iterations
    \(v_{t+1}=\rho v_t + \nabla f(x_t)\)
    \(x_{t+1}=x_t - \alpha v_{t+1}\)
    with rho giving “friction”; typically 0.9 or 0.99
vx = 0
while True:
	dx = compute_gradient(x)
	vx = rho * vx + dx
	x -= learning_rate * vx

Nesterov Momentum

  • “Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction
    \(v_{t+1}=\rho v_t - \alpha\nabla f(x_t + \rho v_t)\)
    \(x_{t+1}=x_t + v_{t+1}\)
    rearrange with \(\tilde{x}_t = x_t + \rho v_t\),
    \(v_{t+1}=\rho v_t - \alpha\nabla f(\tilde{x}_t)\)
    \(\begin{align*} \tilde{x}_{t+1} &= \tilde{x}_t - \rho v_t + (1+\rho)v_{t+1} &= \tilde{x}_t + v_{t+1} + \rho(v_{t+1}-v_t) \end{align*}\)

AdaGrad

grad_squared = 0
while True:
	dx = compute_gradient(x)
	grad_squared += dx * dx
	x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
  • Added element-wise scaling of the gradient based on the historical sum of squares in each dimension
    “Per-parameter learning rates” or “adaptive learning rates”
    Progress along “steep” directions is damped and “flat” directions is accelerated
    Step size decays to zero over time

RMSProp: “Leaky AdaGrad”

grad_squared = 0
while True:
	dx = compute_gradient(x)
	grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx
	x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

Adam

first_moment = 0
second_moment = 0
for t in range(1, num_iterations):
	dx = compute_gradient(x)
	first_moment = beta1 * first_moment + (1 - beta1) * dx		# Momentum
	second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
	first_unbias = first_moment / (1 - beta1 ** t)			# Bias correction
	second_unbias = second_moment / (1 - beta2 ** t)
	x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)	# AdaGrad/ RMSProp
  • Sort of like RMSProp with momentum Bias correction for the fact that first and second moment estimates start at zero
    Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point

Learning rate schedules

Learning rate decays over time

  • Reduce learning rate by a certain value at a few fixed points(after some epochs)

Learning Rate Decay

  • Reduce learning rate gradually, e.g.
    Cosine: $\alpha_t = \frac{1}{2}\alpha_0(1+\mbox{cos}(t\pi / T))$
    Linear: $\alpha_t = \alpha_0(1-t/T)$
    Inverse sqrt: $\alpha_t = \alpha_0 / \sqrt{t}$
    while $\alpha_0$ is the initial learning rate, $\alpha_t$ is one at epoch t, and T is the total number of epochs

  • Linear Warmup
    High initial learning rates can make loss explode; linearly increasing learning rate from 0 over the first ~5000 iterations can prevent this
    Empirical rule of thumb: If you increase the batch size by N, also scale the initial learning rate by N

First-Order Optimization

  1. Use gradient from linear approximation
  2. Step to minimize the approximation

Second-Order Optimization

  1. Use gradient and Hessian to form quadratic approximation
  2. Step to the minima of the (quadratic) approximation
  • But Hessian has O(N^2) elements and inverting takes O(N^3), N is extremely large
    • Quasi-Newton methods (BGFS most popular):
      instead of inverting the Hessian, approximate inverse Hessian with rank 1 updates over time
    • L-BFGS (Limited memory BFGS):
      Does not form/store the full inverse Hessian. Usually works very well in full batch, deterministic mode, but does not transfer very well to mini-batch setting. Large-scale, stochastic setting is an active area of research.

Summary

  • Adam is a good default choice in many cases; even with constant learning rate
  • SGD+Momentum can outperform Adam but may equire more tuning of LR and schedule. Cosine schedule preferred, since it has very few hyperparameters.
  • L-BFGS is good if you can afford to do full batch updates.

Improve test error

  • Better optimization algorithms help reduce training loss, but what we really care is about error on new data - how to reduce the gap?

Early Stopping: Always do this

  • Stop training the model when accuracy on the validation set decreases. Or train for a long time, but always keep track of the model snapshot that worked best on val.

Model Ensembles

  1. Train multiple independent models
  2. At test time average their results
    (Take average of predicted probability distributions, then choose argmax)

Regularization

  • To improve single-model performance, add terms to loss
    e.g. L1, L2, Elastic net.

  • or, use Dropout:
    In each forward pass, randomly set some neurons to zero. Probability of dropping is a hyperparameter; 0.5 is common.
    It forces the network to have a redundant representation; Prevents co-adaptation of features. Dropout can be interpreted as training a large ensemble of models (that share parameters).

p = 0.5 # dropout rate

def train_step(X):		# drop in train time
	H1 = np.maximum(0, np.dot(W1, X) + b1)
	U1 = np.random.rand(*H1.shape) < p
	H1 *= U1
	H2 = np.maximum(0, np.dot(W2, H1) + b2)
	U2 = np.random.rand(*H2.shape) < p
	H2 *= U2
	out = np.dot(W3, H2) + b3

	# backward pass: compute gradients...
	# perform parameter update...

def predict(X):			# scale at test time
	H1 = np.maximum(0, np.dot(W1, X) + b1) * p
	H2 = np.maximum(0, np.dot(W2, H1) + b2) * p
	out = np.dot(W3, H2) + b3
  • more common: “Inverted dropout”
    U1 = (np.random.rand(*H1.shape) < p) / p in train time and no scaling in test time

  • A common pattern of regularization
    Training: Add some kind of randomness
    $y = fw(x,z)$
    Testing: Average out randomness (sometimes approximate)
    \(y = f(x) = E_z[f(x,z)] = \int p(z)f(x,z)\, dz\)

  • Data Augmentation:
    Addes transformed data to train model
    e.g. translation, rotation, stretching, shearing, lens distortions.

  • DropConnect:
    Training: Drop connections between neurons (set weights to 0)
    Testing: Use all the connections

  • Fractional Pooling:
    Training: Use randomized pooling regions
    Testing: Average predictions from several regions

  • Stochastic Depth:
    Training: Skip some layers in the network
    Testing: Use all the layer

  • Cutout:
    Training: Set random image regions to zero
    Testing: Use full image

  • Mixup:
    Training: Train on random blends of images
    Testing: Use original images
    e.g. Randomly blend the pixels of pairs of training images, say 40% cat and 60% dog, and set the target label as cat:0.4 and dog:0.6.

  • Summary Consider dropout for large fully-connected layers
    Batch normalization and data augmentation almost always a good idea
    Try cutout and mixup especially for small classification datasets

Choosing Hyperparameters

  • Step 1: Check initial loss
    Turn off weight decay, sanity check loss at initialization
    e.g. log(C) for softmax with C classes

  • Step 2: Overfit a small sample
    Try to train to 100% training accuracy on a small sample of training data (~5-10 minibatches); fiddle with architecture, learning rate, weight initialization
    If loss is not going down, LR too low or bad initialization. If loss explodes, then LR is too high or bad initialization.

  • Step 3: Find LR that makes loss go down
    Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within ~100 iterations.

  • Step 4: Coarse grid, train for ~1-5 epochs
    Choose a few values of learning rate and weight decay around what worked from Step 3, train a few models for ~1-5 epochs.

  • Step 5: Refine grid, train longer
    Pick best models from Step 4, train them for longer (~10-20 epochs) without learning rate decay

  • Step 6: Look at loss and accuracy curves
    If accuracy still going up, you need to train longer. If it goes down, huge train / val gap means overfitting. You need to increase regularization or get more data. If there’s no gap between train / val, it means underfitting. Train longer or use a bigger model.

  • Look at learning curves
    Losses may be noisy, use a scatter plot and also plot moving average to see trends better. Cross-validation is useful too.

  • Step 7: GO TO Step 5

  • Hyperparameters to play with:
    network architecture,
    learning rate, its decay schedule, update type,
    regularization (L2/Dropout strength)

  • for Hyper-Parameter Optimization, consider both Random Search and Grid Search

Summary

  • Improve your training error:
    • Optimizers
    • Learning rate schedules
  • Improve your test error:
    • Regularization
    • Choosing Hyperparameters