cs231n - Lecture 8. Training Neural Networks II
Optimization
Problems with SGD
-
What if loss changes quickly in one direction and slowly in another? What does gradient descent do? Very slow progress along shallow dimension, jitter along steep direction
-
What if the loss function has a local minima or saddle point?
Zero gradient, gradient descent gets stuck(more common in high dimension) -
Gradients come from minibatches can be noisy
SGD + Momentum
- To avoid local minima, combine gradient at current point with velocity to get step used to update weights; continue moving in the general direction as the previous iterations
\(v_{t+1}=\rho v_t + \nabla f(x_t)\)
\(x_{t+1}=x_t - \alpha v_{t+1}\)
with rho giving “friction”; typically 0.9 or 0.99
vx = 0
while True:
dx = compute_gradient(x)
vx = rho * vx + dx
x -= learning_rate * vx
Nesterov Momentum
- “Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction
\(v_{t+1}=\rho v_t - \alpha\nabla f(x_t + \rho v_t)\)
\(x_{t+1}=x_t + v_{t+1}\)
rearrange with \(\tilde{x}_t = x_t + \rho v_t\),
\(v_{t+1}=\rho v_t - \alpha\nabla f(\tilde{x}_t)\)
\(\begin{align*} \tilde{x}_{t+1} &= \tilde{x}_t - \rho v_t + (1+\rho)v_{t+1} &= \tilde{x}_t + v_{t+1} + \rho(v_{t+1}-v_t) \end{align*}\)
AdaGrad
grad_squared = 0
while True:
dx = compute_gradient(x)
grad_squared += dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
- Added element-wise scaling of the gradient based on the historical sum of squares in each dimension
“Per-parameter learning rates” or “adaptive learning rates”
Progress along “steep” directions is damped and “flat” directions is accelerated
Step size decays to zero over time
RMSProp: “Leaky AdaGrad”
grad_squared = 0
while True:
dx = compute_gradient(x)
grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
Adam
first_moment = 0
second_moment = 0
for t in range(1, num_iterations):
dx = compute_gradient(x)
first_moment = beta1 * first_moment + (1 - beta1) * dx # Momentum
second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
first_unbias = first_moment / (1 - beta1 ** t) # Bias correction
second_unbias = second_moment / (1 - beta2 ** t)
x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7) # AdaGrad/ RMSProp
- Sort of like RMSProp with momentum
Bias correction for the fact that first and second moment estimates start at zero
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point
Learning rate schedules
Learning rate decays over time
- Reduce learning rate by a certain value at a few fixed points(after some epochs)
Learning Rate Decay
-
Reduce learning rate gradually, e.g.
Cosine: $\alpha_t = \frac{1}{2}\alpha_0(1+\mbox{cos}(t\pi / T))$
Linear: $\alpha_t = \alpha_0(1-t/T)$
Inverse sqrt: $\alpha_t = \alpha_0 / \sqrt{t}$
while $\alpha_0$ is the initial learning rate, $\alpha_t$ is one at epoch t, and T is the total number of epochs -
Linear Warmup
High initial learning rates can make loss explode; linearly increasing learning rate from 0 over the first ~5000 iterations can prevent this
Empirical rule of thumb: If you increase the batch size by N, also scale the initial learning rate by N
First-Order Optimization
- Use gradient from linear approximation
- Step to minimize the approximation
Second-Order Optimization
- Use gradient and Hessian to form quadratic approximation
- Step to the minima of the (quadratic) approximation
- But Hessian has O(N^2) elements and inverting takes O(N^3), N is extremely large
- Quasi-Newton methods (BGFS most popular):
instead of inverting the Hessian, approximate inverse Hessian with rank 1 updates over time - L-BFGS (Limited memory BFGS):
Does not form/store the full inverse Hessian. Usually works very well in full batch, deterministic mode, but does not transfer very well to mini-batch setting. Large-scale, stochastic setting is an active area of research.
- Quasi-Newton methods (BGFS most popular):
Summary
- Adam is a good default choice in many cases; even with constant learning rate
- SGD+Momentum can outperform Adam but may equire more tuning of LR and schedule. Cosine schedule preferred, since it has very few hyperparameters.
- L-BFGS is good if you can afford to do full batch updates.
Improve test error
- Better optimization algorithms help reduce training loss, but what we really care is about error on new data - how to reduce the gap?
Early Stopping: Always do this
- Stop training the model when accuracy on the validation set decreases. Or train for a long time, but always keep track of the model snapshot that worked best on val.
Model Ensembles
- Train multiple independent models
- At test time average their results
(Take average of predicted probability distributions, then choose argmax)
Regularization
-
To improve single-model performance, add terms to loss
e.g. L1, L2, Elastic net. -
or, use Dropout:
In each forward pass, randomly set some neurons to zero. Probability of dropping is a hyperparameter; 0.5 is common.
It forces the network to have a redundant representation; Prevents co-adaptation of features. Dropout can be interpreted as training a large ensemble of models (that share parameters).
p = 0.5 # dropout rate
def train_step(X): # drop in train time
H1 = np.maximum(0, np.dot(W1, X) + b1)
U1 = np.random.rand(*H1.shape) < p
H1 *= U1
H2 = np.maximum(0, np.dot(W2, H1) + b2)
U2 = np.random.rand(*H2.shape) < p
H2 *= U2
out = np.dot(W3, H2) + b3
# backward pass: compute gradients...
# perform parameter update...
def predict(X): # scale at test time
H1 = np.maximum(0, np.dot(W1, X) + b1) * p
H2 = np.maximum(0, np.dot(W2, H1) + b2) * p
out = np.dot(W3, H2) + b3
-
more common: “Inverted dropout”
U1 = (np.random.rand(*H1.shape) < p) / p
in train time and no scaling in test time -
A common pattern of regularization
Training: Add some kind of randomness
$y = fw(x,z)$
Testing: Average out randomness (sometimes approximate)
\(y = f(x) = E_z[f(x,z)] = \int p(z)f(x,z)\, dz\) -
Data Augmentation:
Addes transformed data to train model
e.g. translation, rotation, stretching, shearing, lens distortions. -
DropConnect:
Training: Drop connections between neurons (set weights to 0)
Testing: Use all the connections -
Fractional Pooling:
Training: Use randomized pooling regions
Testing: Average predictions from several regions -
Stochastic Depth:
Training: Skip some layers in the network
Testing: Use all the layer -
Cutout:
Training: Set random image regions to zero
Testing: Use full image -
Mixup:
Training: Train on random blends of images
Testing: Use original images
e.g. Randomly blend the pixels of pairs of training images, say 40% cat and 60% dog, and set the target label as cat:0.4 and dog:0.6. -
Summary Consider dropout for large fully-connected layers
Batch normalization and data augmentation almost always a good idea
Try cutout and mixup especially for small classification datasets
Choosing Hyperparameters
-
Step 1: Check initial loss
Turn off weight decay, sanity check loss at initialization
e.g. log(C) for softmax with C classes -
Step 2: Overfit a small sample
Try to train to 100% training accuracy on a small sample of training data (~5-10 minibatches); fiddle with architecture, learning rate, weight initialization
If loss is not going down, LR too low or bad initialization. If loss explodes, then LR is too high or bad initialization. -
Step 3: Find LR that makes loss go down
Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within ~100 iterations. -
Step 4: Coarse grid, train for ~1-5 epochs
Choose a few values of learning rate and weight decay around what worked from Step 3, train a few models for ~1-5 epochs. -
Step 5: Refine grid, train longer
Pick best models from Step 4, train them for longer (~10-20 epochs) without learning rate decay -
Step 6: Look at loss and accuracy curves
If accuracy still going up, you need to train longer. If it goes down, huge train / val gap means overfitting. You need to increase regularization or get more data. If there’s no gap between train / val, it means underfitting. Train longer or use a bigger model. -
Look at learning curves
Losses may be noisy, use a scatter plot and also plot moving average to see trends better. Cross-validation is useful too. -
Step 7: GO TO Step 5
-
Hyperparameters to play with:
network architecture,
learning rate, its decay schedule, update type,
regularization (L2/Dropout strength) -
for Hyper-Parameter Optimization, consider both Random Search and Grid Search
Summary
- Improve your training error:
- Optimizers
- Learning rate schedules
- Improve your test error:
- Regularization
- Choosing Hyperparameters