26 December 2021 / CS231N

cs231n - Lecture 7. Training Neural Networks I

Activation Functions

png

Sigmoid

$\sigma(x)=1/(1+e^{-x})$
- Squashes numbers to range [0,1]
- Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron.
Problem:
- Gradient Vanishing: Saturated neurons “kill” the gradients; If all the gradients flowing back will be zero and weights will never change.
- Sigmoid outputs are not zero-centered and always positive, so the gradients will be always all positive or all negative. Then the gradient update would follow a zig-zag path, resulting in bad efficiency.
- exp() is a bit compute expensive.

tanh(x)

Squashes numbers to range [-1,1]
zero centered
but still kills gradients when saturated

ReLU(Rectified Linear Unit)

$f(x) = \mbox{max}(0,x)$
Does not saturate (in + region)
Very computationally efficient
Converges much faster than sigmoid/tanh
but has not zero-centered output and weights will never be updated for negative x

Leaky ReLU

$f(x) = \mbox{max}(0.01x,x)$
(or parametric, PReLU: $f(x) = \mbox{max}(\alpha x, x)$)
Not saturate
Computationally efficient
Converges much faster
will not “die”

ELU(Exponential Linear Units)

$f(n)= \begin{cases} x & \mbox{if }x>0 \\ \alpha(\mbox{exp}(x)-1) & \mbox{if }x\le 0\end{cases}$
($\scriptstyle{\alpha = 1}$)
All benefits of ReLU
Closer to zero mean outputs
Negative saturation regime compared with Leaky ReLU adds some robustness to noise
Computation requires exp()

SELU (Scaled Exponential Linear Units)

$f(n)= \begin{cases} \lambda x & \mbox{if }x>0 \\ \lambda\alpha(e^x -1) & \mbox{otherwise}\end{cases}$
($\scriptstyle{\alpha=1.6733, \lambda=1.0507}$)
Scaled versionof ELU that works better for deep networks
“Self-normalizing” property;
Can train deep SELU networks without BatchNorm

Maxout “Neuron”

$\mbox{max}(w_1^T x + b_1, w_2^T x + b_2)$
Nonlinearity; does not have the basic form of dot product
Generalizes ReLU and Leaky ReLU
Linear Regime; does not saturate or die
Complexity; Doubles the number of parameters/neuron

Swish

$f(x)=x\sigma(\beta x)$
train a neural network to generate and test out different non-linearities
outperformed all other options for CIFAR-10 accuracy

Summary

Use ReLU and be careful with learning rates
Try out Leaky ReLU / Maxout / ELU / SELU to squeeze out some marginal gains
Don’t use sigmoid or tanh

Data Preprocessing

We may have zero-centered, normalized, decorrelated(PCA) or whitened data
After normalization, it will be less sensitive to small changes in weights and easier to optimize
In practice for images, centering only used.

Weight Initialization

First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

W = 0.01 * np.random.randn(D_in, D_out)

It works okay for small networks, but problems with deeper networks
All activations and gradients tend to zero and no learning proceeded.

“Xavier” Initialization

std = 1/sqrt(D_in)
For conv layers, $\mbox{D_in}$ is $\mbox{filter_size}^2\times \mbox{input_channels}$

W = np.random.randn(D_in, D_out) / np.sqrt(D_in)  
x = np.tanh(x.dot(W))

Activations are nicely scaled for deeper layers
works well especially in non-linear activation functions like sigmoid, tanh
but cannot used in ReLU activation function; activations collapse to zero and no learning

Kaiming / MSRA Initialization

ReLU correction: std = sqrt(2/D_in)

W = np.random.randn(D_in, D_out) * np.sqrt(2/D_in)  
x = np.maximum(0, x.dot(W))

Batch Normalization

To make each dimension zero-mean unit-variance, apply:
$\hat{x}^{(k)} = \frac{x^{(k)} - E[x^{(k)}]}{\sqrt{\mbox{Var}[x^{(k)}]}}$
Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.
Makes deep networks much easier to train
Improves gradient flow
Allows higher learning rates, faster convergence
Networks become more robust to initialization
Acts as regularization during training
Zero overhead at test-time: can be fused with conv
Behaves differently during training and testing: can have bugs
Comparison of Normalization Layers

png

Transfer Learning

Deep learning models are trained to capture characteristics of data, from general features at the first layer to specific features at the last layer.
In transfer learning, we import pre-trained model and fine-tune to our cases.
Strategies

png

E.g.

png

Transfer learning with CNNs is pervasive,
for Object Detection(Fast R-CNN), Image Captioning(CNN + RNN), etc.
but not always be necessary

cs231n - Lecture 7. Training Neural Networks I

Activation Functions

Sigmoid

tanh(x)

ReLU(Rectified Linear Unit)

Leaky ReLU

ELU(Exponential Linear Units)

SELU (Scaled Exponential Linear Units)

Maxout “Neuron”

Swish

Summary

Data Preprocessing

Weight Initialization

“Xavier” Initialization

Kaiming / MSRA Initialization

Batch Normalization

Transfer Learning

cs231n - Lecture 8. Training Neural Networks II

cs231n - Lecture 6. Hardware and Software

Activation Functions

Sigmoid

tanh(x)

ReLU(Rectified Linear Unit)

Leaky ReLU

ELU(Exponential Linear Units)

SELU (Scaled Exponential Linear Units)

Maxout “Neuron”

Swish

Summary

Data Preprocessing

Weight Initialization

“Xavier” Initialization

Kaiming / MSRA Initialization

Batch Normalization

Transfer Learning

Search Darron's Devlog