/ CS231N

cs231n - Lecture 4. Neural Networks and Backpropagation

Image Features

  • Problem: Linear Classifiers are not very powerful
    • Visual Viewpoint: Linear classifiers learn one template per class
    • Geometric Viewpoint: Linear classifiers can only draw linear decision boundaries
  • Image Features: Motivation
    After applying feature transform, points can be separated by linear classifier
    $f(x,y) = (r(x,y), \theta(x,y))$

  • Image Features vs. ConvNets
    png

Neural Networks

  • Neural networks, also called Fully connected networks(FCN) or sometimes multi-layer perceptrons(MLP)
    (Before) Linear score function:
    \(\begin{align*}& f=Wx \\ & x\in\mathbb{R}^D, W\in\mathbb{R}^{C\times D} \end{align*}\)
    $\rightarrow$ 2-layer Neural Network:
    \(\begin{align*}& f=W_2 \mbox{max}(0,W_1 x) \\ & x\in\mathbb{R}^D, W_1\in\mathbb{R}^{H\times D}, W_2\in\mathbb{R}^{C\times H} \end{align*}\)
    $\rightarrow$ or 3-layer Neural Network:
    \(f=W_3\mbox{max}(0,W_2 \mbox{max}(0,W_1 x)) \\ \vdots\)
    (In practice we will usually add a learnable bias at each layer as well)

  • Neural networks: hierarchical computation
    png
    Learning 100s of templates instead of 10 and share templates between classes

  • Why is max operator important?
    The function $\mbox{max}(0,z)$ is called the activation function.
    Q: What if we try to build a neural network without one?
    A: We end up with a linear classifier again!
    $f=W_2 W_1 x, W_3=W_1 W_2, f = W_3 x$

  • Activation functions
    ReLU($\mbox{max}(0,z)$) is a good default choice for most problems
    Others: Sigmoid, tanh, Leaky ReLU, Maxout, ELU, etc.

  • Neural networks: Architectures
    Example feed-forward computation of a neural network
    png

# forward-pass of a 3-layer neural network:
f = lambda x: 1.0/(1.0 + np.exp(-x)) # activation function (use sigmoid)
x = np.random.randn(3,1) # random input vector of three numbers (3x1)
h1 = f(np.dot(W1, x) + b1) # calculate first hidden layer activations (4x1)
h2 = f(np.dot(W2, h1) + b2) #calculate second hidden layer activations (4x1)
out = np.dot(W3, h2) + b3 #output neuron (1x1)

Full implementation of training a 2-layer Neural Network:

import numpy as np
from numpy.random import randn

N, D_in, H, D_out = 64, 1000, 100, 10	# Define the network
x, y = randn(N, D_in), randn(N, D_out)
w1, w2 = randn(D_in, H), randn(H, D_out)

for t in range(2000):					# Forward pass
	h = 1 / (1 + np.exp(-x.dot(w1)))
	y_pred = h.dot(w2)
	loss = np.square(y_pred - y).sum()
	print(t, loss)
	
	grad_y_pred = 2.0 * (y_pred - y)	# Calculate the analytical gradients
	grad_w2 = h.T.dot(grad_y_pred)
	grad_h = grad_y_pred.dot(w2.T)
	grad_w1 = x.T.dot(grad_h * h * (1-h))
	
	w1 -= 1e-4 * grad_w1				# Gradient descent
	w2 -= 1e-4 * grad_w2
  • Plugging in neural networks with loss functions
    $s = f(x;W_1,W_2) = W_2\mbox{max}(0,W_1 x)$ Nonlinear score function
    $L_i = \sum_{j\ne y_i}\mbox{max}(0,s_j-s_{y_i}+1)$ SVM Loss on predictions
    $R(W)=\sum_k W_k^2$ Regularization
    $L=\frac{1}{N}\sum_{i=1}^N L_i + \lambda R(W_1) + \lambda R(W_2)$ Total loss: data loss + regularization

  • Problem: How to compute gradients?
    If we can compute partial derivaties, then we can learn $W_1$ and $W_2$.

Backpropagation

  • Chain rule:
    \(\begin{align*} \frac{\partial f}{\partial y}=\frac{\partial f}{\partial q} \frac{\partial q}{\partial y} \mbox{Upstream gradient} \times \mbox{Local gradient} \end{align*}\)

  • Patterns in gradient flow
    png