/ CS224N

cs224n - Lecture 3. Backprop and Neural Networks

Named Entity Recognition(NER)

  • Task: find and classify names in text
  • Possible uses:
    • Tracking mentions of particular entities in documents
    • For question answering, answers are usually named entities
  • Often followed by Named Entity Linking/Canonicalization into Knowledge Base

Simple NER: Window classification using binary logistic classifier

  • Using word vectors, build a context window of word vectors, then pass through a neural network and feed it to logistic classifier.

  • idea: classify each word in its context window of neighboring words
  • Train logistic classifier on hand-labeled data to classify center word {yes/no} for each class based on a concatenation of word vectors in a window
    • In practice, we usually use multi-class softmax
  • Example: Classify “Paris” as +/- location(binary for is_location) in context of sentence with window length 2:
    sentence $=$ “$\mbox{ the museums in Paris are amazing to see . }$” $x_{\text{window}} = [x_{\text{museums}}\quad x_{\text{in}}\quad x_{\text{Paris}}\quad x_{\text{are}}\quad x_{\text{amazing}}]^T$
  • Resulting vector $x_{\text{window}} = x\in \mathbb{R}^{5d}$, a column vector.
  • To classify all words: run classifier for each class on the vector centered on each word in the sentence

png

Computing Gradients by Hand

  • Matrix calculus: Fully vectorized gradients
    • “Multivariable calculus is just like single-variable calculus if you use matrices”
    • Much faster and more usful than non-vectorized gradients
    • But doing a non-vectorized gradient can be good for intuition; partial derivative for one parameter
  • Vector gradient
    Given a function with 1 output and n inputs
    $f(x) = f(x_1, x_2, \ldots, x_n)$
    Its gradient is a vector of partial derivatives w.r.t each input
    \(\frac{\partial f}{\partial x} = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \cdots, \frac{\partial f}{\partial x_n} \right]\)
  • Jacobian Matrix: Generalization of the Gradient
    Given a function with m outputs and n inputs
    $f(x) = [f_1(x_1, x_2, \ldots, x_n), \ldots, f_m(x_1, x_2, \ldots, x_n)]$
    It’s Jacobian is an $m \times n$ matrix of partial derivatives
    \(\begin{align*} \frac{\partial f}{\partial x} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \end{align*}\) $\quad\quad$ \(\begin{align*}\left( \frac{\partial f}{\partial x} \right)_{ij} = \frac{\partial f_i}{\partial x_j}\end{align*}\)

  • Chain Rule
    • For composition of one-variable functions: multiply derivatives
      $z = g(y)$, $y = f(x)$
      \(\begin{align*}\frac{dz}{dx} = \frac{dz}{dy}\frac{dy}{dx}\end{align*}\)
    • For multiple variables at once: multiply Jacobians
      $h = f(z)$
      $z = Wx + b$
      \(\begin{align*}\frac{\partial h}{\partial x} = \frac{\partial h}{\partial z}\frac{\partial z}{\partial x} = \cdots\end{align*}\)
  • Example Jacobian: Elementwise activation function

png

  • Other Jacobians

png

  • Write out the Jacobians

png

  • Derivative with respect to Matrix: Output shape
    for $\frac{\partial s}{\partial W}$, output $s$ is a scalar and input $W \in \mathbb{R}^{n\times m}$: 1 by nm Jacobian?
    $\rightarrow$ is inconvenient to do \(\theta^{\text{new}} = \theta^{\text{old}} - \alpha\nabla_\theta J(\theta)\) gradient descent.
    • Instead, use the shape convention: the shape of the gradient is the shape of the parameters
      n by m matrix \(\begin{align*}\frac{\partial s}{\partial W} = \begin{bmatrix} \frac{\partial s}{\partial W_{11}} & \cdots & \frac{\partial s}{\partial W_{1m}} \\ \vdots & \ddots & \vdots \\ \frac{\partial s}{\partial W_{n1}} & \cdots & \frac{\partial s}{\partial W_{nm}} \end{bmatrix}\end{align*}\)
    • $\frac{\partial s}{\partial W} = \delta\frac{\partial z}{\partial W}$
      $z = Wx + b$
      \(\begin{align*} \frac{\partial s}{\partial W} &= \delta^T x^T \\ & = \begin{bmatrix} \delta_1 \\ \vdots \\ \delta_n \end{bmatrix} [x_1, \ldots, x_m] = \begin{bmatrix} \delta_1 x_1 & \cdots & \delta_1 x_m \\ \vdots & \ddots & \vdots \\ \delta_n x_1 & \cdots & \delta_n x_m \end{bmatrix} \end{align*}\)
      denote that $\delta$ is local error signal at z and x is local input signal.

What shape should derivatives be?

  • Two options:
    1. Use Jacobian form as much as possible, reshape to follow the shape convention at the end:
      • at the end transpose $\frac{\partial s}{\partial b}$ to make the derivative a column vector, resulting in $\delta^T$
    2. Always follow the shape convention
      • Look at dimensions to figure out when to transpose and/or reorder terms
      • The error message $\delta$ that arrives at a hidden layer has the same dimensionality as that hidden layer

Backpropagation: Chain Rule

  • Other trick:
    re-use derivatives computed for higher layers in computing derivatives for lower layers to minimize computation

png