23 March 2022 / CS224N

cs224n - Lecture 8. Attention (Contd.)

Attention

png

Encoder hidden states $\mathbf{h}_1, \ldots, \mathbf{h}_N \in \mathbb{R}^h$
On timestep $t$, we have Decoder hidden state $\mathbf{s}_t \in \mathbb{R}^h$
Attention score $\mathbf{e}^t$ for this step:
$\mathbf{e}^t = \left[ \mathbf{s}_t^T \mathbf{h}_1, \ldots, \mathbf{s}_t^T \mathbf{h}_N \right] \in \mathbb{R}^N$
Take softmax to get the Attention distribution:
$\alpha^t = \text{softmax}(\mathbf{e}^t) \in \mathbb{R}^N$
Use $\alpha^t$ to take a weighted sum of the encoder hidden states to get the Attention output:
$\mathbf{a}_t = \sum_{i=1}^N \alpha_i^t \mathbf{h}_i \in \mathbb{R}^h$
Finally, concatenate the attention output $\mathbf{a}_t$ with the decoder hidden state $s_t$ and proceed as in the non-attention seq2seq model
$\left[ \mathbf{a}_t ; \mathbf{s}_t \right] \in \mathbb{R}^{2h}$

Attention is great

Attention significantly improves NMT performance
Allow decoder to focus on certain parts of the source
Attention provides more “human-like” model of the MT process
You can look back at the source sentence while translating, rather than needing to remember it all
Attention solves the bottleneck problem
Attention allows decoder to look directly at source; bypass bottleneck
Attention helps with the vanishing gradient problem
Provides shortcut to faraway states
Attention provides some interpretability
By inspecting attention distribution, we can see what the decoder was focusing on; we can get (soft) alignment for free. Without explicitly trained an alignment system, the network just learned alignment by itself

Attention variants

With some values $\mathbf{h}_1, \ldots, \mathbf{h}_N \in \mathbb{R}^{d_1}$ and a query $\mathbf{s} \in \mathbb{R}^{d_2}$,
Attention always involves:
1. Computing the attention scores $\mathbf{e}\in\mathbb{R}^N$
2. Taking softmax to get attention distribution:
$\alpha = \text{softmax}(\mathbf{e})\in\mathbb{R}^N$
3. Take weighted sum of values to get attention output:
$\mathbf{a} = \sum_{i=1}^N \alpha_i \mathbf{h}_i \in \mathbb{R}^{d_1}$

$\rightarrow$ There are multiple ways to compute attention scores $\mathbf{e}\in \mathbb{R}^N$ from $\mathbf{h}_1, \ldots, \mathbf{h}_N \in \mathbb{R}^{d_1}$ and $\mathbf{s} \in \mathbb{R}^{d_2}$:

Basic dot-product attention: $\mathbf{e}_i = \mathbf{s}^T \mathbf{h}_i \in \mathbb{R}$
- What we saw earlier, assume that $d_1 = d_2$.
Multiplicative attention: $\mathbf{e}_i = \mathbf{s}^T \mathbf{W} \mathbf{h}_i \in \mathbb{R}$
- Luong, Pham, and Manning 2015
- Where $\mathbf{W} \in \mathbb{R}^{d_2 \times d_1}$ is a weight matrix of learnable parameters(but too many!)
Reduced rank multiplicative attention: $e_i = \mathbf{s}^T(\mathbf{U}^T \mathbf{V})h_i = (\mathbf{U}s)^T (\mathbf{V}h_i)$
- low rank matrices $\mathbf{U} \in \mathbb{R}^{k\times d_2}$, $\mathbf{V} \in \mathbb{R}^{k\times d_1}$, $k \ll d_1, d_2$
Additive attention: $\mathbf{e}_i = \mathbf{v}^T \tanh (\mathbf{W}_1 \mathbf{h}_i + \mathbf{W}_2 \mathbf{s}) \in \mathbb{R}$
- Bahdanau, Cho, and Bengio 2014
- Where $\mathbf{W}_1 \in \mathbb{R}^{d_3 \times d_1}$, $\mathbf{W}_2 \in \mathbb{R}^{d_3 \times d_2}$ are weighted matrices and $\mathbf{v} \in \mathbb{R}^{d_3}$ is a weight vector.
- $d_3$ (the attention dimensionality) is a hyperparameter
- “Additive” is a weird name; it’s really using a neural net layer.

Attention is a general Deep Learning technique

Attention is a great way to improve the sequence-to-sequence model for Machine Translation. However, you can use attention in many architectures (not just seq2seq) and many tasks (not just MT)

More general definition of attention:

Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.

We sometimes say that the query attends to the values. A kind of memory access mechanism.
Intuition:
- The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.
- Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query).
Upshot:
- Attention has become the powerful, flexible, general way pointer and memory manipulation in deep learning models. A new idea from after 2010!

cs224n - Lecture 8. Attention (Contd.)

Attention

Attention is great

Attention variants

Attention is a general Deep Learning technique

cs224n - Lecture 9. Self-Attention and Transformers

cs224n - Lecture 7. Translation, Seq2Seq, Attention

Attention

Attention is great

Attention variants

Attention is a general Deep Learning technique

Search Darron's Devlog