11 March 2022 / CS224N

cs224n - Lecture 4. Dependency Parsing

Two views of linguistic structure: Phrase structure

Constituency = phrase structure grammar = context-free grammers(CFGs)
Phrase structure organizes words into nested constituents
Starting unit: words (noun, preposition, adjective, determiner, …)
the, $\ $ cat, $\ $ cuddly, $\ $ by, $\ $ door
Words combine into phrases
the cuddly cat(noun phrase),
by the door(prepositional phrase; preposition(by) + noun phrase)
Phrases can combine into bigger phrases
the cuddly cat by the door(noun phrase)
Lexicon:
$\text{N} \rightarrow \text{cat}$
$\text{N} \rightarrow \text{door}$
$\text{Det} \rightarrow \text{the}$
$P \rightarrow \text{by}$
$\text{Adj} \rightarrow \text{cuddly}$
$\vdots$
Grammar:
$\text{NP} \rightarrow \text{Det } \ \text{ (Adj)}^{\ast} \ \text{ N } \ \text{ (PP)}$
$\text{PP} \rightarrow \text{P } \ \text{ NP}$
$\vdots$

Two views of linguistic structure: Dependency structure

Dependency structure shows which words depend on (modify, attach to, or are arguments of) which other words.

Why do we need sentence structure?

Humans communicate complex ideas by composing words together into bigger units to convey complex meanings
Listeners need to work out what modifies attaches to what
A model needs to understand sentence structure in order to be able to interpret language correctly

Ambiguities

Prepositional phrase ambiguity
- A key parsing decision is how we ‘attach’ various constituents
  - PPs, adverbial or participial phrases, infinitives, coordinations
    e.g.
    $\begin{align*} \text{The board approved [its acquisition]} & \text{[by Royal Trustco Ltd.]} \\ & \text{[of Toronto]} \\ & \text{[for \$27 a share]} \\ & \text{[at its monthly meeting].} \end{align*}$
- With a sentence of k prepositional phrases at the end of it, the number of parses is given by the Catalan numbers; $C_n = (2n)!/[(n+1)!n!]$, an exponential series growing as the number of prepositional phrases.
Coordination scopre ambiguity
e.g. Shuttle veteran and longtime NASA executive__ Fred Gregory appointed to board
$\rightarrow$ 1 or 2 person?
Adjectival/Adverbial Modifier ambiguity
e.g. Students get first hand job experience
Verb Phrase(VP) attachment ambiguity
e.g. Mutilated body washes up on Rio beach to be used for Olympics beach volleyball

Dependency paths help extract semantic interpretation

simple practical example: extracting protein-protein interaction

png

Dependency Grammar and Dependency Structure

Dependency syntax postulates that syntactic structure consists of relations between lexical items, normally binary asymmetric relations (“arrows”) called dependencies

png

The arrows are commonly typed with the name of grammatical relations (subject, prepositional object, apposition, etc.)
An arrow connects a head(governor, superior, regent) with a dependent(modifier, inferior, subordinate)
Usually, dependencies form a tree(a connected, acyclic, single-root graph)
Check: some people draw the arrows one way; some the other way
Usually add a fake ROOT so every word is a dependent of precisely 1 other node

The rise of annotated data & Universal Dependencies treebanks

Advantages of treebank
- Reusability of the labor
  - Many parsers, part-of-speech taggers, etc. can be built on it
  - Valuable resource for linguistic
- Broad coverage, not just a few intuitions
- Frequencies and distributional information(statistics)
- A way to evaluate NLP systems

Dependency Conditioning Preferences

png

The sources of information for dependency parsing
1. Bilexical affinities: The dependency $\text{discussion}\rightarrow\text{issues}$ is plausible
2. Dependency distance: Most dependencies are between nearby words
3. Intervening material: Dependencies rarely span intervening verbs or punctuation
4. Valency of heads: How many dependents on which side are usual for a head?

Dependency Parsing

A sentence is parsed by choosing for each word what other word (including ROOT) it is a dependent of
Usually some constraints:
- Only one word is a dependent of ROOT
- Don’t want cycles $A\rightarrow B$, $B\rightarrow A$
This makes the dependencies a tree
Final issue is whether arrows can cross(be non-projective) or not

Projectivity

Definition of a projective parse: There are no crossing dependency arcs when the words are laid out in their linear order, with all arcs above the words
Dependencies corresponding to a CFG tree must be projective
- i.e., by forming dependencies by taking 1 child of each category as head
Most syntactic structure is projective like this, but dependency theory normally does allow non-projective structures to account for displaced constituents
- You can’t easily get the semantics of certain constructions right without these nonprojective dependencies
e.g.
From who did Bill buy the coffee yesterday?
Who did Bill buy the coffee from yesterday

Methods of Dependency Parsing

Dynamic programming, Eisner(1996): $O(n^3)$ complexity
Graph algorithms, McDonald et al.(2005): creating a Minimun Spanning Tree
Constraint Satisfaction, Karlsson(1990)
“Transition-based parsing” or “deterministic dependency parsing”
Greedy choice of attachments guided by good machine learning classifiers
E.g., MaltParser, Nivre et al.(2008)

Greedy transition-based parsing, Nivre 2003

A simple form of greedy discriminative dependency parser
The parser does a sequence of bottom-up actions
- Roughly like “shift” or “reduce” in a shift-reduce parser, but the “reduce” actions are specialized to create dependencies with head on left or right
The parser has:
- a stack $\sigma$, written with top to the right
  - which starts with the ROOT symbol
- a buffer $\beta$, written with top to the left
  - which starts with the input sentence
- a set of dependency arcs A
  - which starts off empty
- a set of actions

Basic transition-based dependency parser

png

Arc-standard transition-based parser
E.g., Analysis of “I ate fish”

png
png

MaltParser, Nivre and Hall 2005

How we choose the next action?
Answer: Machine Learning!
Each action is predicted by a discriminative classifier (e.g., softmax classifier) over legal move
- Max of 3 untyped choices; max of $\lvert R \rvert \times 2 + 1 $ when typed
- Features: top of stack word, POS; first in buffer word, POS; etc.
There is NO search(in the simplest form)
- But you can profitably do a beam search if you wish(slower but better): You keep k good parse prefixes at each time step
The model’s accuracy is fractionally below the state of the art in dependency parsing, but it provides very fast linear time parsing, with high accuracy, great for parsing the web

Conventional Feature Representation

png

Evaluation of Dependency Parsing: (labeled) dependency accuracy

png

cs224n - Lecture 4. Dependency Parsing

Two views of linguistic structure: Phrase structure

Two views of linguistic structure: Dependency structure

Why do we need sentence structure?

Ambiguities

Dependency paths help extract semantic interpretation

Dependency Grammar and Dependency Structure

The rise of annotated data & Universal Dependencies treebanks

Dependency Conditioning Preferences

Dependency Parsing

Projectivity

Methods of Dependency Parsing

Greedy transition-based parsing, Nivre 2003

Basic transition-based dependency parser

MaltParser, Nivre and Hall 2005

Conventional Feature Representation

Evaluation of Dependency Parsing: (labeled) dependency accuracy

cs224n - Lecture 5. Language Models and RNNs

cs224n - Lecture 3. Backprop and Neural Networks

Two views of linguistic structure: Phrase structure

Two views of linguistic structure: Dependency structure

Why do we need sentence structure?

Ambiguities

Dependency paths help extract semantic interpretation

Dependency Grammar and Dependency Structure

The rise of annotated data & Universal Dependencies treebanks

Dependency Conditioning Preferences

Dependency Parsing

Projectivity

Methods of Dependency Parsing

Greedy transition-based parsing, Nivre 2003

Basic transition-based dependency parser

MaltParser, Nivre and Hall 2005

Conventional Feature Representation

Evaluation of Dependency Parsing: (labeled) dependency accuracy

Search Darron's Devlog