Skip to content

Commit e86f9e0

Browse files
authored
Adds parser task using deep biaffine parser (#120)
* Adds metrics for parsing * Beginning integration * Adds metrics test. One major issue is that this requires us to use negative indices for specials, which breaks assumptions in the indexes. Will have to come back and fix this. * Draft of parser and its integration * More work. Known issues: 1. I don't think the metrics test is going to work; I will need to shift all the head indices by special.OFFSET. 2. I am not passing a parser mask. Do I need to? I think maybe yes. * Applies shift to metrics test to avoid collisions. * Moves reverse_edits to data, where it belongs. It has no effect in the model, so let's get rid of it. * Days' debugging work * More work; still debugging * Optimizes mmap instructions (#116) * Updates Black version * Adds logging for vocabularies (#117) * Adds logging for vocabularies Sample output: INFO: 22-Feb-26 17:56:27 - UPOS vocabulary (21): '[PAD]', '[UNK]', '_', 'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', '_' INFO: 22-Feb-26 17:56:27 - XPOS vocabulary (53): '[PAD]', '[UNK]', '_', '$', "''", ',', '-LRB-', '-RRB-', '.', ':', 'ADD', 'AFX', 'CC', 'CD', 'DT', 'EX', 'FW', 'GW', 'HYPH', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NFP', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '_', '``' INFO: 22-Feb-26 17:56:27 - Lemma vocabulary (533): [omitted] INFO: 22-Feb-26 17:56:27 - Features vocabulary (235): [omitted] Closes #115. * black update * f-string fix * driveby: silence more warnings * Avoids "Crashed" status in sweeps. (#118) See Yoyodyne [#369](CUNY-CL/yoyodyne#369) for context. Closes #79. * Pooling layer efficiency (#119) * Fix pooling layer regression in UDTubeEncoder.forward Special cases pooling_layers=1 to use last_hidden_state directly, avoiding unnecessary allocation of all hidden states. This seems to save a lot of GPU memory. A few drive-bys: 1. suppress progress bar during test data generation 2. add "not human-readable" to "[omitted]" when logging lemmas 3. actually log features; why not? 4. pass information about which heads to build to the data module too, so it logs properly 5. removes _ from "special", since it doesn't require any special treatment in actuality; it's just another tag as far as we're concerned. 6. Standardizes trailing """: it's on its own line if the comment is more than one line. * regeneration last-minute fix * Update special.py * fix typo * Optimizes mmap instructions (#116) * Adds logging for vocabularies (#117) * Adds logging for vocabularies Sample output: INFO: 22-Feb-26 17:56:27 - UPOS vocabulary (21): '[PAD]', '[UNK]', '_', 'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', '_' INFO: 22-Feb-26 17:56:27 - XPOS vocabulary (53): '[PAD]', '[UNK]', '_', '$', "''", ',', '-LRB-', '-RRB-', '.', ':', 'ADD', 'AFX', 'CC', 'CD', 'DT', 'EX', 'FW', 'GW', 'HYPH', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NFP', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '_', '``' INFO: 22-Feb-26 17:56:27 - Lemma vocabulary (533): [omitted] INFO: 22-Feb-26 17:56:27 - Features vocabulary (235): [omitted] Closes #115. * black update * f-string fix * driveby: silence more warnings * Avoids "Crashed" status in sweeps. (#118) See Yoyodyne [#369](CUNY-CL/yoyodyne#369) for context. Closes #79. * Pooling layer efficiency (#119) * Fix pooling layer regression in UDTubeEncoder.forward Special cases pooling_layers=1 to use last_hidden_state directly, avoiding unnecessary allocation of all hidden states. This seems to save a lot of GPU memory. A few drive-bys: 1. suppress progress bar during test data generation 2. add "not human-readable" to "[omitted]" when logging lemmas 3. actually log features; why not? 4. pass information about which heads to build to the data module too, so it logs properly 5. removes _ from "special", since it doesn't require any special treatment in actuality; it's just another tag as far as we're concerned. 6. Standardizes trailing """: it's on its own line if the comment is more than one line. * regeneration last-minute fix * Beginning integration * Adds metrics test. One major issue is that this requires us to use negative indices for specials, which breaks assumptions in the indexes. Will have to come back and fix this. * Draft of parser and its integration * More work. Known issues: 1. I don't think the metrics test is going to work; I will need to shift all the head indices by special.OFFSET. 2. I am not passing a parser mask. Do I need to? I think maybe yes. * Moves reverse_edits to data, where it belongs. It has no effect in the model, so let's get rid of it. * Days' debugging work * More work; still debugging * Optimizes mmap instructions (#116) * Pooling layer efficiency (#119) * Fix pooling layer regression in UDTubeEncoder.forward Special cases pooling_layers=1 to use last_hidden_state directly, avoiding unnecessary allocation of all hidden states. This seems to save a lot of GPU memory. A few drive-bys: 1. suppress progress bar during test data generation 2. add "not human-readable" to "[omitted]" when logging lemmas 3. actually log features; why not? 4. pass information about which heads to build to the data module too, so it logs properly 5. removes _ from "special", since it doesn't require any special treatment in actuality; it's just another tag as far as we're concerned. 6. Standardizes trailing """: it's on its own line if the comment is more than one line. * regeneration last-minute fix * Manual merge * README and bibliography * stashing incomplete work * updates parser * Parser testing * Expands grid for biaffine parsing hparams * Adds the parser itself * Updates tests Eliminates a test bug where the file comparisons were against the hypothesis file! * reflows README * updates encoder special-casing logic slightly * Update mappers.py * Daniel's suggestion
1 parent fed582b commit e86f9e0

43 files changed

Lines changed: 1973 additions & 611 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 20 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -60,31 +60,31 @@ Dependencies project](https://universaldependencies.org/).
6060

6161
UDTube can perform up to four morphological tasks simultaneously:
6262

63-
- Lemmatization is performed using the `LEMMA` field and [edit
64-
scripts](https://aclanthology.org/P14-2111/).
65-
63+
- Lemmatization is performed using the `LEMMA` field and edit scripts.
6664
- [Universal part-of-speech
6765
tagging](https://universaldependencies.org/u/pos/index.html) is performed
68-
using the `UPOS` field: enable with `data: use_upos: true`.
69-
66+
using the `UPOS` field.
7067
- Language-specific part-of-speech tagging is performed using the `XPOS`
71-
field: enable with `data: use_xpos: true`.
72-
73-
- Morphological feature tagging is performed using the `FEATS` field: enable
74-
with `data: use_feats: true`.
68+
field.
69+
- Morphological feature tagging is performed using the `FEATS` field.
70+
- Dependency parsing is performed using the `HEAD` and `DEPREL` fields, a deep
71+
biaffine parser, and minimum spanning tree decoding.
7572

7673
The following caveats apply:
7774

75+
- By default, lemmatization uses reverse-edit scripts. This is appropriate for
76+
predominantly suffixal languages, which are thought to represent the
77+
majority of the world's languages. If working with a predominantly prefixal
78+
language, disable this with `data: reverse_edits: false`.
7879
- Note that many newer Universal Dependencies datasets do not have
79-
language-specific part-of-speech-tags.
80+
language-specific part-of-speech-tags so this task should be disabled
81+
(`data: use_xpos: false`).
8082
- The `FEATS` field is treated as a single unit and is not segmented in any
8183
way.
8284
- One can convert from [Universal Dependencies morphological
8385
features](https://universaldependencies.org/u/feat/index.html) to [UniMorph
8486
features](https://unimorph.github.io/schema/) using
8587
[`scripts/convert_to_um.py`](scripts/convert_to_um.py).
86-
- UDTube does not perform dependency parsing at present, so the `HEAD`,
87-
`DEPREL`, and `DEPS` fields are ignored and should be specified as `_`.
8888

8989
## Usage
9090

@@ -132,8 +132,9 @@ supported as they lack an `AutoTokenizer`.
132132

133133
#### Classifier
134134

135-
The classifier layer contains up to four sequential linear heads for the four
136-
tasks described above. By default all four are enabled.
135+
The classifier layer contains up to four sequential linear heads for the tagging
136+
tasks, and a biaffine parser head for the parsing task. By default all heads are
137+
enabled.
137138

138139
#### Optimization
139140

@@ -189,7 +190,7 @@ information](https://github.com/CUNY-CL/yoyodyne/blob/master/README.md#logging).
189190

190191
#### Other options
191192

192-
By default, UDTube attempts to model all four tasks; one can disable the
193+
By default, UDTube attempts to model all five tasks; one can disable the
193194
language-specific tagging task using `model: use_xpos: false`, and so on.
194195

195196
Dropout probability is specified using `model: dropout: ...`.
@@ -198,25 +199,19 @@ The encoder has multiple layers. The input to the classifier consists of just
198199
the last few layers mean-pooled together. The number of layers used for
199200
mean-pooling is specified using `model: pooling_layers: ...`.
200201

201-
By default, lemmatization uses reverse-edit scripts. This is appropriate for
202-
predominantly suffixal languages, which are thought to represent the majority of
203-
the world's languages. If working with a predominantly prefixal language,
204-
disable this with `model: reverse_edits: false`.
205-
206202
The following YAML snippet shows the default architectural arguments.
207203

208204
...
209205
model:
210206
dropout: 0.5
211207
encoder: google-bert/bert-base-multilingual-cased
212208
pooling_layers: 1
213-
reverse_edits: true
214209
use_upos: true
215210
use_xpos: true
216211
use_lemma: true
217212
use_feats: true
213+
use_parse: true
218214
...
219-
220215

221216
Batch size is specified using `data: batch_size: ...` and defaults to 32.
222217

@@ -322,3 +317,6 @@ following document, which describes the model:
322317
Yakubov, D. 2024. [How do we learn what we cannot
323318
say?](https://academicworks.cuny.edu/gc_etds/5622/) Master's thesis, CUNY
324319
Graduate Center.
320+
321+
(See also [`udtube.bib`](udtube.bib) for more work used during the development
322+
of this library.)

configs/ewt_bert.yaml

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,6 @@ trainer:
2222
model:
2323
dropout: 0.4
2424
encoder: google-bert/bert-base-cased
25-
pooling_layers: 4
26-
reverse_edits: true
27-
use_upos: true
28-
use_xpos: true
29-
use_lemma: true
30-
use_feats: true
3125
encoder_optimizer:
3226
class_path: torch.optim.Adam
3327
init_args:

configs/ewt_distilbert.yaml

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,6 @@ trainer:
2222
model:
2323
dropout: 0.4
2424
encoder: distilbert/distilbert-base-cased
25-
pooling_layers: 4
26-
reverse_edits: true
27-
use_upos: true
28-
use_xpos: true
29-
use_lemma: true
30-
use_feats: true
3125
encoder_optimizer:
3226
class_path: torch.optim.Adam
3327
init_args:
@@ -52,6 +46,7 @@ data:
5246
test: /Users/Shinji/UD_English-EWT/en_ewt-ud-test.conllu
5347
predict: /Users/Shinji/UD_English-EWT/en_ewt-ud-test.conllu
5448
batch_size: 32
49+
reverse_edits: true
5550
checkpoint:
5651
filename: "model-{epoch:03d}-{val_loss:.4f}"
5752
monitor: val_loss

configs/ewt_roberta.yaml

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,6 @@ trainer:
2222
model:
2323
dropout: 0.4
2424
encoder: FacebookAI/roberta-base
25-
pooling_layers: 4
26-
reverse_edits: true
27-
use_upos: true
28-
use_xpos: true
29-
use_lemma: true
30-
use_feats: true
3125
encoder_optimizer:
3226
class_path: torch.optim.Adam
3327
init_args:

configs/syntagrus_mbert.yaml

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,7 @@ trainer:
2222
model:
2323
dropout: 0.4
2424
encoder: google-bert/bert-base-multilingual-cased
25-
pooling_layers: 4
26-
reverse_edits: true
27-
use_upos: true
2825
use_xpos: false
29-
use_lemma: true
30-
use_feats: true
3126
encoder_optimizer:
3227
class_path: torch.optim.Adam
3328
init_args:

configs/syntagrus_rubert.yaml

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,7 @@ trainer:
2222
model:
2323
dropout: 0.4
2424
encoder: DeepPavlov/rubert
25-
pooling_layers: 4
26-
reverse_edits: true
27-
use_upos: true
2825
use_xpos: false
29-
use_lemma: true
30-
use_feats: true
3126
encoder_optimizer:
3227
class_path: torch.optim.Adam
3328
init_args:

configs/syntagrus_xlm-roberta.yaml

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,7 @@ trainer:
2222
model:
2323
dropout: 0.4
2424
encoder: FacebookAI/xlm-roberta-base
25-
pooling_layers: 4
26-
reverse_edits: true
27-
use_upos: true
2825
use_xpos: false
29-
use_lemma: true
30-
use_feats: true
3126
encoder_optimizer:
3227
class_path: torch.optim.Adam
3328
init_args:

examples/wandb_sweeps/configs/ewt_grid.yaml

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,14 @@
1-
method: random
1+
method: bayes
22
metric:
33
name: val_loss
44
goal: minimize
55
parameters:
6-
model.use_xpos:
7-
value: true
86
model.dropout:
97
distribution: uniform
108
min: 0
119
max: 0.5
1210
model.encoder:
11+
distribution: categorical
1312
values:
1413
- FacebookAI/roberta-base
1514
- distilbert/distilbert-base-cased
@@ -18,7 +17,7 @@ parameters:
1817
distribution: q_uniform
1918
q: 1
2019
min: 1
21-
max: 8
20+
max: 4
2221
model.encoder_optimizer.class_path:
2322
value: torch.optim.Adam
2423
model.encoder_optimizer.init_args.lr:
@@ -31,7 +30,17 @@ parameters:
3130
distribution: q_uniform
3231
q: 1
3332
min: 1
34-
max: 20
33+
max: 40
34+
model.arc_mlp_size:
35+
distribution: q_uniform
36+
q: 64
37+
min: 64
38+
max: 512
39+
model.deprel_mlp_size:
40+
distribution: q_uniform
41+
q: 64
42+
min: 64
43+
max: 256
3544
model.classifier_optimizer.class_path:
3645
value: torch.optim.Adam
3746
model.classifier_optimizer.init_args.lr:
@@ -49,6 +58,7 @@ parameters:
4958
model.classifier_scheduler.init_args.patience:
5059
value: 5
5160
data.batch_size:
61+
distribution: categorical
5262
values:
5363
- 8
5464
- 16

examples/wandb_sweeps/configs/gdt_grid.yaml

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
method: random
1+
method: bayes
22
metric:
33
name: val_loss
44
goal: minimize
@@ -10,14 +10,15 @@ parameters:
1010
min: 0
1111
max: 0.5
1212
model.encoder:
13+
distribution: categorical
1314
values:
1415
- google-bert/bert-base-multilingual-cased
1516
- FacebookAI/xlm-roberta-base
1617
model.pooling_layers:
1718
distribution: q_uniform
1819
q: 1
1920
min: 1
20-
max: 8
21+
max: 4
2122
model.encoder_optimizer.class_path:
2223
value: torch.optim.Adam
2324
model.encoder_optimizer.init_args.lr:
@@ -30,7 +31,17 @@ parameters:
3031
distribution: q_uniform
3132
q: 1
3233
min: 1
33-
max: 20
34+
max: 40
35+
model.arc_mlp_size:
36+
distribution: q_uniform
37+
q: 64
38+
min: 64
39+
max: 512
40+
model.deprel_mlp_size:
41+
distribution: q_uniform
42+
q: 64
43+
min: 64
44+
max: 256
3445
model.classifier_optimizer.class_path:
3546
value: torch.optim.Adam
3647
model.classifier_optimizer.init_args.lr:
@@ -48,6 +59,7 @@ parameters:
4859
model.classifier_scheduler.init_args.patience:
4960
value: 5
5061
data.batch_size:
62+
distribution: categorical
5163
values:
5264
- 8
5365
- 16

examples/wandb_sweeps/configs/syntagrus_grid.yaml

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
method: random
1+
method: bayes
22
metric:
33
name: val_loss
44
goal: minimize
@@ -10,6 +10,7 @@ parameters:
1010
min: 0
1111
max: 0.5
1212
model.encoder:
13+
distribution: categorical
1314
values:
1415
- google-bert/bert-base-multilingual-cased
1516
- FacebookAI/xlm-roberta-base
@@ -18,7 +19,7 @@ parameters:
1819
distribution: q_uniform
1920
q: 1
2021
min: 1
21-
max: 8
22+
max: 4
2223
model.encoder_optimizer.class_path:
2324
value: torch.optim.Adam
2425
model.encoder_optimizer.init_args.lr:
@@ -31,7 +32,17 @@ parameters:
3132
distribution: q_uniform
3233
q: 1
3334
min: 1
34-
max: 20
35+
max: 40
36+
model.arc_mlp_size:
37+
distribution: q_uniform
38+
q: 64
39+
min: 64
40+
max: 512
41+
model.deprel_mlp_size:
42+
distribution: q_uniform
43+
q: 64
44+
min: 64
45+
max: 256
3546
model.classifier_optimizer.class_path:
3647
value: torch.optim.Adam
3748
model.classifier_optimizer.init_args.lr:
@@ -49,6 +60,7 @@ parameters:
4960
model.classifier_scheduler.init_args.patience:
5061
value: 5
5162
data.batch_size:
63+
distribution: categorical
5264
values:
5365
- 8
5466
- 16

0 commit comments

Comments
 (0)