Models trained with subset of ChemProt data unpredictably make large quantities of spurious predictions

I'm running a quick analysis to evaluate the effect of training corpus size on model performance on a fixed test set. The analysis is performed as follows:

* Choose a set of 30 test docs and 20 val docs
* Choose an initial training set of 150 docs
* Over `n` (here I've been using 7, to get to a 500 doc train set) iterations, add 50 more docs to the training set
* For each of those training sets, train a model and evaluate on the test and validation sets

Observed behavior: For some of the models, there is near-0 NER performance, and 0 relation performance -- but this doesn't correlate with training set size. Additionally, results on validation set as reported from the model a re completely different than those obtained with `allennlp evaluate`

An example run's performances (calculated externally to the model with my own code, but I get basically the same results with `allennlp evaluate`):

```
	rel_F1	docnum
0	0.264448	150
1	0.380308	200
2	0.364521	250
3	0.000000	300
4	0.459839	350
5	0.394745	400
6	0.000000	450
7	0.427195	500
```

Reported validation set performance (`best_validation_MEAN__relation_f1` from `metrics.json` in the model forlder) for the 0 models is ~`0.4`, which is on par with the rest of the models. However, if I call `allennlp evaluate` on the dev set, I also get and F1 of 0.

Other observations:
* When I look at the prediction files, it looks like the cause of the poor entity performance is that the model is making insane numbers of spurious entity predictions, almost predicting an entity on every single word of each doc, and for relations, is that all docs by 1 have no relation predictions at all
* Which number of documents results in 0 performance changes when I re-run the analysis selecting new docs, but there is usually one or two that have 0 performance

Do you have any intuition for what might be going on here? To me it seems like it's possibly something in `allennlp` that fails catastrophically on smaller numbers of documents in an unpredictable manner, but I'd love to know your thoughts.

EDIT: on closer inspection, it looks like the model is predicting an entity on every possible span

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models trained with subset of ChemProt data unpredictably make large quantities of spurious predictions #114

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Models trained with subset of ChemProt data unpredictably make large quantities of spurious predictions #114

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions