Wrongly removed tokens

Hello !

Given that you are reporting 92% F1-score on the Softcite dataset, I was wondering why with the same SciBERT model and dataset I was having 8 points less in F1-score. Reproducing your training with the notebook, I found 2 problems in your data preparation which, I think, explains the difference. 

The first one is related to a filtering that removes all the tokens and tags for labels not `software` and not `version`. In cell 3 of the noteboook `notebooks/Train software mentions model.ipynb`, we have:

```python
data.columns = ['paper_id',"word", "tag", "sentence_id"]
data['word'] = data['word'].astype(str)
data = data[data['tag'].isin(['O', 'B-software', 'B-version', 'I-version','I-software'])]
```
This will remove in particular all the tokens for publishers and url. For example here "Microsoft" and "SPSS Inc" are removed:

Input:
```xml
Radiographic errors were recorded on individual tick sheets and the information was captured 
in an <rs cert="1.0" resp="#annotator0" type="software" xml:id="a7f72b2925-software-0">Excel</rs> 
spreadsheet (<rs corresp="#a7f72b2925-software-0" resp="#curator" type="publisher">Microsoft</rs>, 
Redmond, WA). The readers resolved any differences by consensus.
```

Tokens:
```
['Radio', '##graphic', 'errors', 'were', 'recorded', 'on', 'individual', 'tick', 'sheets', 'and', 'the', 
'information', 'was', 'captured', 'in', 'an', 'Excel', 'spreads', '##he', '##et', '(', ',', 'Red', 
'##mond', ',', 'WA', ')', '.']
```

Input:
```xml
The <rs cert="1.0" resp="#annotator0" type="software" xml:id="f204e3a468-software-0">SPSS</rs> 
software version <rs corresp="#f204e3a468-software-0" resp="#annotator0" type="version">11.0</rs> 
(<rs corresp="#f204e3a468-software-0" resp="#curator" type="publisher">SPSS Inc</rs>., Chicago, 
USA) was used for the statistical analysis.
```

Tokens:
```
['The', 'SPSS', 'software', 'version', '11', '.', '0', '(', '.', ',', 'Chicago', ',', 'USA', ')', 'was', 
'used', 'for', 'the', 'statistical', 'analysis', '.']
```

So the model is trained and evaluated without the text corresponding to publisher and url. This impact the evaluation because publisher and url are often ambiguous with software name. 

This can be fixed by replacing labels to be excluded by `O`, so that the tokens are not removed:

```python
data.columns = ['paper_id',"word", "tag", "sentence_id"]
data['word'] = data['word'].astype(str)
# replace 'B-publisher', 'B-url', 'I-publisher','I-url' and reference marker labels by 'O'
data['tag'] = data['tag'].replace(['B-publisher', 'B-url', 'B-bibr', 'B-table', 'B-figure', 'B-formula', 'I-publisher', 'I-url', 'I-bibr', 'I-table', 'I-figure', 'I-formula'], 'O')
data = data[data['tag'].isin(['O', 'B-software', 'B-version', 'I-version','I-software'])]
```

We have then as expected:

```
['Radio', '##graphic', 'errors', 'were', 'recorded', 'on', 'individual', 'tick', 'sheets', 'and', 'the', 
'information', 'was', 'captured', 'in', 'an', 'Excel', 'spreads', '##he', '##et', '(', 'Microsoft', ',', 
'Red', '##mond', ',', 'WA', ')', '.']
```

```
['The', 'SPSS', 'software', 'version', '11', '.', '0', '(', 'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'USA', ')', 
'was', 'used', 'for', 'the', 'statistical', 'analysis', '.']
```

The evaluation becomes then:

```
F1 score: 0.883440
           precision    recall  f1-score   support

 software     0.8468    0.8737    0.8600       974
  version     0.9440    0.9610    0.9524       333

micro avg     0.8713    0.8959    0.8834      1307
macro avg     0.8715    0.8959    0.8836      1307
```

However, it impacts probably not just the evaluation I think, it also means that your model does not know about publisher and url when applied to new article containing such tokens, so it might degrade the inference scenario too. 

Note: I don't know how to PR when a notebook is used, but the code snippet above fixes the problem (in cell 3 of `notebooks/Train software mentions model.ipynb`). 





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrongly removed tokens #1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Wrongly removed tokens #1

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions