Skip to content

Wrongly removed tokens #1

Description

@kermitt2

Hello !

Given that you are reporting 92% F1-score on the Softcite dataset, I was wondering why with the same SciBERT model and dataset I was having 8 points less in F1-score. Reproducing your training with the notebook, I found 2 problems in your data preparation which, I think, explains the difference.

The first one is related to a filtering that removes all the tokens and tags for labels not software and not version. In cell 3 of the noteboook notebooks/Train software mentions model.ipynb, we have:

data.columns = ['paper_id',"word", "tag", "sentence_id"]
data['word'] = data['word'].astype(str)
data = data[data['tag'].isin(['O', 'B-software', 'B-version', 'I-version','I-software'])]

This will remove in particular all the tokens for publishers and url. For example here "Microsoft" and "SPSS Inc" are removed:

Input:

Radiographic errors were recorded on individual tick sheets and the information was captured 
in an <rs cert="1.0" resp="#annotator0" type="software" xml:id="a7f72b2925-software-0">Excel</rs> 
spreadsheet (<rs corresp="#a7f72b2925-software-0" resp="#curator" type="publisher">Microsoft</rs>, 
Redmond, WA). The readers resolved any differences by consensus.

Tokens:

['Radio', '##graphic', 'errors', 'were', 'recorded', 'on', 'individual', 'tick', 'sheets', 'and', 'the', 
'information', 'was', 'captured', 'in', 'an', 'Excel', 'spreads', '##he', '##et', '(', ',', 'Red', 
'##mond', ',', 'WA', ')', '.']

Input:

The <rs cert="1.0" resp="#annotator0" type="software" xml:id="f204e3a468-software-0">SPSS</rs> 
software version <rs corresp="#f204e3a468-software-0" resp="#annotator0" type="version">11.0</rs> 
(<rs corresp="#f204e3a468-software-0" resp="#curator" type="publisher">SPSS Inc</rs>., Chicago, 
USA) was used for the statistical analysis.

Tokens:

['The', 'SPSS', 'software', 'version', '11', '.', '0', '(', '.', ',', 'Chicago', ',', 'USA', ')', 'was', 
'used', 'for', 'the', 'statistical', 'analysis', '.']

So the model is trained and evaluated without the text corresponding to publisher and url. This impact the evaluation because publisher and url are often ambiguous with software name.

This can be fixed by replacing labels to be excluded by O, so that the tokens are not removed:

data.columns = ['paper_id',"word", "tag", "sentence_id"]
data['word'] = data['word'].astype(str)
# replace 'B-publisher', 'B-url', 'I-publisher','I-url' and reference marker labels by 'O'
data['tag'] = data['tag'].replace(['B-publisher', 'B-url', 'B-bibr', 'B-table', 'B-figure', 'B-formula', 'I-publisher', 'I-url', 'I-bibr', 'I-table', 'I-figure', 'I-formula'], 'O')
data = data[data['tag'].isin(['O', 'B-software', 'B-version', 'I-version','I-software'])]

We have then as expected:

['Radio', '##graphic', 'errors', 'were', 'recorded', 'on', 'individual', 'tick', 'sheets', 'and', 'the', 
'information', 'was', 'captured', 'in', 'an', 'Excel', 'spreads', '##he', '##et', '(', 'Microsoft', ',', 
'Red', '##mond', ',', 'WA', ')', '.']
['The', 'SPSS', 'software', 'version', '11', '.', '0', '(', 'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'USA', ')', 
'was', 'used', 'for', 'the', 'statistical', 'analysis', '.']

The evaluation becomes then:

F1 score: 0.883440
           precision    recall  f1-score   support

 software     0.8468    0.8737    0.8600       974
  version     0.9440    0.9610    0.9524       333

micro avg     0.8713    0.8959    0.8834      1307
macro avg     0.8715    0.8959    0.8836      1307

However, it impacts probably not just the evaluation I think, it also means that your model does not know about publisher and url when applied to new article containing such tokens, so it might degrade the inference scenario too.

Note: I don't know how to PR when a notebook is used, but the code snippet above fixes the problem (in cell 3 of notebooks/Train software mentions model.ipynb).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions