Hello !
Given that you are reporting 92% F1-score on the Softcite dataset, I was wondering why with the same SciBERT model and dataset I was having 8 points less in F1-score. Reproducing your training with the notebook, I found 2 problems in your data preparation which, I think, explains the difference.
The first one is related to a filtering that removes all the tokens and tags for labels not software and not version. In cell 3 of the noteboook notebooks/Train software mentions model.ipynb, we have:
data.columns = ['paper_id',"word", "tag", "sentence_id"]
data['word'] = data['word'].astype(str)
data = data[data['tag'].isin(['O', 'B-software', 'B-version', 'I-version','I-software'])]
This will remove in particular all the tokens for publishers and url. For example here "Microsoft" and "SPSS Inc" are removed:
Input:
Radiographic errors were recorded on individual tick sheets and the information was captured
in an <rs cert="1.0" resp="#annotator0" type="software" xml:id="a7f72b2925-software-0">Excel</rs>
spreadsheet (<rs corresp="#a7f72b2925-software-0" resp="#curator" type="publisher">Microsoft</rs>,
Redmond, WA). The readers resolved any differences by consensus.
Tokens:
['Radio', '##graphic', 'errors', 'were', 'recorded', 'on', 'individual', 'tick', 'sheets', 'and', 'the',
'information', 'was', 'captured', 'in', 'an', 'Excel', 'spreads', '##he', '##et', '(', ',', 'Red',
'##mond', ',', 'WA', ')', '.']
Input:
The <rs cert="1.0" resp="#annotator0" type="software" xml:id="f204e3a468-software-0">SPSS</rs>
software version <rs corresp="#f204e3a468-software-0" resp="#annotator0" type="version">11.0</rs>
(<rs corresp="#f204e3a468-software-0" resp="#curator" type="publisher">SPSS Inc</rs>., Chicago,
USA) was used for the statistical analysis.
Tokens:
['The', 'SPSS', 'software', 'version', '11', '.', '0', '(', '.', ',', 'Chicago', ',', 'USA', ')', 'was',
'used', 'for', 'the', 'statistical', 'analysis', '.']
So the model is trained and evaluated without the text corresponding to publisher and url. This impact the evaluation because publisher and url are often ambiguous with software name.
This can be fixed by replacing labels to be excluded by O, so that the tokens are not removed:
data.columns = ['paper_id',"word", "tag", "sentence_id"]
data['word'] = data['word'].astype(str)
# replace 'B-publisher', 'B-url', 'I-publisher','I-url' and reference marker labels by 'O'
data['tag'] = data['tag'].replace(['B-publisher', 'B-url', 'B-bibr', 'B-table', 'B-figure', 'B-formula', 'I-publisher', 'I-url', 'I-bibr', 'I-table', 'I-figure', 'I-formula'], 'O')
data = data[data['tag'].isin(['O', 'B-software', 'B-version', 'I-version','I-software'])]
We have then as expected:
['Radio', '##graphic', 'errors', 'were', 'recorded', 'on', 'individual', 'tick', 'sheets', 'and', 'the',
'information', 'was', 'captured', 'in', 'an', 'Excel', 'spreads', '##he', '##et', '(', 'Microsoft', ',',
'Red', '##mond', ',', 'WA', ')', '.']
['The', 'SPSS', 'software', 'version', '11', '.', '0', '(', 'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'USA', ')',
'was', 'used', 'for', 'the', 'statistical', 'analysis', '.']
The evaluation becomes then:
F1 score: 0.883440
precision recall f1-score support
software 0.8468 0.8737 0.8600 974
version 0.9440 0.9610 0.9524 333
micro avg 0.8713 0.8959 0.8834 1307
macro avg 0.8715 0.8959 0.8836 1307
However, it impacts probably not just the evaluation I think, it also means that your model does not know about publisher and url when applied to new article containing such tokens, so it might degrade the inference scenario too.
Note: I don't know how to PR when a notebook is used, but the code snippet above fixes the problem (in cell 3 of notebooks/Train software mentions model.ipynb).
Hello !
Given that you are reporting 92% F1-score on the Softcite dataset, I was wondering why with the same SciBERT model and dataset I was having 8 points less in F1-score. Reproducing your training with the notebook, I found 2 problems in your data preparation which, I think, explains the difference.
The first one is related to a filtering that removes all the tokens and tags for labels not
softwareand notversion. In cell 3 of the noteboooknotebooks/Train software mentions model.ipynb, we have:This will remove in particular all the tokens for publishers and url. For example here "Microsoft" and "SPSS Inc" are removed:
Input:
Tokens:
Input:
Tokens:
So the model is trained and evaluated without the text corresponding to publisher and url. This impact the evaluation because publisher and url are often ambiguous with software name.
This can be fixed by replacing labels to be excluded by
O, so that the tokens are not removed:We have then as expected:
The evaluation becomes then:
However, it impacts probably not just the evaluation I think, it also means that your model does not know about publisher and url when applied to new article containing such tokens, so it might degrade the inference scenario too.
Note: I don't know how to PR when a notebook is used, but the code snippet above fixes the problem (in cell 3 of
notebooks/Train software mentions model.ipynb).