Follow-up of #1 :)
In the notebook notebooks/Train software mentions model.ipynb, when expanding the labels for the extra tokens introduced by the BERT Tokenizer, these added labels have a copy of the leading label (which start with B-), while these added labels should be "inside" labels starting with I-, following the IOB scheme.
def tokenize_label_sentence(sentence, text_labels):
tokenized_sentence = []
labels = []
for word, label in zip(sentence, text_labels):
tokenized_word = tokenizer.tokenize(word)
n_subwords = len(tokenized_word)
tokenized_sentence.extend(tokenized_word)
labels.extend([label] * n_subwords) # <--- problem !
return tokenized_sentence, labels
For example:
['Statistical', 'analysis', 'was', 'conducted', 'using', 'SPSS', 'software', 'v', '13', '.', '0', '(',
'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'IL', ',', 'USA', ')', '.']
['O', 'O', 'O', 'O', 'O', 'B-software', 'O', 'O', 'B-version', 'B-version', 'B-version', 'O', 'O',
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
The problem is visible too on the notebook notebooks/Software mentions inference mode.ipynb, cell 6-7:
test_sentence = "I used Python package DBSCAN 1.234 for this analysis"
[('I', 'O'), ('used', 'O'), ('Python', 'B-software'), ('package', 'O'), ('DBSCAN', 'B-software'),
('1', 'B-version'), ('.', 'B-version'), ('234', 'B-version'), ('for', 'O'), ('this', 'O'), ('analysis', 'O')]
We should not have 3 separate entities "version", but only one (1.234) with 3 tokens.
When seqeval then computes the evaluation scores, it will count every labels with B- as a different entity, so over-estimating the scores (it becomes more scores at token level than entity level). There are 4093 software names in the Softcite dataset, but the seqeval evaluation in the training/eval notebook, which is supposed to be on 10% of the corpus, is done on 959 software names, see support below:
f1 socre: 0.922123
Accuracy score: 0.995906
precision recall f1-score support
software 0.9014 0.9343 0.9176 959
version 0.9216 0.9515 0.9363 309
micro avg 0.9063 0.9385 0.9221 1268
macro avg 0.9063 0.9385 0.9221 1268
To fix the issue, the expanded labels should be prefixed with I-, for example:
def tokenize_label_sentence(sentence, text_labels):
tokenized_sentence = []
labels = []
for word, label in zip(sentence, text_labels):
tokenized_word = tokenizer.tokenize(word)
n_subwords = len(tokenized_word)
tokenized_sentence.extend(tokenized_word)
# extend tokens replacing B-label by I-label (otherwise we have B- everywhere at each token and wrong
# entity scores)
labels.extend([label])
if n_subwords>0:
for i in range(0, n_subwords-1):
labels.extend([label.replace("B-", "I-")])
return tokenized_sentence, labels
We have then as expected:
['Statistical', 'analysis', 'was', 'conducted', 'using', 'SPSS', 'software', 'v', '13', '.', '0', '(',
'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'IL', ',', 'USA', ')', '.']
['O', 'O', 'O', 'O', 'O', 'B-software', 'O', 'O', 'B-version', 'I-version', 'I-version', 'O', 'O', 'O',
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
The evaluation scores are then correctly on 10%, 404 software names, and becomes (combined with the fix of #1):
F1 score: 0.855545
precision recall f1-score support
software 0.8107 0.8589 0.8341 404
version 0.9180 0.9412 0.9295 119
micro avg 0.8345 0.8776 0.8555 523
macro avg 0.8352 0.8776 0.8558 523
0.85 is more or less the same F1-score as I have with my independent implementation with the same model and Softcite dataset (I have ~ 0.84).
It's possible that this problem impact the inference and the ability to segment software names and versions with several tokens, because the model is trained with almost only B- tokens, so it's likely not just an issue for the evaluation.
Follow-up of #1 :)
In the notebook
notebooks/Train software mentions model.ipynb, when expanding the labels for the extra tokens introduced by the BERT Tokenizer, these added labels have a copy of the leading label (which start withB-), while these added labels should be "inside" labels starting withI-, following the IOB scheme.For example:
The problem is visible too on the notebook
notebooks/Software mentions inference mode.ipynb, cell 6-7:We should not have 3 separate entities "version", but only one (
1.234) with 3 tokens.When seqeval then computes the evaluation scores, it will count every labels with
B-as a different entity, so over-estimating the scores (it becomes more scores at token level than entity level). There are 4093 software names in the Softcite dataset, but the seqeval evaluation in the training/eval notebook, which is supposed to be on 10% of the corpus, is done on 959 software names, see support below:To fix the issue, the expanded labels should be prefixed with
I-, for example:We have then as expected:
The evaluation scores are then correctly on 10%, 404 software names, and becomes (combined with the fix of #1):
0.85 is more or less the same F1-score as I have with my independent implementation with the same model and Softcite dataset (I have ~ 0.84).
It's possible that this problem impact the inference and the ability to segment software names and versions with several tokens, because the model is trained with almost only
B-tokens, so it's likely not just an issue for the evaluation.