Skip to content

Incorrect IOB labels after BERT tokenization #2

Description

@kermitt2

Follow-up of #1 :)

In the notebook notebooks/Train software mentions model.ipynb, when expanding the labels for the extra tokens introduced by the BERT Tokenizer, these added labels have a copy of the leading label (which start with B-), while these added labels should be "inside" labels starting with I-, following the IOB scheme.

def tokenize_label_sentence(sentence, text_labels):
    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)
        tokenized_sentence.extend(tokenized_word)
        labels.extend([label] * n_subwords)        # <--- problem !
    return tokenized_sentence, labels

For example:

['Statistical', 'analysis', 'was', 'conducted', 'using', 'SPSS', 'software', 'v', '13', '.', '0', '(', 
'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'IL', ',', 'USA', ')', '.']
['O', 'O', 'O', 'O', 'O', 'B-software', 'O', 'O', 'B-version', 'B-version', 'B-version', 'O', 'O', 
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

The problem is visible too on the notebook notebooks/Software mentions inference mode.ipynb, cell 6-7:

test_sentence = "I used Python package DBSCAN 1.234 for this analysis"
[('I', 'O'), ('used', 'O'), ('Python', 'B-software'), ('package', 'O'), ('DBSCAN', 'B-software'), 
('1', 'B-version'), ('.', 'B-version'), ('234', 'B-version'), ('for', 'O'), ('this', 'O'), ('analysis', 'O')]

We should not have 3 separate entities "version", but only one (1.234) with 3 tokens.

When seqeval then computes the evaluation scores, it will count every labels with B- as a different entity, so over-estimating the scores (it becomes more scores at token level than entity level). There are 4093 software names in the Softcite dataset, but the seqeval evaluation in the training/eval notebook, which is supposed to be on 10% of the corpus, is done on 959 software names, see support below:

f1 socre: 0.922123
Accuracy score: 0.995906
           precision    recall  f1-score   support

 software     0.9014    0.9343    0.9176       959
  version     0.9216    0.9515    0.9363       309

micro avg     0.9063    0.9385    0.9221      1268
macro avg     0.9063    0.9385    0.9221      1268

To fix the issue, the expanded labels should be prefixed with I-, for example:

def tokenize_label_sentence(sentence, text_labels):
    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)
        tokenized_sentence.extend(tokenized_word)
        # extend tokens replacing B-label by I-label (otherwise we have B- everywhere at each token and wrong 
        # entity scores)
        labels.extend([label])
        if n_subwords>0:
            for i in range(0, n_subwords-1):
                labels.extend([label.replace("B-", "I-")])
    return tokenized_sentence, labels

We have then as expected:

['Statistical', 'analysis', 'was', 'conducted', 'using', 'SPSS', 'software', 'v', '13', '.', '0', '(', 
'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'IL', ',', 'USA', ')', '.']
['O', 'O', 'O', 'O', 'O', 'B-software', 'O', 'O', 'B-version', 'I-version', 'I-version', 'O', 'O', 'O', 
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

The evaluation scores are then correctly on 10%, 404 software names, and becomes (combined with the fix of #1):

F1 score: 0.855545
           precision    recall  f1-score   support

 software     0.8107    0.8589    0.8341       404
  version     0.9180    0.9412    0.9295       119

micro avg     0.8345    0.8776    0.8555       523
macro avg     0.8352    0.8776    0.8558       523

0.85 is more or less the same F1-score as I have with my independent implementation with the same model and Softcite dataset (I have ~ 0.84).

It's possible that this problem impact the inference and the ability to segment software names and versions with several tokens, because the model is trained with almost only B- tokens, so it's likely not just an issue for the evaluation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions