Incorrect IOB labels after BERT tokenization

Follow-up of #1 :)

In the notebook `notebooks/Train software mentions model.ipynb`, when expanding the labels for the extra tokens introduced by the BERT Tokenizer, these added labels have a copy of the leading label (which start with `B-`), while these added labels should be "inside" labels starting with `I-`, following the IOB scheme. 

```python
def tokenize_label_sentence(sentence, text_labels):
    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)
        tokenized_sentence.extend(tokenized_word)
        labels.extend([label] * n_subwords)        # <--- problem !
    return tokenized_sentence, labels
```

For example:

```
['Statistical', 'analysis', 'was', 'conducted', 'using', 'SPSS', 'software', 'v', '13', '.', '0', '(', 
'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'IL', ',', 'USA', ')', '.']
['O', 'O', 'O', 'O', 'O', 'B-software', 'O', 'O', 'B-version', 'B-version', 'B-version', 'O', 'O', 
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
```

The problem is visible too on the notebook `notebooks/Software mentions inference mode.ipynb`, cell 6-7:

```
test_sentence = "I used Python package DBSCAN 1.234 for this analysis"
[('I', 'O'), ('used', 'O'), ('Python', 'B-software'), ('package', 'O'), ('DBSCAN', 'B-software'), 
('1', 'B-version'), ('.', 'B-version'), ('234', 'B-version'), ('for', 'O'), ('this', 'O'), ('analysis', 'O')]
```

We should not have 3 separate entities "version", but only one (`1.234`) with 3 tokens. 

When seqeval then computes the evaluation scores, it will count every labels with `B-` as a different entity, so over-estimating the scores (it becomes more scores at token level than entity level). There are 4093 software names in the Softcite dataset, but the seqeval evaluation in the training/eval notebook, which is supposed to be on 10% of the corpus, is done on 959 software names, see support below:

```
f1 socre: 0.922123
Accuracy score: 0.995906
           precision    recall  f1-score   support

 software     0.9014    0.9343    0.9176       959
  version     0.9216    0.9515    0.9363       309

micro avg     0.9063    0.9385    0.9221      1268
macro avg     0.9063    0.9385    0.9221      1268
```

To fix the issue, the expanded labels should be prefixed with `I-`, for example:

```python
def tokenize_label_sentence(sentence, text_labels):
    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)
        tokenized_sentence.extend(tokenized_word)
        # extend tokens replacing B-label by I-label (otherwise we have B- everywhere at each token and wrong 
        # entity scores)
        labels.extend([label])
        if n_subwords>0:
            for i in range(0, n_subwords-1):
                labels.extend([label.replace("B-", "I-")])
    return tokenized_sentence, labels
```

We have then as expected:

```
['Statistical', 'analysis', 'was', 'conducted', 'using', 'SPSS', 'software', 'v', '13', '.', '0', '(', 
'SPSS', 'Inc', '.', ',', 'Chicago', ',', 'IL', ',', 'USA', ')', '.']
['O', 'O', 'O', 'O', 'O', 'B-software', 'O', 'O', 'B-version', 'I-version', 'I-version', 'O', 'O', 'O', 
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
```

The evaluation scores are then correctly on 10%, 404 software names, and becomes (combined with the fix of #1):

```
F1 score: 0.855545
           precision    recall  f1-score   support

 software     0.8107    0.8589    0.8341       404
  version     0.9180    0.9412    0.9295       119

micro avg     0.8345    0.8776    0.8555       523
macro avg     0.8352    0.8776    0.8558       523
```

0.85 is more or less the same F1-score as I have with my independent implementation with the same model and Softcite dataset (I have ~ 0.84).

It's possible that this problem impact the inference and the ability to segment software names and versions with several tokens, because the model is trained with almost only `B-` tokens, so it's likely not just an issue for the evaluation. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect IOB labels after BERT tokenization #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Incorrect IOB labels after BERT tokenization #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions