Description
MaskedDataset.__getitem__ does not follow standard BERT masking behavior.
Current implementation:
mask_positions = random.sample(valid_positions, n_to_mask)
actual_mask_positions = random.sample(mask_positions, int(len(mask_positions) * 0.8))
x_masked[actual_mask_positions] = self.mask_idx
This causes two issues:
-
Effective masking becomes 0.8 × masked_rate due to double-sampling.
-
BERT-style 80/10/10 behavior is incomplete:
- 80%
[MASK]
- 10% random token
- 10% unchanged
Only the [MASK] replacement path is implemented.
The issue is also referenced in the TODO comment in datasets/dataclasses/_masked.py.
To Reproduce
import numpy as np
from pyaptamer.datasets.dataclasses import MaskedDataset
rng = np.random.default_rng(42)
seqs = rng.integers(1, 11, size=(200, 20)).tolist()
ds = MaskedDataset(
x=seqs,
y=seqs,
max_len=20,
mask_idx=11,
masked_rate=0.50,
)
fracs = []
for i in range(len(ds)):
x_masked, _, x, _ = ds[i]
fracs.append((x_masked == 11).sum() / (x > 0).sum())
print(np.mean(fracs))
# Expected: ~0.40
# Actual: ~0.32
Expected behavior
MaskedDataset should implement standard BERT masking:
- 80%
[MASK]
- 10% random token
- 10% unchanged
without reducing the effective masking rate through double-sampling.
References
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Description
MaskedDataset.__getitem__does not follow standard BERT masking behavior.Current implementation:
This causes two issues:
Effective masking becomes
0.8 × masked_ratedue to double-sampling.BERT-style 80/10/10 behavior is incomplete:
[MASK]Only the
[MASK]replacement path is implemented.The issue is also referenced in the TODO comment in
datasets/dataclasses/_masked.py.To Reproduce
Expected behavior
MaskedDatasetshould implement standard BERT masking:[MASK]without reducing the effective masking rate through double-sampling.
References