Skip to content

[BUG] MaskedDataset masking deviates from BERT 80/10/10 masking behavior #646

@onkar717

Description

@onkar717

Description

MaskedDataset.__getitem__ does not follow standard BERT masking behavior.

Current implementation:

mask_positions = random.sample(valid_positions, n_to_mask)
actual_mask_positions = random.sample(mask_positions, int(len(mask_positions) * 0.8))
x_masked[actual_mask_positions] = self.mask_idx

This causes two issues:

  1. Effective masking becomes 0.8 × masked_rate due to double-sampling.

  2. BERT-style 80/10/10 behavior is incomplete:

    • 80% [MASK]
    • 10% random token
    • 10% unchanged

Only the [MASK] replacement path is implemented.

The issue is also referenced in the TODO comment in datasets/dataclasses/_masked.py.


To Reproduce

import numpy as np
from pyaptamer.datasets.dataclasses import MaskedDataset

rng = np.random.default_rng(42)
seqs = rng.integers(1, 11, size=(200, 20)).tolist()

ds = MaskedDataset(
    x=seqs,
    y=seqs,
    max_len=20,
    mask_idx=11,
    masked_rate=0.50,
)

fracs = []

for i in range(len(ds)):
    x_masked, _, x, _ = ds[i]
    fracs.append((x_masked == 11).sum() / (x > 0).sum())

print(np.mean(fracs))

# Expected: ~0.40
# Actual:   ~0.32

Expected behavior

MaskedDataset should implement standard BERT masking:

  • 80% [MASK]
  • 10% random token
  • 10% unchanged

without reducing the effective masking rate through double-sampling.


References

  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions