Skip to content

[BUG] Fix MaskedDataset y_masked initialization from wrong tensor#304

Closed
officialasishkumar wants to merge 1 commit into
gc-os-ai:mainfrom
officialasishkumar:bug/302-fix-masked-dataset-y-clone
Closed

[BUG] Fix MaskedDataset y_masked initialization from wrong tensor#304
officialasishkumar wants to merge 1 commit into
gc-os-ai:mainfrom
officialasishkumar:bug/302-fix-masked-dataset-y-clone

Conversation

@officialasishkumar

Copy link
Copy Markdown
Contributor

Reference Issues/PRs

Fixes #302

What does this implement/fix? Explain your changes.

MaskedDataset.__getitem__ was initializing y_masked by cloning x instead of y. This meant the target tensor for masked language modeling was derived from the input rather than the actual target sequence, corrupting training targets.

The fix changes line 154 in pyaptamer/datasets/dataclasses/_masked.py from y_masked = x.clone().detach() to y_masked = y.clone().detach().

What should a reviewer concentrate their feedback on?

  • Verify the one-line fix in _masked.py is correct
  • Review the new tests in test_masked_dataset.py for completeness

Did you add any tests for the change?

Yes. Added pyaptamer/datasets/tests/test_masked_dataset.py with four tests:

  • test_len — verifies __len__ returns the correct count
  • test_mismatched_lengths_raises — verifies ValueError on mismatched x/y
  • test_getitem_returns_four_tensors — verifies return shape and types
  • test_getitem_y_masked_derived_from_y — regression test ensuring y_masked values come from y, not x

Any other comments?

None.

PR checklist

  • The PR title starts with [BUG]
  • Added/modified tests
  • Used pre-commit hooks

MaskedDataset.__getitem__ was initializing y_masked by cloning x
instead of y. This meant the target tensor for masked language modeling
was derived from the input rather than the actual target sequence.

Added tests for MaskedDataset.__getitem__ to prevent regression.

Fixes gc-os-ai#302

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] MaskedDataset.__getitem__ initializes y_masked from x instead of y

2 participants