Skip to content

[BUG] MaskedDataset initializes y_masked from x instead of y #599

Description

@blenbot

Describe the bug

MaskedDataset.__getitem__ initializes y_masked from x instead of y, which can corrupt masked training targets when input and target sequences differ.
To Reproduce

import torch
from pyaptamer.datasets.dataclasses import MaskedDataset

# x and y intentionally different
x_data = [[1, 2, 3, 0]]
y_data = [[9, 9, 9, 0]]

dataset = MaskedDataset(
    x=x_data,
    y=y_data,
    max_len=4,
    mask_idx=99,
    masked_rate=0.5,
    is_rna=False,
)

x_masked, y_masked, x, y = dataset[0]

print("x:", x.tolist())
print("y:", y.tolist())
print("y_masked:", y_masked.tolist())

Expected behavior

x: [1, 2, 3, 0]                                                          
y: [9, 9, 9, 0]                                                          
y_masked: [0, 9, 0, 0]

What Actually Happened

x: [1, 2, 3, 0]                                                          
y: [9, 9, 9, 0]                                                          
y_masked: [0, 0, 3, 0] 

Additional context

Versions

Details

0.1.0a1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions