Describe the bug
MaskedDataset.__getitem__ initializes y_masked from x instead of y, which can corrupt masked training targets when input and target sequences differ.
To Reproduce
import torch
from pyaptamer.datasets.dataclasses import MaskedDataset
# x and y intentionally different
x_data = [[1, 2, 3, 0]]
y_data = [[9, 9, 9, 0]]
dataset = MaskedDataset(
x=x_data,
y=y_data,
max_len=4,
mask_idx=99,
masked_rate=0.5,
is_rna=False,
)
x_masked, y_masked, x, y = dataset[0]
print("x:", x.tolist())
print("y:", y.tolist())
print("y_masked:", y_masked.tolist())
Expected behavior
x: [1, 2, 3, 0]
y: [9, 9, 9, 0]
y_masked: [0, 9, 0, 0]
What Actually Happened
x: [1, 2, 3, 0]
y: [9, 9, 9, 0]
y_masked: [0, 0, 3, 0]
Additional context
Versions
Details
0.1.0a1
Describe the bug
MaskedDataset.__getitem__initializes y_masked from x instead of y, which can corrupt masked training targets when input and target sequences differ.To Reproduce
Expected behavior
What Actually Happened
Additional context
Versions
Details
0.1.0a1