Skip to content

MoleculeLoader is not sklearn-CV-sliceable, breaking AptaNet + Benchmarking #706

Description

@siddharth7113

LLM generated content, by Claude Opus 4.8

Summary

AptaNetPipeline now consumes a MoleculeLoader only (via the MoleculeLoader-only PairsToFeatures), but MoleculeLoader is not sliceable, so it cannot be used with scikit-learn cross-validation / grid-search. This breaks Benchmarking with AptaNet estimators.

Details

Benchmarking.run() calls sklearn.model_selection.cross_validate(estimator, X, y, cv=...), which slices X per fold via _safe_indexing. Two problems:

  • Passing a list of pairs (as test_benchmarking.py does) is sliceable, but the list is then rejected by PairsToFeatures (TypeError: PairsToFeatures accepts only a MoleculeLoader as input, got list.).
  • Passing a MoleculeLoader is accepted by PairsToFeatures, but MoleculeLoader has no __len__/__getitem__, so _safe_indexing fails ('MoleculeLoader' object is not subscriptable).

Reproduce:

import numpy as np
from sklearn.utils import _safe_indexing
from pyaptamer.data import MoleculeLoader

ml = MoleculeLoader(data={"aptamer": ["ACGU"] * 40, "protein": ["MK"] * 40})
_safe_indexing(ml, np.arange(10))  # TypeError: not subscriptable

Proposed fix

Make MoleculeLoader sklearn-sliceable: add __len__ (number of materialized samples) and __getitem__ (integer / array / slice → a sub-MoleculeLoader over the selected rows), so it survives cross_validate and returns a loader each fold. Then migrate test_benchmarking.py to pass a MoleculeLoader and re-enable the skipped tests.

Affected tests (currently skipped)

  • pyaptamer/benchmarking/tests/test_benchmarking.py::test_benchmarking_with_predefined_split_classification
  • pyaptamer/benchmarking/tests/test_benchmarking.py::test_benchmarking_with_predefined_split_regression

Skipped in the PR that lands the pipeline migration so main stays green; this issue tracks the real fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions