Skip to content

[BUG] Honor MoleculeLoader ignore_duplicates when exporting sequences#313

Closed
officialasishkumar wants to merge 1 commit into
gc-os-ai:mainfrom
officialasishkumar:issue312
Closed

[BUG] Honor MoleculeLoader ignore_duplicates when exporting sequences#313
officialasishkumar wants to merge 1 commit into
gc-os-ai:mainfrom
officialasishkumar:issue312

Conversation

@officialasishkumar

Copy link
Copy Markdown
Contributor

Reference Issues/PRs

Fixes #312.

What does this implement/fix? Explain your changes.

This fixes MoleculeLoader.to_df_seq() so the documented ignore_duplicates flag is actually honored. The loader now builds the sequence DataFrame first, drops duplicate sequence rows while keeping the first occurrence when the flag is enabled, and then applies any optional column renaming.

What should a reviewer concentrate their feedback on?

The duplicate-removal behavior in MoleculeLoader.to_df_seq() and whether preserving the first indexed occurrence matches the intended contract for sequence exports.

Did you add any tests for the change?

Yes. I added a regression test that reproduces the bug with duplicate PDB inputs and verifies that:

  • duplicate sequence rows are removed when ignore_duplicates=True
  • optional custom column names still work after deduplication

Any other comments?

Validation run locally:

  • .venv/bin/pytest pyaptamer -p no:warnings
  • .venv/bin/pre-commit run --files pyaptamer/data/loader.py pyaptamer/data/tests/test_loader.py

PR checklist

  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
  • Added/modified tests
  • Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with pre-commit install.
    To run hooks independent of commit, execute pre-commit run --all-files

@siddharth7113 siddharth7113 self-requested a review April 7, 2026 14:54
@siddharth7113 siddharth7113 marked this pull request as draft April 7, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] MoleculeLoader ignores ignore_duplicates when exporting sequences

2 participants