Skip to content

[BUG] Lowercase protein sequences are corrupted by clean_protein_seq #550

@Alleny244

Description

@Alleny244

Describe the bug
clean_protein_seq()replaces valid lowercase amino acid characters with'N'becauseAMINO_ACIDSonly contains uppercase letters. Lowercase input like"acdef"is treated as entirely invalid and returned as"NNNNN"`.

To Reproduce

from pyaptamer.utils._pseaac_utils import clean_protein_seq

print(clean_protein_seq("acdef"))
# Output: "NNNNN" (expected: "ACDEF")

print(clean_protein_seq("AcDeF"))
# Output: "ANDNF" (expected: "ACDEF")

Expected behavior
Lowercase and mixed-case sequences should be normalized to uppercase before validation,
preserving all valid amino acids. clean_protein_seq("acdef") should return "ACDEF",
not "NNNNN".

Additional context
PDB files and other bioinformatics tools sometimes output lowercase amino acid sequences.
The function should handle these gracefully rather than treating them as invalid residues.

Versions

0.1.0a1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions