Skip to content

[BUG] clean_protein_seq replaces invalid amino acids with Asparagine ("N") instead of removing them #648

@onkar717

Description

@onkar717

Describe the bug

clean_protein_seq() in pyaptamer/utils/_pseaac_utils.py replaces invalid amino acid characters with "N". However, "N" is the IUPAC one-letter code for Asparagine a real amino acid already present in the AMINO_ACIDS list.

This silently corrupts protein sequences by turning unknown characters into a valid amino acid, producing incorrect PSeAAC feature vectors downstream for both PSeAAC and AptaNetPSeAAC.

To Reproduce

from pyaptamer.utils._pseaac_utils import clean_protein_seq, AMINO_ACIDS

# "N" is Asparagine a valid amino acid
print("N" in AMINO_ACIDS)  # True

# Input sequence with invalid character "X"
result = clean_protein_seq("ACXD")
print(result)  # "ACND" — X was silently replaced with Asparagine!

# This corrupts PSeAAC feature vectors:
from pyaptamer.pseaac import PSeAAC
p = PSeAAC()
# These produce DIFFERENT vectors even though they should be equivalent
vec_buggy = p.transform("ACNDACNDACNDACNDACND")  # contains false Asparagine
vec_clean = p.transform("ACDACDACDACDACDACDAC")  # without corruption

Expected behavior

Invalid characters should be removed (filtered out) from the sequence rather than replaced with a valid amino acid. The existing UserWarning should be kept to inform the user about the removal.

Note: the aa_str_to_letter() utility in the same package already correctly uses "X" for unknown amino acid codes, showing an inconsistency in how unknowns are handled.

Additional context

  • AMINO_ACIDS = list("ACDEFGHIKLMNPQRSTVWY") N is Asparagine at position 11
  • The bug affects all PSeAAC feature computations when input contains non-standard residues (e.g., X, B, Z, digits, whitespace)
  • All 144 existing tests pass after applying the fix

Versions

Details
pyaptamer 0.1.0a1
```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions