Describe the bug
clean_protein_seq() in pyaptamer/utils/_pseaac_utils.py replaces invalid amino acid characters with "N". However, "N" is the IUPAC one-letter code for Asparagine a real amino acid already present in the AMINO_ACIDS list.
This silently corrupts protein sequences by turning unknown characters into a valid amino acid, producing incorrect PSeAAC feature vectors downstream for both PSeAAC and AptaNetPSeAAC.
To Reproduce
from pyaptamer.utils._pseaac_utils import clean_protein_seq, AMINO_ACIDS
# "N" is Asparagine a valid amino acid
print("N" in AMINO_ACIDS) # True
# Input sequence with invalid character "X"
result = clean_protein_seq("ACXD")
print(result) # "ACND" — X was silently replaced with Asparagine!
# This corrupts PSeAAC feature vectors:
from pyaptamer.pseaac import PSeAAC
p = PSeAAC()
# These produce DIFFERENT vectors even though they should be equivalent
vec_buggy = p.transform("ACNDACNDACNDACNDACND") # contains false Asparagine
vec_clean = p.transform("ACDACDACDACDACDACDAC") # without corruption
Expected behavior
Invalid characters should be removed (filtered out) from the sequence rather than replaced with a valid amino acid. The existing UserWarning should be kept to inform the user about the removal.
Note: the aa_str_to_letter() utility in the same package already correctly uses "X" for unknown amino acid codes, showing an inconsistency in how unknowns are handled.
Additional context
AMINO_ACIDS = list("ACDEFGHIKLMNPQRSTVWY") N is Asparagine at position 11
- The bug affects all PSeAAC feature computations when input contains non-standard residues (e.g.,
X, B, Z, digits, whitespace)
- All 144 existing tests pass after applying the fix
Versions
Details
```
Describe the bug
clean_protein_seq()inpyaptamer/utils/_pseaac_utils.pyreplaces invalid amino acid characters with"N". However,"N"is the IUPAC one-letter code for Asparagine a real amino acid already present in theAMINO_ACIDSlist.This silently corrupts protein sequences by turning unknown characters into a valid amino acid, producing incorrect PSeAAC feature vectors downstream for both
PSeAACandAptaNetPSeAAC.To Reproduce
Expected behavior
Invalid characters should be removed (filtered out) from the sequence rather than replaced with a valid amino acid. The existing
UserWarningshould be kept to inform the user about the removal.Note: the
aa_str_to_letter()utility in the same package already correctly uses"X"for unknown amino acid codes, showing an inconsistency in how unknowns are handled.Additional context
AMINO_ACIDS = list("ACDEFGHIKLMNPQRSTVWY")Nis Asparagine at position 11X,B,Z, digits, whitespace)Versions
Details