Description
KMerEncoder silently ignores "U" (uracil) nucleotides when encoding RNA sequences. This happens because the encoder's base alphabet is hardcoded to DNA bases (DNA_BASES = list("ACGT")).
When processing an RNA sequence, any k-mer containing "U" is silently skipped (membership check if kmer in kmer_counts: fails). This causes two major issues:
- Silent feature loss: All
"U" bases are completely omitted from the resulting k-mer representation.
- Frequency distortion (inflation): The normalization denominator (
total_kmers = sum(kmer_counts.values())) is artificially smaller since it ignores "U". This causes the frequencies of "A", "C", and "G" to be artificially inflated.
This issue also impacts the official user guide example in docs/source/user_guide/aptanet.md which passes RNA sequence strings containing "U" (e.g., "GGGAGGACGAAGACGACUCGAGACAGGCUAGGGAGGGA") to AptaNetPipeline.
Reproducible Example
You can reproduce this behavior with the following script:
import pandas as pd
from pyaptamer.trafos.encode._kmer import KMerEncoder
# DNA sequence "ACGT" (Works perfectly)
df_dna = pd.DataFrame(["ACGT"], columns=["Sequence"])
enc = KMerEncoder(k=1)
res_dna = enc.fit_transform(df_dna)
res_dna.columns = ["A", "C", "G", "T"]
print("DNA Encoding:")
print(pd.concat([df_dna, res_dna], axis=1))
# RNA sequence "ACGU" (Ignored U, inflated frequencies)
df_rna = pd.DataFrame(["ACGU"], columns=["Sequence"])
res_rna = enc.fit_transform(df_rna)
res_rna.columns = ["A", "C", "G", "T"]
print("\nRNA Encoding:")
print(pd.concat([df_rna, res_rna], axis=1))
# All-U sequence "UUUU" (Returns all zeros)
df_all_u = pd.DataFrame(["UUUU"], columns=["Sequence"])
res_all_u = enc.fit_transform(df_all_u)
res_all_u.columns = ["A", "C", "G", "T"]
print("\nAll-U Encoding:")
print(pd.concat([df_all_u, res_all_u], axis=1))
Output:
DNA Encoding:
Sequence A C G T
0 ACGT 0.25 0.25 0.25 0.25
RNA Encoding:
Sequence A C G T
0 ACGU 0.333333 0.333333 0.333333 0.0
All-U Encoding:
Sequence A C G T
0 UUUU 0.0 0.0 0.0 0.0
Proposed Solution
- Auto-Inferred Alphabet by Default: Update
KMerEncoder tag "property:fit_is_empty" to False and implement _fit(self, X, y=None) to automatically extract unique characters from the training sequences and store them as a fitted attribute self.alphabet_.
- Expose
alphabet parameter: Add an optional alphabet: list[str] | str | None = None parameter in __init__ to allow users to override the default auto-inference with a custom alphabet.
- Propagate: Expose and pass this
alphabet parameter from AptaNetPipeline and AptaNetFeatureExtractor down to KMerEncoder.
Cc: @fkiraly @siddharth7113 for review and feedback
Description
KMerEncodersilently ignores"U"(uracil) nucleotides when encoding RNA sequences. This happens because the encoder's base alphabet is hardcoded to DNA bases (DNA_BASES = list("ACGT")).When processing an RNA sequence, any k-mer containing
"U"is silently skipped (membership checkif kmer in kmer_counts:fails). This causes two major issues:"U"bases are completely omitted from the resulting k-mer representation.total_kmers = sum(kmer_counts.values())) is artificially smaller since it ignores"U". This causes the frequencies of"A","C", and"G"to be artificially inflated.This issue also impacts the official user guide example in
docs/source/user_guide/aptanet.mdwhich passes RNA sequence strings containing"U"(e.g.,"GGGAGGACGAAGACGACUCGAGACAGGCUAGGGAGGGA") toAptaNetPipeline.Reproducible Example
You can reproduce this behavior with the following script:
Output:
Proposed Solution
KMerEncodertag"property:fit_is_empty"toFalseand implement_fit(self, X, y=None)to automatically extract unique characters from the training sequences and store them as a fitted attributeself.alphabet_.alphabetparameter: Add an optionalalphabet: list[str] | str | None = Noneparameter in__init__to allow users to override the default auto-inference with a custom alphabet.alphabetparameter fromAptaNetPipelineandAptaNetFeatureExtractordown toKMerEncoder.Cc: @fkiraly @siddharth7113 for review and feedback