Skip to content

[BUG] KMerEncoder ignores RNA 'U' nucleotides and distorts frequencies #696

Description

@purvanshjoshi

Description

KMerEncoder silently ignores "U" (uracil) nucleotides when encoding RNA sequences. This happens because the encoder's base alphabet is hardcoded to DNA bases (DNA_BASES = list("ACGT")).

When processing an RNA sequence, any k-mer containing "U" is silently skipped (membership check if kmer in kmer_counts: fails). This causes two major issues:

  1. Silent feature loss: All "U" bases are completely omitted from the resulting k-mer representation.
  2. Frequency distortion (inflation): The normalization denominator (total_kmers = sum(kmer_counts.values())) is artificially smaller since it ignores "U". This causes the frequencies of "A", "C", and "G" to be artificially inflated.

This issue also impacts the official user guide example in docs/source/user_guide/aptanet.md which passes RNA sequence strings containing "U" (e.g., "GGGAGGACGAAGACGACUCGAGACAGGCUAGGGAGGGA") to AptaNetPipeline.


Reproducible Example

You can reproduce this behavior with the following script:

import pandas as pd
from pyaptamer.trafos.encode._kmer import KMerEncoder

# DNA sequence "ACGT" (Works perfectly)
df_dna = pd.DataFrame(["ACGT"], columns=["Sequence"])
enc = KMerEncoder(k=1)
res_dna = enc.fit_transform(df_dna)
res_dna.columns = ["A", "C", "G", "T"]
print("DNA Encoding:")
print(pd.concat([df_dna, res_dna], axis=1))

# RNA sequence "ACGU" (Ignored U, inflated frequencies)
df_rna = pd.DataFrame(["ACGU"], columns=["Sequence"])
res_rna = enc.fit_transform(df_rna)
res_rna.columns = ["A", "C", "G", "T"]
print("\nRNA Encoding:")
print(pd.concat([df_rna, res_rna], axis=1))

# All-U sequence "UUUU" (Returns all zeros)
df_all_u = pd.DataFrame(["UUUU"], columns=["Sequence"])
res_all_u = enc.fit_transform(df_all_u)
res_all_u.columns = ["A", "C", "G", "T"]
print("\nAll-U Encoding:")
print(pd.concat([df_all_u, res_all_u], axis=1))

Output:

DNA Encoding:
  Sequence     A     C     G     T
0     ACGT  0.25  0.25  0.25  0.25

RNA Encoding:
  Sequence         A         C         G    T
0     ACGU  0.333333  0.333333  0.333333  0.0

All-U Encoding:
  Sequence    A    C    G    T
0     UUUU  0.0  0.0  0.0  0.0

Proposed Solution

  1. Auto-Inferred Alphabet by Default: Update KMerEncoder tag "property:fit_is_empty" to False and implement _fit(self, X, y=None) to automatically extract unique characters from the training sequences and store them as a fitted attribute self.alphabet_.
  2. Expose alphabet parameter: Add an optional alphabet: list[str] | str | None = None parameter in __init__ to allow users to override the default auto-inference with a custom alphabet.
  3. Propagate: Expose and pass this alphabet parameter from AptaNetPipeline and AptaNetFeatureExtractor down to KMerEncoder.

Cc: @fkiraly @siddharth7113 for review and feedback

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions