[BUG] KMerEncoder ignores RNA 'U' nucleotides and distorts frequencies

### Description

`KMerEncoder` silently ignores `"U"` (uracil) nucleotides when encoding RNA sequences. This happens because the encoder's base alphabet is hardcoded to DNA bases (`DNA_BASES = list("ACGT")`). 

When processing an RNA sequence, any k-mer containing `"U"` is silently skipped (membership check `if kmer in kmer_counts:` fails). This causes two major issues:
1. **Silent feature loss:** All `"U"` bases are completely omitted from the resulting k-mer representation.
2. **Frequency distortion (inflation):** The normalization denominator (`total_kmers = sum(kmer_counts.values())`) is artificially smaller since it ignores `"U"`. This causes the frequencies of `"A"`, `"C"`, and `"G"` to be artificially inflated.

This issue also impacts the official user guide example in `docs/source/user_guide/aptanet.md` which passes RNA sequence strings containing `"U"` (e.g., `"GGGAGGACGAAGACGACUCGAGACAGGCUAGGGAGGGA"`) to `AptaNetPipeline`.

---

### Reproducible Example

You can reproduce this behavior with the following script:

```python
import pandas as pd
from pyaptamer.trafos.encode._kmer import KMerEncoder

# DNA sequence "ACGT" (Works perfectly)
df_dna = pd.DataFrame(["ACGT"], columns=["Sequence"])
enc = KMerEncoder(k=1)
res_dna = enc.fit_transform(df_dna)
res_dna.columns = ["A", "C", "G", "T"]
print("DNA Encoding:")
print(pd.concat([df_dna, res_dna], axis=1))

# RNA sequence "ACGU" (Ignored U, inflated frequencies)
df_rna = pd.DataFrame(["ACGU"], columns=["Sequence"])
res_rna = enc.fit_transform(df_rna)
res_rna.columns = ["A", "C", "G", "T"]
print("\nRNA Encoding:")
print(pd.concat([df_rna, res_rna], axis=1))

# All-U sequence "UUUU" (Returns all zeros)
df_all_u = pd.DataFrame(["UUUU"], columns=["Sequence"])
res_all_u = enc.fit_transform(df_all_u)
res_all_u.columns = ["A", "C", "G", "T"]
print("\nAll-U Encoding:")
print(pd.concat([df_all_u, res_all_u], axis=1))
```

#### Output:
```text
DNA Encoding:
  Sequence     A     C     G     T
0     ACGT  0.25  0.25  0.25  0.25

RNA Encoding:
  Sequence         A         C         G    T
0     ACGU  0.333333  0.333333  0.333333  0.0

All-U Encoding:
  Sequence    A    C    G    T
0     UUUU  0.0  0.0  0.0  0.0
```

---

### Proposed Solution

1. **Auto-Inferred Alphabet by Default:** Update `KMerEncoder` tag `"property:fit_is_empty"` to `False` and implement `_fit(self, X, y=None)` to automatically extract unique characters from the training sequences and store them as a fitted attribute `self.alphabet_`.
2. **Expose `alphabet` parameter:** Add an optional `alphabet: list[str] | str | None = None` parameter in `__init__` to allow users to override the default auto-inference with a custom alphabet.
3. **Propagate:** Expose and pass this `alphabet` parameter from `AptaNetPipeline` and `AptaNetFeatureExtractor` down to `KMerEncoder`.

Cc: @fkiraly @siddharth7113  for review and feedback 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] KMerEncoder ignores RNA 'U' nucleotides and distorts frequencies #696

Description

Reproducible Example

Output:

Proposed Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] KMerEncoder ignores RNA 'U' nucleotides and distorts frequencies #696

Description

Description

Reproducible Example

Output:

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions