[ENH] Unify RNA tokenizer core and refactor encode_rna & seq2vec by SiddhantGahankari · Pull Request #291 · gc-os-ai/pyaptamer

SiddhantGahankari · 2026-04-04T23:33:25Z

Reference Issues/PRs

Closes #282 .

What does this implement/fix? Explain your changes.

This PR improves sequence vectorization and completes tokenizer unification for the current scope.

Added a shared internal greedy tokenizer core in pyaptamer/utils/_rna.py.
Refactored encode_rna in pyaptamer/utils/_rna.py to use the shared core.
Refactored seq2vec in pyaptamer/utils/_aptatrans_utils.py to use the same shared core with span-aware secondary-structure alignment.
Kept public APIs stable for encode_rna and seq2vec.
Preserved behavior contracts for unknown handling, truncation/chunking, and zero-padding.
Added shared-core tests in pyaptamer/utils/tests/test_rna.py and kept parity/edge-case tests in pyaptamer/utils/tests/test_seq2vec.py.

What should a reviewer concentrate their feedback on?

Behavior parity of encode_rna and seq2vec after tokenizer unification.
Correctness of greedy longest-first matching and unknown-token policy handling.
Correctness of span-based secondary-structure alignment in seq2vec.
Whether shared-core placement in pyaptamer/utils/_rna.py is appropriate for follow-up reuse.

Did you add any tests for the change?

Yes.

Added/updated tests include:

pyaptamer/utils/tests/test_rna.py:
- append-zero policy behavior
- skip policy behavior
- chunking with span alignment
- invalid chunk-size guard
- tokenize-to-chunks combined behavior
pyaptamer/utils/tests/test_seq2vec.py:
- legacy parity check
- empty-vocabulary guard
- unknown-region skipping behavior
- shorter-than-pattern edge case
- chunking and padding behavior

Any other comments?

This PR completes tokenizer unification for encode_rna and seq2vec.
Benchmark and lockfile artifacts are intentionally excluded from this PR.

PR checklist

The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
Added/modified tests
Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with pre-commit install.
To run hooks independent of commit, execute pre-commit run --all-files

satvshr · 2026-04-05T06:53:24Z

Thanks for the PR! I do not see where the said generalization is taking place though, could you point it out?

SiddhantGahankari · 2026-04-05T09:50:22Z

At the moment, the generalization in this PR is only partial: I extracted shared internals into _rna.py and reused them from _aptatrans_utils.py to remove duplicated greedy-pattern and padding logic.
Given your feedback, I will extend this same PR to do full unification by introducing one shared greedy tokenizer core in _rna.py then making both encode_rna and seq2vec use it while keeping public APIs stable.(was thinking of doing this in another PR)

satvshr · 2026-04-05T10:23:58Z

I do not think the increase in speed (1.76x) warrants such a huge change, I was mostly interested in the generalization. Limiting this PR to optimization only makes sense, but I do not know if we need it, though the TODO was not written by me so @NennoMP would know better.

NennoMP · 2026-04-06T12:36:01Z

The TODO comment was mostly a placeholder/suggestion, there isn't a fully-fledged out issue about the optimization aspect.

I do not see any empirical evidence of such speedup, is it shown anywhere? Anyway, I agree with @satvshr that a speedup of 1.76x does not warrant such a huge change. I think we do not need this right now.

SiddhantGahankari · 2026-04-06T12:48:17Z

Understood, thanks for the context @NennoMP The benchmark was run on train_li2014 (2,320 sequences, 10 iterations) and showed a 1.76x median speedup with much tighter variance, happy to share the full output if useful. That said, I agree with both of you that it doesn't justify the structural change at this scale.That said ,Should I close this PR, or is the generalization angle still worth pursuing in a more focused form?

SiddhantGahankari · 2026-04-07T17:38:40Z

@satvshr I have done the generalization , If any issues found please let me know.

SiddhantGahankari · 2026-04-12T09:58:22Z

@satvshr Is this PR good to be merged or do you suggest any changes?

satvshr · 2026-04-12T18:07:15Z

@NennoMP for you to have a look if you think it is required, I still believe if so many changes are needed the codebase is better left untouched

siddharth7113 · 2026-04-12T18:45:15Z

Hi @SiddhantGahankari , to make this review easy , I would request you to add docstrings, since this is adding multiple new functions.

SiddhantGahankari · 2026-04-13T11:15:32Z

Hi @SiddhantGahankari , to make this review easy , I would request you to add docstrings, since this is adding multiple new functions.

Sure @siddharth7113 , I will do so right away.

SiddhantGahankari · 2026-04-15T13:44:46Z

@siddharth7113 Hi , I have already added the required docstrings to the functions , Can you review it once and tell me whether any changes are required to this or it is ready to be merged?

satvshr · 2026-04-15T17:30:34Z

@NennoMP for you to have a look if you think it is required, I still believe if so many changes are needed the codebase is better left untouched

Was my take on it

siddharth7113 · 2026-04-20T12:57:53Z

Hi @SiddhantGahankari,

At this time , this PR is making a lot of file changes and this would be a premature optimisation, especially given the API is not stable, thank you for putting the effort but at this moment, I am closing the PR and the issue.

SiddhantGahankari · 2026-04-20T13:19:42Z

@siddharth7113
Got it, thanks for the feedback. I understand the concern about the scope and the API stability. I’ll hold off on further changes for now.Appreciate you taking the time to review this.Let me know if you'd like me work on any other issue.

SiddhantGahankari added 3 commits April 5, 2026 03:55

feat(utils): optimize seq2vec with regex-based tokenization

0db4b40

test(utils): add seq2vec regression and edge-case tests

f5943ce

extract shared seq2vec tokenizer helpers into rna utils

e79f989

SiddhantGahankari marked this pull request as draft April 7, 2026 14:38

SiddhantGahankari added 2 commits April 7, 2026 20:19

refactor(utils): add shared tokenizer core and route encode_rna

c0292d5

refactor(utils): route seq2vec through shared tokenizer core

e3f0da6

SiddhantGahankari marked this pull request as ready for review April 7, 2026 17:34

SiddhantGahankari changed the title ~~[ENH] Optimize seq2vec and extract shared greedy tokenization helpers~~ [ENH] Unify RNA tokenizer core and refactor encode_rna & seq2vec Apr 7, 2026

siddharth7113 requested review from NennoMP, satvshr and siddharth7113 April 12, 2026 18:43

added docstrings for functions

587b4de

siddharth7113 closed this Apr 20, 2026

Uh oh!

Conversation

SiddhantGahankari commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

Uh oh!

satvshr commented Apr 5, 2026

Uh oh!

SiddhantGahankari commented Apr 5, 2026

Uh oh!

satvshr commented Apr 5, 2026

Uh oh!

NennoMP commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SiddhantGahankari commented Apr 6, 2026

Uh oh!

SiddhantGahankari commented Apr 7, 2026

Uh oh!

SiddhantGahankari commented Apr 12, 2026

Uh oh!

satvshr commented Apr 12, 2026

Uh oh!

siddharth7113 commented Apr 12, 2026

Uh oh!

SiddhantGahankari commented Apr 13, 2026

Uh oh!

SiddhantGahankari commented Apr 15, 2026

Uh oh!

satvshr commented Apr 15, 2026

Uh oh!

siddharth7113 commented Apr 20, 2026

Uh oh!

SiddhantGahankari commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SiddhantGahankari commented Apr 4, 2026 •

edited

Loading

NennoMP commented Apr 6, 2026 •

edited

Loading