Skip to content

[ENH] Unify RNA tokenizer core and refactor encode_rna & seq2vec#291

Closed
SiddhantGahankari wants to merge 6 commits into
gc-os-ai:mainfrom
SiddhantGahankari:main
Closed

[ENH] Unify RNA tokenizer core and refactor encode_rna & seq2vec#291
SiddhantGahankari wants to merge 6 commits into
gc-os-ai:mainfrom
SiddhantGahankari:main

Conversation

@SiddhantGahankari

@SiddhantGahankari SiddhantGahankari commented Apr 4, 2026

Copy link
Copy Markdown

Reference Issues/PRs

Closes #282 .

What does this implement/fix? Explain your changes.

This PR improves sequence vectorization and completes tokenizer unification for the current scope.

  • Added a shared internal greedy tokenizer core in pyaptamer/utils/_rna.py.
  • Refactored encode_rna in pyaptamer/utils/_rna.py to use the shared core.
  • Refactored seq2vec in pyaptamer/utils/_aptatrans_utils.py to use the same shared core with span-aware secondary-structure alignment.
  • Kept public APIs stable for encode_rna and seq2vec.
  • Preserved behavior contracts for unknown handling, truncation/chunking, and zero-padding.
  • Added shared-core tests in pyaptamer/utils/tests/test_rna.py and kept parity/edge-case tests in pyaptamer/utils/tests/test_seq2vec.py.

What should a reviewer concentrate their feedback on?

  • Behavior parity of encode_rna and seq2vec after tokenizer unification.
  • Correctness of greedy longest-first matching and unknown-token policy handling.
  • Correctness of span-based secondary-structure alignment in seq2vec.
  • Whether shared-core placement in pyaptamer/utils/_rna.py is appropriate for follow-up reuse.

Did you add any tests for the change?

Yes.

Added/updated tests include:

  • pyaptamer/utils/tests/test_rna.py:
    • append-zero policy behavior
    • skip policy behavior
    • chunking with span alignment
    • invalid chunk-size guard
    • tokenize-to-chunks combined behavior
  • pyaptamer/utils/tests/test_seq2vec.py:
    • legacy parity check
    • empty-vocabulary guard
    • unknown-region skipping behavior
    • shorter-than-pattern edge case
    • chunking and padding behavior

Any other comments?

  • This PR completes tokenizer unification for encode_rna and seq2vec.
  • Benchmark and lockfile artifacts are intentionally excluded from this PR.

PR checklist

  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
  • Added/modified tests
  • Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with pre-commit install.
    To run hooks independent of commit, execute pre-commit run --all-files

@satvshr

satvshr commented Apr 5, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the PR! I do not see where the said generalization is taking place though, could you point it out?

@SiddhantGahankari

Copy link
Copy Markdown
Author

At the moment, the generalization in this PR is only partial: I extracted shared internals into _rna.py and reused them from _aptatrans_utils.py to remove duplicated greedy-pattern and padding logic.
Given your feedback, I will extend this same PR to do full unification by introducing one shared greedy tokenizer core in _rna.py then making both encode_rna and seq2vec use it while keeping public APIs stable.(was thinking of doing this in another PR)

@satvshr

satvshr commented Apr 5, 2026

Copy link
Copy Markdown
Collaborator

I do not think the increase in speed (1.76x) warrants such a huge change, I was mostly interested in the generalization. Limiting this PR to optimization only makes sense, but I do not know if we need it, though the TODO was not written by me so @NennoMP would know better.

@NennoMP

NennoMP commented Apr 6, 2026

Copy link
Copy Markdown
Collaborator

The TODO comment was mostly a placeholder/suggestion, there isn't a fully-fledged out issue about the optimization aspect.

I do not see any empirical evidence of such speedup, is it shown anywhere? Anyway, I agree with @satvshr that a speedup of 1.76x does not warrant such a huge change. I think we do not need this right now.

@SiddhantGahankari

Copy link
Copy Markdown
Author

Understood, thanks for the context @NennoMP The benchmark was run on train_li2014 (2,320 sequences, 10 iterations) and showed a 1.76x median speedup with much tighter variance, happy to share the full output if useful. That said, I agree with both of you that it doesn't justify the structural change at this scale.That said ,Should I close this PR, or is the generalization angle still worth pursuing in a more focused form?

@SiddhantGahankari SiddhantGahankari marked this pull request as draft April 7, 2026 14:38
@SiddhantGahankari SiddhantGahankari marked this pull request as ready for review April 7, 2026 17:34
@SiddhantGahankari SiddhantGahankari changed the title [ENH] Optimize seq2vec and extract shared greedy tokenization helpers [ENH] Unify RNA tokenizer core and refactor encode_rna & seq2vec Apr 7, 2026
@SiddhantGahankari

Copy link
Copy Markdown
Author

@satvshr I have done the generalization , If any issues found please let me know.

@SiddhantGahankari

Copy link
Copy Markdown
Author

@satvshr Is this PR good to be merged or do you suggest any changes?

@satvshr

satvshr commented Apr 12, 2026

Copy link
Copy Markdown
Collaborator

@NennoMP for you to have a look if you think it is required, I still believe if so many changes are needed the codebase is better left untouched

@siddharth7113

Copy link
Copy Markdown
Collaborator

Hi @SiddhantGahankari , to make this review easy , I would request you to add docstrings, since this is adding multiple new functions.

@SiddhantGahankari

Copy link
Copy Markdown
Author

Hi @SiddhantGahankari , to make this review easy , I would request you to add docstrings, since this is adding multiple new functions.

Sure @siddharth7113 , I will do so right away.

@SiddhantGahankari

Copy link
Copy Markdown
Author

@siddharth7113 Hi , I have already added the required docstrings to the functions , Can you review it once and tell me whether any changes are required to this or it is ready to be merged?

@satvshr

satvshr commented Apr 15, 2026

Copy link
Copy Markdown
Collaborator

@NennoMP for you to have a look if you think it is required, I still believe if so many changes are needed the codebase is better left untouched

Was my take on it

@siddharth7113

Copy link
Copy Markdown
Collaborator

Hi @SiddhantGahankari,

At this time , this PR is making a lot of file changes and this would be a premature optimisation, especially given the API is not stable, thank you for putting the effort but at this moment, I am closing the PR and the issue.

@SiddhantGahankari

Copy link
Copy Markdown
Author

@siddharth7113
Got it, thanks for the feedback. I understand the concern about the scope and the API stability. I’ll hold off on further changes for now.Appreciate you taking the time to review this.Let me know if you'd like me work on any other issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] Optimize and generalize RNA sequence vectorization (rna2vec) in _aptatrans_utils

4 participants