[ENH] Unify RNA tokenizer core and refactor encode_rna & seq2vec#291
[ENH] Unify RNA tokenizer core and refactor encode_rna & seq2vec#291SiddhantGahankari wants to merge 6 commits into
Conversation
|
Thanks for the PR! I do not see where the said generalization is taking place though, could you point it out? |
|
At the moment, the generalization in this PR is only partial: I extracted shared internals into |
|
I do not think the increase in speed (1.76x) warrants such a huge change, I was mostly interested in the generalization. Limiting this PR to optimization only makes sense, but I do not know if we need it, though the TODO was not written by me so @NennoMP would know better. |
|
The TODO comment was mostly a placeholder/suggestion, there isn't a fully-fledged out issue about the optimization aspect. I do not see any empirical evidence of such speedup, is it shown anywhere? Anyway, I agree with @satvshr that a speedup of |
|
Understood, thanks for the context @NennoMP The benchmark was run on train_li2014 (2,320 sequences, 10 iterations) and showed a |
|
@satvshr I have done the generalization , If any issues found please let me know. |
|
@satvshr Is this PR good to be merged or do you suggest any changes? |
|
@NennoMP for you to have a look if you think it is required, I still believe if so many changes are needed the codebase is better left untouched |
|
Hi @SiddhantGahankari , to make this review easy , I would request you to add docstrings, since this is adding multiple new functions. |
Sure @siddharth7113 , I will do so right away. |
|
@siddharth7113 Hi , I have already added the required docstrings to the functions , Can you review it once and tell me whether any changes are required to this or it is ready to be merged? |
Was my take on it |
|
At this time , this PR is making a lot of file changes and this would be a premature optimisation, especially given the API is not stable, thank you for putting the effort but at this moment, I am closing the PR and the issue. |
|
@siddharth7113 |
Reference Issues/PRs
Closes #282 .
What does this implement/fix? Explain your changes.
This PR improves sequence vectorization and completes tokenizer unification for the current scope.
pyaptamer/utils/_rna.py.encode_rnainpyaptamer/utils/_rna.pyto use the shared core.seq2vecinpyaptamer/utils/_aptatrans_utils.pyto use the same shared core with span-aware secondary-structure alignment.encode_rnaandseq2vec.pyaptamer/utils/tests/test_rna.pyand kept parity/edge-case tests inpyaptamer/utils/tests/test_seq2vec.py.What should a reviewer concentrate their feedback on?
encode_rnaandseq2vecafter tokenizer unification.seq2vec.pyaptamer/utils/_rna.pyis appropriate for follow-up reuse.Did you add any tests for the change?
Yes.
Added/updated tests include:
pyaptamer/utils/tests/test_rna.py:pyaptamer/utils/tests/test_seq2vec.py:Any other comments?
encode_rnaandseq2vec.PR checklist
pre-commit install.To run hooks independent of commit, execute
pre-commit run --all-files