[ENH] PseAAC encoding algorithm by satvshr · Pull Request #29 · gc-os-ai/pyaptamer

satvshr · 2025-07-07T07:55:04Z

Solves #28.
Merge before #13.
Implements the encoding algorithm that will take the string of the target protein as input and output a feature vector containing its physicochemical properties based on the distance between each amino acid in that protein.

satvshr · 2025-07-07T07:58:09Z

To do: Add tests comparing it to the implementation in the official AptaNet repository.

fkiraly

Looks like a quite simple algorithm?

Can you kindly change this to a class, and move the many internal functions to private methods? Functions inside functions are not good style.

Further, there are also a number of loops, which look like they can be vectorized using numpy.

Finally, it looks like a lot of stuff gets precomputed every time, i.e., the prop_groups. I would separate this out from the algorithm, and think about caching the results.

fkiraly · 2025-07-07T08:59:22Z

regarding code structure: this should move in a folder for packaging, I think currently we are using pyaptamer and not src.

satvshr · 2025-07-07T11:50:50Z

Looks like a quite simple algorithm?

Yup! Logically, it is not as complex as I initially thought it was.

Can you kindly change this to a class, and move the many internal functions to private methods? Functions inside functions are not good style.

Will get to it.

it looks like a lot of stuff gets precomputed every time, i.e., the prop_groups. I would separate this out from the algorithm, and think about caching the results.

prop_groups is not getting computed but I think you are talking about the normalize_properties function which uses fixed values and hence does not need to be computed? I thought that the actual physicochemical values for all the properties should also be mentioned hence I introduced the function, Should I add 21 more variables for the normalized physicochemical values?

fkiraly · 2025-07-07T13:58:40Z

prop_groups is not getting computed

I meant everything depending on prop_groups and only on that - since prop_groups are hard-coded, the normalized prop_groups will always yield the same result.

I thought that the actual physicochemical values for all the properties should also be mentioned hence I introduced the function, Should I add 21 more variables for the normalized physicochemical values?

Yes, I think we should have the physiochemical values stored somewhere - the pattern I would personally use is compute it once and hard-code the result. But then also add a test that the hard-coded results are equal to the computation from the original values, so the compute load for the "verification" is in the tests, and not every time the algorithm gets called (where it is unnecessary).

Plus, of course, ample comments, maybe a getter function whose docstrings explain what the values exactly are that it returns, with a pointer to the tests and the original physiochemical tables.

…on tests and bug fixing

fkiraly

Great!

May I request some structural changes:

move the input from __init__ to vectorize - I assume that is possible?
move the checking whether a string is an amino-acid string to a utils module. Use the set of letters instead of the dict.
move the self.P1 etc to a separate private module where thes stay hard-coded as constants.
_average_aa is not pythonic, this can be simplified
various methods look like they can be vectorized or replaced by numpy drop-ins.
where did you move the precomputations?

satvshr · 2025-07-09T18:21:48Z

move the input from __init__ to vectorize - I assume that is possible?

Done.

move the checking whether a string is an amino-acid string to a utils module. Use the set of letters instead of the dict.

Done.

move the self.P1 etc to a separate private module where they stay hard-coded as constants.

Done.

_average_aa is not pythonic, this can be simplified

Done.

various methods look like they can be vectorized or replaced by numpy drop-ins.

Done.

where did you move the precomputations?

All the values with variable names P{i} are the raw values, variable named NP{i} are the normalized ones. I also moved the normalized values to the private file.

fkiraly · 2025-07-18T11:57:48Z

+    3 consecutive properties. Specifically, the groups are arranged in order as follows:
+    Group 1 includes properties 1–3, Group 2 includes properties 4–6, and so on, up to
+    Group 7, which includes properties 19–21. the properties in order are:
+    - Hydrophobicity


above you give numbers to the properties, that is good since it is specific. For consistency, I would suggest you use a numbered list instead of a bullet point list then.

Minor formatting suggestin at this point: ensure before lists (bullet points, numbered etc) and after there are two newlinds, i.e., an empty line between the text and first item etc. The reason for this is, rst expects this.

fkiraly · 2025-07-18T11:58:32Z

+    amino acid)
+    - 30 sequence-order correlation features based on physicochemical similarity between
+    residues.
+    - These 50 features are computed for each of 7 predefined property groups,


Remove from bullet point and move to top level. This is not another set of features but explanation of how they sum up to everything.

fkiraly

only small docstring comments, above

satvshr · 2025-07-18T21:07:39Z

After 110 comments (most of them about docstrings annoyingly) and this PR almost being merged. I propose some changes if we want to make PSeAAC a standalone algorithm:

Move all "21 physicochemical properties" references out of PSeAAC as in the paper, they mention having a total of 24 properties, out of which different combinations of them give the best results for different k values.
Update the helper function aa_props and the PSeAAC class to accept indices of physicochemical properties user wants to use.

satvshr · 2025-07-20T17:38:44Z

Was browsing through google to find the original PSeAAC paper by Chou (not available publically) and found this, it is not installable via pip but the download link can be found here, along with a user guide.

fkiraly

Question, why did you extend the specification of the class? With group_props etc.
Is this "feature creep"? I feel this should have been a separate pull request, if at all.

The docstring is also very hard to understand what it means.

satvshr · 2025-07-22T06:09:55Z

Question, why did you extend the specification of the class? With group_props etc. Is this "feature creep"? I feel this should have been a separate pull request, if at all.

The algorithm I had implemented was wrong because we were assuming there were only 21 property groups to use, and those groups would be grouped into groups of 3. This is valid only for the best experiment from AptaNet, not for PSeAAC in general.

The docstring is also very hard to understand what it means.

Forgot to edit the docstrings after the final draft, will do so.

fkiraly · 2025-07-22T06:59:35Z

The algorithm I had implemented was wrong because we were assuming there were only 21 property groups to use, and those groups would be grouped into groups of 3. This is valid only for the best experiment from AptaNet, not for PSeAAC in general.

Can you please explain that?

I would suggest we split this PR: turn this back into the mergeable almost-approved state where we had the 21 property groups; move the changes after that into a new PR. I think we need to discuss things like the API design and the abstractions here, I do not think the current ones are good choices.

The danger is we have an ever-growing PR before we have merged anything. It is best practice to not re-scope or widen scope of issues or PR in the middle of it.

fkiraly

Thanks.

Since the tests do not run on CI, can you confirm they run locally?

satvshr · 2025-07-24T20:40:03Z

Thanks.

Since the tests do not run on CI, can you confirm they run locally?

Tests pass locally.

satvshr · 2025-07-24T21:05:28Z

FINALLY! 🎉

Added the pseaac encoding algorithm

e37135c

satvshr marked this pull request as draft July 7, 2025 07:55

fkiraly requested changes Jul 7, 2025

View reviewed changes

fkiraly assigned satvshr Jul 7, 2025

satvshr added 2 commits July 7, 2025 22:22

Made pseaac to a class and made the functions private, still working …

a5f01e0

…on tests and bug fixing

Made a few readability changes

3773a90

fkiraly assigned avinab-neogy and unassigned satvshr Jul 8, 2025

avinab-neogy removed their assignment Jul 8, 2025

satvshr self-assigned this Jul 8, 2025

satvshr added 2 commits July 8, 2025 12:55

Edited tests

9b9a3da

Added pytest to tests

2dfe0c7

satvshr mentioned this pull request Jul 8, 2025

AptaNet discrepancy between implementation and paper #34

Closed

satvshr added 3 commits July 9, 2025 01:25

Added numpy style docstrings and ruff formatting

1e182d3

Removed AptaNet from root

fc2f051

Added example

62f6c42

satvshr marked this pull request as ready for review July 9, 2025 07:21

satvshr requested a review from fkiraly July 9, 2025 07:21

fkiraly requested changes Jul 9, 2025

View reviewed changes

Made requested changes

1515efe

satvshr requested a review from fkiraly July 9, 2025 18:22

fkiraly reviewed Jul 9, 2025

View reviewed changes

Comment thread pyaptamer/AptaNet/_props.py Outdated

fkiraly reviewed Jul 9, 2025

View reviewed changes

Comment thread pyaptamer/AptaNet/pseaac.py Outdated

fkiraly reviewed Jul 9, 2025

View reviewed changes

Comment thread pyaptamer/AptaNet/pseaac.py Outdated

fkiraly reviewed Jul 9, 2025

View reviewed changes

Comment thread pyaptamer/AptaNet/utils.py Outdated

satvshr added 6 commits July 16, 2025 21:05

Merge branch 'main' into issue28

d24c4d7

Added requested changes

0cd72b7

Added requested changes

fabc7b4

Added info about prop groups in class docstring

32633d3

Removed init method description

6136c39

editing changes

88c0122

fkiraly reviewed Jul 18, 2025

View reviewed changes

fkiraly requested changes Jul 18, 2025

View reviewed changes

satvshr added 2 commits July 19, 2025 00:42

Made requested changes

b7a7349

Made requested changes

c14c0bb

satvshr requested a review from fkiraly July 18, 2025 19:15

fkiraly requested changes Jul 21, 2025

View reviewed changes

satvshr force-pushed the issue28 branch from 71cfd9a to c14c0bb Compare July 22, 2025 08:10

Added .vscode to .gitignore

d1075a7

satvshr requested a review from fkiraly July 22, 2025 08:12

Added metadata

19a9e98

fkiraly approved these changes Jul 24, 2025

View reviewed changes

satvshr requested a review from fkiraly July 24, 2025 21:00

fkiraly changed the title ~~Added the pseaac encoding algorithm~~ [ENH] PseAAC encoding algorithm Jul 24, 2025

fkiraly added the enhancement New feature or request label Jul 24, 2025

fkiraly merged commit 7fff054 into main Jul 24, 2025
2 of 4 checks passed

satvshr mentioned this pull request Jul 30, 2025

[ENH] PseAAC feature encoding algorithm #28

Closed

satvshr mentioned this pull request Aug 12, 2025

[ENH] AptaNet algorithm #30

Merged

Conversation

satvshr commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satvshr commented Jul 7, 2025

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

fkiraly commented Jul 7, 2025

Uh oh!

satvshr commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

satvshr commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fkiraly Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

satvshr Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

fkiraly Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

satvshr Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

satvshr commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satvshr commented Jul 20, 2025

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

satvshr commented Jul 22, 2025

Uh oh!

fkiraly commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

satvshr commented Jul 24, 2025

Uh oh!

Uh oh!

satvshr commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

satvshr commented Jul 7, 2025 •

edited

Loading

satvshr commented Jul 7, 2025 •

edited

Loading

fkiraly commented Jul 7, 2025 •

edited

Loading

satvshr commented Jul 18, 2025 •

edited

Loading

fkiraly commented Jul 22, 2025 •

edited

Loading

satvshr commented Jul 24, 2025 •

edited

Loading