Skip to content

[ENH] PseAAC encoding algorithm#29

Merged
fkiraly merged 37 commits into
mainfrom
issue28
Jul 24, 2025
Merged

[ENH] PseAAC encoding algorithm#29
fkiraly merged 37 commits into
mainfrom
issue28

Conversation

@satvshr
Copy link
Copy Markdown
Collaborator

@satvshr satvshr commented Jul 7, 2025

Solves #28.
Merge before #13.
Implements the encoding algorithm that will take the string of the target protein as input and output a feature vector containing its physicochemical properties based on the distance between each amino acid in that protein.

@satvshr satvshr marked this pull request as draft July 7, 2025 07:55
@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 7, 2025

To do: Add tests comparing it to the implementation in the official AptaNet repository.

Copy link
Copy Markdown
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a quite simple algorithm?

Can you kindly change this to a class, and move the many internal functions to private methods? Functions inside functions are not good style.

Further, there are also a number of loops, which look like they can be vectorized using numpy.

Finally, it looks like a lot of stuff gets precomputed every time, i.e., the prop_groups. I would separate this out from the algorithm, and think about caching the results.

@fkiraly
Copy link
Copy Markdown
Contributor

fkiraly commented Jul 7, 2025

regarding code structure: this should move in a folder for packaging, I think currently we are using pyaptamer and not src.

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 7, 2025

Looks like a quite simple algorithm?

Yup! Logically, it is not as complex as I initially thought it was.

Can you kindly change this to a class, and move the many internal functions to private methods? Functions inside functions are not good style.

Will get to it.

it looks like a lot of stuff gets precomputed every time, i.e., the prop_groups. I would separate this out from the algorithm, and think about caching the results.

prop_groups is not getting computed but I think you are talking about the normalize_properties function which uses fixed values and hence does not need to be computed? I thought that the actual physicochemical values for all the properties should also be mentioned hence I introduced the function, Should I add 21 more variables for the normalized physicochemical values?

@fkiraly
Copy link
Copy Markdown
Contributor

fkiraly commented Jul 7, 2025

prop_groups is not getting computed

I meant everything depending on prop_groups and only on that - since prop_groups are hard-coded, the normalized prop_groups will always yield the same result.

I thought that the actual physicochemical values for all the properties should also be mentioned hence I introduced the function, Should I add 21 more variables for the normalized physicochemical values?

Yes, I think we should have the physiochemical values stored somewhere - the pattern I would personally use is compute it once and hard-code the result. But then also add a test that the hard-coded results are equal to the computation from the original values, so the compute load for the "verification" is in the tests, and not every time the algorithm gets called (where it is unnecessary).

Plus, of course, ample comments, maybe a getter function whose docstrings explain what the values exactly are that it returns, with a pointer to the tests and the original physiochemical tables.

@fkiraly fkiraly assigned avinab-neogy and unassigned satvshr Jul 8, 2025
@avinab-neogy avinab-neogy removed their assignment Jul 8, 2025
@satvshr satvshr self-assigned this Jul 8, 2025
@satvshr satvshr marked this pull request as ready for review July 9, 2025 07:21
@satvshr satvshr requested a review from fkiraly July 9, 2025 07:21
Copy link
Copy Markdown
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

May I request some structural changes:

  • move the input from __init__ to vectorize - I assume that is possible?
  • move the checking whether a string is an amino-acid string to a utils module. Use the set of letters instead of the dict.
  • move the self.P1 etc to a separate private module where thes stay hard-coded as constants.
  • _average_aa is not pythonic, this can be simplified
  • various methods look like they can be vectorized or replaced by numpy drop-ins.
  • where did you move the precomputations?

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 9, 2025

move the input from __init__ to vectorize - I assume that is possible?

Done.

move the checking whether a string is an amino-acid string to a utils module. Use the set of letters instead of the dict.

Done.

move the self.P1 etc to a separate private module where they stay hard-coded as constants.

Done.

_average_aa is not pythonic, this can be simplified

Done.

various methods look like they can be vectorized or replaced by numpy drop-ins.

Done.

where did you move the precomputations?

All the values with variable names P{i} are the raw values, variable named NP{i} are the normalized ones. I also moved the normalized values to the private file.

@satvshr satvshr requested a review from fkiraly July 9, 2025 18:22
Comment thread pyaptamer/AptaNet/_props.py Outdated
Comment thread pyaptamer/AptaNet/pseaac.py Outdated
Comment thread pyaptamer/AptaNet/pseaac.py Outdated
Comment thread pyaptamer/AptaNet/utils.py Outdated
Comment thread pyaptamer/pseaac/_features.py Outdated
3 consecutive properties. Specifically, the groups are arranged in order as follows:
Group 1 includes properties 1–3, Group 2 includes properties 4–6, and so on, up to
Group 7, which includes properties 19–21. the properties in order are:
- Hydrophobicity
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above you give numbers to the properties, that is good since it is specific. For consistency, I would suggest you use a numbered list instead of a bullet point list then.

Minor formatting suggestin at this point: ensure before lists (bullet points, numbered etc) and after there are two newlinds, i.e., an empty line between the text and first item etc. The reason for this is, rst expects this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread pyaptamer/pseaac/_features.py Outdated
amino acid)
- 30 sequence-order correlation features based on physicochemical similarity between
residues.
- These 50 features are computed for each of 7 predefined property groups,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove from bullet point and move to top level. This is not another set of features but explanation of how they sum up to everything.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only small docstring comments, above

@satvshr satvshr requested a review from fkiraly July 18, 2025 19:15
@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 18, 2025

After 110 comments (most of them about docstrings annoyingly) and this PR almost being merged. I propose some changes if we want to make PSeAAC a standalone algorithm:

  1. Move all "21 physicochemical properties" references out of PSeAAC as in the paper, they mention having a total of 24 properties, out of which different combinations of them give the best results for different k values.
  2. Update the helper function aa_props and the PSeAAC class to accept indices of physicochemical properties user wants to use.

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 20, 2025

Was browsing through google to find the original PSeAAC paper by Chou (not available publically) and found this, it is not installable via pip but the download link can be found here, along with a user guide.

Copy link
Copy Markdown
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, why did you extend the specification of the class? With group_props etc.
Is this "feature creep"? I feel this should have been a separate pull request, if at all.

The docstring is also very hard to understand what it means.

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 22, 2025

Question, why did you extend the specification of the class? With group_props etc. Is this "feature creep"? I feel this should have been a separate pull request, if at all.

The algorithm I had implemented was wrong because we were assuming there were only 21 property groups to use, and those groups would be grouped into groups of 3. This is valid only for the best experiment from AptaNet, not for PSeAAC in general.

The docstring is also very hard to understand what it means.

Forgot to edit the docstrings after the final draft, will do so.

@fkiraly
Copy link
Copy Markdown
Contributor

fkiraly commented Jul 22, 2025

The algorithm I had implemented was wrong because we were assuming there were only 21 property groups to use, and those groups would be grouped into groups of 3. This is valid only for the best experiment from AptaNet, not for PSeAAC in general.

Can you please explain that?

I would suggest we split this PR: turn this back into the mergeable almost-approved state where we had the 21 property groups; move the changes after that into a new PR. I think we need to discuss things like the API design and the abstractions here, I do not think the current ones are good choices.

The danger is we have an ever-growing PR before we have merged anything. It is best practice to not re-scope or widen scope of issues or PR in the middle of it.

@satvshr satvshr requested a review from fkiraly July 22, 2025 08:12
Copy link
Copy Markdown
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Since the tests do not run on CI, can you confirm they run locally?

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 24, 2025

Thanks.

Since the tests do not run on CI, can you confirm they run locally?

Tests pass locally.

@satvshr satvshr requested a review from fkiraly July 24, 2025 21:00
@fkiraly fkiraly changed the title Added the pseaac encoding algorithm [ENH] PseAAC encoding algorithm Jul 24, 2025
@fkiraly fkiraly added the enhancement New feature or request label Jul 24, 2025
@fkiraly fkiraly merged commit 7fff054 into main Jul 24, 2025
2 of 4 checks passed
@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 24, 2025

FINALLY! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants