Conversation
|
To do: Add tests comparing it to the implementation in the official AptaNet repository. |
fkiraly
left a comment
There was a problem hiding this comment.
Looks like a quite simple algorithm?
Can you kindly change this to a class, and move the many internal functions to private methods? Functions inside functions are not good style.
Further, there are also a number of loops, which look like they can be vectorized using numpy.
Finally, it looks like a lot of stuff gets precomputed every time, i.e., the prop_groups. I would separate this out from the algorithm, and think about caching the results.
|
regarding code structure: this should move in a folder for packaging, I think currently we are using |
Yup! Logically, it is not as complex as I initially thought it was.
Will get to it.
|
I meant everything depending on
Yes, I think we should have the physiochemical values stored somewhere - the pattern I would personally use is compute it once and hard-code the result. But then also add a test that the hard-coded results are equal to the computation from the original values, so the compute load for the "verification" is in the tests, and not every time the algorithm gets called (where it is unnecessary). Plus, of course, ample comments, maybe a getter function whose docstrings explain what the values exactly are that it returns, with a pointer to the tests and the original physiochemical tables. |
fkiraly
left a comment
There was a problem hiding this comment.
Great!
May I request some structural changes:
- move the input from
__init__tovectorize- I assume that is possible? - move the checking whether a string is an amino-acid string to a
utilsmodule. Use the set of letters instead of the dict. - move the
self.P1etc to a separate private module where thes stay hard-coded as constants. _average_aais not pythonic, this can be simplified- various methods look like they can be vectorized or replaced by
numpydrop-ins. - where did you move the precomputations?
Done.
Done.
Done.
Done.
Done.
All the values with variable names |
| 3 consecutive properties. Specifically, the groups are arranged in order as follows: | ||
| Group 1 includes properties 1–3, Group 2 includes properties 4–6, and so on, up to | ||
| Group 7, which includes properties 19–21. the properties in order are: | ||
| - Hydrophobicity |
There was a problem hiding this comment.
above you give numbers to the properties, that is good since it is specific. For consistency, I would suggest you use a numbered list instead of a bullet point list then.
Minor formatting suggestin at this point: ensure before lists (bullet points, numbered etc) and after there are two newlinds, i.e., an empty line between the text and first item etc. The reason for this is, rst expects this.
| amino acid) | ||
| - 30 sequence-order correlation features based on physicochemical similarity between | ||
| residues. | ||
| - These 50 features are computed for each of 7 predefined property groups, |
There was a problem hiding this comment.
Remove from bullet point and move to top level. This is not another set of features but explanation of how they sum up to everything.
fkiraly
left a comment
There was a problem hiding this comment.
only small docstring comments, above
|
After 110 comments (most of them about docstrings annoyingly) and this PR almost being merged. I propose some changes if we want to make PSeAAC a standalone algorithm:
|
fkiraly
left a comment
There was a problem hiding this comment.
Question, why did you extend the specification of the class? With group_props etc.
Is this "feature creep"? I feel this should have been a separate pull request, if at all.
The docstring is also very hard to understand what it means.
The algorithm I had implemented was wrong because we were assuming there were only 21 property groups to use, and those groups would be grouped into groups of 3. This is valid only for the best experiment from
Forgot to edit the docstrings after the final draft, will do so. |
Can you please explain that? I would suggest we split this PR: turn this back into the mergeable almost-approved state where we had the 21 property groups; move the changes after that into a new PR. I think we need to discuss things like the API design and the abstractions here, I do not think the current ones are good choices. The danger is we have an ever-growing PR before we have merged anything. It is best practice to not re-scope or widen scope of issues or PR in the middle of it. |
fkiraly
left a comment
There was a problem hiding this comment.
Thanks.
Since the tests do not run on CI, can you confirm they run locally?
Tests pass locally. |
|
FINALLY! 🎉 |
Solves #28.
Merge before #13.
Implements the encoding algorithm that will take the string of the target protein as input and output a feature vector containing its physicochemical properties based on the distance between each amino acid in that protein.