Conversation
|
To do: Add tests comparing it to the implementation in the official AptaNet repository. |
fkiraly
left a comment
There was a problem hiding this comment.
Some questions for my understanding: what is this trained on?
Are there pre-trained weights? If so, where?
If we can also train on own data, how does that work?
…on tests and bug fixing
Nope, there are no pre-trained weights.
This is answered in the "Data collection" section of the original paper (page 11)
I assume you send aptamer and target sequence as X, and y will be a binary value (if it binds or not) |
Can you give a short summary in your own words, or say that you do not know? If there are no pretrained weights, then there is nothing this was trained on. |
There was a problem hiding this comment.
Great!
Request to change the signature of the class:
- move the
__init__args to the methodgenerate_final-vector - rename the latter to
transform - change the output to an 1D
np.ndarray
Also, usual quality requests:
- please ensure to add docstring examples
- please ensure to add module docstrings
Had a look again, and in the original repository they do have a
I don’t necessarily agree with that point, just because the weights weren’t saved or shared doesn’t imply that the model was never trained. The original AptaNet paper includes a results section, which strongly suggests that the model was trained, even if those weights aren’t available. |
|
I would like to provide some feedback about structure. I quickly skimmed through the original paper and found that AptaNet is essentially a multi-layer perceptron (MLP) with some pre-processing that includes PseAAC, random forest, etc. Personally, I think it would be better to have the pre-processing logic (currently in Also, I noticed the authors mention applying a neighborhood cleaning algorithm to address class imbalance. In their code, this corresponds to the following snippet between the random forest and the actual neural network: imblearn.under_sampling import NeighbourhoodCleaningRule
# ...
# apply random forest
ncr = NeighbourhoodCleaningRule()
x_resampled, y_resampled = ncr.fit_resample(x, y)
# apply MLP
# ...I may have missed it, but it doesn't appear to be included in our current implementation. Indeed, this is a step needed only when the dataset is skewed in favour of one class, so we could simply have an optional argument that applies or does not apply such step. |
There was a problem hiding this comment.
As promised, some comments on the deep neural network.
For multiple layers, I prefer using nn.Sequential as a container rather than hardcode them. This approach offers better decoupling of logic, readability, and reusability (https://github.com/FrancescoSaverioZuppichini/Pytorch-how-and-when-to-use-Module-Sequential-ModuleList-and-ModuleDict).
I would also suggest to make AptaNet architecture customizable with arguments for number of layers, dropout, etc.. Finally, I would argue that random forest and training should be outside the class where the architecture itself is defined, for instance directly in some example/tutorial notebook. Motivations: seperation of concerns and, in the context of training, align with PyTorch/torchvision style.
In particular, I think having feature extraction (random forest + SelectFromModel) here could be problematic. SelectFromModel transforms some features from shape n to m where m << n. However, m is unknown at priori. This means that we would have to "delay" initialization of fully-connected layers until m is know by inspecting the output from SelectFromModel. Currently this is hardcoded as input_dim=639 but this won't always work.
That said, below is an example of how the AptaNet deep neural network could be refactored.
import torch.nn as nn
from torch import Tensor
# Each AptaNet hidden layer has the same three components: (nn.Linear - Activation - AlphaDropout)
# Thus, we can simplify our code by having a function that returns a nn.Sequential container of
# them. This also helps in reducing code duplication btw!
def aptanet_layer(input_dim: int, output_dim: int, dropout: float) -> nn.Sequential:
"""Create a single AptaNet layer with AlphaDropout and ReLU activation."""
return nn.Sequential(
nn.Linear(input_dim, output_dim),
nn.ReLU(),
nn.AlphaDropout(dropout),
)
class AptaNet(nn.Module):
"""AptaNet deep neural network for classification."""
def __init__(
self,
n_layers: int,
input_dim: int,
hidden_dim: int,
output_dim: int,
dropout: float,
) -> None:
super().__init__()
assert n_layers > 0, "Number of hidden layers must be greater than 0."
self.model = self._init_model(n_layers, input_dim, hidden_dim, output_dim, dropout)
def _init_model(
self,
n_layers: int,
input_dim: int,
hidden_dim: int,
output_dim: int,
dropout: float,
) -> nn.Sequential:
"""Initialize AptaNet's deep neural network."""
model = [aptanet_layer(input_dim, hidden_dim, dropout)]
for _ in range(n_layers):
model.append(aptanet_layer(hidden_dim, hidden_dim, dropout))
model.append(nn.Linear(hidden_dim, output_dim))
model.append(nn.Sigmoid())
return nn.Sequential(*model)
def forward(self, x: Tensor) -> Tensor:
# thanks to nn.Sequential() we can now use the model directly, rather than applying each
# architectural component manually
return self.model(x)There was a problem hiding this comment.
agree with the principles, great ideas!
defaults should probably be what is currently hard-coded
Hmm the only preprocessing happening is in
Did not add it as it seemed optional, but I could certainly add a method giving that provision to users. Giving my 2 cents to it, I don't believe data preparation should be combined with the main algorithm, given it's not something we need to do before sending data through AptaNet, and is a part of data preprocessing in general.
I heavily disagree with this, given that we will be changing the architecture completely and the implementation will no longer be of AptaNet, but something completely different, if that makes sense. Other changes (especially the code block) are pretty interesting and eye-opening! I will definitely try integrating them into the PR. Thanks @NennoMP ! |
|
I see - pickling might be failing due to |
So.....what to do about it? |
|
Can you check why it is failing locally but not remove? E.g., discrepancies in versions |
Great guess! For future reference: In CI tests under "Install packages and dependencies" one can see all the packages installed in the testing env. |
There was a problem hiding this comment.
Great! Looks like it works now.
- signature: please make all the parameters in
AptaNetPipelineexplicit.pairs_to_featuresis not a public utility. - I would make
AptaNetClassifierpublic, and expose the classifier choice as an argclassifierinAptaNetPipeline. The default is the default ofAptaNetClassifier(or,None; and make sure you clone and do not overwrite__init__params) - docstring: please add a reference to the algorithm in the title, e.g., what algorithm is it? Reference the source prominently
- docstring: please avoid double newlines
- docstring: docstrings should make clear which component a parameter applies to.
Question: what implies python<3.13?
Only the classifier? The
I thought before and after every list we should add 2 newlines? Was that not what we discussed in the PSeAAC PR?
skorch requires versions <3.13. |
Should I do that even for |
Yes, but it should be a choice up to the user. Any sklearn compatible classifier should work.
You expose it as
Yes, you are right - I mean there are instances of three newlines throughout your docstrings. The max should be two, and I am surprised that the linting does not catch this. |
| self.pipeline_.fit(X, y) | ||
|
|
||
| def predict(self, X): | ||
| if not hasattr(self, "pipeline_"): |
There was a problem hiding this comment.
I would use the scikit-learn idiomatic check_is_fitted here
| @@ -0,0 +1,93 @@ | |||
| from itertools import product | |||
There was a problem hiding this comment.
if these are specifically to aptanet, move then to aptanet
| return kmer_freq | ||
|
|
||
|
|
||
| def pairs_to_features(X, k=4): |
There was a problem hiding this comment.
this looks like it should either be in aptanet or pseaac folder
There was a problem hiding this comment.
I thought we were keeping utility functions inside the utils directory? Why move it, and more importantly, to where (given its a utility function) and how to put it inside pseaac or aptanet (file name, sub folder name)?
|
As discussed in the daily, we will be keeping the utils only for aptanet in a private file inside the |
fkiraly
left a comment
There was a problem hiding this comment.
just some docstring remarks
Merge after #28.
Solves #13.
Adds AptaNet, a binary classification algorithm to predict if an aptamer will bind to the protein or not.