Skip to content

[ENH] AptaNet algorithm#30

Merged
fkiraly merged 92 commits into
mainfrom
issue13
Aug 14, 2025
Merged

[ENH] AptaNet algorithm#30
fkiraly merged 92 commits into
mainfrom
issue13

Conversation

@satvshr
Copy link
Copy Markdown
Collaborator

@satvshr satvshr commented Jul 7, 2025

Merge after #28.
Solves #13.
Adds AptaNet, a binary classification algorithm to predict if an aptamer will bind to the protein or not.

@satvshr satvshr marked this pull request as draft July 7, 2025 07:57
@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 7, 2025

To do: Add tests comparing it to the implementation in the official AptaNet repository.

Copy link
Copy Markdown
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions for my understanding: what is this trained on?

Are there pre-trained weights? If so, where?
If we can also train on own data, how does that work?

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 7, 2025

Are there pre-trained weights? If so, where?

Nope, there are no pre-trained weights.

What is this trained on?

This is answered in the "Data collection" section of the original paper (page 11)

If we can also train on our own data, how does that work?

I assume you send aptamer and target sequence as X, and y will be a binary value (if it binds or not)

@fkiraly
Copy link
Copy Markdown
Contributor

fkiraly commented Jul 9, 2025

What is this trained on?

This is answered in the "Data collection" section of the original paper (page 11)

Can you give a short summary in your own words, or say that you do not know?

If there are no pretrained weights, then there is nothing this was trained on.

Copy link
Copy Markdown
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Request to change the signature of the class:

  • move the __init__ args to the method generate_final-vector
  • rename the latter to transform
  • change the output to an 1D np.ndarray

Also, usual quality requests:

  • please ensure to add docstring examples
  • please ensure to add module docstrings

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 10, 2025

Can you give a short summary in your own words, or say that you do not know?

Had a look again, and in the original repository they do have a Dataset.csv file.

If there are no pretrained weights, then there is nothing this was trained on.

I don’t necessarily agree with that point, just because the weights weren’t saved or shared doesn’t imply that the model was never trained. The original AptaNet paper includes a results section, which strongly suggests that the model was trained, even if those weights aren’t available.

@NennoMP
Copy link
Copy Markdown
Collaborator

NennoMP commented Jul 10, 2025

I would like to provide some feedback about structure.
@fkiraly may have a different opinion, so take this with a grain of salt.

I quickly skimmed through the original paper and found that AptaNet is essentially a multi-layer perceptron (MLP) with some pre-processing that includes PseAAC, random forest, etc. Personally, I think it would be better to have the pre-processing logic (currently in aptanet.py) in separate classes and/or utility methods. I would also call AptaNet the class containing the architecture (currently defined in class MLP).

Also, I noticed the authors mention applying a neighborhood cleaning algorithm to address class imbalance. In their code, this corresponds to the following snippet between the random forest and the actual neural network:

imblearn.under_sampling import NeighbourhoodCleaningRule
# ...
# apply random forest
ncr = NeighbourhoodCleaningRule()
x_resampled, y_resampled = ncr.fit_resample(x, y)
# apply MLP
# ...

I may have missed it, but it doesn't appear to be included in our current implementation. Indeed, this is a step needed only when the dataset is skewed in favour of one class, so we could simply have an optional argument that applies or does not apply such step.

Comment thread pyaptamer/AptaNet/neural_net.py Outdated
Copy link
Copy Markdown
Collaborator

@NennoMP NennoMP Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As promised, some comments on the deep neural network.

For multiple layers, I prefer using nn.Sequential as a container rather than hardcode them. This approach offers better decoupling of logic, readability, and reusability (https://github.com/FrancescoSaverioZuppichini/Pytorch-how-and-when-to-use-Module-Sequential-ModuleList-and-ModuleDict).

I would also suggest to make AptaNet architecture customizable with arguments for number of layers, dropout, etc.. Finally, I would argue that random forest and training should be outside the class where the architecture itself is defined, for instance directly in some example/tutorial notebook. Motivations: seperation of concerns and, in the context of training, align with PyTorch/torchvision style.

In particular, I think having feature extraction (random forest + SelectFromModel) here could be problematic. SelectFromModel transforms some features from shape n to m where m << n. However, m is unknown at priori. This means that we would have to "delay" initialization of fully-connected layers until m is know by inspecting the output from SelectFromModel. Currently this is hardcoded as input_dim=639 but this won't always work.

That said, below is an example of how the AptaNet deep neural network could be refactored.

import torch.nn as nn
from torch import Tensor

# Each AptaNet hidden layer has the same three components: (nn.Linear - Activation - AlphaDropout)
# Thus, we can simplify our code by having a function that returns a nn.Sequential container of 
# them. This also helps in reducing code duplication btw!
def aptanet_layer(input_dim: int, output_dim: int, dropout: float) -> nn.Sequential:
   """Create a single AptaNet layer with AlphaDropout and ReLU activation."""
   return nn.Sequential(
       nn.Linear(input_dim, output_dim),
       nn.ReLU(),
       nn.AlphaDropout(dropout),
   )

class AptaNet(nn.Module):
   """AptaNet deep neural network for classification."""
   
   def __init__(
       self, 
       n_layers: int, 
       input_dim: int, 
       hidden_dim: int,
       output_dim: int,
       dropout: float,
   ) -> None:
       super().__init__()
       assert n_layers > 0, "Number of hidden layers must be greater than 0."
       self.model = self._init_model(n_layers, input_dim, hidden_dim, output_dim, dropout)

   def _init_model(
       self, 
       n_layers: int,
       input_dim: int, 
       hidden_dim: int, 
       output_dim: int, 
       dropout: float,
   ) -> nn.Sequential:
       """Initialize AptaNet's deep neural network."""
       model = [aptanet_layer(input_dim, hidden_dim, dropout)]
       for _ in range(n_layers):
           model.append(aptanet_layer(hidden_dim, hidden_dim, dropout))
       model.append(nn.Linear(hidden_dim, output_dim))
       model.append(nn.Sigmoid())
       return nn.Sequential(*model)
   
   def forward(self, x: Tensor) -> Tensor:
       # thanks to nn.Sequential() we can now use the model directly, rather than applying each 
       # architectural component manually
       return self.model(x)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with the principles, great ideas!

defaults should probably be what is currently hard-coded

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Jul 11, 2025

I think it would be better to have the pre-processing logic (currently in aptanet.py) in separate classes and/or utility methods.

Hmm the only preprocessing happening is in _generate_kmer_vecs which on second thought should be moved to utils.py in root given its a kmer generating function which may be used by other algorithms. generate_final_vector can be renamed as preprocessing given that is what the function is doing (combining kmer frequency vector and the vector being generated by the pseaac encoding algorithm), before sending the vector through the neural net and then I can delete neural_net.py and move everything to one file called aptanet.py, sounds good @NennoMP @fkiraly ?

Indeed, this is a step needed only when the dataset is skewed in favour of one class, so we could simply have an optional argument that applies or does not apply such a step.

Did not add it as it seemed optional, but I could certainly add a method giving that provision to users. Giving my 2 cents to it, I don't believe data preparation should be combined with the main algorithm, given it's not something we need to do before sending data through AptaNet, and is a part of data preprocessing in general.

I would also suggest to make AptaNet architecture customizable with arguments for the number of layers, dropout, etc..

I heavily disagree with this, given that we will be changing the architecture completely and the implementation will no longer be of AptaNet, but something completely different, if that makes sense.

Other changes (especially the code block) are pretty interesting and eye-opening! I will definitely try integrating them into the PR. Thanks @NennoMP !
@fkiraly I would appreciate your take, given we do not agree on the above topics.

@fkiraly
Copy link
Copy Markdown
Contributor

fkiraly commented Aug 10, 2025

I see - pickling might be failing due to torch objects - I am not sure why it does not fail on the remote?

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Aug 10, 2025

I see - pickling might be failing due to torch objects - I am not sure why it does not fail on the remote?

So.....what to do about it?

@fkiraly
Copy link
Copy Markdown
Contributor

fkiraly commented Aug 10, 2025

Can you check why it is failing locally but not remove? E.g., discrepancies in versions

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Aug 10, 2025

discrepancies in versions

Great guess! For future reference: In CI tests under "Install packages and dependencies" one can see all the packages installed in the testing env.
I had to update my skorch version, it solved the bugs. Now only a warning is thrown locally which I can supress but it does not seem to be a big deal given it is not an error:

pyaptamer/aptanet/tests/test_aptanet.py::test_sklearn_compatible_estimator[AptaNetFeaturesClassifier()-check_n_features_in_after_fitting]
 UserWarning: The least populated class in y has only 4 members, which is less than n_splits=5.
    warnings.warn(

Copy link
Copy Markdown
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Looks like it works now.

  • signature: please make all the parameters in AptaNetPipeline explicit. pairs_to_features is not a public utility.
  • I would make AptaNetClassifier public, and expose the classifier choice as an arg classifier in AptaNetPipeline. The default is the default of AptaNetClassifier (or, None; and make sure you clone and do not overwrite __init__ params)
  • docstring: please add a reference to the algorithm in the title, e.g., what algorithm is it? Reference the source prominently
  • docstring: please avoid double newlines
  • docstring: docstrings should make clear which component a parameter applies to.

Question: what implies python<3.13?

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Aug 11, 2025

  • expose the classifier choice as an arg classifier in AptaNetPipeline

Only the classifier? The AptaNetFeaturesClassifier (rename to AptaNetClassifier) contains the random forest classifier along with the AptaNetMLP, so do you want the random forest classifier as a "classifier choice"?
Edit: Only making the classifier as a choice, not the network.

  • docstring: please avoid double newlines

I thought before and after every list we should add 2 newlines? Was that not what we discussed in the PSeAAC PR?

Question: what implies python<3.13?

skorch requires versions <3.13.

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Aug 12, 2025

  • signature: please make all the parameters in AptaNetPipeline explicit. pairs_to_features is not a public utility.

Should I do that even for AptaNetClassifier?

@fkiraly
Copy link
Copy Markdown
Contributor

fkiraly commented Aug 12, 2025

so do you want the random forest classifier as a "classifier choice"? Edit: Only making the classifier as a choice, not the network.

Yes, but it should be a choice up to the user. Any sklearn compatible classifier should work.

Should I do that even for AptaNetClassifier?

You expose it as classifier, the parameters of which will be explicit because it in return does not accept kwargs but named parameters, so that will not be necessary as long as AptaNetClassifier does the same.

I thought before and after every list we should add 2 newlines? #29 (comment)?

Yes, you are right - I mean there are instances of three newlines throughout your docstrings. The max should be two, and I am surprised that the linting does not catch this.

Comment thread pyaptamer/aptanet/_pipeline.py Outdated
self.pipeline_.fit(X, y)

def predict(self, X):
if not hasattr(self, "pipeline_"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use the scikit-learn idiomatic check_is_fitted here

@@ -0,0 +1,93 @@
from itertools import product
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these are specifically to aptanet, move then to aptanet

return kmer_freq


def pairs_to_features(X, k=4):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like it should either be in aptanet or pseaac folder

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were keeping utility functions inside the utils directory? Why move it, and more importantly, to where (given its a utility function) and how to put it inside pseaac or aptanet (file name, sub folder name)?

Copy link
Copy Markdown
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left minimal comments

@satvshr
Copy link
Copy Markdown
Collaborator Author

satvshr commented Aug 13, 2025

As discussed in the daily, we will be keeping the utils only for aptanet in a private file inside the utils folder at the moment. I added check_is_fitted as requested, I had removed it as it was not needed to pass sklearn checks and I was afraid they would fail, but there are no issues and all tests pass.

@satvshr satvshr requested a review from fkiraly August 13, 2025 09:27
Comment thread pyaptamer/aptanet/_pipeline.py Outdated
Comment thread pyaptamer/aptanet/_pipeline.py Outdated
Comment thread pyaptamer/aptanet/_pipeline.py Outdated
Comment thread pyaptamer/aptanet/_pipeline.py Outdated
Copy link
Copy Markdown
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some docstring remarks

@satvshr satvshr requested a review from fkiraly August 14, 2025 13:33
@fkiraly fkiraly merged commit f6f0e49 into main Aug 14, 2025
13 checks passed
@satvshr satvshr deleted the issue13 branch August 14, 2025 20:29
fkiraly pushed a commit that referenced this pull request Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants