In this repository you may find a PyTorch implementation of Generative Adversarial Imputation Networks (GAIN) [1] for imputing missing iBAQ values in proteomics datasets.
Here are the main components you'll find in this repository:
- .github/workflows
- contains the code for the automatization of the tests in the repository
- Datasets
- directory with datasets with missing values from PRIDE that can be used for testing
- GenerativeProteomics:
- Contains the core package source code
- docs/source
- contains the information used for the documentation of our work (ReadtheDocs)
- tests:
- batery of unittests to assess the model's functionality
- use-case
- set of clear examples on how to use our model's functionalities
- includes examples on how to install the package and use it, how to run the tests, and how to download and use a pre-trained model from HuggingFace
We have submitted a package to the Python Package Index (PyPI) for easy installation. You can install the package using the following command:
pip install GenerativeProteomicsThis way, you can install the package and its dependencies in one go.
from GenerativeProteomics import utils, Network, Params, Metrics, Data
import torch
import pandas as pd
# Load your dataset
dataset_path = "your_dataset.tsv"
dataset_df = utils.build_protein_matrix(dataset_path) # use this function if dataset is a tsv
#dataset_df = pd.read_csv(dataset_path) # if your dataset is a csv
dataset = dataset_df.values
missing_header = dataset_df.columns.tolist()
# Define your parameters
params = Params(
input=dataset_path,
output="imputed.csv",
ref=None,
output_folder=".",
num_iterations=2001,
batch_size=128,
alpha=10,
miss_rate=0.1,
hint_rate=0.9,
lr_D=0.001,
lr_G=0.001,
override=1,
output_all=1,
)
# Define model architecture
input_dim = dataset.shape[1]
h_dim = input_dim
net_G = torch.nn.Sequential(
torch.nn.Linear(input_dim * 2, h_dim),
torch.nn.Sigmoid()
)
net_D = torch.nn.Sequential(
torch.nn.Linear(input_dim * 2, h_dim),
torch.nn.Sigmoid()
)
# Set up the model and data
metrics = Metrics(params)
network = Network(hypers=params, net_G=net_G, net_D=net_D, metrics=metrics)
data = Data(dataset=dataset, miss_rate=0.2, hint_rate=0.9, ref=None)
# Run evaluation and training
network.evaluate(data=data, missing_header=missing_header)
network.train(data=data, missing_header=missing_header)
print("Final Matrix:\n", metrics.data_imputed) For a more detailed explanation on how to use the model and all the functionalities we have to offer, you can open the use-case directory.
If you prefer to use the code of the GenerativeProteomics model directly, you can access it in our GitHub repository and follow the next sequence of commands.
- Clone this repository:
git clone https://github.com/QuantitativeBiology/GenerativeProteomics/ - Create a Python environment:
conda create -n proto python=3.10if you have conda installed - Activate the previously created environment:
conda activate proto - Install the necessary packages:
pip install -r libraries.txt
If you just want to impute a general dataset, the most straightforward and simplest way to run GenerativeProteomics is to run: python generativeproteomics.py -i /path/to/file_to_impute.csv
Running in this manner will result in two separate training phases.
-
Evaluation run: In this run a percentage of the values (10% by default) are concealed during the training phase and then the dataset is imputed. The RMSE is calculated with those hidden values as targets and at the end of the training phase a
test_imputed.csvfile will be created containing the original hidden values and the resulting imputation, this way you can have an estimation of the imputation accuracy. -
Imputation run: Then a proper training phase takes place using the entire dataset. An
imputed.csvfile will be created containing the imputed dataset.
However, there are a few arguments which you may want to change. You can do this using a parameters.json file (you may find an example in GenerativeProteomics/breast/parameters.json) or you can choose them directly in the command line.
Run with a parameters.json file: python generativeproteomics.py --parameters /path/to/parameters.json
Run with command line arguments: python generativeproteomics.py -i /path/to/file_to_impute.csv -o imputed_name --ofolder ./results/ --it 2001
Instead of running our trained model GenerativeProteomics, you can always use other inference forms. To do so, all you need to do is use the --model flag.
Run the following command in order to use an alternative imputation form: python generativeproteomics.py -i /path/to/file_to_impute.csv --model <name_of_model>
-i: Path to file to impute
-o: Name of imputed file
--ofolder: Path to the output folder
--it: Number of iterations to train the model
--miss: The percentage of values to be concealed during the evaluation run (from 0 to 1)
--outall: Set this argument to 1 if you want to output every metric
--override: Set this argument to 1 if you want to delete the previously created files when writing the new output
--model: Contains the name of the imputation form to run. Default value is the GenerativeProteomics model.
If you want to test the efficacy of the code you may give a reference file containing a complete version of the dataset (without missing values): python generativeproteomics.py -i /path/to/file_to_impute.csv --ref /path/to/complete_dataset.csv
Running this way will calculate the RMSE of the imputation in relation to the complete dataset.
In this repository you may find a folder named breast, inside it you have a breast cancer diagnostic dataset [2] which you may use to try out the code.
breast.csv: complete dataset
breastMissing_20.csv: the same dataset but with 20% of its values taken out
To simply impute breastMissing_20.csv run: python generativeproteomics.py -i ./breast/breastMissing_20.csv
If you want to compare the imputation with the original dataset run: python generativeproteomics.py -i ./breast/breastMissing_20.csv --ref ./breast/breast.csv or python generativeproteomics.py --parameters ./breast/parameters.json
If you want to go deep in the analysis of every metric you either set --outall to 1 or you run the code in an IPython console, this way you can access every variable you want in the metrics object, e.g. metrics.loss_D.
[1]
J. Yoon, J. Jordon & M. van der Schaar (2018). GAIN: Missing Data Imputation using Generative Adversarial Nets
[2]
https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic