A robust R machine learning pipeline for classification tasks and survival analysis
You can install the development version of pipeML
from
GitHub with:
# install.packages("pak")
pak::pkg_install("VeraPancaldiLab/pipeML")
pipeML is a robust R-based pipeline designed to streamline the training and testing of machine learning models for classification tasks. It is developed for fast, user-friendly deployment while maintaining the flexibility and complexity needed for rigorous, reliable implementation at every stage of the pipeline (Figure 1).
Figure 1. Machine learning pipeline.
- Stratified data split
- Iterative Boruta algorithm for feature selection
- Repeated k-fold cross validation (kCV)
- Hyperparameter tuning based on AUROC, AUPRC or Accuracy
- Stratified k-fold construction
- SHAP values implementation for feature importance
- Model stacking implementation based on GLM
- Visualization functions for RO and PR curves
- 13 Machine Learning methods implemented
- Bagged CART
- Random Forest (RF)
- C50
- Logistic regression (LG)
- CART
- Naive Bayes (NB)
- Regularized Lasso
- Ridge regression
- Linear Discriminant Analysis (LDA)
- Regularized Logistic Regression (Elastic net)
- K-nearest neighbors (KNN)
- Support vector machine with radial kernel (SVMr)
- Support vector machine with linear kernel (SVMl)
- Extreme Gradient Boosting (XGboost)
These are basic examples which shows you how to use pipeML
for
different tasks. For a detailed tutorial, see Get
started
library(pipeML)
compute_features.training.ML()
: This function is designed for training
machine learning models on a single dataset using repeated k-fold
cross-validation. It supports feature selection via Boruta, optional
model stacking, and flexible hyperparameter tuning and the construction
of k-folds stratified by cohorts when this information is available. It
can be used when the user do not account with a prediction dataset, in
order to train different folds on the same dataset and evaluate
performance.
res_ml = compute_features.training.ML(features_train, clinical$Response, "CR",
metric = "AUROC", stack = F, k_folds = 5,
n_rep = 10, feature.selection = F, seed = 123,
file_name = "Test", ncores = 2, return = T)
After training, predictions on new data can be computed using the
compute_prediction()
function. You can specify which metric to
maximize when determining the optimal classification threshold.
Supported values for maximize include: “Accuracy”, “Precision”,
“Recall”, “Specificity”, “Sensitivity”, “F1”, and “MCC”.
pred = compute_prediction(res_ml, features_test, traitData_test$Response,
"CR", stack = F, file.name = "Test",
maximize = "Accuracy", return = T)
compute_features.ML()
: This function is intended for training on a
dataset and evaluating on a separate test dataset when is available. It
automatically computes the prediction using the trained model in the
testing set provided. It includes both previous functions.
res = compute_features.ML(tme_features_train[[i]], tme_features_test[[i]],
clinical = traitData, trait = "Response",
trait.positive = "R", metric = "AUROC", stack = F,
k_folds = 2, n_rep = 1, feature.selection = F,
seed = 123, LODO = T, batch_id = "Cohort",
ncores = 2, maximize = "Accuracy", return = F)
If you encounter any problems or have questions about the package, we encourage you to open an issue here. We’ll do our best to assist you!
pipeML
was developed by Marcelo
Hurtado in supervision of Vera
Pancaldi and is part of the
Pancaldi team. Currently, Marcelo
is the primary maintainer of this package.
If you use pipeML
in a scientific publication, we would appreciate
citation to the :
Hurtado M, Pancaldi V (2025). pipeML: A robust R machine learning pipeline for classification tasks and survival analysis. R package version 0.0.1, https://github.com/VeraPancaldiLab/pipeML.