Skip to content

VeraPancaldiLab/pipeML

Repository files navigation

pipeML

A robust R machine learning pipeline for classification tasks and survival analysis

Installation

You can install the development version of pipeML from GitHub with:

# install.packages("pak")
pak::pkg_install("VeraPancaldiLab/pipeML")

Description

pipeML is a robust R-based pipeline designed to streamline the training and testing of machine learning models for classification tasks. It is developed for fast, user-friendly deployment while maintaining the flexibility and complexity needed for rigorous, reliable implementation at every stage of the pipeline (Figure 1).

Figure 1. Machine learning pipeline.

Key features

  • Stratified data split
  • Iterative Boruta algorithm for feature selection
  • Repeated k-fold cross validation (kCV)
  • Hyperparameter tuning based on AUROC, AUPRC or Accuracy
  • Stratified k-fold construction
  • SHAP values implementation for feature importance
  • Model stacking implementation based on GLM
  • Visualization functions for RO and PR curves
  • 13 Machine Learning methods implemented
    • Bagged CART
    • Random Forest (RF)
    • C50
    • Logistic regression (LG)
    • CART
    • Naive Bayes (NB)
    • Regularized Lasso
    • Ridge regression
    • Linear Discriminant Analysis (LDA)
    • Regularized Logistic Regression (Elastic net)
    • K-nearest neighbors (KNN)
    • Support vector machine with radial kernel (SVMr)
    • Support vector machine with linear kernel (SVMl)
    • Extreme Gradient Boosting (XGboost)

General usage

These are basic examples which shows you how to use pipeML for different tasks. For a detailed tutorial, see Get started

library(pipeML)

compute_features.training.ML(): This function is designed for training machine learning models on a single dataset using repeated k-fold cross-validation. It supports feature selection via Boruta, optional model stacking, and flexible hyperparameter tuning and the construction of k-folds stratified by cohorts when this information is available. It can be used when the user do not account with a prediction dataset, in order to train different folds on the same dataset and evaluate performance.

res_ml = compute_features.training.ML(features_train, clinical$Response, "CR", 
                                      metric = "AUROC", stack = F, k_folds = 5, 
                                      n_rep = 10, feature.selection = F, seed = 123, 
                                      file_name = "Test", ncores = 2, return = T)

After training, predictions on new data can be computed using the compute_prediction() function. You can specify which metric to maximize when determining the optimal classification threshold. Supported values for maximize include: “Accuracy”, “Precision”, “Recall”, “Specificity”, “Sensitivity”, “F1”, and “MCC”.

pred = compute_prediction(res_ml, features_test, traitData_test$Response, 
                          "CR", stack = F, file.name = "Test", 
                          maximize = "Accuracy", return = T)

compute_features.ML(): This function is intended for training on a dataset and evaluating on a separate test dataset when is available. It automatically computes the prediction using the trained model in the testing set provided. It includes both previous functions.

res = compute_features.ML(tme_features_train[[i]], tme_features_test[[i]], 
                          clinical = traitData, trait = "Response", 
                          trait.positive = "R", metric = "AUROC", stack = F, 
                          k_folds = 2, n_rep = 1, feature.selection = F, 
                          seed = 123, LODO = T, batch_id = "Cohort", 
                          ncores = 2, maximize = "Accuracy", return = F)

Issues

If you encounter any problems or have questions about the package, we encourage you to open an issue here. We’ll do our best to assist you!

Authors

pipeML was developed by Marcelo Hurtado in supervision of Vera Pancaldi and is part of the Pancaldi team. Currently, Marcelo is the primary maintainer of this package.

Citing pipeML

If you use pipeML in a scientific publication, we would appreciate citation to the :

Hurtado M, Pancaldi V (2025). pipeML: A robust R machine learning pipeline for classification tasks and survival analysis. R package version 0.0.1, https://github.com/VeraPancaldiLab/pipeML.

About

A robust R machine learning pipeline for classification tasks and survival analysis.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages