pipeML

A robust R machine learning pipeline for classification tasks and survival analysis

Installation

You can install the development version of pipeML from GitHub with:

# install.packages("pak")
pak::pkg_install("VeraPancaldiLab/pipeML")

Description

pipeML is a robust R-based pipeline designed to streamline the training and testing of machine learning models for classification tasks. It is developed for fast, user-friendly deployment while maintaining the flexibility and complexity needed for rigorous, reliable implementation at every stage of the pipeline (Figure 1).

Figure 1. Machine learning pipeline.

Key features

Stratified data split
Iterative Boruta algorithm for feature selection
Repeated k-fold cross validation (kCV)
Hyperparameter tuning based on AUROC, AUPRC or Accuracy
Stratified k-fold construction
SHAP values implementation for feature importance
Model stacking implementation based on GLM
Visualization functions for RO and PR curves
13 Machine Learning methods implemented
- Bagged CART
- Random Forest (RF)
- C50
- Logistic regression (LG)
- CART
- Naive Bayes (NB)
- Regularized Lasso
- Ridge regression
- Linear Discriminant Analysis (LDA)
- Regularized Logistic Regression (Elastic net)
- K-nearest neighbors (KNN)
- Support vector machine with radial kernel (SVMr)
- Support vector machine with linear kernel (SVMl)
- Extreme Gradient Boosting (XGboost)

General usage

These are basic examples which shows you how to use pipeML for different tasks. For a detailed tutorial, see Get started

library(pipeML)

compute_features.training.ML(): This function is designed for training machine learning models on a single dataset using repeated k-fold cross-validation. It supports feature selection via Boruta, optional model stacking, and flexible hyperparameter tuning and the construction of k-folds stratified by cohorts when this information is available. It can be used when the user do not account with a prediction dataset, in order to train different folds on the same dataset and evaluate performance.

res_ml = compute_features.training.ML(features_train, clinical$Response, "CR", 
                                      metric = "AUROC", stack = F, k_folds = 5, 
                                      n_rep = 10, feature.selection = F, seed = 123, 
                                      file_name = "Test", ncores = 2, return = T)

After training, predictions on new data can be computed using the compute_prediction() function. You can specify which metric to maximize when determining the optimal classification threshold. Supported values for maximize include: “Accuracy”, “Precision”, “Recall”, “Specificity”, “Sensitivity”, “F1”, and “MCC”.

pred = compute_prediction(res_ml, features_test, traitData_test$Response, 
                          "CR", stack = F, file.name = "Test", 
                          maximize = "Accuracy", return = T)

compute_features.ML(): This function is intended for training on a dataset and evaluating on a separate test dataset when is available. It automatically computes the prediction using the trained model in the testing set provided. It includes both previous functions.

res = compute_features.ML(tme_features_train[[i]], tme_features_test[[i]], 
                          clinical = traitData, trait = "Response", 
                          trait.positive = "R", metric = "AUROC", stack = F, 
                          k_folds = 2, n_rep = 1, feature.selection = F, 
                          seed = 123, LODO = T, batch_id = "Cohort", 
                          ncores = 2, maximize = "Accuracy", return = F)

Issues

If you encounter any problems or have questions about the package, we encourage you to open an issue here. We’ll do our best to assist you!

Authors

pipeML was developed by Marcelo Hurtado in supervision of Vera Pancaldi and is part of the Pancaldi team. Currently, Marcelo is the primary maintainer of this package.

Citing pipeML

If you use pipeML in a scientific publication, we would appreciate citation to the :

Hurtado M, Pancaldi V (2025). pipeML: A robust R machine learning pipeline for classification tasks and survival analysis. R package version 0.0.1, https://github.com/VeraPancaldiLab/pipeML.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github		.github
R		R
data		data
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
pipeML.Rproj		pipeML.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pipeML

Installation

Description

Key features

General usage

Issues

Authors

Citing pipeML

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

VeraPancaldiLab/pipeML

Folders and files

Latest commit

History

Repository files navigation

pipeML

Installation

Description

Key features

General usage

Issues

Authors

Citing pipeML

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages