Skip to content

CMG-GUTS/maensembles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

maensembles

Automatic machine learning ensembles for microbiota-based predictions


maensembles implements a pipeline for microbiome-based disease classification: it exhaustively evaluates combinations of data transformations and classifiers, then automatically assembles optimal ensembles from the best-performing configurations.

Generalisability is assessed via leave-one-dataset-out (LODO) cross-validation for multi-cohort tasks, and repeated stratified k-fold for single-cohort tasks.


Pipeline

data-preparation/  →  config-sweep/  →  ensemble-sweep/  →  inference/

1 · Data preparation

data-preparation/

Prepare input features from raw MetaPhlAn4 taxonomic profiles:

Script Purpose
csv2study-tsv-na.py Convert raw CSV profiles to per-study TSV format (non-aggregated, leaf-level features)
prepare_taxa_resolutions.py Aggregate strain-level features to all taxonomic ranks (phylum → class → order → family → genus → species → strain) by summation
python data-preparation/csv2study-tsv-na.py
python data-preparation/prepare_taxa_resolutions.py

2 · Configuration sweep

config-sweep/

Exhaustive grid search over 44 data transforms × ~70 classifier variants for each prediction task. Folds are evaluated in parallel; results are saved incrementally so interrupted runs resume automatically.

python config-sweep/run_crc_lampp_lodo_na_fast.py
python config-sweep/run_dmw_lampp_lodo_na_fast.py
python config-sweep/run_ghs_lampp_lodo_na_fast.py
python config-sweep/run_ibd_lampp_lodo_na_fast.py
python config-sweep/run_scz_lampp_kfold_na_fast.py

Output (per task, under experiments/<task>_lampp_lodo_na_fast/):

configs.tsv            config_id → transform + classifier
results/fold_<N>.tsv   per-fold metrics for every configuration
predictions/fold_<N>.tsv
results_all.tsv        metrics merged across all folds
predictions_all.tsv
summary.tsv            aggregated performance summary

Add --redo to force a full rerun from scratch.


3 · Ensemble sweep

ensemble-sweep/

Post-hoc search for the optimal ensemble: member selection strategy × aggregation function. Members are scored on held-out folds (inner validation) to prevent leakage into the ensemble evaluation.

python ensemble-sweep/run_crc_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_dmw_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_ghs_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_ibd_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_scz_lampp_kfold_na_fast_ensembling.py

The selected ensemble is written to:

experiments/<task>_lampp_lodo_na_fast/ensembling/ensemble_best_config.json

4 · Inference

inference/

Train the final ensemble on all available training cohorts and generate held-out test set predictions.

cd inference
python maensemble_predict_crc.py
python maensemble_predict_dmw.py
python maensemble_predict_ghs.py
python maensemble_predict_ibd.py
python maensemble_predict_scz.py

Each script produces a CSV with columns sample_id and prediction (posterior probability of class 1).


Installation

uv pip install -e .

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages