GitHub - CMG-GUTS/maensembles

Automatic machine learning ensembles for microbiota-based predictions

maensembles implements a pipeline for microbiome-based disease classification: it exhaustively evaluates combinations of data transformations and classifiers, then automatically assembles optimal ensembles from the best-performing configurations.

Generalisability is assessed via leave-one-dataset-out (LODO) cross-validation for multi-cohort tasks, and repeated stratified k-fold for single-cohort tasks.

Pipeline

data-preparation/  →  config-sweep/  →  ensemble-sweep/  →  inference/

1 · Data preparation

data-preparation/

Prepare input features from raw MetaPhlAn4 taxonomic profiles:

Script	Purpose
`csv2study-tsv-na.py`	Convert raw CSV profiles to per-study TSV format (non-aggregated, leaf-level features)
`prepare_taxa_resolutions.py`	Aggregate strain-level features to all taxonomic ranks (phylum → class → order → family → genus → species → strain) by summation

python data-preparation/csv2study-tsv-na.py
python data-preparation/prepare_taxa_resolutions.py

2 · Configuration sweep

config-sweep/

Exhaustive grid search over 44 data transforms × ~70 classifier variants for each prediction task. Folds are evaluated in parallel; results are saved incrementally so interrupted runs resume automatically.

python config-sweep/run_crc_lampp_lodo_na_fast.py
python config-sweep/run_dmw_lampp_lodo_na_fast.py
python config-sweep/run_ghs_lampp_lodo_na_fast.py
python config-sweep/run_ibd_lampp_lodo_na_fast.py
python config-sweep/run_scz_lampp_kfold_na_fast.py

Output (per task, under experiments/<task>_lampp_lodo_na_fast/):

configs.tsv            config_id → transform + classifier
results/fold_<N>.tsv   per-fold metrics for every configuration
predictions/fold_<N>.tsv
results_all.tsv        metrics merged across all folds
predictions_all.tsv
summary.tsv            aggregated performance summary

Add --redo to force a full rerun from scratch.

3 · Ensemble sweep

ensemble-sweep/

Post-hoc search for the optimal ensemble: member selection strategy × aggregation function. Members are scored on held-out folds (inner validation) to prevent leakage into the ensemble evaluation.

python ensemble-sweep/run_crc_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_dmw_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_ghs_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_ibd_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_scz_lampp_kfold_na_fast_ensembling.py

The selected ensemble is written to:

experiments/<task>_lampp_lodo_na_fast/ensembling/ensemble_best_config.json

4 · Inference

inference/

Train the final ensemble on all available training cohorts and generate held-out test set predictions.

cd inference
python maensemble_predict_crc.py
python maensemble_predict_dmw.py
python maensemble_predict_ghs.py
python maensemble_predict_ibd.py
python maensemble_predict_scz.py

Each script produces a CSV with columns sample_id and prediction (posterior probability of class 1).

Installation

uv pip install -e .

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
config-sweep		config-sweep
data-preparation		data-preparation
ensemble-sweep		ensemble-sweep
inference		inference
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline

1 · Data preparation

2 · Configuration sweep

3 · Ensemble sweep

4 · Inference

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pipeline

1 · Data preparation

2 · Configuration sweep

3 · Ensemble sweep

4 · Inference

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages