Automatic machine learning ensembles for microbiota-based predictions
maensembles implements a pipeline for microbiome-based disease classification: it exhaustively evaluates combinations of data transformations and classifiers, then automatically assembles optimal ensembles from the best-performing configurations.
Generalisability is assessed via leave-one-dataset-out (LODO) cross-validation for multi-cohort tasks, and repeated stratified k-fold for single-cohort tasks.
data-preparation/ → config-sweep/ → ensemble-sweep/ → inference/
data-preparation/
Prepare input features from raw MetaPhlAn4 taxonomic profiles:
| Script | Purpose |
|---|---|
csv2study-tsv-na.py |
Convert raw CSV profiles to per-study TSV format (non-aggregated, leaf-level features) |
prepare_taxa_resolutions.py |
Aggregate strain-level features to all taxonomic ranks (phylum → class → order → family → genus → species → strain) by summation |
python data-preparation/csv2study-tsv-na.py
python data-preparation/prepare_taxa_resolutions.py
config-sweep/
Exhaustive grid search over 44 data transforms × ~70 classifier variants for each prediction task. Folds are evaluated in parallel; results are saved incrementally so interrupted runs resume automatically.
python config-sweep/run_crc_lampp_lodo_na_fast.py
python config-sweep/run_dmw_lampp_lodo_na_fast.py
python config-sweep/run_ghs_lampp_lodo_na_fast.py
python config-sweep/run_ibd_lampp_lodo_na_fast.py
python config-sweep/run_scz_lampp_kfold_na_fast.pyOutput (per task, under experiments/<task>_lampp_lodo_na_fast/):
configs.tsv config_id → transform + classifier
results/fold_<N>.tsv per-fold metrics for every configuration
predictions/fold_<N>.tsv
results_all.tsv metrics merged across all folds
predictions_all.tsv
summary.tsv aggregated performance summary
Add --redo to force a full rerun from scratch.
ensemble-sweep/
Post-hoc search for the optimal ensemble: member selection strategy × aggregation function. Members are scored on held-out folds (inner validation) to prevent leakage into the ensemble evaluation.
python ensemble-sweep/run_crc_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_dmw_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_ghs_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_ibd_lampp_lodo_na_fast_ensembling.py
python ensemble-sweep/run_scz_lampp_kfold_na_fast_ensembling.pyThe selected ensemble is written to:
experiments/<task>_lampp_lodo_na_fast/ensembling/ensemble_best_config.json
inference/
Train the final ensemble on all available training cohorts and generate held-out test set predictions.
cd inference
python maensemble_predict_crc.py
python maensemble_predict_dmw.py
python maensemble_predict_ghs.py
python maensemble_predict_ibd.py
python maensemble_predict_scz.pyEach script produces a CSV with columns sample_id and prediction (posterior probability of class 1).
uv pip install -e .