Spectroscopy preprocessing using Bayesian Optimization
SpectoPrep provides a toolkit for optimizing spectroscopic data preprocessing pipelines using Bayesian optimization. It automatically discovers the optimal combination of preprocessing techniques and their parameters to improve model performance for spectroscopic data analysis.
- Pipeline Optimization: Automate the discovery of optimal preprocessing pipelines using Bayesian optimization
- Flexible Preprocessing: Choose from multiple preprocessing techniques (MSC, SNV, Savitzky-Golay, etc.)
- Cross-Validation Support: Group-based cross-validation methods for robust evaluation
- Configurable Pipeline Length: Control maximum preprocessing steps and allowed combinations
pip install spectoprepfrom spectoprep.pipeline.optimizer import PipelineOptimizer
import numpy as np
# Prepare your data
X_train = np.array(...) # Your spectral data matrix
y_train = np.array(...) # Your target values
groups = np.array(...) # Optional group labels for cross-validation
# Initialize the optimizer
optimizer = PipelineOptimizer(
X_train=X_train,
y_train=y_train,
X_test=None,
y_test=None,
preprocessing_steps=['msc', 'savgol', 'detrend', 'scaler', 'snv',
'robust_scaler', 'emsc', 'meancn'],
cv_method="group_shuffle_split",
n_splits=3,
random_state=21,
groups=groups,
max_pipeline_length=2,
allowed_preprocess_combinations=[1, 2]
)
# Run Bayesian optimization to find the best pipeline
best_params, best_pipeline = optimizer.bayesian_optimize(
init_points=50,
n_iter=1000
)
# Extract preprocessing steps without the final model
custom_preprocessing = []
for name, step in best_pipeline.steps[:-1]:
custom_preprocessing.append((name, step))
# Print optimization summary
summary = optimizer.summarize_optimization()
print(f"Best pipeline configuration: {summary['best_pipeline']}")
print(f"Best RMSE: {summary['best_rmse']:.4f}")
# Make predictions with the optimized pipeline
predictions, rmse, r2 = optimizer.get_best_pipeline_predictions(best_pipeline)- msc: Multiplicative Scatter Correction
- savgol: Savitzky-Golay filtering
- detrend: Linear detrending
- scaler: Standard scaling
- snv: Standard Normal Variate
- robust_scaler: Robust scaling
- emsc: Extended Multiplicative Signal Correction
- meancn: Mean centering
- pca: Principal Component Analysis
- select_k_best: Feature selection
For detailed documentation, visit spectoprep.readthedocs.io.
We welcome contributions! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
If you use this package in your research, please cite our paper:
@article{babatunde2025automated,
title={Automated Spectral Preprocessing via Bayesian Optimization for Chemometric Analysis of Milk Constituents},
author={Babatunde, Habeeb Abolaji and McDougal, Owen M and Andersen, Timothy},
journal={Foods},
volume={14},
number={17},
pages={2996},
year={2025},
publisher={MDPI}
}Warning
This package is still under heavy development.