This directory contains scripts and workflows for applying the hybrid modeling pipeline to real bioprocess data.
real_data_hybrid_modeling/
├── data/ # Place your real bioprocess data files here
├── inputs/ # Preprocessed data files
├── outputs/ # Results, models, and visualizations
├── logs/ # Job logs
├── load_real_data.py # Script to load and preprocess real data
├── run_real_data.py # Main script for real data analysis
├── run_real_data_puhti.sh # Batch script for Puhti
└── README.md # This file
Your real bioprocess data should be in one of these formats:
-
CSV File (Recommended)
- Columns:
time,biomass(orX),substrate(orS),product(orP) - Optional:
experiment_id,pH,temperature,DO(dissolved oxygen) - Example:
time,biomass,substrate,product,experiment_id 0,0.2,10.0,0.0,exp1 10,0.5,8.5,0.1,exp1 20,1.2,6.0,0.3,exp1 ...
- Columns:
-
Excel File (.xlsx, .xls)
- Same column structure as CSV
- Can have multiple sheets (one per experiment)
-
Multiple Files
- One file per experiment
- All files should have the same structure
- Time Series Data: Measurements over time for each experiment
- Minimum Features:
- Time (hours)
- Biomass concentration (X, cells/mL or g/L)
- Substrate concentration (S, g/L)
- Product concentration (P, g/L) - optional but recommended
- Multiple Experiments: At least 3-5 experiments for meaningful training
- Time Points: At least 10-20 time points per experiment
Place your data file(s) in the data/ directory:
cd "/scratch/project_2010726/solution_data scientist/real_data_hybrid_modeling"
# Copy your data file here
cp /path/to/your/data.csv data/Edit load_real_data.py to match your data format:
# Update column names to match your data
COLUMN_MAPPING = {
'time': 'time', # or 'Time', 't', etc.
'biomass': 'biomass', # or 'X', 'cell_density', etc.
'substrate': 'substrate', # or 'S', 'glucose', etc.
'product': 'product' # or 'P', 'protein', etc.
}On Puhti (recommended):
sbatch run_real_data_puhti.shLocally:
python run_real_data.pyThe pipeline will automatically:
- Load Data: Read from CSV/Excel files
- Clean Data:
- Remove missing values
- Handle outliers
- Normalize units
- Prepare Sequences:
- Create time series sequences for LSTM
- Handle multiple experiments
- Split Data:
- Train/validation/test sets
- Maintain temporal structure
Edit run_real_data.py to customize:
-
Mechanistic Parameters: Based on your cell line/process
mechanistic_params = { 'mu_max': 0.5, # Adjust based on your process 'Ks': 0.1, # Substrate saturation constant 'Yxs': 0.5, # Biomass yield 'Yps': 0.3, # Product yield 'qp_max': 0.1 # Product formation rate }
-
Model Architecture:
model = HybridModel( mechanistic_params=mechanistic_params, ml_input_dim=3, # Increase if adding pH, T, etc. ml_hidden_dim=64, # Adjust based on data complexity ml_num_layers=2, # Deeper for complex patterns use_residual_learning=True )
-
Training Parameters:
trainer = Trainer( model=model, learning_rate=0.001, # Adjust learning rate weight_decay=1e-5 )
After running, you'll find in outputs/:
training_history.png- Training curvespredictions.png- Model predictions vs observationsprediction_scatter.png- Prediction accuracymetrics_comparison.png- Hybrid vs mechanistic comparisonfinal_model.pt- Trained modelreal_data_analysis_report.md- Detailed analysis report
If you get errors about missing columns:
- Check your column names match the mapping in
load_real_data.py - Ensure required columns (time, biomass, substrate) are present
- Check for typos or extra spaces in column names
If model performance is poor:
- Check for outliers or missing values
- Ensure sufficient data (multiple experiments, enough time points)
- Verify data units are consistent
- Consider data normalization
If you run out of memory:
- Reduce batch size in
run_real_data.py - Use fewer experiments or time points
- Request more memory in batch script
time,biomass,substrate,product,experiment_id
0,0.2,10.0,0.0,exp1
10,0.5,8.5,0.1,exp1
20,1.2,6.0,0.3,exp1
0,0.15,12.0,0.0,exp2
10,0.4,9.0,0.08,exp2
...data/
├── experiment1.csv
├── experiment2.csv
├── experiment3.csv
...
Each file:
time,biomass,substrate,product
0,0.2,10.0,0.0
10,0.5,8.5,0.1
...- Prepare Your Data: Format according to requirements above
- Test Data Loading: Run
load_real_data.pyto verify data loads correctly - Run Analysis: Submit batch job or run locally
- Review Results: Check outputs and analysis report
- Iterate: Adjust parameters and rerun as needed
For issues or questions:
- Check the main pipeline README:
../hybrid_modeling_pipeline/README.md - Review example usage:
../hybrid_modeling_pipeline/example_usage.py - Check logs in
logs/directory
Md Karim Uddin, PhD
PhD Veterinary Medicine | MEng Big Data Analytics
Postdoctoral Researcher, University of Helsinki
- GitHub: @mdkarimuddin
- LinkedIn: Md Karim Uddin, PhD
This project is licensed under the MIT License - see the LICENSE file for details.