Author: Sharath Sathish, University of York, UK
Deep reinforcement learning for autonomous control of a supercritical CO₂ (sCO₂) recuperated Brayton cycle recovering waste heat from steel industry electric arc furnace (EAF) and basic oxygen furnace (BOF) exhaust (200–1,200°C). Trained on a physics-faithful OpenModelica FMU via the FMPy interface on an NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified memory).
| Metric | Value |
|---|---|
| RL vs ZN-PID: Phase 0 (steady-state) | +30.3% cumulative reward |
| RL vs ZN-PID: Phase 1 (±30% load following) | +30.4% |
| RL vs ZN-PID: Phase 2 (ambient disturbance) | +39.0% |
| Phases 3–6 (severe transients) | PID wins — curriculum imbalance (<5% training steps each) |
| Constraint violations (140 eval episodes) | 0 (RL and PID) |
| MLP surrogate: PPO vs PID tracking error | 18.5× lower (0.122 MW vs 2.259 MW) |
| MLP surrogate: GPU training throughput | 250,000 steps/s (470× faster than FMU) |
| FNO surrogate (PhysicsNeMo) | R² = 1.000, RMSE = 0.0010 (76,600 LHS trajectories) |
| TensorRT FP16 deployment | p99 = 0.046 ms (22× under 1 ms SLA) |
| Training hardware | NVIDIA DGX Spark, GB10 Grace Blackwell, 128 GB |
- Physics simulation: OpenModelica FMU (FMI 2.0 Co-Simulation) via FMPy with CoolProp Span-Wagner CO₂ EOS
- Gymnasium environment: 14-variable observation, 4-dim continuous action space, Lagrangian safety constraints
- 7-phase curriculum: steady-state → load following → ambient disturbance → EAF transients → load rejection → cold startup → emergency trip
- Dual training paths: FMU-direct PPO (SB3, 8 CPU workers, 530 steps/s) and MLP surrogate GPU path (1,024 parallel envs, 250K steps/s)
- Surrogate models: MLP step predictor (val_loss = 5×10⁻⁶) + NVIDIA PhysicsNeMo FNO (R² = 1.000)
- Deployment: PyTorch → ONNX → TensorRT FP16, p99 = 0.046 ms
- Paper: Full LaTeX manuscript (41 pages, 4 appendices) ready for arXiv submission
- Practitioner lessons: Five non-trivial infrastructure bugs documented with diagnosis, fix, and detection strategy
| Notebook | Description | Colab |
|---|---|---|
| 01_cycle_analysis | Open-loop FMU thermodynamic traces | |
| 02_reward_shaping | Reward component diagnostics | |
| 03_surrogate_validation | FNO surrogate V1 vs V2 fidelity analysis | |
| 04_policy_evaluation | RL vs PID evaluation across all phases | |
| 05_control_analysis | Step response, Bode plots, control metrics |
All notebooks auto-detect Google Colab, clone the repo, and install requirements. No Google Drive connection needed.
src/sco2rl/ Core library (environment, training, surrogate, deployment)
configs/ YAML configs (environment, curriculum, surrogate, training)
scripts/ CLI scripts (train, evaluate, export, figure generation)
notebooks/ Interactive analysis notebooks (Colab-ready, with inline outputs)
paper/ LaTeX manuscript (arxiv-compatible split .tex + .bib)
data/ Pre-computed report JSON files (tracked for Colab)
tests/ Unit tests
# Build image (compiles OpenModelica + CoolProp + ExternalMedia for ARM64)
docker build -t sco2-rl-automation:latest .
# Launch with GPU access
docker run --rm -it --gpus all -v $(pwd):/workspace --shm-size=64g sco2-rl-automation:latest
# Inside container: collect trajectories, train surrogate, train RL
python scripts/collect_trajectories.py --n-trajectories 100000
python scripts/train_mlp_surrogate.py
python scripts/train_ppo_mlp.py
python scripts/cross_validate_and_export.pyThe paper is in paper/ with split section files for arXiv compatibility:
cd paper && pdflatex main && bibtex main && pdflatex main && pdflatex mainPre-compiled PDF: paper/main.pdf (41 pages)
@article{sathish2026sco2rl,
title={Deep Reinforcement Learning for Autonomous Control of Supercritical CO$_2$ Brayton Cycles in Steel Industry Waste Heat Recovery},
author={Sathish, Sharath},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}MIT