Open reproducibility kit for two companion Euro-Par 2026 Workshop papers on flexible, grid-responsive AI/HPC supercomputers — one on the upstream user-side contract (PECS), one on the downstream sub-second actuation controller (WHPC).
GridPilot is the downstream actuation layer: a three-tier predictive controller (per-GPU at 200 Hz, per-host at 1 Hz, per-cluster hourly) plus an out-of-band safety-island bypass, validated on a real 3× NVIDIA V100 SXM2 testbed. It is paired with the upstream contract (f-SLA): a five-tier ladder (T0 rigid, T1 hour, T2 day, T3 week, T4 elastic burst) under which job submitters declare deferrability and elasticity in exchange for proportional service credits, evaluated on the Marconi100 (M100) production trace against six European grids.
The two papers are:
| Paper | Folder | Focus |
|---|---|---|
| WHPC 2026 | papers/whpc2026/ |
GridPilot controller (attributed) |
| PECS 2026 | papers/pecs2026/ |
f-SLA contract (double-blind) |
Together they form a cross-layer flexibility programme: user-side incentives elicit flexibility (PECS) that the platform-level controller can dispatch deterministically at the facility meter (WHPC).
These are auto-extracted from on-disk experiment outputs into
papers/{whpc,pecs}2026/figs/results.tex by
scripts/figures/extract_paper_macros.py — there are no hard-coded
numbers in the LaTeX body text. The current real measurements are:
- WHPC E7 (FFR latency on V100): ~97 ms median end-to-end response across 90 trials (max ≤ 102 ms), ~7× faster than the 700 ms Nordic FFR budget. 90/90 trials pass.
-
WHPC E1 (power-cap calibration): best-efficiency operating point
$p_{\text{cap}}=150$ W,$f_{\text{sm}}=945$ MHz, within ±5 % on iterations-per-joule across matmul / inference / bursty. Per-workload power model LOOCV MAE 3.45 %; 980-node scaling envelope matches the published Marconi100 facility-power reference within +1.4 %. - WHPC E3 (AR(4) predictor MAE): ~4.7 W (inference), ~7.0 W (matmul), ~19.7 W (bursty) on the V100 testbed.
- WHPC E8 (PUE-aware controller sweep): 50 MW cooling-overhead drag closed across six European grids is 2.5–5.8 pp; envelope is widest on low-CI grids.
- PECS multi-country sweep: the f-SLA contract measurably shifts both CFE share and energy-weighted effective grid CI on the M100 trace; the relative lift is largest on the cleanest grids (SE, CH, FR) and the largest absolute avoided tonnage is on the dirtiest (PL, DE).
# From the workspace root (one level above gridpilot/):
git submodule update --init --recursive
python3 -m venv .venv && source .venv/bin/activate
pip3 install --upgrade pip setuptools wheel
pip3 install -r gridpilot/requirements.txt
PYTHONPATH=gridpilot/src pytest -q gridpilot/tests/ # expect 54 passed
bash gridpilot/scripts/run_all_experiments.shDependency files used by this repository:
requirements.txt(main GridPilot dependencies; required)raps/api_client/requirements.txt(optional; only if you use RAPS API client code)raps/pyproject.toml(optional; only if you develop/run the RAPS package itself)
If you work directly inside gridpilot/, use:
python3 -m venv .venv && source .venv/bin/activate
pip3 install --upgrade pip setuptools wheel
pip3 install -r requirements.txtAll subsequent commands assume the virtual environment is active. The single most useful entry point is:
bash scripts/run_all_experiments.shThis runs the M100 policy-matrix replay, the multi-country sweep,
regenerates every figure, extracts the headline-number macros, and
rebuilds both PDFs. Wall time on a 16-core / 64 GB workstation is
~75 minutes. For a fast paper rebuild from literature-anchored stub
data, append stub:
bash scripts/run_all_experiments.sh stub # ~2 minutesFor a step-by-step description of every stage (including how to
re-run a single stage), see docs/RUNBOOK.md.
Two ablation sweeps complement the multi-country headline above:
# Per-tier contribution sweep (T0..T5; ~35 min on 16 cores):
PYTHONPATH=gridpilot/src python3 \
gridpilot/scripts/multicountry/replay_single_tier_sweep.py \
--jobs gridpilot/data/traces/m100_real_jobs.parquet \
--output-dir gridpilot/data/m100/tier_sweep/ --force
# Contract-hyperparameter sensitivity sweep (~20 min):
PYTHONPATH=gridpilot/src python3 \
gridpilot/scripts/multicountry/replay_hyperparameter_sweep.py \
--jobs gridpilot/data/traces/m100_real_jobs.parquet \
--output-dir gridpilot/data/m100/hyper_sweep/ --force
# Composed 2x2 figure (~5 s):
PYTHONPATH=gridpilot/src python3 \
gridpilot/scripts/figures/fig_tier_and_hyper.py \
--tier-summary gridpilot/data/m100/tier_sweep/TIER_SUMMARY.csv \
--hyper-summary gridpilot/data/m100/hyper_sweep/HYPER_SUMMARY.csv \
--out gridpilot/figs/fig_tier_and_hyper.pdfProtocol: docs/TIER_AND_HYPER_SWEEPS.md.
Remote-run copy-paste sequence: ../EXPERIMENTS_REMOTE.md.
gridpilot/
├── README.md ← this file
├── LICENSE ← MIT (code, scripts, configs)
├── CONTRIBUTING.md ← how to extend the framework
├── LIMITATIONS.md ← scope caveats and lessons learned
├── requirements.txt ← Python environment specification
├── .gitignore
├── src/ ← controller + scheduler library
│ ├── controller/ ← Tier-1/2/3 PID/AR(4)/cluster controllers
│ ├── cooling/ ← four-component PUE model
│ ├── scheduler/ ← f-SLA ladder, M0–M3 mechanisms, dispatcher
│ └── integration/ ← RAPS adapter + ENTSO-E connector
├── scripts/ ← all run scripts (entry: run_all_experiments.sh)
│ ├── run_all_experiments.sh ← single end-to-end command
│ ├── m100/ ← M100 trace ETL, replays, ENTSO-E fetcher
│ ├── multicountry/ ← multi-country sweep driver + stub seeder
│ ├── v100/experiments/ ← V100 hardware-experiment drivers (E1–E7)
│ └── figures/ ← figure scripts (read CSVs → PDFs)
├── data/ ← bundled traces + telemetry (CC-BY 4.0)
│ ├── traces/m100_real_jobs.parquet
│ ├── m100/{policy_matrix,country_sweep}/ ← outputs of the replay drivers
│ └── v100_raw/ ← raw V100 measurement campaign telemetry
├── configs/ ← YAML configurations
│ ├── grids/{SE,CH,FR,IT,DE,PL}.yaml ← per-country CI configs
│ └── raps_systems/marconi100.yaml ← tracked fallback PUE anchor
├── raps/ ← bundled ExaDigiT/RAPS submodule (read-only)
│ └── config/marconi100.yaml ← cooling-model calibration anchor
├── figs/ ← rendered figure PDFs (auto-generated)
├── docs/ ← protocols + reproducibility documentation
│ ├── RUNBOOK.md ← step-by-step rerun guide
│ ├── ARCHITECTURE.md ← three-tier controller details
│ ├── DATASETS.md ← M100, ENTSO-E, RAPS dataset descriptions
│ ├── V100_MEASUREMENT_PROTOCOL.md ← V100 campaign protocol
│ ├── FSLA_PROTOCOL.md ← f-SLA contract specification
│ ├── COMPANION_PAPERS_MAP.md ← claim → script → artefact map
│ └── COUNTRY_SWEEP_PROTOCOL.md, POLICY_MATRIX_PROTOCOL.md
├── tests/ ← pytest suite (~30 s; 48 tests)
└── licenses/ ← third-party data licence acknowledgements
The release bundles a copy of the ExaDigiT/RAPS repository under
raps/ and consumes its canonical system configurations at
raps/config/<system>.yaml. For remote clones where the submodule is
not initialized, scripts/run_all_experiments.sh falls back to
configs/raps_systems/marconi100.yaml for the Marconi100 PUE anchor.
This is a lightweight integration
mode: we read RAPS configs to extract per-node power, node counts,
cooling efficiency, and country code, but we do not run the RAPS
simulation engine.
The bridge is src/integration/raps_config_adapter.py, which parses
any RAPS system YAML into a RAPSSystemConfig dataclass. Two
calibration cross-checks ship under scripts/raps_adapter/:
# Run from the gridpilot/ directory with the project's virtualenv active.
python3 scripts/raps_adapter/m100_calibration_check.py
python3 scripts/raps_adapter/frontier_calibration_check.py
# If you are one level above (workspace root), use:
python3 gridpilot/scripts/raps_adapter/m100_calibration_check.py
python3 gridpilot/scripts/raps_adapter/frontier_calibration_check.pyBoth emit JSON reporting the parsed geometry, per-node max power,
calibrated PUE, and the structural gap percentage against the RAPS
scalar cooling_efficiency. The Marconi100 cross-check reports a
~12 % structural gap (the four-component PUE model is calibrated to
the published anchor PUE 1.20 rather than the RAPS scalar cooling
efficiency 0.945); Frontier matches within ±2 %.
PYTHONPATH=src pytest -q48 tests in total, covering: f-SLA tier ladder + Dirichlet prior, M0–M3 anti-gaming mechanisms + SWFs + Jain, PUE-aware scheduler invariants, RAPS YAML adapter, ENTSO-E CI loader, scheduler error paths, CLI round-trip. All tests pass without network access in under 30 seconds.
The complete claim-to-artefact map for both companion papers, with
one-line jq / awk / python verification commands for every
headline number, is in
docs/COMPANION_PAPERS_MAP.md.
A reviewer can verify the paper–data correspondence by direct file
comparison without re-running anything.
The V100 hardware E1–E7 campaign is not re-runnable from this kit
alone — it requires the EPFL EcoCloud ecocloud-exp06 testbed (or a
comparable 3× V100 SXM2 testbed with NVML). The raw 100 Hz NVML
telemetry is shipped under data/v100_raw/ (CC-BY 4.0); for a fresh
measurement, follow
docs/V100_MEASUREMENT_PROTOCOL.md
(≤ 48 GPU-hours on the same testbed class).
We document scope caveats explicitly so reviewers and reproducers can
calibrate expectations. See
LIMITATIONS.md for the full discussion; four
lessons summarised:
- The 5 % closed-loop tracking threshold (E4) is a cascade-composition diagnostic, not a failure mode — the bursty 11.08 % residual is what the Tier-2 host predictor absorbs.
- The ~97 ms FFR latency (E7) is reproducible only with the deterministic safety-island bypass — Python-only implementations show p99 > 250 ms.
- Facility-meter accounting is the binding correctness criterion: a controller that ignores the four-component PUE under-delivers at the meter by 4–7 pp.
- Deferral alone is not enough on a small, lightly-loaded trace like the bundled M100 month — the literature-grounded T4 elastic- burst tier (CarbonScaler), spatial routing (GAIA), and price- proportional credits (Lechowicz) are the three SOTA-grounded extensions the kit ships or designs.
Two papers are in submission for Euro-Par 2026 Workshops; cite the companion most relevant to your use:
@inproceedings{constantinescu2026gridpilot,
title = {{GridPilot}: Real-Time Grid-Responsive Control for {AI} Supercomputers},
author = {Constantinescu, Denisa-Andreea and Atienza, David},
booktitle = {Euro-Par 2026 Workshops --- Women in HPC (WHPC) Session},
year = {2026},
publisher = {Springer LNCS},
note = {Companion paper to ``f-SLA: A User-Side Contract for
Truthful Workload Flexibility towards Carbon-Free
Supercomputing''}
}
@inproceedings{anonymous2026fsla,
title = {{f-SLA}: A User-Side Contract for Truthful Workload Flexibility
towards Carbon-Free Supercomputing},
author = {Anonymous},
booktitle = {Euro-Par 2026 Workshops --- Performance and Energy-efficient
Computing Systems (PECS)},
year = {2026},
publisher = {Springer LNCS},
note = {Double-blind submission; authorship restored at camera-ready}
}This work has been partially supported by the EPFL Solutions 4 Sustainability program ‘‘HeatingBits: renewable-supplied data centers integrating heating and cooling supply of local districts’’ and the UrbanTwin project (ETH Board Joint Initiatives for the Strategic Area Energy, Climate and Environmental Sustainability, and the Strategic Area Engagement and Dialogue with Society). The authors also thank the EcoCloud center of EPFL, in particular Dr. Xavier Ouvrard, for providing access to the V100 server node.
- Code (scripts, configs): MIT
- Figures and data: CC-BY 4.0
- Papers: subject to the Euro-Par 2026 Workshops publication agreement
- Third-party data (M100 PM100, ENTSO-E A75, RAPS configurations):
see
licenses/THIRD_PARTY.md.
For questions, issues, and contributions: see
CONTRIBUTING.md or email
denisa.constantinescu@epfl.ch.