This folder contains the GridForecast step: creating machine-learning ready time series datasets from GridExpand results and training MLP and Transformer models to forecast grid-level net demand.
The code is organized into three sub-steps:
0_preprocessing/: convert the raw GridExpand/PostPowerflow HDF5 outputs into compact ML tables (ts_train.h5,ts_test.h5).2_mlp/: train an MLP baseline with Ray Tune hyperparameter optimization (HPO).3_transformer/: train a Transformer model (optionally with VMD feature decomposition) with Ray Tune HPO.
Important: Several scripts contain absolute paths to the LRZ DSS filesystem (e.g.
/dss/...). You must adjust those paths if you run elsewhere.
This step is designed to: (1) convert GridExpand/PostPowerflow outputs into ML-ready time series tables, and (2) train + evaluate forecasting models on those tables.
- Reads one HDF5 per grid from
0_preprocessing/config.py:Config.input_dirand extracts a consistent set of time series and scalar features. - Produces ML tables in a format consumable by both models:
Xandystored aspandas.DataFramein HDF5 keysXandy- MultiIndex rows:
(batch, hour)wherebatchis the grid id
- Creates an intermediate inspection-friendly xarray file (
Data/all_data.h5) with groupsX_ts,X_sclr,y_ts,y_sclr. - Performs a train/test split by grid id (not by time) to reduce leakage across grids.
- Optional (thesis-specific): filters “small grids” and writes
ts_*_large_grids.h5.
Limitations / assumptions:
- Input raw files must contain the expected group structure (e.g.
raw_data/...,urbs_out/reduced_data/...). - The produced features/targets must match the column names listed in
X_BASE_COLS/TARGET_COLSin the training scripts.
MLP (2_mlp/):
- Trains an MLP on per-hour rows (no windowing) with a preprocessing pipeline that can add:
log1pfeatures (LOG1P_COLS)- seasonal sin/cos features (
TS_PERIODS) - scaling via
StandardScaler
- Supports single-target or two-target training (interpreted as active power P and reactive power Q).
- Implements normalized error training via MAE/MAEx.
- Runs hyperparameter optimization (HPO) via Ray Tune + Optuna (TPE) + ASHA and writes results under
2_mlp/ray_tune/.
Transformer (3_transformer/):
- Trains a Transformer encoder model on sliding windows of length
core_len + 2*pad_hours. - Supports two aggregation modes to produce aggregated-hour outputs:
aggregation_mode='sum': trim pad, then sum blocks ofagg_hoursaggregation_mode='conv': depthwise Conv1D aggregation withconv_paddingcontext
- Supports optional feature augmentation via VMD (Variational Mode Decomposition) for selected columns:
- VMD can be computed and cached to disk (
VMD_APPROACH='write') or read from cache ('read').
- VMD can be computed and cached to disk (
- Supports multiple loss functions (e.g.
mae,mse,mae_maex,alpha_peak). - Provides evaluation helpers that can produce metrics and plots, plus integrated gradients explainability.
- Runs HPO via Ray Tune + Optuna (TPE) + ASHA and writes results under
3_transformer/ray_tune/.
General limitations:
- This repo is not packaged as a pip-installable library; the entrypoints are scripts and notebooks.
- The HPO scripts currently perform runtime
pip installcalls. - Many defaults are tuned for LRZ GPU partitions and may need adjustment on other systems.
GridForecast/
0_preprocessing/
0.extract_ML_data.ipynb
config.py
Data/
all_data.h5
ts_train.h5
ts_test.h5
ts_train_large_grids.h5 # optional (created by the notebook)
ts_test_large_grids.h5 # optional (created by the notebook)
2_mlp/
mlp_model.py
evaluation.py
run_mlp_models_HPO.py
submit_asha_tpe.sbatch
ray_tune/ # Ray Tune experiment storage
logs/
error/
out/ # created by Slurm scripts
models/ # checkpoints / exported models (project-specific)
3_transformer/
transformer_model.py
evaluation.py
resource_report.py
run_transformer_model_HPO.py
run_transformer_model.ipynb
submit_asha_tpe.sbatch
data/ # cached VMD results (HDF5) if VMD_APPROACH=read/write
ray_tune/ # Ray Tune experiment storage
logs/
error/
out/ # created by Slurm scripts
models/
runs/ # TensorBoard logs if enabled
The preprocessing notebook reads one HDF5 file per grid from config.input_dir (see 0_preprocessing/config.py).
Each file is expected to contain groups similar to:
raw_data/(buildings, consumers, region, weather, net, ...)urbs_in/andurbs_out/(reduced tables such asdemand,eff_factor,supim,process,storage, ...)
These files are produced by the upstream GridExpand pipeline (powerflow + urbs).
There is no single environment.yml in this folder. On LRZ/HPC the recommended approach is to use the provided Slurm container scripts:
2_mlp/submit_asha_tpe.sbatch3_transformer/submit_asha_tpe.sbatch
If you run locally, note that the HPO scripts currently install packages at runtime via pip (see “Notes / Pitfalls”).
The notebook 0_preprocessing/0.extract_ML_data.ipynb generates:
-
Data/all_data.h5- Xarray datasets stored under groups:
X_ts,X_sclr,y_ts,y_sclr - This file is an intermediate “wide” format useful for inspection.
- Xarray datasets stored under groups:
-
Data/ts_train.h5andData/ts_test.h5- Each file stores two Pandas DataFrames in HDF5 keys:
X: featuresy: targets
- Each file stores two Pandas DataFrames in HDF5 keys:
Data format expectations (used by both MLP and Transformer):
Xandyarepandas.DataFramewith aMultiIndex:- level 0:
batch(grid id) - level 1:
hour(time index)
- level 0:
- Columns in
Xmatch the lists in each training script (e.g.X_BASE_COLS). - Columns in
ymatchTARGET_COLS, typically:demand_net_active_postdemand_net_reactive_post
Optional small-grid filtering (thesis-specific) can create:
Data/ts_train_large_grids.h5Data/ts_test_large_grids.h5
Both 2_mlp/ and 3_transformer/ use Ray Tune for HPO. The main outputs are:
ray_tune/<experiment_name>/...:- trial directories
- checkpoints (via Ray Tune checkpointing)
- TensorBoard event files (when enabled)
Additional project-specific outputs may appear in models/ (e.g. exported .pt weights) and under logs/.
- Selects raw HDF5 files from
config.input_dir. - Extracts time series features (weather, demand components, PV production proxies, mobility features, heat/COP) and scalar grid/building descriptors.
- Writes an intermediate xarray dataset (
all_data.h5). - Splits the dataset into train/test by grid id.
- Converts xarray datasets into a single multi-indexed table and saves
ts_train.h5/ts_test.h5.
-
run_mlp_models_HPO.pyruns Ray Tune + Optuna (TPE) + ASHA. - Model/training logic is implemented in
mlp_model.py:- preprocessing: log1p, seasonal sin/cos features,
StandardScaler - dual-target scaling: learns scaler on apparent power
$S=\sqrt{P^2+Q^2}$ and applies the same scale to both P and Q - loss: MAE/MAEx (normalized by per-grid mean absolute exchange)
- preprocessing: log1p, seasonal sin/cos features,
run_transformer_model_HPO.pyruns Ray Tune + Optuna (TPE) + ASHA.- Core code is in
transformer_model.py:- windowing: builds sequences of length
core_len + 2*pad_hours - model: per-timestep MLP → positional encoding → transformer encoder → output head
- aggregation: sum aggregation (classic) or depthwise Conv1D aggregation (
aggregation_mode='conv') - optional VMD: if enabled, computes or reads cached VMD modes for selected columns (
VMD_COLS)
- windowing: builds sequences of length
Open and run the notebook:
0_preprocessing/0.extract_ML_data.ipynb
Before running, update:
0_preprocessing/config.py→Config.input_dirto point to your PostPowerflow directory.
The final cells write:
0_preprocessing/Data/ts_train.h50_preprocessing/Data/ts_test.h5
On LRZ/HPC:
cd 2_mlp
sbatch submit_asha_tpe.sbatchThe script runs:
python run_mlp_models_HPO.py
On LRZ/HPC:
cd 3_transformer
# defaults: agg_hours=1, loss_type=alpha_peak
sbatch submit_asha_tpe.sbatch
# override defaults
sbatch submit_asha_tpe.sbatch 4 mae_maexThis runs:
python run_transformer_model_HPO.py --agg-hours <int> --loss-type <str>
The provided Slurm scripts run the code inside an NVIDIA PyTorch container. The upstream reference image is:
nvcr.io/nvidia/pytorch:25.06-py3
On LRZ, the submit_asha_tpe.sbatch scripts use srun --container-image <path>.sqsh (a SquashFS image). That .sqsh is typically built from the NGC image above.
Practical notes:
- If you already have a local
.sqshbuilt fromnvcr.io/nvidia/pytorch:25.06-py3, point--container-imageto it. - Make sure you mount:
- the code folder into the container (the Slurm scripts mount to
/workspace) - the transformer VMD cache folder
3_transformer/data/(the transformer Slurm script mounts it to/data)
- the code folder into the container (the Slurm scripts mount to
- The scripts assume GPU access; verify the partition +
--gres=gpu:1matches your cluster.
-
Hard-coded paths: HDF5 input paths (train/test) and Ray storage paths are hard-coded in the run scripts. Update them if your directory layout differs.
-
Runtime
pip install:run_mlp_models_HPO.pyandrun_transformer_model_HPO.pycallpip install ...at runtime. This is convenient in notebooks/containers but can be undesirable on shared systems. -
Large batch sizes / memory: the MLP HPO search uses very large
batch_sizeoptions (up to 524288). On GPUs this can OOM depending on feature count and precision. -
Index naming assumptions: evaluation utilities frequently assume the DataFrame index has levels named
batchandhour. If you change index names, metrics that group bybatchmay break or silently behave differently. -
Transformer VMD cache: if
VMD_APPROACH='read', the transformer preprocessor expects cached VMD files under3_transformer/data/. The Slurm script mounts this directory into the container (seesubmit_asha_tpe.sbatch). If you run without that mount, VMD reads will fail and the code may fall back to recomputing. -
Dual-target interpretation (P/Q): both models assume the first target is active power (P) and the second is reactive power (Q) when using two targets.
- Change raw input location:
0_preprocessing/config.py - Change features/targets: the
X_BASE_COLS/TARGET_COLSlists in:2_mlp/run_mlp_models_HPO.py3_transformer/run_transformer_model_HPO.py3_transformer/transformer_model.py(Ray Tune helpers)
- Change Ray output location:
STORAGE_ROOTin eachrun_*_HPO.py