PCZ-PPO: Per-Component Z-Normalization for PPO

When a reinforcement learning environment decomposes its reward into K heterogeneous components (landing bonus, fuel penalty, control cost, …), standard PPO normalizes the aggregate and lets the highest-magnitude component dominate the learning signal. PCZ-PPO applies z-normalization independently to each component before GAE, equalizing scales while preserving per-component priority through explicit weights.

Paper result: At 500k steps, PCZ-PPO's mean performance is approximately invariant to K while standard PPO degrades progressively. The growing PCZ/PPO gap is driven by PPO collapsing under reward heterogeneity, not by PCZ improving. All three K∈{4,6,8} comparisons survive Holm–Bonferroni correction at α=0.05.

What this repo gives you:

A drop-in TorchRL PPO variant (torchrl-pcz-ppo-running) — per-component z-normalization as a ~60-line preprocessing wrapper on the reward buffer, no architectural changes.
A systematic K-scaling sweep (K∈{2,4,6,8}) with pre-registered seeds, Holm–Bonferroni correction, BCa bootstrap CIs, IQM, and stochastic-dominance tests.
A 20-variant ablation suite (ZCA, symlog→z-norm, asymmetric clip, GRPO, multi-head critic, PopArt, …).
A data-driven paper pipeline: every number in the PDF is backed by a fragment in paper/generated/ rendered from results.csv — hand-typed numbers are forbidden and caught by pre-commit.

Install

git clone https://github.com/KookaS/pcz-ppo
cd pcz-ppo
uv sync --extra torchrl   # TorchRL + LunarLander/BipedalWalker/CartPole
uv sync --extra trading   # + financial trading environments

Requires Python ≥ 3.11 and uv.

Quickstart

# Train PCZ-PPO on LunarLander K=4 (~10 min on CPU)
uv run python -m core.train \
    --algorithm torchrl-pcz-ppo-running \
    --env lunarlander \
    --total-timesteps 500000

# Compare PCZ-PPO vs PPO across 5 seeds
uv run python -m core.compare \
    --algorithms torchrl-pcz-ppo-running torchrl-ppo \
    --env lunarlander \
    --seeds 42 43 44 45 46 \
    --total-timesteps 500000

# K=6 with canonical weights
uv run python -m core.train \
    --algorithm torchrl-pcz-ppo-running \
    --env lunarlander-k6 \
    --reward-component-weights 10.0,3.0,1.0,1.0,0.5,0.5 \
    --total-timesteps 500000

Supported Environments

`--env`	K	Components
`lunarlander`	4	landing, shaping, fuel_main, fuel_side
`lunarlander-k2`	2	landing, dense aggregate
`lunarlander-k6`	6	intermediate split
`lunarlander-k8`	8	fine-grained shaping split
`bipedalwalker`	3	shaping, energy, crash
`cartpole`	2	balance, center
`halfcheetah`	2	run, ctrl_cost
`trading-k4`	4	pnl_gain, pnl_loss, txn_cost, borrow_cost

All environments return info["reward_components"] as {str: float}. See docs/environment-api.md for the full table and component semantics.

Reproduce the Paper

The committed data/results.csv and data/metrics/*.parquet files are sufficient to rebuild all figures without a live MLflow instance.

# Rebuild stale figures and all generated/*.tex fragments
uv run python artifacts/pcz-ppo/paper/paper_build.py --build

# Also recompile the PDF (requires latexmk)
uv run python artifacts/pcz-ppo/paper/paper_build.py --build --pdf

# Run paper tests (formatter, fragment registry, headline-number regression)
uv run pytest artifacts/pcz-ppo/paper/tests/

# Check that committed fragments match current data (CI gate)
uv run python artifacts/pcz-ppo/paper/render_claims.py --check

If you have an MLflow instance with your own runs, export first:

uv run python -m core.plot.export_results \
    --tracking-uri http://127.0.0.1:5050 \
    --output artifacts/pcz-ppo/data/results.csv --append

uv run python -m core.plot.export_metrics \
    --tracking-uri http://127.0.0.1:5050 \
    --output-dir artifacts/pcz-ppo/data/metrics --append

See docs/REPRODUCE.md for full step-by-step instructions including expected wall-clock times.

Key Results (from paper)

At 500k steps on LunarLander (Holm–Bonferroni corrected, all three K survive at α=0.05):

K	n	PCZ-PPO	PPO	Δ	Welch p (Holm)	Var ratio
4	15	+159.8±34.6	+123.5±40.6	+36.3	0.042 (0.042)	1.38×
6	15	+136.7±34.8	+82.1±47.2	+54.6	0.0008 (0.003)	1.84×
8	10	+143.8±45.4	+18.0±163.4	+125.8	0.049 (0.049)	12.9×

At 4M steps on LunarLander K=8 with heterogeneous weights (10,1,1,1,1,1,0.5,0.5), n=10: PCZ-PPO +179.0±35.7 vs PPO +124.6±59.9 (Δ=+54.4, p=0.026, Cohen d=+1.10).

Sharp usage constraint: At K=8 with equal weights and 4M steps, PCZ-PPO collapses while PPO remains robust. Only with hand-designed, strictly-positive heterogeneous weights does PCZ-PPO win at high K / long horizon. See §Limitations in the paper.

Citation

@software{charrez2026pczppo,
  title   = {{PCZ-PPO}: Per-Component Z-Normalization for {PPO} with Heterogeneous Reward Components},
  author  = {Charrez, Olivier and Bussler, Maarten and Weber, Tobias},
  year    = {2026},
  url     = {https://github.com/KookaS/pcz-ppo},
  license = {Apache-2.0}
}

License

Code: Apache-2.0
Paper text and figures (artifacts/pcz-ppo/paper/): CC-BY-4.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
core		core
data		data
docs		docs
experiments/E46.1_noise_amp/eval		experiments/E46.1_noise_amp/eval
llm_alignment/runs		llm_alignment/runs
paper		paper
scripts		scripts
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
Dockerfile.train		Dockerfile.train
LICENSE		LICENSE
README.md		README.md
docker-compose.mlflow.yml		docker-compose.mlflow.yml
docker-compose.train.yml		docker-compose.train.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PCZ-PPO: Per-Component Z-Normalization for PPO

Install

Quickstart

Supported Environments

Reproduce the Paper

Key Results (from paper)

Citation

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PCZ-PPO: Per-Component Z-Normalization for PPO

Install

Quickstart

Supported Environments

Reproduce the Paper

Key Results (from paper)

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages