Describe the bug
gen_kl_error stays between 0.15–0.65 (with spikes to 1.75) throughout a full GRPO training run on circle_click with Qwen/Qwen3-VL-2B-Instruct. This metric measures KL(P_gen || P_train) — the divergence between logprobs from vLLM inference and logprobs from the DTensor training forward pass on the same model weights and same token sequences.
Training still converges (95% val accuracy), but the noisy importance-sampling ratio may hurt sample efficiency on harder tasks.
Observed values across 62 training steps:
| Phase |
Typical gen_kl_error |
Range |
| Steps 1–20 (early) |
~0.45 |
0.32 – 0.64 |
| Steps 21–40 (mid) |
~0.30 |
0.16 – 1.75 (spike at step 22) |
| Steps 41–62 (late) |
~0.20 |
0.14 – 0.60 |
Despite the high KL error, reward and validation accuracy still converge (val accuracy reaches 0.95, avg reward ~1.0), so training is not broken — but the elevated gen_kl_error suggests the GRPO importance-sampling ratio (exp(curr_logprobs - generation_logprobs)) used in the clipped PG loss is computed from a noisy baseline, which may reduce sample efficiency or cause instability in harder tasks.
Steps/Code to reproduce bug
# 1. Clone and checkout PR #2092
git clone git@github.com:NVIDIA-NeMo/RL.git nemo-rl --recursive
cd nemo-rl
git fetch origin pull/2092/head:cmunley1/gym-vlm
git checkout cmunley1/gym-vlm
git submodule update --init --recursive
# 2. Set up venvs and generate data
uv venv
cd 3rdparty/Gym-workspace/Gym
uv venv --python 3.12 --allow-existing .venv
source .venv/bin/activate
SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv sync --active --extra dev
mkdir -p data/circle_click
python3 resources_servers/circle_click/generate_data.py --n 1000 --out data/circle_click/train.jsonl
python3 resources_servers/circle_click/generate_data.py --n 100 --out data/circle_click/validation.jsonl
deactivate
cd ../../..
# 3. Run GRPO training (4x H100, single node)
export CUDNN_HOME=.venv/lib/python3.12/site-packages/nvidia/cudnn
export LD_LIBRARY_PATH=".venv/lib/python3.12/site-packages/nvidia/cudnn/lib:${LD_LIBRARY_PATH:-}"
export TORCH_CUDA_ARCH_LIST="9.0"
uv run python examples/nemo_gym/run_grpo_nemo_gym.py \
--config examples/nemo_gym/grpo_circle_click_qwen3vl2b.yaml \
cluster.num_nodes=1 \
cluster.gpus_per_node=4
Uses the default config grpo_circle_click_qwen3vl2b.yaml shipped with the PR — no overrides beyond node/GPU count.
Expected behavior
Expected ideal values of gen_kl_error are near zero (< 0.01). The consistently elevated values suggest a systematic numerical divergence between the two inference backends.
Describe the bug
gen_kl_errorstays between 0.15–0.65 (with spikes to 1.75) throughout a full GRPO training run oncircle_clickwithQwen/Qwen3-VL-2B-Instruct. This metric measuresKL(P_gen || P_train)— the divergence between logprobs from vLLM inference and logprobs from the DTensor training forward pass on the same model weights and same token sequences.Training still converges (95% val accuracy), but the noisy importance-sampling ratio may hurt sample efficiency on harder tasks.
Observed values across 62 training steps:
gen_kl_errorDespite the high KL error, reward and validation accuracy still converge (val accuracy reaches 0.95, avg reward ~1.0), so training is not broken — but the elevated gen_kl_error suggests the GRPO importance-sampling ratio (
exp(curr_logprobs - generation_logprobs)) used in the clipped PG loss is computed from a noisy baseline, which may reduce sample efficiency or cause instability in harder tasks.Steps/Code to reproduce bug
Uses the default config
grpo_circle_click_qwen3vl2b.yamlshipped with the PR — no overrides beyond node/GPU count.Expected behavior
Expected ideal values of gen_kl_error are near zero (< 0.01). The consistently elevated values suggest a systematic numerical divergence between the two inference backends.