Skip to content

train/gen_kl_error persistently high (0.15–1.75) during GRPO training with NeMo Gym VLM support #2260

@shashank3959

Description

@shashank3959

Describe the bug

gen_kl_error stays between 0.15–0.65 (with spikes to 1.75) throughout a full GRPO training run on circle_click with Qwen/Qwen3-VL-2B-Instruct. This metric measures KL(P_gen || P_train) — the divergence between logprobs from vLLM inference and logprobs from the DTensor training forward pass on the same model weights and same token sequences.

Training still converges (95% val accuracy), but the noisy importance-sampling ratio may hurt sample efficiency on harder tasks.

Observed values across 62 training steps:

Phase Typical gen_kl_error Range
Steps 1–20 (early) ~0.45 0.32 – 0.64
Steps 21–40 (mid) ~0.30 0.16 – 1.75 (spike at step 22)
Steps 41–62 (late) ~0.20 0.14 – 0.60
Image

Despite the high KL error, reward and validation accuracy still converge (val accuracy reaches 0.95, avg reward ~1.0), so training is not broken — but the elevated gen_kl_error suggests the GRPO importance-sampling ratio (exp(curr_logprobs - generation_logprobs)) used in the clipped PG loss is computed from a noisy baseline, which may reduce sample efficiency or cause instability in harder tasks.

Steps/Code to reproduce bug

# 1. Clone and checkout PR #2092
git clone git@github.com:NVIDIA-NeMo/RL.git nemo-rl --recursive
cd nemo-rl
git fetch origin pull/2092/head:cmunley1/gym-vlm
git checkout cmunley1/gym-vlm
git submodule update --init --recursive

# 2. Set up venvs and generate data
uv venv
cd 3rdparty/Gym-workspace/Gym
uv venv --python 3.12 --allow-existing .venv
source .venv/bin/activate
SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv sync --active --extra dev
mkdir -p data/circle_click
python3 resources_servers/circle_click/generate_data.py --n 1000 --out data/circle_click/train.jsonl
python3 resources_servers/circle_click/generate_data.py --n 100  --out data/circle_click/validation.jsonl
deactivate
cd ../../..

# 3. Run GRPO training (4x H100, single node)
export CUDNN_HOME=.venv/lib/python3.12/site-packages/nvidia/cudnn
export LD_LIBRARY_PATH=".venv/lib/python3.12/site-packages/nvidia/cudnn/lib:${LD_LIBRARY_PATH:-}"
export TORCH_CUDA_ARCH_LIST="9.0"

uv run python examples/nemo_gym/run_grpo_nemo_gym.py \
  --config examples/nemo_gym/grpo_circle_click_qwen3vl2b.yaml \
  cluster.num_nodes=1 \
  cluster.gpus_per_node=4

Uses the default config grpo_circle_click_qwen3vl2b.yaml shipped with the PR — no overrides beyond node/GPU count.

Expected behavior

Expected ideal values of gen_kl_error are near zero (< 0.01). The consistently elevated values suggest a systematic numerical divergence between the two inference backends.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions