train/gen_kl_error persistently high (0.15–1.75) during GRPO training with NeMo Gym VLM support

**Describe the bug**

`gen_kl_error` stays between 0.15–0.65 (with spikes to 1.75) throughout a full GRPO training run on `circle_click` with `Qwen/Qwen3-VL-2B-Instruct`. This metric measures `KL(P_gen || P_train)` — the divergence between logprobs from vLLM inference and logprobs from the DTensor training forward pass on the **same model weights and same token sequences**.

Training still converges (95% val accuracy), but the noisy importance-sampling ratio may hurt sample efficiency on harder tasks.
                                                                                                                                                                                                                                                                 
**Observed values across 62 training steps:**

| Phase | Typical `gen_kl_error` | Range |
|-------|----------------------|-------|
| Steps 1–20 (early) | ~0.45 | 0.32 – 0.64 |
| Steps 21–40 (mid) | ~0.30 | 0.16 – 1.75 (spike at step 22) |
| Steps 41–62 (late) | ~0.20 | 0.14 – 0.60 |

<img width="1036" height="751" alt="Image" src="https://github.com/user-attachments/assets/3a292ff8-bcd4-41ce-905a-52961fe27c55" />

Despite the high KL error, reward and validation accuracy still converge (val accuracy reaches 0.95, avg reward ~1.0), so training is not broken — but the elevated gen_kl_error suggests the GRPO importance-sampling ratio (`exp(curr_logprobs - generation_logprobs)`) used in the clipped PG loss is computed from a noisy baseline, which may reduce sample efficiency or cause instability in harder tasks.

## Steps/Code to reproduce bug

```bash
# 1. Clone and checkout PR #2092
git clone git@github.com:NVIDIA-NeMo/RL.git nemo-rl --recursive
cd nemo-rl
git fetch origin pull/2092/head:cmunley1/gym-vlm
git checkout cmunley1/gym-vlm
git submodule update --init --recursive

# 2. Set up venvs and generate data
uv venv
cd 3rdparty/Gym-workspace/Gym
uv venv --python 3.12 --allow-existing .venv
source .venv/bin/activate
SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv sync --active --extra dev
mkdir -p data/circle_click
python3 resources_servers/circle_click/generate_data.py --n 1000 --out data/circle_click/train.jsonl
python3 resources_servers/circle_click/generate_data.py --n 100  --out data/circle_click/validation.jsonl
deactivate
cd ../../..

# 3. Run GRPO training (4x H100, single node)
export CUDNN_HOME=.venv/lib/python3.12/site-packages/nvidia/cudnn
export LD_LIBRARY_PATH=".venv/lib/python3.12/site-packages/nvidia/cudnn/lib:${LD_LIBRARY_PATH:-}"
export TORCH_CUDA_ARCH_LIST="9.0"

uv run python examples/nemo_gym/run_grpo_nemo_gym.py \
  --config examples/nemo_gym/grpo_circle_click_qwen3vl2b.yaml \
  cluster.num_nodes=1 \
  cluster.gpus_per_node=4
```

Uses the default config `grpo_circle_click_qwen3vl2b.yaml` shipped with the PR — no overrides beyond node/GPU count.


**Expected behavior**

Expected ideal values of **gen_kl_error** are near zero (< 0.01). The consistently elevated values suggest a systematic numerical divergence between the two inference backends. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train/gen_kl_error persistently high (0.15–1.75) during GRPO training with NeMo Gym VLM support #2260

Steps/Code to reproduce bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phase	Typical `gen_kl_error`	Range
Steps 1–20 (early)	~0.45	0.32 – 0.64
Steps 21–40 (mid)	~0.30	0.16 – 1.75 (spike at step 22)
Steps 41–62 (late)	~0.20	0.14 – 0.60

train/gen_kl_error persistently high (0.15–1.75) during GRPO training with NeMo Gym VLM support #2260

Description

Steps/Code to reproduce bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions