THUDM
diff --git a/‎.doctrees/_examples_synced/geo3k_vlm/README.doctree‎
9.82 KB b/‎.doctrees/_examples_synced/geo3k_vlm/README.doctree‎
9.82 KB
diff --git a/‎.doctrees/environment.pickle‎
4 KB b/‎.doctrees/environment.pickle‎
4 KB
diff --git a/‎_examples_synced/geo3k_vlm/README.html‎
Lines changed: 579 additions & 0 deletions b/‎_examples_synced/geo3k_vlm/README.html‎
Lines changed: 579 additions & 0 deletions
diff --git a/‎_sources/_examples_synced/geo3k_vlm/README.md‎
Lines changed: 32 additions & 0 deletions b/‎_sources/_examples_synced/geo3k_vlm/README.md‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎objects.inv‎
329 Bytes b/‎objects.inv‎
329 Bytes
diff --git a/‎searchindex.js‎
Lines changed: 1 addition & 1 deletion b/‎searchindex.js‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎zh/.doctrees/_examples_synced/geo3k_vlm/README.doctree‎
9.81 KB b/‎zh/.doctrees/_examples_synced/geo3k_vlm/README.doctree‎
9.81 KB
diff --git a/‎zh/.doctrees/environment.pickle‎
4 KB b/‎zh/.doctrees/environment.pickle‎
4 KB
@@ -0,0 +1,32 @@
+# FSDP + VLM Single-Turn RL
+
+Training VLMs with FSDP on single-turn reasoning task using GRPO on the [GEO3K dataset](https://huggingface.co/datasets/hiyouga/geometry3k). We used processed version [here](https://huggingface.co/datasets/chenhegu/geo3k_imgurl).
+
+<p align="center">
+  <img src="rewards.png" alt="Reward Plot" width="800">
+</p>
+
+## Reproduce
+
+```bash
+export WANDB_API_KEY=your_wandb_api_key
+
+SLIME_SCRIPT_MODEL_NAME=Qwen3-VL-2B-Instruct SLIME_SCRIPT_NUM_GPUS=8 python examples/geo3k_vlm/run_geo3k_vlm.py 2>&1 | tee run_simple.log
+```
+
+## Notes
+
+### Reward Model Configuration
+
+We experimented with three reward model configurations:
+1. A geo3k-specific RM with tolerance=0.05 (to handle rounding in ground truth labels)
+2. A geo3k-specific RM with tolerance=0.0 (strict matching)
+3. The default math RM
+
+All three performed similarly, so we use the default math RM for simplicity.
+
+### Numerical Precision with Non-Binary Rewards
+
+Our initial geo3k-specific verifier produced "format scores" (**0 and 0.9**) instead of clean binary rewards. Under **fp32**, fractional values like 0.9 can't be exactly represented, so when all samples in a group have the same reward, `reward - mean` doesn't equal zero—creating spurious gradient signal.
+
+We fixed this by switching to the default math RM with clean **binary 0/1 rewards**. If you encounter similar precision issues with non-binary rewards, you can change the reward tensor dtype from `torch.float` to `torch.float16` in `slime/ray/rollout.py` (`_post_process_rewards` method) to truncate precision artifacts.