Skip to content

Commit 9637698

Browse files
committed
deploy: 81d8cd6
1 parent fca3612 commit 9637698

File tree

12 files changed

+1219
-2
lines changed

12 files changed

+1219
-2
lines changed
9.82 KB
Binary file not shown.

.doctrees/environment.pickle

4 KB
Binary file not shown.

_examples_synced/geo3k_vlm/README.html

Lines changed: 579 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# FSDP + VLM Single-Turn RL
2+
3+
Training VLMs with FSDP on single-turn reasoning task using GRPO on the [GEO3K dataset](https://huggingface.co/datasets/hiyouga/geometry3k). We used processed version [here](https://huggingface.co/datasets/chenhegu/geo3k_imgurl).
4+
5+
<p align="center">
6+
<img src="rewards.png" alt="Reward Plot" width="800">
7+
</p>
8+
9+
## Reproduce
10+
11+
```bash
12+
export WANDB_API_KEY=your_wandb_api_key
13+
14+
SLIME_SCRIPT_MODEL_NAME=Qwen3-VL-2B-Instruct SLIME_SCRIPT_NUM_GPUS=8 python examples/geo3k_vlm/run_geo3k_vlm.py 2>&1 | tee run_simple.log
15+
```
16+
17+
## Notes
18+
19+
### Reward Model Configuration
20+
21+
We experimented with three reward model configurations:
22+
1. A geo3k-specific RM with tolerance=0.05 (to handle rounding in ground truth labels)
23+
2. A geo3k-specific RM with tolerance=0.0 (strict matching)
24+
3. The default math RM
25+
26+
All three performed similarly, so we use the default math RM for simplicity.
27+
28+
### Numerical Precision with Non-Binary Rewards
29+
30+
Our initial geo3k-specific verifier produced "format scores" (**0 and 0.9**) instead of clean binary rewards. Under **fp32**, fractional values like 0.9 can't be exactly represented, so when all samples in a group have the same reward, `reward - mean` doesn't equal zero—creating spurious gradient signal.
31+
32+
We fixed this by switching to the default math RM with clean **binary 0/1 rewards**. If you encounter similar precision issues with non-binary rewards, you can change the reward tensor dtype from `torch.float` to `torch.float16` in `slime/ray/rollout.py` (`_post_process_rewards` method) to truncate precision artifacts.

objects.inv

329 Bytes
Binary file not shown.

searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
9.81 KB
Binary file not shown.

zh/.doctrees/environment.pickle

4 KB
Binary file not shown.

0 commit comments

Comments
 (0)