Training VLMs with FSDP or Megatron on single-turn reasoning task using GRPO on the GEO3K dataset. We used processed version here.
Supported models:
- Qwen2.5-VL
- Qwen3-VL (Dense and Moe)
Note: Please make sure the cudnn version in the environment is 9.16.0.29 to prevent severe performance regression in conv3d in torch 2.9 mentioned in pytorch/pytorch#168167. Otherwise, you can reinstall cudnn with:
pip install nvidia-cudnn-cu12==9.16.0.29The geo3k_imgurl dataset contains:
problem: The math problem text (string)answer: The answer (string, e.g., "270")images: Image data (list)
For SFT training, we need to format the answer field for \boxed{} format and the messages. You can use the following script to format the answer field:
from datasets import load_dataset
import pandas as pd
ds = load_dataset("chenhegu/geo3k_imgurl", split="train")
def format_answer(answer: str) -> str:
"""Format answer to include \\boxed{} format."""
return f"Answer: \\boxed{{{answer}}}"
def process_sample(sample):
formatted_answer = f"Answer: \\boxed{{{sample['answer']}}}"
sample["messages"] = [
{"role": "user", "content": sample["problem"]},
{"role": "assistant", "content": formatted_answer}
]
return sample
ds = ds.map(process_sample)
ds.to_parquet("/root/datasets/geo3k_imgurl/train_formatted.parquet")export WANDB_API_KEY=your_wandb_api_key
# Megatron backend (default -> Qwen3-VL-8B-Instruct + Megatron)
./examples/geo3k_vlm/run_geo3k_vlm.sh
# FSDP backend
SLIME_SCRIPT_TRAIN_BACKEND=fsdp ./examples/geo3k_vlm/run_geo3k_vlm.sh
# With different model
SLIME_SCRIPT_MODEL_NAME=Qwen3-VL-4B-Instruct ./examples/geo3k_vlm/run_geo3k_vlm.sh
# SFT
./examples/geo_3k_vlm/run_geo3k_vlm_sft.sh| Environment Variable | Default | Description |
|---|---|---|
SLIME_SCRIPT_TRAIN_BACKEND |
megatron |
Training backend (megatron or fsdp) |
SLIME_SCRIPT_MODEL_NAME |
Qwen3-VL-8B-Instruct |
Model name |
SLIME_SCRIPT_DATASET_NAME |
chenhegu/geo3k_imgurl |
HuggingFace dataset name |
SLIME_SCRIPT_NUM_GPUS |
8 |
Number of GPUs |
SLIME_SCRIPT_EXTERNAL_RAY |
0 |
Use external Ray cluster (1 to enable) |
Qwen3-VL-2B-InstructQwen3-VL-4B-InstructQwen3-VL-8B-InstructQwen3-VL-30B-A3B-InstructQwen3-VL-235B-A22B-InstructQwen3-VL-2B-ThinkingQwen3-VL-4B-ThinkingQwen3-VL-8B-ThinkingQwen3-VL-30B-A3B-ThinkingQwen3-VL-235B-A22B-Thinking
We experimented with three reward model configurations:
- A geo3k-specific RM with tolerance=0.05 (to handle rounding in ground truth labels)
- A geo3k-specific RM with tolerance=0.0 (strict matching)
- The default math RM
All three performed similarly, so we use the default math RM for simplicity.
Our initial geo3k-specific verifier produced "format scores" (0 and 0.9) instead of clean binary rewards. Under fp32, fractional values like 0.9 can't be exactly represented, so when all samples in a group have the same reward, reward - mean doesn't equal zero—creating spurious gradient signal.
We fixed this by switching to the default math RM with clean binary 0/1 rewards. If you encounter similar precision issues with non-binary rewards, you can change the reward tensor dtype from torch.float to torch.float16 in slime/ray/rollout.py (_post_process_rewards method) to truncate precision artifacts.
Blackwell currently does not support fa3, we need to use --sglang-mm-attention-backend sdpa and --attn-implementation flash_attention_2
