1. Install LlamaFactory
conda create -n llamafactory python=3.10 -y
conda activate llamafactory
cd LongVideo-R1
cd LLaMA-Factory
pip install -e ".[torch,metrics,deepspeed,liger-kernel,bitsandbytes]" --no-build-isolation
-
We provide the SFT data
longvideor1-sft-qwen2.5.jsongenerated with Qwen2.5-VL-72B as the caption model and Qwen2.5-VL-32B as the video_qa model, as well as the SFT datalongvideor1-sft-qwen3.jsongenerated with Qwen3-VL-32B as both the caption model and video_qa model.You can download them from HuggingFace, and we recommend using the data generated with Qwen3.
-
Put
longvideor1-sft-qwen3-llamafactory.jsoninLLaMA-Factory/dataand modify the file_name in dataset_info.json file.
Training command:
llamafactory-cli train examples/train_full/qwen3.yamlAfter training, replace the chat_template.jinja file in the trained checkpoint folder with the standard chat_template.jinja to prevent the content within be masked.
conda create -n verl-tool python=3.10 -y
conda activate verl-tool
cd verl-tool
cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
cd ..
pip install -e ".[vllm]"
- Download CGBench from Huggingface.
- Download CGBench captions and rl data from HuggingFace.
- Example strategy: deploy the tool vision model (e.g., Qwen3-VL) on
GPU 6,7using vllm serve. - Place policy/reference/other components on the remaining GPUs based on your training setup.
Example Command:
PORT=9081
MODEL_PATH="path/to/Qwen3-VL-32B-Instruct"
GPU_PAIR="6,7"
echo "Starting GPU ${GPU_PAIR} vLLM serve (Port $PORT)..."
CUDA_VISIBLE_DEVICES=$GPU_PAIR vllm serve $MODEL_PATH \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.8 \
--host 127.0.0.1 \
--port $PORT \
--mm-processor-cache-gb 0 \
--served-model-name Qwen3-VL-32B
You need to edit:
verl-tool/verl_tool/servers/tools/utils/caption_videoqa_config.example.jsonverl-tool/verl_tool/servers/tool_init_config.example.json
Replace the following with your actual values:
caption_dir/video_dirapi_key/base_url/videoqa_modelconfig_path
Use the example script to verify tools are working:
verl-tool/examples/server_test.sh
If the test passes, you can start RL training.
We provide RL training data initialized with the caption from Qwen2.5-VL-72B and RL training data initialized with the caption from Qwen3-VL-32B. You can download them on HuggingFace.
Edit verl-tool/examples/train/get_caption/train_7b_videoqa.sh
#run rl training
cd verl-tool
bash ./examples/train/get_caption/train_7b_videoqa.sh