LongVideo-R1/training.md at main · 198808xc/LongVideo-R1

SFT Training

1. Install LlamaFactory

conda create -n llamafactory python=3.10 -y
conda activate llamafactory 
cd LongVideo-R1
cd LLaMA-Factory
pip install -e ".[torch,metrics,deepspeed,liger-kernel,bitsandbytes]" --no-build-isolation

2. Training

We provide the SFT data longvideor1-sft-qwen2.5.json generated with Qwen2.5-VL-72B as the caption model and Qwen2.5-VL-32B as the video_qa model, as well as the SFT data longvideor1-sft-qwen3.json generated with Qwen3-VL-32B as both the caption model and video_qa model.

You can download them from HuggingFace, and we recommend using the data generated with Qwen3.
Put longvideor1-sft-qwen3-llamafactory.json in LLaMA-Factory/data and modify the file_name in dataset_info.json file.

Training command:
llamafactory-cli train examples/train_full/qwen3.yaml

After training, replace the chat_template.jinja file in the trained checkpoint folder with the standard chat_template.jinja to prevent the content within be masked.

RL Training

1. Install verl-tool&verl

conda create -n verl-tool python=3.10 -y
conda activate verl-tool 
cd verl-tool
cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
cd ..
pip install -e ".[vllm]"

2. Data Preparing

Download CGBench from Huggingface.
Download CGBench captions and rl data from HuggingFace.

3. Tool Preparing

3.1 Model Deployment (Example: 8xA800)

Example strategy: deploy the tool vision model (e.g., Qwen3-VL) on GPU 6,7 using vllm serve.
Place policy/reference/other components on the remaining GPUs based on your training setup.

Example Command:

PORT=9081
MODEL_PATH="path/to/Qwen3-VL-32B-Instruct"

GPU_PAIR="6,7"
echo "Starting GPU ${GPU_PAIR} vLLM serve (Port $PORT)..."

CUDA_VISIBLE_DEVICES=$GPU_PAIR vllm serve $MODEL_PATH \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.8 \
--host 127.0.0.1 \
--port $PORT \
--mm-processor-cache-gb 0 \
--served-model-name Qwen3-VL-32B

3.2 Tool Configuration

You need to edit:

verl-tool/verl_tool/servers/tools/utils/caption_videoqa_config.example.json
verl-tool/verl_tool/servers/tool_init_config.example.json

Replace the following with your actual values:

caption_dir / video_dir
api_key / base_url / videoqa_model
config_path

3.3 Tool Connectivity Test

Use the example script to verify tools are working:

verl-tool/examples/server_test.sh

If the test passes, you can start RL training.

4. RL Training

We provide RL training data initialized with the caption from Qwen2.5-VL-72B and RL training data initialized with the caption from Qwen3-VL-32B. You can download them on HuggingFace.

Edit verl-tool/examples/train/get_caption/train_7b_videoqa.sh

#run rl training
cd verl-tool
bash ./examples/train/get_caption/train_7b_videoqa.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT Training

1. Install LlamaFactory

2. Training

RL Training

1. Install verl-tool&verl

2. Data Preparing

3. Tool Preparing

3.1 Model Deployment (Example: 8xA800)

3.2 Tool Configuration

3.3 Tool Connectivity Test

4. RL Training

FilesExpand file tree

training.md

Latest commit

History

training.md

File metadata and controls

SFT Training

1. Install LlamaFactory

2. Training

RL Training

1. Install verl-tool&verl

2. Data Preparing

3. Tool Preparing

3.1 Model Deployment (Example: 8xA800)

3.2 Tool Configuration

3.3 Tool Connectivity Test

4. RL Training