MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs

MedVLSynther is a rubric-guided generator-verifier framework that synthesizes high-quality multiple-choice VQA items directly from open biomedical literature by conditioning on figures, captions, and in-text references. Applying this pipeline to PubMed Central yields MedSynVQA: 13,087 audited questions over 14,803 images spanning 13 imaging modalities and 28 anatomical regions. Training open-weight LMMs with reinforcement learning using verifiable rewards improves accuracy across six medical VQA benchmarks, achieving averages of 55.85 (3B) and 58.15 (7B), with up to 77.57 on VQA-RAD and 67.76 on PathVQA, outperforming strong medical LMMs.

🔥 Highlights

Fully open stack — End-to-end release of code, data curation scripts, checkpoints, and evaluation to enable full reproduction and auditing.
Automatic, open-sourced pipeline — A rubric-guided generator–verifier workflow turns figures + captions into exam-quality MCQs with minimal manual effort, and is designed for easy extension.
Contamination analysis assurance — We audit potential train/test overlap at both text and image levels; under our protocol, we find no leakage between our training data and evaluation suites.
Effective in practice — Training open-weight LMMs on our verified synthetic data yields consistent gains across standard medical VQA benchmarks.

🚀 Installation

Prerequisites

Python 3.10
CUDA 12.1 or later

git clone 
conda create -n medvlsynther python==3.10
conda activate medvlsynther

Training Environment

We use verl for GRPO and trl for SFT.

GRPO:

conda activate medvlsynther
git clone https://github.com/volcengine/verl.git
cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .

trl:

conda create -n medvlsynther_sft python==3.10
conda activate medvlsynther_sft
# Install torch according to your own cuda version
pip install trl transformers

Synthesis Environment

Because GLM‑4.5V requires recent vLLM and transformers, we recommend using the SFT (TRL) environment for the entire synthesis pipeline.

🎯 Quick Start

Demo

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load the model
model_name="MedVLSynther/MedVLSynther-7B-RL_13K"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Example usage
messages_1 = [
    {
        "role": "system",
        "content": "You will solve a problem/request. You should provide your thoughts within <think> </think> tags before providing the answer.\nWrite your final answer within <answer> </answer> tags.",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "assets/7bMMMU.png",
            },
            {"type": "text", "text": "This line of of myelinated axons in layer IV of visual cortex represents the axons of cells in the Choices: (A) Superior colliculus. (B) Lateral geniculate.(C) Retina. (D) Medial geniculate."},
        ],
    }
]

messages_2 = [
    {
        "role": "system",
        "content": "You will solve a problem/request. You should provide your thoughts within <think> </think> tags before providing the answer.\nWrite your final answer within <answer> </answer> tags.",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "assets/7bslake.png",
            },
            {"type": "text", "text": "Does the picture contain kidney? Choices: (A) Yes (B) No"},
        ],
    }
]

# Preparation for inference
messages = messages_2

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=2048, temperature=0.6, top_p=0.95, do_sample=True)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📊 Datasets

Available Datasets

We release MedSynVQA and the subsets used in our paper. Each set targets medical vision–language QA and supports RLVR/SFT training.

Dataset	Generator	Verifier	Modality	Description	Download
MedSynVQA	GLM-4.5V 108B	Qwen2.5-VL 72B	Image–Text	Full training set for medical VQA (used for RLVR).	🤗 HF
MedSynVQA-10K	GLM-4.5V 108B	Qwen2.5-VL 72B	Image–Text	10K-sample training subset for RLVR.	🤗 HF
MedSynVQA-5K	GLM-4.5V 108B	Qwen2.5-VL 72B	Image–Text	5K-sample training subset for RLVR.	🤗 HF
MedSynVQA-2K	GLM-4.5V 108B	Qwen2.5-VL 72B	Image–Text	2K-sample training subset for RLVR.	🤗 HF
MedSynVQA-1K	GLM-4.5V 108B	Qwen2.5-VL 72B	Image–Text	1K-sample training subset for RLVR.	🤗 HF
MedSynVQA-5K-qwen-glm	Qwen2.5-VL 72B	GLM-4.5V 108B	Image–Text	5K subset for generator and verifier choice ablation (GLM→Qwen generator, Qwen→GLM verifier).	🤗 HF
MedSynVQA-5K-internvl-glm	InternVL-3.5 38B	GLM-4.5V 108B	Image–Text	5K subset for generator choice ablation (InternVL→GLM verifier).	🤗 HF
MedSynVQA-5K-glm-glm	GLM-4.5V 108B	GLM-4.5V 108B	Image–Text	5K subset for verifier choice ablation (Qwen→GLM verifier).	🤗 HF
MedSynVQA-5K-no-verify	GLM-4.5V 108B	N/A	Image–Text	5K subset for verifier necessity ablation (no verification step).	🤗 HF
MedSynVQA-5K-PMC-style	GLM-4.5V 108B	N/A	Image–Text	5K subset generated with PMC-VQA–style prompts.	🤗 HF
MedSynVQA-5K-SFT	GLM-4.5V 108B	N/A	Image–Text	5K subset generated for SFT training.	🤗 HF

Dataset Usage

from datasets import load_dataset

# Load evaluation dataset
eval_dataset = load_dataset("UCSC-VLAA/MedVLThinker-Eval")

# Load training dataset
train_dataset = load_dataset("MedVLSynther/MedSynVQA-13K")

For dataset details and utilizing the synthesis pipeline, please refer to synthesis/README.md and data_process/README.md.

Dataset details and preparation of your own

Data Format

All train datasets follow a unified format, just the same as MedVLThinker:

{
    "images": [PIL.Image],           # List of images                           
    "question": str,                 # Question text
    "options": Dict[str, str],       # Multiple choice options
    "answer_label": str,             # Correct answer label (A, B, C, D, E)
    "answer": str,                   # Full answer text
    "reasoning": str,                # Chain-of-thought reasoning (optional)
    "dataset_name": str,             # Source dataset name
    "dataset_index": int             # Unique sample identifier
}

Prepare Evaluation Data

Please download MedVLThinker-Eval.

Prepare Training Data

Please download the dataset you want to use above, e.g., MedSynVQA:

hf download MedVLSynther/MedSynVQA-13K --repo-type=dataset

Prepare it for verl format:

python data_process/prep_to_hf_bytes.py \
    --parquet_glob "data/MedSynVQA-13K/*.parquet" \
    --out_dir data/MedSynVQA-13K_hf \
    --num_proc 32 --strict_image --keep_first_k_images 6

python data_process/convert_verl_format.py \
    --local_data_dir data/MedSynVQA-13K_hf \
    --data_source MedSynVQA-13K \
    --ability medical_mcqa \
    --split train \
    --output_dir data/MedSynVQA-13K_verl \
    --num_proc 32

🏋️ Training

Reinforcement Learning (GRPO)

Please refer to train.

After training, you need to convert verl checkpoints for inference:

python -m verl.model_merger merge --backend fsdp --local_dir /path/to/checkpoints/global_step_xxx/actor --target_dir /path/to/converted/checkpoints

Supervised Fine-tuning (SFT)

bash train/sft/train_commands.sh

Evaluation

Evaluation for trained model

Please refer to eval/README.md.

📈 Models and Results

Available Models

Model	Size	RL/SFT	Training Data	Download
RL Models
MedVLSynther-3B-RL_1K	3B	RL	MedSynVQA-1K	🤗 HF
MedVLSynther-3B-RL_2K	3B	RL	MedSynVQA-2K	🤗 HF
MedVLSynther-3B-RL_5K	3B	RL	MedSynVQA-5K	🤗 HF
MedVLSynther-3B-RL_10K	3B	RL	MedSynVQA-10K	🤗 HF
MedVLSynther-3B-RL_13K	3B	RL	MedSynVQA	🤗 HF
MedVLSynther-3B-RL_5K_qwen-glm	3B	RL	MedSynVQA-5K-qwen-glm	🤗 HF
MedVLSynther-3B-RL_5K_internvl-glm	3B	RL	MedSynVQA-5K-internvl-glm	🤗 HF
MedVLSynther-3B-RL_5K_glm-glm	3B	RL	MedSynVQA-5K-glm-glm	🤗 HF
MedVLSynther-3B-RL_5K_no-verify	3B	RL	MedSynVQA-5K-no-verify	🤗 HF
MedVLSynther-3B-RL_5K_PMC-style	3B	RL	MedSynVQA-5K-PMC-style	🤗 HF
MedVLSynther-7B-RL_1K	7B	RL	MedSynVQA-1K	🤗 HF
MedVLSynther-7B-RL_2K	7B	RL	MedSynVQA-2K	🤗 HF
MedVLSynther-7B-RL_5K	7B	RL	MedSynVQA-5K	🤗 HF
MedVLSynther-7B-RL_10K	7B	RL	MedSynVQA-10K	🤗 HF
MedVLSynther-7B-RL_13K	7B	RL	MedSynVQA	🤗 HF
MedVLSynther-7B-RL_5K_qwen-glm	7B	RL	MedSynVQA-5K-qwen-glm	🤗 HF
MedVLSynther-7B-RL_5K_internvl-glm	7B	RL	MedSynVQA-5K-internvl-glm	🤗 HF
MedVLSynther-7B-RL_5K_glm-glm	7B	RL	MedSynVQA-5K-glm-glm	🤗 HF
MedVLSynther-7B-RL_5K_no-verify	7B	RL	MedSynVQA-5K-no-verify	🤗 HF
MedVLSynther-7B-RL_5K_PMC-style	7B	RL	MedSynVQA-5K-PMC-style	🤗 HF
SFT Models
MedVLThinker-3B-SFT_5K	3B	SFT	MedSynVQA-5K-SFT	🤗 HF
MedVLThinker-7B-SFT_5K	7B	SFT	MedSynVQA-5K-SFT	🤗 HF

Benchmark Results

Comparison with other methods.

Model	PMC	MMMU	MedX-M	PathVQA	SLAKE	VQA-Rad	Avg.
General LMM
Gemme 3 4B	44.42	46.67	21.89	59.24	66.59	56.86	49.28
Qwen2.5-VL-3B-Instruct	44.77	44.12	20.69	61.96	61.30	62.01	49.14
Qwen2.5-VL-7B-Instruct	49.30	52.94	18.89	65.39	65.71	68.75	53.50
Medical LMM
MedGemma 4B	42.73	32.55	8.17	59.64	83.49	78.55	50.86
MedGemma 27B	36.75	35.88	12.13	62.09	77.40	72.67	49.49
Llava Med v1.5 Mistral 7B	34.28	31.37	22.56	56.52	62.82	56.74	44.05
HuatuoGPT-Vision-7B	53.39	50.59	22.00	63.53	75.00	63.60	54.69
MedVLThinker-3B	47.32	52.16	22.90	62.28	63.38	71.08	53.19
MedVLThinker-7B	50.67	56.86	24.43	66.83	65.79	64.71	54.88
MedVLSynther-3B	50.23	52.35	21.40	62.82	74.76	73.53	55.85
MedVLSynther-7B	55.43	55.88	22.10	65.56	72.36	77.57	58.15

📁 Project Structure

MedVLSynther/
├── analysis/          # Result analysis
├── assets/            # Assets for this project
├── contamination/     # Contamination analysis
├── data_process/      # Data preprocessing and preparation
├── eval/              # Evaluation scripts and benchmarks
├── synthesis/         # Data synthesis
├── train/             # Training scripts and configurations
└── README.md          # This file

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

VERL for reinforcement learning framework
vLLM for efficient inference
GLM-V, Qwen-VL, and Intern-VL for SOTA LMMs
BioMedica for curated biomedical literature
Medical VQA dataset providers

📚 Citation

If you find this work useful, please cite:

@article{MedVLSynther,
  title={MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs},
  author={Huang, Xiaoke and Wang, Ningsen and Liu, Hui and Tang, Xianfeng and Zhou, Yuyin},
  journal={arXiv preprint arXiv:2510.25867},
  year={2025}
}
@article{MedVLThinker,
  title={Medvlthinker: Simple baselines for multimodal medical reasoning},
  author={Huang, Xiaoke and Wu, Juncheng and Liu, Hui and Tang, Xianfeng and Zhou, Yuyin},
  journal={arXiv preprint arXiv:2508.02669},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs

🔥 Highlights

📋 Table of Contents

🚀 Installation

Prerequisites

Training Environment

Synthesis Environment

🎯 Quick Start

Demo

📊 Datasets

Available Datasets

Dataset Usage

Data Format

Prepare Evaluation Data

Prepare Training Data

🏋️ Training

Reinforcement Learning (GRPO)

Supervised Fine-tuning (SFT)

Evaluation

Evaluation for trained model

📈 Models and Results

Available Models

Benchmark Results

📁 Project Structure

📄 License

🙏 Acknowledgments

📚 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
analysis		analysis
assets		assets
contamination		contamination
data_process		data_process
eval		eval
synthesis		synthesis
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

UCSC-VLAA/MedVLSynther

Folders and files

Latest commit

History

Repository files navigation

MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs

🔥 Highlights

📋 Table of Contents

🚀 Installation

Prerequisites

Training Environment

Synthesis Environment

🎯 Quick Start

Demo

📊 Datasets

Available Datasets

Dataset Usage

Data Format

Prepare Evaluation Data

Prepare Training Data

🏋️ Training

Reinforcement Learning (GRPO)

Supervised Fine-tuning (SFT)

Evaluation

Evaluation for trained model

📈 Models and Results

Available Models

Benchmark Results

📁 Project Structure

📄 License

🙏 Acknowledgments

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages