QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

This repository contains the code, model weights, and training pipeline for QoQ-Med (Qwen Omni-Reasoning on Medical Questions), a multimodal clinical foundation model with reasoning capabilities.

Paper: https://arxiv.org/abs/2506.00711

Model Weights

Model	Weights	Avg. Val Accuracy
QoQ-Med-VL-7B	🤗 HuggingFace	68.6%
QoQ-Med-VL-32B	🤗 HuggingFace	70.7%

Quick Start

Use with Front End Apps

Prefer a point-and-click experience? Community-maintained GGUF builds are already on the Hub. They load instantly in desktop chat front-ends such as LM Studio, Ollama, and other llama.cpp-compatible apps—just search for “QoQ-Med-VL-7B/32B,” click Download, and start chatting. No Python environment, GPU, or command-line setup required.

Model	Format	HuggingFace Link
QoQ-Med-VL-7B	GGUF	mradermacher/QoQ-Med-VL-7B-GGUF
QoQ-Med-VL-7B-i1	GGUF	mradermacher/QoQ-Med-VL-7B-i1-GGUF
QoQ-Med-VL-32B	GGUF	mradermacher/QoQ-Med-VL-32B-GGUF

Installation

First, ensure you have the necessary dependencies:

pip install transformers qwen-vl-utils torch

Loading the Model

You may load the QoQ-Med model and processors via Huggingface package:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "ddvd233/QoQ-Med-VL-7B", 
    torch_dtype="auto", 
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("ddvd233/QoQ-Med-VL-7B")

For better performance with flash attention:

import torch

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "ddvd233/QoQ-Med-VL-7B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

Configuring Visual Token Range

You can adjust the visual token range to balance performance and computational cost:

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28

processor = AutoProcessor.from_pretrained(
    "ddvd233/QoQ-Med-VL-7B", 
    min_pixels=min_pixels, 
    max_pixels=max_pixels
)

Preparing Multimodal Input

Create a message with both image and text content:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/your/medical/image.jpg",
            },
            {"type": "text", "text": "Describe this medical image."},
        ],
    }
]

Processing the Input

Prepare the input for model inference:

from qwen_vl_utils import process_vision_info

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

Generating Output

Run inference and decode the output:

generated_ids = model.generate(**inputs, max_new_tokens=128)

generated_ids_trimmed = [
    out_ids[len(in_ids):] 
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)

print(output_text[0])

Time Series Support

Due to limitations with the transformers package, models loaded this way only supports vision and text inputs. Our current approach method involves a lot of hacking around this and is not easily migratable. We are working on a better solution now and hope to release it in the near future.

Overview

QoQ-Med is the first open generalist clinical foundation model that jointly reasons across:

Medical images (2D/3D)
Time-series signals (ECG)
Text reports

The model is trained with our novel Domain-aware Relative Policy Optimization (DRPO), a reinforcement learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, addressing performance imbalance in heterogeneous clinical data.

Key Features

Multimodal Integration: Processes and reasons across 1D, 2D, and 3D clinical data
Domain-Aware Training: DRPO balances learning across 9 clinical domains
Enhanced Interpretability: Generates reasoning traces and highlights salient regions
State-of-the-Art Performance: Outperforms existing open-source clinical MLLMs

Clinical Domains

QoQ-Med spans multiple clinical specialties:

Cardiology (ECG, Chest X-ray)
Radiology (CT, MRI, Ultrasound)
Dermatology
Ophthalmology (Fundus)
Pathology
Mammography

Citations

If you find the project useful, please cite the following papers:

@article{dai2025climb,
  title={Climb: Data foundations for large scale multimodal clinical foundation models},
  author={Dai, Wei and Chen, Peilin and Lu, Malinda and Li, Daniel and Wei, Haowen and Cui, Hejie and Liang, Paul Pu},
  journal={International Conference on Machine Learning},
  year={2025}
}
@article{dai2025qoq,
  title={QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training},
  author={Dai, Wei and Chen, Peilin and Ekbote, Chanakya and Liang, Paul Pu},
  journal={arXiv preprint arXiv:2506.00711},
  year={2025}
}

Important Note

This model is intended for research purposes only. Before extensive real-world testing (like human trials), it is not suitable for clinical deployment. This is a research preview, not a product approved by federal agencies.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
datasets		datasets
examples		examples
images		images
verl		verl
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

Model Weights

Quick Start

Use with Front End Apps

Installation

Loading the Model

Configuring Visual Token Range

Preparing Multimodal Input

Processing the Input

Generating Output

Time Series Support

Overview

Key Features

Clinical Domains

Citations

Important Note

About

Uh oh!

Languages

License

DDVD233/QoQ_Med

Folders and files

Latest commit

History

Repository files navigation

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

Model Weights

Quick Start

Use with Front End Apps

Installation

Loading the Model

Configuring Visual Token Range

Preparing Multimodal Input

Processing the Input

Generating Output

Time Series Support

Overview

Key Features

Clinical Domains

Citations

Important Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages