This repository contains the code, model weights, and training pipeline for QoQ-Med (Qwen Omni-Reasoning on Medical Questions), a multimodal clinical foundation model with reasoning capabilities.
Paper: https://arxiv.org/abs/2506.00711
| Model | Weights | Avg. Val Accuracy |
|---|---|---|
| QoQ-Med-VL-7B | 🤗 HuggingFace | 68.6% |
| QoQ-Med-VL-32B | 🤗 HuggingFace | 70.7% |
Prefer a point-and-click experience? Community-maintained GGUF builds are already on the Hub. They load instantly in desktop chat front-ends such as LM Studio, Ollama, and other llama.cpp-compatible apps—just search for “QoQ-Med-VL-7B/32B,” click Download, and start chatting. No Python environment, GPU, or command-line setup required.
| Model | Format | HuggingFace Link |
|---|---|---|
| QoQ-Med-VL-7B | GGUF | mradermacher/QoQ-Med-VL-7B-GGUF |
| QoQ-Med-VL-7B-i1 | GGUF | mradermacher/QoQ-Med-VL-7B-i1-GGUF |
| QoQ-Med-VL-32B | GGUF | mradermacher/QoQ-Med-VL-32B-GGUF |
First, ensure you have the necessary dependencies:
pip install transformers qwen-vl-utils torchYou may load the QoQ-Med model and processors via Huggingface package:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"ddvd233/QoQ-Med-VL-7B",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("ddvd233/QoQ-Med-VL-7B")For better performance with flash attention:
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"ddvd233/QoQ-Med-VL-7B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)You can adjust the visual token range to balance performance and computational cost:
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
"ddvd233/QoQ-Med-VL-7B",
min_pixels=min_pixels,
max_pixels=max_pixels
)Create a message with both image and text content:
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/your/medical/image.jpg",
},
{"type": "text", "text": "Describe this medical image."},
],
}
]Prepare the input for model inference:
from qwen_vl_utils import process_vision_info
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")Run inference and decode the output:
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):]
for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(output_text[0])Due to limitations with the transformers package, models loaded this way only supports vision and text inputs. Our current approach method involves a lot of hacking around this and is not easily migratable. We are working on a better solution now and hope to release it in the near future.
QoQ-Med is the first open generalist clinical foundation model that jointly reasons across:
- Medical images (2D/3D)
- Time-series signals (ECG)
- Text reports
The model is trained with our novel Domain-aware Relative Policy Optimization (DRPO), a reinforcement learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, addressing performance imbalance in heterogeneous clinical data.
- Multimodal Integration: Processes and reasons across 1D, 2D, and 3D clinical data
- Domain-Aware Training: DRPO balances learning across 9 clinical domains
- Enhanced Interpretability: Generates reasoning traces and highlights salient regions
- State-of-the-Art Performance: Outperforms existing open-source clinical MLLMs
QoQ-Med spans multiple clinical specialties:
- Cardiology (ECG, Chest X-ray)
- Radiology (CT, MRI, Ultrasound)
- Dermatology
- Ophthalmology (Fundus)
- Pathology
- Mammography
If you find the project useful, please cite the following papers:
@article{dai2025climb,
title={Climb: Data foundations for large scale multimodal clinical foundation models},
author={Dai, Wei and Chen, Peilin and Lu, Malinda and Li, Daniel and Wei, Haowen and Cui, Hejie and Liang, Paul Pu},
journal={International Conference on Machine Learning},
year={2025}
}
@article{dai2025qoq,
title={QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training},
author={Dai, Wei and Chen, Peilin and Ekbote, Chanakya and Liang, Paul Pu},
journal={arXiv preprint arXiv:2506.00711},
year={2025}
}
This model is intended for research purposes only. Before extensive real-world testing (like human trials), it is not suitable for clinical deployment. This is a research preview, not a product approved by federal agencies.
