feat: add LLaVA-OneVision2 chat model wrapper by yiyexy · Pull Request #1337 · EvolvingLMMs-Lab/lmms-eval

yiyexy · 2026-05-19T08:55:33Z

Summary

Add a chat-style inference wrapper for LLaVA-OneVision2, registered as llava_onevision2 (chat class llava_onevision2_chat). Targets the released checkpoint lmms-lab-encoder/LLaVA-OneVision2-8B-Instruct.

The model is loaded via AutoModelForImageTextToText.from_pretrained(..., trust_remote_code=True) so that the bundled remote code (modeling_llava_onevision2.py, processing_llava_onevision2.py, video_processing_llava_onevision2.py) is used — preserving patch_positions, the RoPE block layout, frame sampling + smart_resize, and per-frame timestamp expansion exactly as during training.

Changes

New lmms_eval/models/chat/llava_onevision2.py — the wrapper:
- Video frames are pre-fetched via qwen_vl_utils.fetch_video (soft dep via optional_import) with fps / min_pixels / max_pixels / max_frames knobs.
- Builds a per-frame chat content list of <t seconds> text + image PIL pairs (timestamp_decimals configurable).
- Feeds PIL frames via images=... (not videos=...) to take the image-processor branch the model was trained on.
- Supports multi-GPU sharding (device_map=auto/balanced/...) and logs throughput via log_metrics().
Modified lmms_eval/models/__init__.py — register llava_onevision2 in AVAILABLE_CHAT_TEMPLATE_MODELS.
New examples/models/llava_onevision2.sh — accelerate launch example (MLVU-dev best config).
Modified docs/advanced/throughput_metrics.md — listed llava_onevision2 as a backend logging throughput metrics.

Usage

pip install qwen-vl-utils

accelerate launch --num_processes=8 -m lmms_eval \
    --model=llava_onevision2 \
    --model_args=pretrained=lmms-lab-encoder/LLaVA-OneVision2-8B-Instruct,attn_implementation=flash_attention_2,messages_format=timestamp,fps=1,max_num_frames=384,min_pixels=102400,max_pixels=102400 \
    --tasks=mlvu_dev \
    --batch_size=1

Checklist

pre-commit run passes (black --line-length=240, isort)
Smoke-tested locally on the target checkpoint
Soft-imports qwen_vl_utils (clear install hint if missing)
No hardcoded local paths / internal env vars

Register llava_onevision2_chat (key: llava_onevision2) targeting the released checkpoint lmms-lab-encoder/LLaVA-OneVision2-8B-Instruct. The wrapper loads via AutoModelForImageTextToText with trust_remote_code so the bundled processing pipeline (patch_positions, RoPE block layout, frame sampling + smart_resize, per-frame timestamp expansion) is used exactly as during training. - New: lmms_eval/models/chat/llava_onevision2.py - Register in lmms_eval/models/__init__.py - Example launch script: examples/models/llava_onevision2.sh - Documented under docs/advanced/throughput_metrics.md as a backend that logs throughput via log_metrics().

kcz358 approved these changes May 19, 2026

View reviewed changes

kcz358 merged commit 7108c2c into EvolvingLMMs-Lab:main May 19, 2026
1 check failed

yiyexy mentioned this pull request May 20, 2026

fix(llava_onevision2): forward static images to image_processor #1344

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add LLaVA-OneVision2 chat model wrapper#1337

feat: add LLaVA-OneVision2 chat model wrapper#1337
kcz358 merged 1 commit into
EvolvingLMMs-Lab:mainfrom
yiyexy:feat/llava-onevision2-model

yiyexy commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yiyexy commented May 19, 2026

Summary

Changes

Usage

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants