Skip to content

feat: add LLaVA-OneVision2 chat model wrapper#1337

Merged
kcz358 merged 1 commit into
EvolvingLMMs-Lab:mainfrom
yiyexy:feat/llava-onevision2-model
May 19, 2026
Merged

feat: add LLaVA-OneVision2 chat model wrapper#1337
kcz358 merged 1 commit into
EvolvingLMMs-Lab:mainfrom
yiyexy:feat/llava-onevision2-model

Conversation

@yiyexy
Copy link
Copy Markdown
Collaborator

@yiyexy yiyexy commented May 19, 2026

Summary

Add a chat-style inference wrapper for LLaVA-OneVision2, registered as llava_onevision2 (chat class llava_onevision2_chat). Targets the released checkpoint lmms-lab-encoder/LLaVA-OneVision2-8B-Instruct.

The model is loaded via AutoModelForImageTextToText.from_pretrained(..., trust_remote_code=True) so that the bundled remote code (modeling_llava_onevision2.py, processing_llava_onevision2.py, video_processing_llava_onevision2.py) is used — preserving patch_positions, the RoPE block layout, frame sampling + smart_resize, and per-frame timestamp expansion exactly as during training.

Changes

  • New lmms_eval/models/chat/llava_onevision2.py — the wrapper:
    • Video frames are pre-fetched via qwen_vl_utils.fetch_video (soft dep via optional_import) with fps / min_pixels / max_pixels / max_frames knobs.
    • Builds a per-frame chat content list of <t seconds> text + image PIL pairs (timestamp_decimals configurable).
    • Feeds PIL frames via images=... (not videos=...) to take the image-processor branch the model was trained on.
    • Supports multi-GPU sharding (device_map=auto/balanced/...) and logs throughput via log_metrics().
  • Modified lmms_eval/models/__init__.py — register llava_onevision2 in AVAILABLE_CHAT_TEMPLATE_MODELS.
  • New examples/models/llava_onevision2.sh — accelerate launch example (MLVU-dev best config).
  • Modified docs/advanced/throughput_metrics.md — listed llava_onevision2 as a backend logging throughput metrics.

Usage

pip install qwen-vl-utils

accelerate launch --num_processes=8 -m lmms_eval \
    --model=llava_onevision2 \
    --model_args=pretrained=lmms-lab-encoder/LLaVA-OneVision2-8B-Instruct,attn_implementation=flash_attention_2,messages_format=timestamp,fps=1,max_num_frames=384,min_pixels=102400,max_pixels=102400 \
    --tasks=mlvu_dev \
    --batch_size=1

Checklist

  • pre-commit run passes (black --line-length=240, isort)
  • Smoke-tested locally on the target checkpoint
  • Soft-imports qwen_vl_utils (clear install hint if missing)
  • No hardcoded local paths / internal env vars

Register llava_onevision2_chat (key: llava_onevision2) targeting the
released checkpoint lmms-lab-encoder/LLaVA-OneVision2-8B-Instruct.
The wrapper loads via AutoModelForImageTextToText with trust_remote_code
so the bundled processing pipeline (patch_positions, RoPE block layout,
frame sampling + smart_resize, per-frame timestamp expansion) is used
exactly as during training.

- New: lmms_eval/models/chat/llava_onevision2.py
- Register in lmms_eval/models/__init__.py
- Example launch script: examples/models/llava_onevision2.sh
- Documented under docs/advanced/throughput_metrics.md as a backend
  that logs throughput via log_metrics().
@kcz358 kcz358 merged commit 7108c2c into EvolvingLMMs-Lab:main May 19, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants