feat: add LLaVA-OneVision2 chat model wrapper#1337
Merged
Conversation
Register llava_onevision2_chat (key: llava_onevision2) targeting the released checkpoint lmms-lab-encoder/LLaVA-OneVision2-8B-Instruct. The wrapper loads via AutoModelForImageTextToText with trust_remote_code so the bundled processing pipeline (patch_positions, RoPE block layout, frame sampling + smart_resize, per-frame timestamp expansion) is used exactly as during training. - New: lmms_eval/models/chat/llava_onevision2.py - Register in lmms_eval/models/__init__.py - Example launch script: examples/models/llava_onevision2.sh - Documented under docs/advanced/throughput_metrics.md as a backend that logs throughput via log_metrics().
kcz358
approved these changes
May 19, 2026
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a
chat-style inference wrapper for LLaVA-OneVision2, registered asllava_onevision2(chat classllava_onevision2_chat). Targets the released checkpointlmms-lab-encoder/LLaVA-OneVision2-8B-Instruct.The model is loaded via
AutoModelForImageTextToText.from_pretrained(..., trust_remote_code=True)so that the bundled remote code (modeling_llava_onevision2.py,processing_llava_onevision2.py,video_processing_llava_onevision2.py) is used — preservingpatch_positions, the RoPE block layout, frame sampling +smart_resize, and per-frame timestamp expansion exactly as during training.Changes
lmms_eval/models/chat/llava_onevision2.py— the wrapper:qwen_vl_utils.fetch_video(soft dep viaoptional_import) withfps/min_pixels/max_pixels/max_framesknobs.<t seconds>text +imagePIL pairs (timestamp_decimalsconfigurable).images=...(notvideos=...) to take the image-processor branch the model was trained on.device_map=auto/balanced/...) and logs throughput vialog_metrics().lmms_eval/models/__init__.py— registerllava_onevision2inAVAILABLE_CHAT_TEMPLATE_MODELS.examples/models/llava_onevision2.sh— accelerate launch example (MLVU-dev best config).docs/advanced/throughput_metrics.md— listedllava_onevision2as a backend logging throughput metrics.Usage
pip install qwen-vl-utils accelerate launch --num_processes=8 -m lmms_eval \ --model=llava_onevision2 \ --model_args=pretrained=lmms-lab-encoder/LLaVA-OneVision2-8B-Instruct,attn_implementation=flash_attention_2,messages_format=timestamp,fps=1,max_num_frames=384,min_pixels=102400,max_pixels=102400 \ --tasks=mlvu_dev \ --batch_size=1Checklist
pre-commit runpasses (black --line-length=240,isort)qwen_vl_utils(clear install hint if missing)