feat: integrate six traceable benchmarks with unified smoke test by Luodian · Pull Request #1202 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-02-24T12:10:39Z

Summary

add six benchmark tasks: repcount, countix, ovr_kinetics, ssv2, vggsound, av_asr
add compatibility aliases: anet_qa, mmmu_a, egosch_a
update smoke strategy by modality: non-audio with Gemini 3 Flash, audio with OpenRouter omni-capable model
run video smoke with video_fps=1 to increase temporal coverage

Smoke Tests (1 FPS)

1) Non-audio tasks with Gemini 3 Flash

export OPENAI_API_KEY="$OPENROUTER_API_KEY"
export OPENAI_API_BASE="https://openrouter.ai/api/v1"
uv run python -m lmms_eval eval \
  --model openai_compatible \
  --model_args "model_version=google/gemini-3-flash-preview,video_fps=1,max_frames_num=64,num_concurrent=1,max_retries=2,timeout=60" \
  --tasks repcount,countix,ovr_kinetics,ssv2 \
  --limit 1 \
  --output_path outputs/smoke_gemini3_flash_non_audio \
  --log_samples --process_with_media --verbosity INFO --force_simple

2) Audio tasks with OpenRouter omni-capable model

export OPENAI_API_KEY="$OPENROUTER_API_KEY"
export OPENAI_API_BASE="https://openrouter.ai/api/v1"
uv run python -m lmms_eval eval \
  --model openai_compatible \
  --model_args "model_version=google/gemini-2.5-flash,video_fps=1,max_frames_num=64,num_concurrent=1,max_retries=2,timeout=60" \
  --tasks vggsound,av_asr \
  --limit 1 \
  --output_path outputs/smoke_openrouter_omni_audio \
  --log_samples --process_with_media --verbosity INFO --force_simple

Final Table (smoke)

Benchmark	Model	Metric	Value
repcount	`google/gemini-3-flash-preview`	mae_norm	0.9804
repcount	`google/gemini-3-flash-preview`	obo	0.0000
countix	`google/gemini-3-flash-preview`	mae_norm	2.2581
countix	`google/gemini-3-flash-preview`	obo	0.0000
ovr_kinetics	`google/gemini-3-flash-preview`	mae	8.0000
ovr_kinetics	`google/gemini-3-flash-preview`	obo	0.0000
ssv2	`google/gemini-3-flash-preview`	acc	0.0000
vggsound	`google/gemini-2.5-flash`	acc	0.0000
av_asr	`google/gemini-2.5-flash`	wer	1050.0000

…ge in load_video

- Rename read_video_pyav to read_video in load_video.py with backward-compat alias - Delete _resize_image and read_video_pyav_base64 dead functions - Update all 12 caller files to use read_video directly - Inline base64 encoding logic in qwen2_5_omni.py (was read_video_pyav_base64) - Fix missing import in vila.py (latent bug) - Remove use_custom_video_loader dead code from 5 models that declared but never checked it (qwen2_5_vl, qwen3_vl, qwen3_omni, llava_onevision1_5, huggingface)

…d Section 7.2

Add lmms_eval/cli/ package with subcommand-based architecture: eval - run evaluation (wizard mode when no args) tasks - list/groups/subtasks/tags browser models - list backends with optional --aliases ui - launch Web UI serve - start HTTP eval server power - statistical power analysis version - version and environment info tui - terminal UI (textual) Full backward compat: lmms-eval --model X --tasks Y still works. Entrypoint rewired through cli.dispatch:main in pyproject.toml.

Add docs/external_usage.md covering CLI subcommands (tasks, models, eval wizard, ui, serve, power, version) and Python library usage (TaskManager, datasets, evaluator, metrics). Update docs index link. Polish v0.7 release notes for consistency.

* refactor: remove dead read_video_pyav_pil and deduplicate _resize_image in load_video * refactor: rename read_video_pyav -> read_video, remove dead code - Rename read_video_pyav to read_video in load_video.py with backward-compat alias - Delete _resize_image and read_video_pyav_base64 dead functions - Update all 12 caller files to use read_video directly - Inline base64 encoding logic in qwen2_5_omni.py (was read_video_pyav_base64) - Fix missing import in vila.py (latent bug) - Remove use_custom_video_loader dead code from 5 models that declared but never checked it (qwen2_5_vl, qwen3_vl, qwen3_omni, llava_onevision1_5, huggingface) * docs: rewrite Section 7.1 to document read_video backends, remove dead Section 7.2 * feat: unified CLI with subcommand dispatch and interactive wizard Add lmms_eval/cli/ package with subcommand-based architecture: eval - run evaluation (wizard mode when no args) tasks - list/groups/subtasks/tags browser models - list backends with optional --aliases ui - launch Web UI serve - start HTTP eval server power - statistical power analysis version - version and environment info tui - terminal UI (textual) Full backward compat: lmms-eval --model X --tasks Y still works. Entrypoint rewired through cli.dispatch:main in pyproject.toml. * docs: add external usage guide for CLI and library access Add docs/external_usage.md covering CLI subcommands (tasks, models, eval wizard, ui, serve, power, version) and Python library usage (TaskManager, datasets, evaluator, metrics). Update docs index link. Polish v0.7 release notes for consistency. * feat(tasks): add six benchmark tasks and unified smoke report * fix(smoke): enable audio payloads for openrouter omni runs * fix(smoke): use 1fps video sampling for api smoke runs * fix(multimodal): correct audio routing and video fps sampling * test(cli): add dispatch and task pipeline coverage

Luodian added 10 commits February 24, 2026 00:53

refactor: remove dead read_video_pyav_pil and deduplicate _resize_ima…

b989747

…ge in load_video

docs: rewrite Section 7.1 to document read_video backends, remove dea…

fd7dc3a

…d Section 7.2

feat(tasks): add six benchmark tasks and unified smoke report

416d529

fix(smoke): enable audio payloads for openrouter omni runs

8b9ddc4

fix(smoke): use 1fps video sampling for api smoke runs

397216d

fix(multimodal): correct audio routing and video fps sampling

f91692b

test(cli): add dispatch and task pipeline coverage

29b1902

Luodian merged commit e136fa6 into dev-v0d7 Feb 24, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate six traceable benchmarks with unified smoke test#1202

feat: integrate six traceable benchmarks with unified smoke test#1202
Luodian merged 10 commits into
dev-v0d7from
feat/new-benchmarks-smoke

Luodian commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Smoke Tests (1 FPS)

1) Non-audio tasks with Gemini 3 Flash

2) Audio tasks with OpenRouter omni-capable model

Final Table (smoke)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Luodian commented Feb 24, 2026 •

edited

Loading