Skip to content

feat: integrate six traceable benchmarks with unified smoke test#1202

Merged
Luodian merged 10 commits into
dev-v0d7from
feat/new-benchmarks-smoke
Feb 24, 2026
Merged

feat: integrate six traceable benchmarks with unified smoke test#1202
Luodian merged 10 commits into
dev-v0d7from
feat/new-benchmarks-smoke

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Feb 24, 2026

Summary

  • add six benchmark tasks: repcount, countix, ovr_kinetics, ssv2, vggsound, av_asr
  • add compatibility aliases: anet_qa, mmmu_a, egosch_a
  • update smoke strategy by modality: non-audio with Gemini 3 Flash, audio with OpenRouter omni-capable model
  • run video smoke with video_fps=1 to increase temporal coverage

Smoke Tests (1 FPS)

1) Non-audio tasks with Gemini 3 Flash

export OPENAI_API_KEY="$OPENROUTER_API_KEY"
export OPENAI_API_BASE="https://openrouter.ai/api/v1"
uv run python -m lmms_eval eval \
  --model openai_compatible \
  --model_args "model_version=google/gemini-3-flash-preview,video_fps=1,max_frames_num=64,num_concurrent=1,max_retries=2,timeout=60" \
  --tasks repcount,countix,ovr_kinetics,ssv2 \
  --limit 1 \
  --output_path outputs/smoke_gemini3_flash_non_audio \
  --log_samples --process_with_media --verbosity INFO --force_simple

2) Audio tasks with OpenRouter omni-capable model

export OPENAI_API_KEY="$OPENROUTER_API_KEY"
export OPENAI_API_BASE="https://openrouter.ai/api/v1"
uv run python -m lmms_eval eval \
  --model openai_compatible \
  --model_args "model_version=google/gemini-2.5-flash,video_fps=1,max_frames_num=64,num_concurrent=1,max_retries=2,timeout=60" \
  --tasks vggsound,av_asr \
  --limit 1 \
  --output_path outputs/smoke_openrouter_omni_audio \
  --log_samples --process_with_media --verbosity INFO --force_simple

Final Table (smoke)

Benchmark Model Metric Value
repcount google/gemini-3-flash-preview mae_norm 0.9804
repcount google/gemini-3-flash-preview obo 0.0000
countix google/gemini-3-flash-preview mae_norm 2.2581
countix google/gemini-3-flash-preview obo 0.0000
ovr_kinetics google/gemini-3-flash-preview mae 8.0000
ovr_kinetics google/gemini-3-flash-preview obo 0.0000
ssv2 google/gemini-3-flash-preview acc 0.0000
vggsound google/gemini-2.5-flash acc 0.0000
av_asr google/gemini-2.5-flash wer 1050.0000

- Rename read_video_pyav to read_video in load_video.py with backward-compat alias
- Delete _resize_image and read_video_pyav_base64 dead functions
- Update all 12 caller files to use read_video directly
- Inline base64 encoding logic in qwen2_5_omni.py (was read_video_pyav_base64)
- Fix missing import in vila.py (latent bug)
- Remove use_custom_video_loader dead code from 5 models that declared but never checked it (qwen2_5_vl, qwen3_vl, qwen3_omni, llava_onevision1_5, huggingface)
Add lmms_eval/cli/ package with subcommand-based architecture:
  eval    - run evaluation (wizard mode when no args)
  tasks   - list/groups/subtasks/tags browser
  models  - list backends with optional --aliases
  ui      - launch Web UI
  serve   - start HTTP eval server
  power   - statistical power analysis
  version - version and environment info
  tui     - terminal UI (textual)

Full backward compat: lmms-eval --model X --tasks Y still works.
Entrypoint rewired through cli.dispatch:main in pyproject.toml.
Add docs/external_usage.md covering CLI subcommands (tasks, models,
eval wizard, ui, serve, power, version) and Python library usage
(TaskManager, datasets, evaluator, metrics). Update docs index link.
Polish v0.7 release notes for consistency.
@Luodian Luodian merged commit e136fa6 into dev-v0d7 Feb 24, 2026
2 checks passed
Luodian added a commit that referenced this pull request Feb 28, 2026
* refactor: remove dead read_video_pyav_pil and deduplicate _resize_image in load_video

* refactor: rename read_video_pyav -> read_video, remove dead code

- Rename read_video_pyav to read_video in load_video.py with backward-compat alias
- Delete _resize_image and read_video_pyav_base64 dead functions
- Update all 12 caller files to use read_video directly
- Inline base64 encoding logic in qwen2_5_omni.py (was read_video_pyav_base64)
- Fix missing import in vila.py (latent bug)
- Remove use_custom_video_loader dead code from 5 models that declared but never checked it (qwen2_5_vl, qwen3_vl, qwen3_omni, llava_onevision1_5, huggingface)

* docs: rewrite Section 7.1 to document read_video backends, remove dead Section 7.2

* feat: unified CLI with subcommand dispatch and interactive wizard

Add lmms_eval/cli/ package with subcommand-based architecture:
  eval    - run evaluation (wizard mode when no args)
  tasks   - list/groups/subtasks/tags browser
  models  - list backends with optional --aliases
  ui      - launch Web UI
  serve   - start HTTP eval server
  power   - statistical power analysis
  version - version and environment info
  tui     - terminal UI (textual)

Full backward compat: lmms-eval --model X --tasks Y still works.
Entrypoint rewired through cli.dispatch:main in pyproject.toml.

* docs: add external usage guide for CLI and library access

Add docs/external_usage.md covering CLI subcommands (tasks, models,
eval wizard, ui, serve, power, version) and Python library usage
(TaskManager, datasets, evaluator, metrics). Update docs index link.
Polish v0.7 release notes for consistency.

* feat(tasks): add six benchmark tasks and unified smoke report

* fix(smoke): enable audio payloads for openrouter omni runs

* fix(smoke): use 1fps video sampling for api smoke runs

* fix(multimodal): correct audio routing and video fps sampling

* test(cli): add dispatch and task pipeline coverage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant