Skip to content

[Feature] v0.7: Operational simplicity and pipeline maturity for users (human/agents)#1210

Merged
Luodian merged 1725 commits into
mainfrom
dev-v0d7
Feb 28, 2026
Merged

[Feature] v0.7: Operational simplicity and pipeline maturity for users (human/agents)#1210
Luodian merged 1725 commits into
mainfrom
dev-v0d7

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Feb 27, 2026

Description

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • New benchmark/task
  • New model integration
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)

Changes Made

Testing

  • Tested locally with: python -m lmms_eval --model <model> --tasks <task> --limit 8
  • Ran pre-commit: pre-commit run --all-files
  • Added/updated tests (if applicable)

Additional Notes

oscarqjh and others added 30 commits November 3, 2025 17:23
* feat(mindcube): Add YAML configurations and utility functions for MindCube tasks

* refactor(mindcube): Enhance docstrings and improve code readability in utils.py

* feat(mindcube): Introduce _default_template_yaml and refactor task YAML files to include shared configurations
Change doc_to_visual function() for Karpathy test to coco_doc_to_visual_karpathy()
The previous code uses an unsafe way to remove the image key from the sample set. Yet, since the same image is used multiple times for eval the script fails, specifically when doing a distributed eval across several GPUs. Instead the new function remove just from the copy not from the original dict.
* add qwen3vl huggingface (no slang/vllm)

* fix handle batch_size > 1 with padding_side='left'

---------

Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com>
* add UEval

* chore: apply pre-commit fixes

---------

Co-authored-by: LB <libo81501@gmail.com>
* add MME-SCI

* update config and path

* update utils.py
* Add SciVideoBench benchmark to lmms-eval

* style: apply black/isort formatting fixes

* fix: update scivideobench HF integration
Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com>
* Add try catch for longvila

* Fix gqa doc id key error for more robustness

* Revise PR template

* Lint
* add MME-SCI

* update config and path

* update utils.py

* update utils.py

* add MME-SCI
add an OpenCompass version of MMstar with official Qwen3 prompt template
* Fix simple_parse_args_strings Function in utils.py

* Apply black

* Separate _smart_comma_split function

* Apply black
…icial results (#912)

* Update VideoMME with Qwen3VL prompt

* update Qwen3VL to better handle qwen-vl-utils params
Handle empty response lists by returning an empty string.
* Add reasoning utils

* Add charxiv reasoning version

* Add reasoning tasks for images

* Add text reasoning task

* Lint
* update mmmu with qwen3vl prompts

* update mmmu and mmmu pro with qwen3vl prompts

---------

Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com>
* Fix tqdm bar for qwen25 vl when batch responding

* Fix qwen3 vl batch tqdm processing issue

* Lint
* Add bagel lmms-engine version for better api to transformers

* Allow bagel to load from chat messages
* Checkout patch for gedit

* Remove redundant mllm tools

* Revise and add gedit bench

* Fix bagel lmms-engine on multi rank

---------

Co-authored-by: KemingWu <wukemingcqu@gmail.com>
oscarqjh and others added 10 commits February 27, 2026 12:23
* docs: added dedicated changelogs folder

* style: fix linting
* refactor(models/chat): improve async_openai code structure and readability (#1102)

* refactor(models/chat): extract prepare_messages method

* refactor(models/chat): refactor async concurrency control and add docstrings

- Extract _AdaptiveConcurrencyTracker for cleaner state management
- Split generate_until's run() into focused helper methods
- Add comprehensive docstrings to all new methods
- Simplify run() from 130 lines to 8 lines
- Update async_openai_qwen3_vl.py with class docstring

* style: auto-fix lint (black + isort)

* refactor: replace async_openai_qwen3_vl class with message_format parameter

- Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl')
- Extract _build_video_kwargs() to eliminate DRY violation
- Remove separate async_openai_qwen3_vl.py and its registry entry
- Fix missing f-string prefix in tool response formatting
- Fix duplicate .gitignore entry

* refactor(models/chat): add message_format parameter to support qwen3_vl

- Add message_format parameter to AsyncOpenAIChat
- Support both 'default' and 'qwen3_vl' message formats
- Remove async_openai_qwen3_vl.py (no longer needed)
- Unregister async_openai_qwen3_vl from model registry
- Fix string formatting for tool call tags

* fix tool response tag format

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Bo Li <drluodian@gmail.com>

* docs: add MMMU eval discrepancy report and TLDR FP definitions

* fix(ci): make lint workflow fork-PR safe

* feat(tasks): add MMStar reasoning task

* refactor(tasks): merge cn and en reasoning into unified structure

- Combine cn_reasoning and en_reasoning into single reasoning directory
- Share common template yaml across both cn and en reasoning tasks
- Unified utils.py handles cn/en via DATASET_NAME environment variable
- Keep separate group files for mmbench_cn_reasoning and mmbench_en_reasoning

* refactor(tasks): unify cn and en reasoning with single group

- Remove environment variable dependency
- Add separate doc_to_text/doc_to_messages for cn and en in utils.py
- Template yaml shared, specific functions defined in task yaml
- Single mmbench_reasoning group containing both cn and en dev tasks
- Unified process_results without data_source distinction

* fix(tasks): add dataset_name to reasoning task configs

* feat(tasks): add test split for mmbench reasoning tasks

- Add mmbench_cn_test_reasoning and mmbench_en_test_reasoning
- Add test_split to dev reasoning configs
- Update mmbench_reasoning group to include all four tasks

* feat(tasks): add MME-RealWorld reasoning tasks

- Add mme_realworld_reasoning (en) and mme_realworld_cn_reasoning (cn)
- Include doc_to_messages for both languages with reasoning prompts
- Support accuracy and format scoring metrics

* feat(tasks): add SEED-Bench reasoning tasks

- Add seedbench_reasoning with doc_to_messages for reasoning format
- Add seedbench_2_plus_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics for both benchmarks

* feat(tasks): add CV-Bench reasoning tasks

- Add cv_bench_reasoning, cv_bench_2d_reasoning, cv_bench_3d_reasoning
- Include doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* fix(reasoning): improve mcq matching with normalize comparison

- Apply parse_mcq to ground_truth for consistency
- Use case-insensitive comparison for MCQ answers
- Strip whitespace for more robust matching

* feat(tasks): add OCR-Bench reasoning task

- Add ocrbench_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add ChartQA reasoning task

- Add chartqa_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add InfoVQA reasoning task

- Add infovqa_val_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add CountBenchQA reasoning task

- Add countbenchqa_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add CountBenchQA benchmark

- Add countbenchqa task config and utils
- Add countbenchqa_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add VStar-Bench reasoning tasks

- Add vstar_bench_reasoning with doc_to_messages for reasoning format
- Add vstar_bench_direct_attributes_reasoning
- Add vstar_bench_relative_position_reasoning
- Support accuracy and format scoring metrics

* feat(tasks): add PixMo-Count benchmark

- Add pixmo_count task config and utils
- Add pixmo_count_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(models): add system_prompt_file support to AsyncOpenAIChat

- Allow loading system prompt from file via system_prompt_file parameter
- Add _apply_system_prompt method to inject system prompt into messages
- Apply system prompt before generation in generate_until

* style: auto-fix lint (black + isort)

* refactor(reasoning): extract acc_score computation to separate function

Extracted accuracy reward logic from compute_score into acc_reward function
for better separation of concerns.

* Fix async oai rebase error

* Lint

* refactor(reasoning): add model-side system_prompt support and deduplicate reasoning task utils

- Add _resolve_system_prompt() and _apply_system_prompt() to base lmms class
  for model-side system prompt injection (supports file paths and literal strings)
- Add factory functions make_reasoning_doc_to_messages() and
  make_reasoning_process_results() to reasoning_utils.py, eliminating ~400 lines
  of copy-paste across 12 reasoning task modules
- Update AsyncOpenAIChat: replace system_prompt_file with system_prompt using
  base class utilities, remove duplicate _apply_system_prompt method
- Wire up HuggingFace chat model to inject system_prompt into messages during
  generation (opt-in only, default None to avoid overwriting task-level prompts)
- Fix infovqa reasoning: anls(ground_truth, results) -> anls(ground_truth, [extracted])
- Fix mmbench reasoning: cache YAML parsing with @lru_cache instead of per-sample I/O
- Fix format_reward() to also match <analysis>...</analysis> tag pattern
- Expand --reasoning_tags default to include <analysis> tags

* fix(ci): restore task_input_specs/redundancy_refactor.yaml deleted by 418bfe6

* fix: remove duplicate --reasoning_tags CLI argument

* docs: restore docs/README.md from dev-v0d7

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Bo Li <drluodian@gmail.com>
Replace hand-rolled doc_to_messages and process_results in 10
reasoning task utils with make_reasoning_doc_to_messages and
make_reasoning_process_results from _task_utils/reasoning_utils.py.

Files: ai2d, chartqa, charxiv, logicvista, mathvision,
olympiadbench_mimo, phyx, realworldqa, seedbench, seedbench_2_plus.

All YAML-referenced function names, metric keys, and scoring logic
preserved exactly.
* feat(models): add Phi4 multimodal backend

Added Phi4 multimodal model support with chat template interface.

* style: auto-fix lint (black + isort)

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Move developer workflow knowledge (smoke tests, full run recipes, debug workflow, edit-routing) from AGENTS.md into skills/lmms-eval-guide/references/workflows.md. Update SKILL.md routing matrix. Add AGENTS.md to .gitignore.
SKILL.md: add --config, --reasoning_tags, video backend env vars, MINERVA Lance vars, upgrade guide from v0.6, expanded triggers and routing matrix.

models.md: add message_format parameter, async_openai breaking changes, NanoVLM/Async HF model references, v0.6->v0.7 migration table.

tasks.md: add reasoning_tags YAML field, safety_redteam task group, v0.7 task domain listing across 9 domains.

workflows.md: add YAML config-driven evaluation workflow, reasoning tag stripping usage, video decode backend selection guide, efficiency metrics output reference.

lmms-eval-0.7.md §6: add workflows.md to skill file listing, expanded agent dispatch routing.
* refactor: remove dead read_video_pyav_pil and deduplicate _resize_image in load_video

* refactor: rename read_video_pyav -> read_video, remove dead code

- Rename read_video_pyav to read_video in load_video.py with backward-compat alias
- Delete _resize_image and read_video_pyav_base64 dead functions
- Update all 12 caller files to use read_video directly
- Inline base64 encoding logic in qwen2_5_omni.py (was read_video_pyav_base64)
- Fix missing import in vila.py (latent bug)
- Remove use_custom_video_loader dead code from 5 models that declared but never checked it (qwen2_5_vl, qwen3_vl, qwen3_omni, llava_onevision1_5, huggingface)

* docs: rewrite Section 7.1 to document read_video backends, remove dead Section 7.2

* docs: add external usage guide for CLI and library access

Add docs/external_usage.md covering CLI subcommands (tasks, models,
eval wizard, ui, serve, power, version) and Python library usage
(TaskManager, datasets, evaluator, metrics). Update docs index link.
Polish v0.7 release notes for consistency.

* feat: add MCP server for AI agent integration

Add Model Context Protocol (MCP) server that lets AI agents
programmatically discover tasks/models, run evaluations, and
inspect results. Uses direct JobScheduler (subprocess-based eval)
with two-tier lazy imports to avoid loading torch for discovery.

New files:
- lmms_eval/mcp/schemas.py: Pydantic response models
- lmms_eval/mcp/tools.py: 8 MCP tools (list_tasks, get_task_info,
  list_models, get_model_info, evaluate, get_run_status,
  get_run_result, cancel_run)
- lmms_eval/mcp/server.py: FastMCP instance + JobScheduler lifespan
- lmms_eval/cli/mcp_cmd.py: CLI subcommand handler

Modified:
- dispatch.py: add 'mcp' subcommand + banner entry
- pyproject.toml: add [mcp] optional extra, lmms-eval-mcp script
- mcp/__init__.py: simplify to import-guarded main() entry point

Usage: lmms-eval mcp [--transport stdio|sse]
   or: pip install 'lmms_eval[mcp]' && lmms-eval-mcp
@Luodian Luodian changed the title [Feature] v0.7 Update: Re-re-engineering on accelerating multimodal agentic evaluation [Feature] v0.7: Operational simplicity and pipeline maturity for users (human/agents) Feb 28, 2026
Reorder sections: new benchmarks/models first, then infrastructure
(video I/O, Lance, safety, efficiency, agent workflows), then
developer-facing changes (YAML config, reasoning tags, async_openai,
JSONL, bug fixes).

Rename 5 section titles per editorial direction:
- Lance-Backed Video Mode (drop 'for MINERVA')
- Better One-Line Evaluation Support (add 'Support')
- Support customized message_format in async_openai (was Async OpenAI: Refactored...)
- Safety and Red-Teaming Baseline (drop 'JailbreakBench')
- Efficiency Metrics Coverage (drop 'and TTFT Backend')

Update all TOC anchors, cross-references, and CHANGELOG links to
match renamed headings.
Previously sections used non-sequential numbering (1,2,3,4,9,10,11,
5,6,7,8,12) to preserve original IDs after reordering. Renumber all
sections, TOC entries, cross-references, and CHANGELOG anchor links
to sequential 1-12.
Document the new generate_until_agentic output type: iterative
tool-call loop with deterministic Python simulators, max_agentic_steps
budget, trace-level metrics (step validity, state progress, termination
quality), and two seed tasks (vending_bench2_seed, tau2_bench_telecom_seed).

Renumber old §8-12 to §9-13 and fix all subsection numbering (### N.M)
to match parent section numbers.
- vending_bench2_seed -> vending_bench2
- tau2_bench_telecom_seed -> tau2_bench_telecom

Renames YAML configs, data JSONL files, task names, READMEs,
and all references in release notes and CHANGELOG.
Luodian and others added 7 commits March 1, 2026 00:19
Test splits lack ground truth answers. Changed to submission metric
to save predictions, extracting answers from <answer> tags.
- Added infovqa_test_reasoning.yaml with submission metric
- Test split extracts answers from <answer> tags for submission
- Created infovqa_reasoning.yaml group with val + test
- Added docvqa_test_reasoning.yaml with submission metric
- Test split extracts answers from <answer> tags for submission
- Created docvqa_reasoning.yaml group with val + test
@Luodian Luodian merged commit d8fcb9c into main Feb 28, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.