[Feat] v0.5 Release Pack by Luodian · Pull Request #846 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2025-10-03T06:16:17Z

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

A descriptive title: [xxx] XXXX
A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Thank you for your contributions!

* add scibench task (full ) and change medqa * run precommit --------- Co-authored-by: pbcong <congphamba2005@gmail.com>

* add csbench * run precommit --------- Co-authored-by: pbcong <congphamba2005@gmail.com>

* [fix] batch size in openai compatible endpoint (#835) * more * more * more * more * more * more * more * more * more * more * more * more * more * more * [Feature] Add WenetSpeech Dataset * add lmms-eval-0.5 doc's 1st draft * remove unneccessary parts in lmms-eval-0.5.md --------- Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>

…modal Expansion**, detailing significant new features including: * A comprehensive **audio evaluation suite** (Step2 Audio Paralinguistic, VoiceBench, WenetSpeech). * A production-ready **response caching system**. * Integration of **five new models** (e.g., GPT-4o Audio Preview, Gemma-3). * Addition of **numerous new benchmarks** across vision, coding, and STEM domains. * Support for the **Model Context Protocol (MCP)** and improvements to **Async OpenAI integration**.

…ltimodal Expansion** release, updating the `README.md` and refining the `v0.5` release notes with improved structure and reproducibility validation for new benchmarks.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-03T06:19:18Z

+    if unit_prob != unit:
+        pred = cal_not(parse_not(pred))
+        ans = cal_not((ans, unit))
+        if len(ans) > 1:
+            ans = ans[0]


Preserve full numeric answers during scientific notation handling

In scibench_process_results, when a scientific-notation unit is detected, the code calls cal_not and then immediately does if len(ans) > 1: ans = ans[0]. Because ans is a string after conversion, len(ans) > 1 is almost always true and the value is truncated to a single character (e.g., "1234.0" becomes "1"). This causes nearly every converted ground-truth answer to be incorrect and evaluation to fail even when predictions are accurate. The check should distinguish between tuple vs string instead of slicing the string to its first character.

Useful? React with 👍 / 👎.

…Eval v0.5 release notes, changing '†' to '+-'.

Updated metrics and model integration details in the documentation.

Corrected the name of the model 'GPT-4o Audio' to 'GPT-4o Audio Preview' in the announcement section.

* add scibench task (full) and change medqa (#840) * add scibench task (full ) and change medqa * run precommit --------- Co-authored-by: pbcong <congphamba2005@gmail.com> * add csbench (#841) * add csbench * run precommit --------- Co-authored-by: pbcong <congphamba2005@gmail.com> * fix linting (#842) * [Feature] Add WenetSpeech Dataset (#837) * [fix] batch size in openai compatible endpoint (#835) * more * more * more * more * more * more * more * more * more * more * more * more * more * more * [Feature] Add WenetSpeech Dataset * add lmms-eval-0.5 doc's 1st draft * remove unneccessary parts in lmms-eval-0.5.md --------- Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> * This commit documents the official release of **LMMS-Eval v0.5: Multimodal Expansion**, detailing significant new features including: * A comprehensive **audio evaluation suite** (Step2 Audio Paralinguistic, VoiceBench, WenetSpeech). * A production-ready **response caching system**. * Integration of **five new models** (e.g., GPT-4o Audio Preview, Gemma-3). * Addition of **numerous new benchmarks** across vision, coding, and STEM domains. * Support for the **Model Context Protocol (MCP)** and improvements to **Async OpenAI integration**. * This commit formally announces and documents the **LMMS-Eval v0.5: Multimodal Expansion** release, updating the `README.md` and refining the `v0.5` release notes with improved structure and reproducibility validation for new benchmarks. * Updates the status legend for reproducibility validation in the LMMS-Eval v0.5 release notes, changing '†' to '+-'. * Revise metrics and model integration in lmms-eval doc Updated metrics and model integration details in the documentation. * Fix model name in LMMs-Eval v0.5 announcement Corrected the name of the model 'GPT-4o Audio' to 'GPT-4o Audio Preview' in the announcement section. --------- Co-authored-by: Do Duc Anh (Erwin) <104162175+KelvinDo183@users.noreply.github.com> Co-authored-by: pbcong <congphamba2005@gmail.com> Co-authored-by: Cong <101887866+pbcong@users.noreply.github.com> Co-authored-by: JAM_Yichen <110095482+YichenG170@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>

* add scibench task (full) and change medqa (EvolvingLMMs-Lab#840) * add scibench task (full ) and change medqa * run precommit --------- Co-authored-by: pbcong <congphamba2005@gmail.com> * add csbench (EvolvingLMMs-Lab#841) * add csbench * run precommit --------- Co-authored-by: pbcong <congphamba2005@gmail.com> * fix linting (EvolvingLMMs-Lab#842) * [Feature] Add WenetSpeech Dataset (EvolvingLMMs-Lab#837) * [fix] batch size in openai compatible endpoint (EvolvingLMMs-Lab#835) * more * more * more * more * more * more * more * more * more * more * more * more * more * more * [Feature] Add WenetSpeech Dataset * add lmms-eval-0.5 doc's 1st draft * remove unneccessary parts in lmms-eval-0.5.md --------- Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> * This commit documents the official release of **LMMS-Eval v0.5: Multimodal Expansion**, detailing significant new features including: * A comprehensive **audio evaluation suite** (Step2 Audio Paralinguistic, VoiceBench, WenetSpeech). * A production-ready **response caching system**. * Integration of **five new models** (e.g., GPT-4o Audio Preview, Gemma-3). * Addition of **numerous new benchmarks** across vision, coding, and STEM domains. * Support for the **Model Context Protocol (MCP)** and improvements to **Async OpenAI integration**. * This commit formally announces and documents the **LMMS-Eval v0.5: Multimodal Expansion** release, updating the `README.md` and refining the `v0.5` release notes with improved structure and reproducibility validation for new benchmarks. * Updates the status legend for reproducibility validation in the LMMS-Eval v0.5 release notes, changing '†' to '+-'. * Revise metrics and model integration in lmms-eval doc Updated metrics and model integration details in the documentation. * Fix model name in LMMs-Eval v0.5 announcement Corrected the name of the model 'GPT-4o Audio' to 'GPT-4o Audio Preview' in the announcement section. --------- Co-authored-by: Do Duc Anh (Erwin) <104162175+KelvinDo183@users.noreply.github.com> Co-authored-by: pbcong <congphamba2005@gmail.com> Co-authored-by: Cong <101887866+pbcong@users.noreply.github.com> Co-authored-by: JAM_Yichen <110095482+YichenG170@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>

KelvinDo183 and others added 6 commits October 3, 2025 11:16

add scibench task (full) and change medqa (#840)

5f5a82c

* add scibench task (full ) and change medqa * run precommit --------- Co-authored-by: pbcong <congphamba2005@gmail.com>

add csbench (#841)

48e8b59

* add csbench * run precommit --------- Co-authored-by: pbcong <congphamba2005@gmail.com>

fix linting (#842)

f07f7a5

This commit formally announces and documents the **LMMS-Eval v0.5: Mu…

1f5f9e2

…ltimodal Expansion** release, updating the `README.md` and refining the `v0.5` release notes with improved structure and reproducibility validation for new benchmarks.

Luodian requested review from KelvinDo183 and kcz358 October 3, 2025 06:16

Luodian assigned KelvinDo183 Oct 3, 2025

chatgpt-codex-connector Bot reviewed Oct 3, 2025

View reviewed changes

Updates the status legend for reproducibility validation in the LMMS-…

f1ef460

…Eval v0.5 release notes, changing '†' to '+-'.

kcz358 approved these changes Oct 6, 2025

View reviewed changes

YichenG170 added 2 commits October 6, 2025 13:45

Revise metrics and model integration in lmms-eval doc

0b79944

Updated metrics and model integration details in the documentation.

Fix model name in LMMs-Eval v0.5 announcement

8488f78

Corrected the name of the model 'GPT-4o Audio' to 'GPT-4o Audio Preview' in the announcement section.

Luodian merged commit 8f142bc into main Oct 7, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] v0.5 Release Pack#846

[Feat] v0.5 Release Pack#846
Luodian merged 9 commits into
mainfrom
dev/v0d5

Luodian commented Oct 3, 2025

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Luodian commented Oct 3, 2025

When you open a pull-request, please be sure to include the following

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants