Skip to content

[Task] Report sub category score for 3DSRBench and Viewspatial#1285

Merged
kcz358 merged 7 commits into
EvolvingLMMs-Lab:mainfrom
oscarqjh:sub-metrics-update
Apr 7, 2026
Merged

[Task] Report sub category score for 3DSRBench and Viewspatial#1285
kcz358 merged 7 commits into
EvolvingLMMs-Lab:mainfrom
oscarqjh:sub-metrics-update

Conversation

@oscarqjh
Copy link
Copy Markdown
Contributor

@oscarqjh oscarqjh commented Apr 6, 2026

Updated 3DSRBench and Viewspatial bench to report sub category metric scores

test run (3DSRBench):
image

test run (Viewspatial):
image

@oscarqjh
Copy link
Copy Markdown
Contributor Author

oscarqjh commented Apr 6, 2026

@PeterWangyi @kcz358

@oscarqjh
Copy link
Copy Markdown
Contributor Author

oscarqjh commented Apr 6, 2026

Added sub category metrics for Embspatial as well:
image

@oscarqjh
Copy link
Copy Markdown
Contributor Author

oscarqjh commented Apr 7, 2026

Also added vsibench_debiased by frames:
image

@oscarqjh
Copy link
Copy Markdown
Contributor Author

oscarqjh commented Apr 7, 2026

Added sub category metrics for Sparbench as well:
image

@kcz358 kcz358 merged commit 20e1c96 into EvolvingLMMs-Lab:main Apr 7, 2026
3 checks passed
@kcz358 kcz358 mentioned this pull request Apr 9, 2026
11 tasks
mwxely added a commit that referenced this pull request Apr 9, 2026
Per review feedback (ref: PR #1285 pattern):
- Report relevance/logic group-type scores separately
- Report per-level (1/2/3) scores separately
- Refactor aggregate logic into _compute_all_subscores helper
- process_results returns same entry under all 6 metric keys
- Detailed second_head/third_head breakdowns still logged
kcz358 pushed a commit that referenced this pull request Apr 9, 2026
* feat(videomme_v2): add task config and default template

- Dataset: MME-Benchmarks/Video-MME-v2 (800 videos, 3200 questions)
- 8-option MCQ (A-H) with grouped non-linear scoring
- Generation config: max_new_tokens=64, temperature=0

* feat(videomme_v2): add scoring, prompts, and evaluation logic

- Grouped non-linear scoring: relevance (quadratic) + logic (chain-based)
- 3 group structures: [1,2,3,4], [1,[2,3],4], [[1,2],3,4]
- Answer extraction with 11 prefix patterns (A-H range)
- Per-level, per-category, per-group-type breakdown reporting
- Prompt aligned with official INSTRUCT_PROMPT
- Verified against VLMEvalKit implementation

* feat(videomme_v2): add subtitle variant (concatenated mode)

- Load word-level JSONL subtitles and prepend to prompt
- Graceful fallback when subtitle file is missing
- Task: videomme_v2_w_subtitle

* feat(videomme_v2): add reasoning mode variant

- Chain-of-thought prompt requiring Final Answer: <letter> format
- max_new_tokens=4096 for reasoning space
- Task: videomme_v2_reasoning

* fix(videomme_v2): report sub-category scores as separate metrics

Per review feedback (ref: PR #1285 pattern):
- Report relevance/logic group-type scores separately
- Report per-level (1/2/3) scores separately
- Refactor aggregate logic into _compute_all_subscores helper
- process_results returns same entry under all 6 metric keys
- Detailed second_head/third_head breakdowns still logged

---------

Co-authored-by: mwxely <mwxely@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants