feat: add JumpScore evaluation task by mathCrazyy · Pull Request #1329 · EvolvingLMMs-Lab/lmms-eval

mathCrazyy · 2026-05-11T07:18:54Z

Summary

Adds JumpScore, a new video understanding benchmark that evaluates a model's ability to temporally localize jump rope events in video. The task uses a multi-turn conversation format: the model first answers a jump count question, then predicts the start timestamps of each jump. Evaluation is based on mean Average Precision (mAP) computed over multiple time tolerances (0.1 s, 0.2 s, 0.3 s).
This PR also includes two bug fixes that were discovered during integration:

mmmu/utils.py initialized the OpenAI judge server at module import time, causing CI failures when OPENAI_API_KEY is not set.
The CI workflow task-input-ab.yml had no fallback when the BASE snapshot fails due to pre-existing import-time errors in the base revision, blocking all PRs.

In scope

New task: lmms_eval/tasks/jump_rope/ (jumpscore.yaml + utils.py)
- Multi-turn video QA evaluation (jump count → timestamp prediction)
- mAP metric with tolerances [0.1, 0.2, 0.3] seconds
- Lazy HF dataset download via _get_cache_dir() (on first use only)
Bug fix: mmmu/utils.py — OpenAI judge server initialization moved to _get_judge_server() (lazy, on first call)
CI fix: .github/workflows/task-input-ab.yml — BASE snapshot step now uses continue-on-error: true; compare step gracefully skips diff when base.json is absent

Out of scope

No changes to existing task logic, metrics, or prompts
No changes to the core lmms_eval framework (evaluator.py, api/, etc.)
No model or dataset changes
The mmmu fix does not change scoring behavior — only defers initialization

Validation

Task YAML loads without error in a standard lmms-eval environment
jumpscore_process_results and jumpscore_aggregate_results verified locally against synthetic predictions
mmmu/utils.py imports successfully without OPENAI_API_KEY set; server is initialized correctly when evaluation is actually run
CI task-input-ab HEAD snapshot passes; BASE snapshot failure is surfaced as a warning rather than a hard failure

Risk / Compatibility

Low risk. The new jump_rope task is entirely additive and does not touch any shared utility.
The mmmu fix is behavior-neutral: evaluation results are identical; only the timing of server initialization changes.
The CI workflow change is conservative: it never suppresses a real regression (mismatch between BASE and HEAD snapshots); it only skips comparison when BASE cannot be captured at all.
No existing tests are affected.

Type of Change

New feature (new evaluation task)
Bug fix (mmmu import-time crash without OPENAI_API_KEY)
CI / tooling fix (task-input-ab workflow resilience)
Breaking change
Documentation update

…dule import The judge server was initialized at module import time, causing OpenAI API errors in CI environments where OPENAI_API_KEY is not set. Now the server is created on first use via _get_judge_server() instead.

…or on module import" This reverts commit 18dd0c3.

…wnload snapshot_download was called at module level, causing CI to fail when loading task configs without HF credentials. Moved to _get_cache_dir() which is called on first actual use, following the same pattern as other tasks (e.g. vbvr/utils.py).

…dule import The judge server was initialized at module level, causing an OpenAIError in CI environments where OPENAI_API_KEY is not set. Replaced the top-level initialization with _get_judge_server(), which creates the server on first actual use, consistent with how jump_rope/utils.py handles its HF download.

The BASE worktree may contain pre-existing import-time errors (e.g. module-level OpenAI client init requiring OPENAI_API_KEY, or network calls at import time). These cause the BASE capture step to fail, blocking all PRs even when the PR itself introduces no regression. Changes: - Add continue-on-error: true to 'Capture BASE snapshot' step - Update 'Compare snapshots' to skip diff when base.json is absent, printing a clear warning instead of failing the workflow

…or on module import" This reverts commit 917a3ed.

kcz358 · 2026-05-11T10:37:29Z

+_JUMPSCORE_CACHE_DIR: Optional[str] = None
+
+
+def _get_cache_dir() -> str:
+    """Return the local HF snapshot directory, downloading on first call."""
+    global _JUMPSCORE_CACHE_DIR
+    if _JUMPSCORE_CACHE_DIR is None:
+        _JUMPSCORE_CACHE_DIR = snapshot_download(
+            repo_id=_load_dataset_path(),
+            repo_type="dataset",
+            local_dir_use_symlinks=False,
+        )
+    return _JUMPSCORE_CACHE_DIR


Following other video dataset config to put this into yaml would be better.

kcz358 · 2026-05-11T10:39:00Z

+    for tolerance in sorted(ap_per_tolerance_combined.keys()):
+        ap_list = ap_per_tolerance_combined[tolerance]
+        mean_ap = sum(ap_list) / len(ap_list) if ap_list else 0.0
+        eval_logger.info(f"[JumpScore] AP@{tolerance}s: {mean_ap:.4f}")
+
+    eval_logger.info(f"[JumpScore] mAP: {mean_map:.4f}")


Can put the mAP into metric list so that the results can be logged into the results.json

…hot fails" This reverts commit 86f7f9a.

mathCrazyy added 6 commits May 11, 2026 15:09

feat: add jump rope evaluation task

455d699

Revert "fix(mmmu): lazy-load judge server to avoid OpenAI API key err…

e4c6438

…or on module import" This reverts commit 18dd0c3.

mathCrazyy changed the title ~~feat: add jump rope evaluation task~~ feat: add JumpScore evaluation task May 11, 2026

mathCrazyy added 2 commits May 11, 2026 16:27

refactor(jump_rope): rename task directory from jump_rope to jumpscore

1f26f50

Revert "fix(mmmu): lazy-load judge server to avoid OpenAI API key err…

191ff52

…or on module import" This reverts commit 917a3ed.

kcz358 reviewed May 11, 2026

View reviewed changes

mathCrazyy added 3 commits May 11, 2026 18:41

Revert "ci(task-input-ab): gracefully skip comparison when BASE snaps…

4ecc683

…hot fails" This reverts commit 86f7f9a.

fix(jumpscore): configure video cache in yaml

ac2becf

fix(jumpscore): expose map metric

c8ccfc5

kcz358 approved these changes May 11, 2026

View reviewed changes

kcz358 merged commit 4510f3e into EvolvingLMMs-Lab:main May 11, 2026
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add JumpScore evaluation task#1329

feat: add JumpScore evaluation task#1329
kcz358 merged 11 commits into
EvolvingLMMs-Lab:mainfrom
mathCrazyy:main

mathCrazyy commented May 11, 2026 •

edited

Loading

Uh oh!

kcz358 May 11, 2026

Uh oh!

kcz358 May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mathCrazyy commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

In scope

Out of scope

Validation

Risk / Compatibility

Type of Change

Uh oh!

kcz358 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

kcz358 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mathCrazyy commented May 11, 2026 •

edited

Loading