Skip to content

feat: add JumpScore evaluation task#1329

Merged
kcz358 merged 11 commits into
EvolvingLMMs-Lab:mainfrom
mathCrazyy:main
May 11, 2026
Merged

feat: add JumpScore evaluation task#1329
kcz358 merged 11 commits into
EvolvingLMMs-Lab:mainfrom
mathCrazyy:main

Conversation

@mathCrazyy
Copy link
Copy Markdown
Contributor

@mathCrazyy mathCrazyy commented May 11, 2026

Summary

Adds JumpScore, a new video understanding benchmark that evaluates a model's ability to temporally localize jump rope events in video. The task uses a multi-turn conversation format: the model first answers a jump count question, then predicts the start timestamps of each jump. Evaluation is based on mean Average Precision (mAP) computed over multiple time tolerances (0.1 s, 0.2 s, 0.3 s).
This PR also includes two bug fixes that were discovered during integration:

  • mmmu/utils.py initialized the OpenAI judge server at module import time, causing CI failures when OPENAI_API_KEY is not set.
  • The CI workflow task-input-ab.yml had no fallback when the BASE snapshot fails due to pre-existing import-time errors in the base revision, blocking all PRs.

In scope

  • New task: lmms_eval/tasks/jump_rope/ (jumpscore.yaml + utils.py)
    • Multi-turn video QA evaluation (jump count → timestamp prediction)
    • mAP metric with tolerances [0.1, 0.2, 0.3] seconds
    • Lazy HF dataset download via _get_cache_dir() (on first use only)
  • Bug fix: mmmu/utils.py — OpenAI judge server initialization moved to _get_judge_server() (lazy, on first call)
  • CI fix: .github/workflows/task-input-ab.yml — BASE snapshot step now uses continue-on-error: true; compare step gracefully skips diff when base.json is absent

Out of scope

  • No changes to existing task logic, metrics, or prompts
  • No changes to the core lmms_eval framework (evaluator.py, api/, etc.)
  • No model or dataset changes
  • The mmmu fix does not change scoring behavior — only defers initialization

Validation

  • Task YAML loads without error in a standard lmms-eval environment
  • jumpscore_process_results and jumpscore_aggregate_results verified locally against synthetic predictions
  • mmmu/utils.py imports successfully without OPENAI_API_KEY set; server is initialized correctly when evaluation is actually run
  • CI task-input-ab HEAD snapshot passes; BASE snapshot failure is surfaced as a warning rather than a hard failure

Risk / Compatibility

  • Low risk. The new jump_rope task is entirely additive and does not touch any shared utility.
  • The mmmu fix is behavior-neutral: evaluation results are identical; only the timing of server initialization changes.
  • The CI workflow change is conservative: it never suppresses a real regression (mismatch between BASE and HEAD snapshots); it only skips comparison when BASE cannot be captured at all.
  • No existing tests are affected.

Type of Change

  • New feature (new evaluation task)
  • Bug fix (mmmu import-time crash without OPENAI_API_KEY)
  • CI / tooling fix (task-input-ab workflow resilience)
  • Breaking change
  • Documentation update

…dule import

The judge server was initialized at module import time, causing
OpenAI API errors in CI environments where OPENAI_API_KEY is not set.
Now the server is created on first use via _get_judge_server() instead.
…wnload

snapshot_download was called at module level, causing CI to fail when
loading task configs without HF credentials. Moved to _get_cache_dir()
which is called on first actual use, following the same pattern as
other tasks (e.g. vbvr/utils.py).
…dule import

The judge server was initialized at module level, causing an OpenAIError
in CI environments where OPENAI_API_KEY is not set. Replaced the top-level
initialization with _get_judge_server(), which creates the server on first
actual use, consistent with how jump_rope/utils.py handles its HF download.
The BASE worktree may contain pre-existing import-time errors (e.g.
module-level OpenAI client init requiring OPENAI_API_KEY, or network
calls at import time). These cause the BASE capture step to fail, blocking
all PRs even when the PR itself introduces no regression.

Changes:
- Add continue-on-error: true to 'Capture BASE snapshot' step
- Update 'Compare snapshots' to skip diff when base.json is absent,
  printing a clear warning instead of failing the workflow
@mathCrazyy mathCrazyy changed the title feat: add jump rope evaluation task feat: add JumpScore evaluation task May 11, 2026
Comment thread lmms_eval/tasks/jumpscore/utils.py Outdated
Comment on lines +22 to +34
_JUMPSCORE_CACHE_DIR: Optional[str] = None


def _get_cache_dir() -> str:
"""Return the local HF snapshot directory, downloading on first call."""
global _JUMPSCORE_CACHE_DIR
if _JUMPSCORE_CACHE_DIR is None:
_JUMPSCORE_CACHE_DIR = snapshot_download(
repo_id=_load_dataset_path(),
repo_type="dataset",
local_dir_use_symlinks=False,
)
return _JUMPSCORE_CACHE_DIR
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following other video dataset config to put this into yaml would be better.

Comment on lines +347 to +352
for tolerance in sorted(ap_per_tolerance_combined.keys()):
ap_list = ap_per_tolerance_combined[tolerance]
mean_ap = sum(ap_list) / len(ap_list) if ap_list else 0.0
eval_logger.info(f"[JumpScore] AP@{tolerance}s: {mean_ap:.4f}")

eval_logger.info(f"[JumpScore] mAP: {mean_map:.4f}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can put the mAP into metric list so that the results can be logged into the results.json

@kcz358 kcz358 merged commit 4510f3e into EvolvingLMMs-Lab:main May 11, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants