Skip to content

fix(jumpscore): align message format and video lookup#1330

Merged
kcz358 merged 16 commits into
EvolvingLMMs-Lab:mainfrom
mathCrazyy:main
May 15, 2026
Merged

fix(jumpscore): align message format and video lookup#1330
kcz358 merged 16 commits into
EvolvingLMMs-Lab:mainfrom
mathCrazyy:main

Conversation

@mathCrazyy
Copy link
Copy Markdown
Contributor

@mathCrazyy mathCrazyy commented May 12, 2026

Summary

  • Remove legacy count QA context from doc_to_messages so evaluation input matches the intended timestamp-only prompt.

In scope

  • Update lmms_eval/tasks/jumpscore/utils.py.
  • Change JumpScore chat message construction from multi-turn count QA context to a single-turn timestamp query.

Out of scope

  • No model implementation changes.
  • No metric or scoring logic changes.
  • No dataset content changes.
  • No changes to prompts outside JumpScore.
  • No changes to other tasks.

Validation

  • Confirmed the committed diff only touches lmms_eval/tasks/jumpscore/utils.py.
  • Verified the message construction now produces one user turn containing the video and timestamp question.
  • Verified the video lookup includes existing cache paths plus HF snapshot fallback paths.
  • Pushed commit cf7f49f to origin/main.

Risk / Compatibility

  • Low risk for model code because this only changes JumpScore task input construction.
  • Expected behavior change for JumpScore evaluation prompts: legacy count QA history is no longer included.
  • Compatible with existing cache layouts; adds support for HF snapshot cache layout as a fallback.
  • Results may differ from prior JumpScore runs because the evaluation input format is now aligned to the single-turn protocol.

Type of Change

  • Bug fix (non-breaking change)
  • Evaluation/task configuration alignment

mathCrazyy added 13 commits May 11, 2026 15:09
…dule import

The judge server was initialized at module import time, causing
OpenAI API errors in CI environments where OPENAI_API_KEY is not set.
Now the server is created on first use via _get_judge_server() instead.
…wnload

snapshot_download was called at module level, causing CI to fail when
loading task configs without HF credentials. Moved to _get_cache_dir()
which is called on first actual use, following the same pattern as
other tasks (e.g. vbvr/utils.py).
…dule import

The judge server was initialized at module level, causing an OpenAIError
in CI environments where OPENAI_API_KEY is not set. Replaced the top-level
initialization with _get_judge_server(), which creates the server on first
actual use, consistent with how jump_rope/utils.py handles its HF download.
The BASE worktree may contain pre-existing import-time errors (e.g.
module-level OpenAI client init requiring OPENAI_API_KEY, or network
calls at import time). These cause the BASE capture step to fail, blocking
all PRs even when the PR itself introduces no regression.

Changes:
- Add continue-on-error: true to 'Capture BASE snapshot' step
- Update 'Compare snapshots' to skip diff when base.json is absent,
  printing a clear warning instead of failing the workflow
Comment thread lmms_eval/tasks/jumpscore/utils.py Outdated
@kcz358
Copy link
Copy Markdown
Collaborator

kcz358 commented May 14, 2026

JumpScore does not zip the data yet. Will it be zipped later? If this is the case, I will merge this PR. Thanks!

@mathCrazyy
Copy link
Copy Markdown
Contributor Author

JumpScore does not zip the data yet. Will it be zipped later? If this is the case, I will merge this PR. Thanks!

Thank you for the review!
The data has been updated to zip format, and I’ve already adapted the code to support it.
Feel free to merge this PR when you’re ready. Thanks!

@kcz358 kcz358 merged commit a1ba778 into EvolvingLMMs-Lab:main May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants