Skip to content

fix(api/task): handle None generation responses in process_results#1311

Merged
kcz358 merged 1 commit into
EvolvingLMMs-Lab:mainfrom
dankit:fix/none-generation-response-postprocess
May 6, 2026
Merged

fix(api/task): handle None generation responses in process_results#1311
kcz358 merged 1 commit into
EvolvingLMMs-Lab:mainfrom
dankit:fix/none-generation-response-postprocess

Conversation

@dankit
Copy link
Copy Markdown
Contributor

@dankit dankit commented Apr 26, 2026

Summary

  • Prevent Task.process_results from crashing when a generated response slot, resps is None. This can occur in instances when token budget is not sufficient, and thinking generation had not completed yet resulting in an empty response. Even if 99/100 examples succeed, the entire eval is currently discarded.
  • This change prevents discarding the entire evaluation run by treating missing generation responses as empty strings so postprocessing can finish and users can inspect degraded eval outputs, token counts on empty responses, and failure context. Otherwise there is no observability on the issue even with verbose output.
  • This is distinct from prior fixes for OpenAI-compatible message.content normalization and empty results = [] handling; this PR covers resps = [None] / resps = [[None]].

In scope

  • Updates lmms_eval/api/task.py so generated outputs are checked for None before calling .strip().
  • Applies to generate_until, generate_visual_cot, and the later generate_until.
  • Preserves existing behavior for normal string responses.

Out of scope

  • Does not change model generation, token budgeting, scoring logic, metrics, or sample logging.
  • Does not introduce new logging or warning behavior in order to keep the fix minimal and non-disruptive.

Validation

  • IFEval run with a missing generation response. Also ran MMMU val set, mmstar, screenspot_v2. | sample size: N=full run | key metrics: postprocessing completes; sample outputs/token counts remain inspectable even when results are degraded | result: pass
  • uv run pre-commit run --all-files | sample size: N=all files | key metrics: Python formatting via black and import ordering via isort | result: pass

Risk / Compatibility

  • Low risk: normal string responses keep the same .strip() behavior, and only None generation responses are treated as empty responses instead of aborting postprocessing.
  • This could be perceived as making a missing response less visible; I kept the change minimal and non-disruptive, but can add explicit logging if maintainers prefer stronger observability.

P.S. I know it says to create an issue first before PR if the bug is new, but this is in the same boat as #1218 , and I don't think it's quite just at the api level? There's probably a few different ways to look at this.

Type of Change

  • Bug fix (non-breaking change)
  • New feature
  • New benchmark/task
  • New model integration
  • Breaking change
  • Documentation update
  • Refactoring (no functional changes)

@kcz358 kcz358 merged commit a31a7de into EvolvingLMMs-Lab:main May 6, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants