fix(api/task): handle None generation responses in process_results#1311
Merged
kcz358 merged 1 commit intoMay 6, 2026
Merged
Conversation
kcz358
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Task.process_resultsfrom crashing when a generated response slot,respsisNone. This can occur in instances when token budget is not sufficient, and thinking generation had not completed yet resulting in an empty response. Even if 99/100 examples succeed, the entire eval is currently discarded.message.contentnormalization and emptyresults = []handling; this PR coversresps = [None]/resps = [[None]].In scope
lmms_eval/api/task.pyso generated outputs are checked forNonebefore calling.strip().generate_until,generate_visual_cot, and the latergenerate_until.Out of scope
Validation
N=full run| key metrics:postprocessing completes; sample outputs/token counts remain inspectable even when results are degraded| result:passuv run pre-commit run --all-files| sample size:N=all files| key metrics:Python formatting via black and import ordering via isort| result:passRisk / Compatibility
.strip()behavior, and onlyNonegeneration responses are treated as empty responses instead of aborting postprocessing.P.S. I know it says to create an issue first before PR if the bug is new, but this is in the same boat as #1218 , and I don't think it's quite just at the api level? There's probably a few different ways to look at this.
Type of Change