Skip to content

(eval) Support simple evaluation#39

Merged
Randomizez merged 4 commits into
devfrom
support-evaluation
Mar 18, 2026
Merged

(eval) Support simple evaluation#39
Randomizez merged 4 commits into
devfrom
support-evaluation

Conversation

@Randomizez

@Randomizez Randomizez commented Mar 17, 2026

Copy link
Copy Markdown
Collaborator
  • support AIME25, GPQADiamond, HMMT25, IFBench, MMLUPro
  • add evaluation results of step-3.5-flash-sft
  • bump version of torch to 2.10; vllm to 0.17.1

@Randomizez Randomizez requested a review from brian14708 March 17, 2026 03:00
@Randomizez Randomizez requested a review from a team March 17, 2026 03:24

@brian14708 brian14708 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Overall this is a well-structured eval framework addition. A few items to address before merge:

Breaking Changes (please confirm scope)

  1. VLLMDeployConfig.get_sampling_params() removed along with temperature, top_p, top_k, eos_ids, stop_strings, vllm_sampling_params fields. Any downstream code calling these will break. Please grep internal repos to confirm no other consumers.

  2. vllm_client.completion() return type changed from CompletionResponse to dict[str, Any]. Same concern.

Code Issues

  1. HMMT25/benchmark.py: Remove the redundant pass statement.

  2. IFBenchBenchmark._load_records() duplicates the parent JsonlChatBenchmark shuffle/downsample logic (same seed, same slicing). If the parent ever changes strategy, these will silently diverge. Consider calling super()._load_records() or extracting the shared logic.

  3. SimpleChatGeneratable context budget silent truncation: When remaining_context < sampling_params.max_tokens, resolved_max_tokens is silently clamped. For eval correctness this should at least emit a logger.warning so users know their generation budget was reduced.

  4. _is_correct signature inconsistency across AIME25/GPQA/HMMT: They take (result, answer) but _build_metric.is_success_fn expects Callable[[Generated], bool]. The lambda adapter works but is redundant — consider having _is_correct pull gold answer from result.case.benchmark.context internally.

Infrastructure Concerns

  1. vllm_router: TCPConnector(limit=0, limit_per_host=0) + all timeouts disabled. Fine for eval, but if the router is shared with non-eval workloads, unlimited connections risk fd exhaustion. Please confirm scope or add a config knob.

  2. pyproject.toml: transformers upper bound removed entirely. Major version bumps could break tokenizer APIs — consider keeping <6.0 or similar.

Nits

  • GenerationController: If a callback calls submit_with_callback, it will deadlock on _state_lock since _callback_loop already holds it. Worth documenting this constraint.
  • Worker exception wrapping (RuntimeError(f"{type(e).__name__}: {e}")) loses the original traceback — consider including traceback.format_exc() in the message.

@Randomizez

Copy link
Copy Markdown
Collaborator Author

make sense, transformers has to be < 5.0 for correct apply_chat_template()

@Randomizez Randomizez marked this pull request as draft March 17, 2026 14:21
@brian14708

Copy link
Copy Markdown
Collaborator

LGTM

Comment thread steptronoss/generation/async_generation.py
Comment thread playground/eval/benchmarks/AIME25/benchmark.py Outdated
Comment thread steptronoss/generation/async_generation.py
@Randomizez Randomizez marked this pull request as ready for review March 18, 2026 09:17
@Randomizez Randomizez merged commit 471c676 into dev Mar 18, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants