(eval) Support simple evaluation by Randomizez · Pull Request #39 · stepfun-ai/SteptronOss

Randomizez · 2026-03-17T03:00:01Z

support AIME25, GPQADiamond, HMMT25, IFBench, MMLUPro
add evaluation results of step-3.5-flash-sft
bump version of torch to 2.10; vllm to 0.17.1

brian14708

Review Summary

Overall this is a well-structured eval framework addition. A few items to address before merge:

Breaking Changes (please confirm scope)

VLLMDeployConfig.get_sampling_params() removed along with temperature, top_p, top_k, eos_ids, stop_strings, vllm_sampling_params fields. Any downstream code calling these will break. Please grep internal repos to confirm no other consumers.
vllm_client.completion() return type changed from CompletionResponse to dict[str, Any]. Same concern.

Code Issues

HMMT25/benchmark.py: Remove the redundant pass statement.
IFBenchBenchmark._load_records() duplicates the parent JsonlChatBenchmark shuffle/downsample logic (same seed, same slicing). If the parent ever changes strategy, these will silently diverge. Consider calling super()._load_records() or extracting the shared logic.
SimpleChatGeneratable context budget silent truncation: When remaining_context < sampling_params.max_tokens, resolved_max_tokens is silently clamped. For eval correctness this should at least emit a logger.warning so users know their generation budget was reduced.
_is_correct signature inconsistency across AIME25/GPQA/HMMT: They take (result, answer) but _build_metric.is_success_fn expects Callable[[Generated], bool]. The lambda adapter works but is redundant — consider having _is_correct pull gold answer from result.case.benchmark.context internally.

Infrastructure Concerns

vllm_router: TCPConnector(limit=0, limit_per_host=0) + all timeouts disabled. Fine for eval, but if the router is shared with non-eval workloads, unlimited connections risk fd exhaustion. Please confirm scope or add a config knob.
pyproject.toml: transformers upper bound removed entirely. Major version bumps could break tokenizer APIs — consider keeping <6.0 or similar.

Nits

GenerationController: If a callback calls submit_with_callback, it will deadlock on _state_lock since _callback_loop already holds it. Worth documenting this constraint.
Worker exception wrapping (RuntimeError(f"{type(e).__name__}: {e}")) loses the original traceback — consider including traceback.format_exc() in the message.

Randomizez · 2026-03-17T14:21:42Z

make sense, transformers has to be < 5.0 for correct apply_chat_template()

brian14708 · 2026-03-17T15:30:50Z

LGTM

Randomizez requested a review from brian14708 March 17, 2026 03:00

feat: evaluation

3337612

Randomizez force-pushed the support-evaluation branch from 8ef61f1 to 3337612 Compare March 17, 2026 03:24

Randomizez requested a review from a team March 17, 2026 03:24

update lock file

f777306

brian14708 reviewed Mar 17, 2026

View reviewed changes

Randomizez marked this pull request as draft March 17, 2026 14:21

brian14708 approved these changes Mar 17, 2026

View reviewed changes

Comment thread steptronoss/generation/async_generation.py

Comment thread playground/eval/benchmarks/AIME25/benchmark.py Outdated

Comment thread steptronoss/generation/async_generation.py

ZRandomize added 2 commits March 18, 2026 17:03

handle transformers>=5.0 issues

518335f

modify according to review

30f012a

Randomizez marked this pull request as ready for review March 18, 2026 09:17

Randomizez merged commit 471c676 into dev Mar 18, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(eval) Support simple evaluation#39

(eval) Support simple evaluation#39
Randomizez merged 4 commits into
devfrom
support-evaluation

Randomizez commented Mar 17, 2026 •

edited

Loading

Uh oh!

brian14708 left a comment

Uh oh!

Randomizez commented Mar 17, 2026

Uh oh!

brian14708 commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Randomizez commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian14708 left a comment

Choose a reason for hiding this comment

Review Summary

Breaking Changes (please confirm scope)

Code Issues

Infrastructure Concerns

Nits

Uh oh!

Randomizez commented Mar 17, 2026

Uh oh!

brian14708 commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Randomizez commented Mar 17, 2026 •

edited

Loading