(eval) Support simple evaluation#39
Conversation
8ef61f1 to
3337612
Compare
brian14708
left a comment
There was a problem hiding this comment.
Review Summary
Overall this is a well-structured eval framework addition. A few items to address before merge:
Breaking Changes (please confirm scope)
-
VLLMDeployConfig.get_sampling_params()removed along withtemperature,top_p,top_k,eos_ids,stop_strings,vllm_sampling_paramsfields. Any downstream code calling these will break. Please grep internal repos to confirm no other consumers. -
vllm_client.completion()return type changed fromCompletionResponsetodict[str, Any]. Same concern.
Code Issues
-
HMMT25/benchmark.py: Remove the redundantpassstatement. -
IFBenchBenchmark._load_records()duplicates the parentJsonlChatBenchmarkshuffle/downsample logic (same seed, same slicing). If the parent ever changes strategy, these will silently diverge. Consider callingsuper()._load_records()or extracting the shared logic. -
SimpleChatGeneratablecontext budget silent truncation: Whenremaining_context < sampling_params.max_tokens,resolved_max_tokensis silently clamped. For eval correctness this should at least emit alogger.warningso users know their generation budget was reduced. -
_is_correctsignature inconsistency across AIME25/GPQA/HMMT: They take(result, answer)but_build_metric.is_success_fnexpectsCallable[[Generated], bool]. The lambda adapter works but is redundant — consider having_is_correctpull gold answer fromresult.case.benchmark.contextinternally.
Infrastructure Concerns
-
vllm_router:
TCPConnector(limit=0, limit_per_host=0)+ all timeouts disabled. Fine for eval, but if the router is shared with non-eval workloads, unlimited connections risk fd exhaustion. Please confirm scope or add a config knob. -
pyproject.toml:transformersupper bound removed entirely. Major version bumps could break tokenizer APIs — consider keeping<6.0or similar.
Nits
GenerationController: If a callback callssubmit_with_callback, it will deadlock on_state_locksince_callback_loopalready holds it. Worth documenting this constraint.- Worker exception wrapping (
RuntimeError(f"{type(e).__name__}: {e}")) loses the original traceback — consider includingtraceback.format_exc()in the message.
|
make sense, transformers has to be < 5.0 for correct apply_chat_template() |
|
LGTM |
Uh oh!
There was an error while loading. Please reload this page.