Skip to content

Commit f67a1c2

Browse files
committed
[evals] Rewrite evals.py, drop legacy scaffolding, levanter->function
Wires the new typed API into the user-facing layer and cleans up the pre-OpenAI-HTTP scaffolding that no longer has callers. experiments/evals/evals.py - Every helper now builds typed (ModelDeployment, LmEvalRun|HarborRun) pairs; step runners use @Remote(resources=...) v2-Fray wrappers instead of the v1 launch_evaluate_with_ray path. - engine_kwargs param split into explicit deployment_kwargs (vLLM server flags) + extra_model_args (lm-eval client knobs) + batch_size. - evaluate_harbor supports external-API mode (model_path is None) by building a RunningModel with LITELLM_PROVIDER_URL. - default_eval stays on Levanter shim until #4828 lands (step 10). experiments/evals/engine_configs.py - DEFAULT_LM_EVAL_MODEL_KWARGS split into DEFAULT_VLLM_DEPLOYMENT_KWARGS + DEFAULT_LM_EVAL_EXTRA_MODEL_ARGS, matching the new API shape. Callers updated - run_base_model_evals.py, exp_evalchemy_eval.py, exp_evalchemy_eval_reproduce_openthoughts.py: engine_kwargs dicts split into deployment_kwargs + extra_model_args + batch_size args. Levanter evaluator - LevanterLmEvalEvaluator class collapsed into a run_levanter_lm_eval() function. No Evaluator ABC / ModelConfig coupling. Scheduled for full deletion in step 10 (gated on #4828). Deleted - marin.evaluation.run (draccus CLI, evaluate(config), EVALUATORS dict, _impute_model_config, _to_v1_resource_config adapter, _normalize_model_path) - marin.evaluation.evaluators.evaluator (Evaluator ABC, ModelConfig, Dependency, v1-Fray launch_evaluate_with_ray free function) - marin.evaluation.evaluators.simple_evaluator (the "debug" mapping was only reachable via the deleted run.py:main) - marin.evaluation.evaluators.levanter_tpu_evaluator (base class for the removed LevanterLmEvalEvaluator) - marin.evaluation.evaluation_config.EvaluationConfig - marin.inference.vllm_server.resolve_model_name_or_path / _maybe_enable_streaming (legacy ModelConfig shims) Tests - tests/evals/test_lm_eval.py: @tpu_ci guard updated to the new helper signature; test_lm_eval_harness_levanter dropped (duplicate Tier C — default_eval still exercises the Levanter path via other experiments). - tests/evals/test_evals_helpers.py: new migration tests — each helper builds the expected (ModelDeployment, run-config) pair, Harbor external-API vs local-vLLM modes, parameterized step-suffix extraction. Known out-of-scope issue, flagged for separate handoff: the evalchemy commit pin 010412c set on main in PR #3690 is not reachable on either teetone/evalchemy or mlfoundations/evalchemy, so `git checkout 010412c` at runtime fails. Not introduced by this PR.
1 parent 6db4c8d commit f67a1c2

File tree

13 files changed

+762
-1313
lines changed

13 files changed

+762
-1313
lines changed
Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,16 @@
11
# Copyright The Marin Authors
22
# SPDX-License-Identifier: Apache-2.0
33

4-
"""Engine configuration for vLLM used for evals."""
4+
"""Engine + run defaults for vLLM-backed evals.
55
6-
DEFAULT_VLLM_ENGINE_KWARGS = {"max_model_len": 4096}
6+
Splits today's single `DEFAULT_LM_EVAL_MODEL_KWARGS` bag into two halves that
7+
target the post-#4827 types:
78
8-
DEFAULT_LM_EVAL_MODEL_KWARGS = {**DEFAULT_VLLM_ENGINE_KWARGS, "max_gen_toks": 4096}
9+
- `DEFAULT_VLLM_DEPLOYMENT_KWARGS`: vLLM server flags. Feeds `ModelDeployment.engine_kwargs`.
10+
- `DEFAULT_LM_EVAL_EXTRA_MODEL_ARGS`: per-request / lm-eval client knobs.
11+
Feeds `LmEvalRun.extra_model_args` as pre-formatted `k=v` strings.
12+
"""
13+
14+
DEFAULT_VLLM_DEPLOYMENT_KWARGS: dict = {"max_model_len": 4096}
15+
16+
DEFAULT_LM_EVAL_EXTRA_MODEL_ARGS: tuple[str, ...] = ("max_gen_toks=4096",)

0 commit comments

Comments
 (0)