Commit f67a1c2
committed
[evals] Rewrite evals.py, drop legacy scaffolding, levanter->function
Wires the new typed API into the user-facing layer and cleans up the
pre-OpenAI-HTTP scaffolding that no longer has callers.
experiments/evals/evals.py
- Every helper now builds typed (ModelDeployment, LmEvalRun|HarborRun)
pairs; step runners use @Remote(resources=...) v2-Fray wrappers instead
of the v1 launch_evaluate_with_ray path.
- engine_kwargs param split into explicit deployment_kwargs (vLLM server
flags) + extra_model_args (lm-eval client knobs) + batch_size.
- evaluate_harbor supports external-API mode (model_path is None) by
building a RunningModel with LITELLM_PROVIDER_URL.
- default_eval stays on Levanter shim until #4828 lands (step 10).
experiments/evals/engine_configs.py
- DEFAULT_LM_EVAL_MODEL_KWARGS split into DEFAULT_VLLM_DEPLOYMENT_KWARGS
+ DEFAULT_LM_EVAL_EXTRA_MODEL_ARGS, matching the new API shape.
Callers updated
- run_base_model_evals.py, exp_evalchemy_eval.py,
exp_evalchemy_eval_reproduce_openthoughts.py: engine_kwargs dicts
split into deployment_kwargs + extra_model_args + batch_size args.
Levanter evaluator
- LevanterLmEvalEvaluator class collapsed into a run_levanter_lm_eval()
function. No Evaluator ABC / ModelConfig coupling. Scheduled for full
deletion in step 10 (gated on #4828).
Deleted
- marin.evaluation.run (draccus CLI, evaluate(config), EVALUATORS dict,
_impute_model_config, _to_v1_resource_config adapter, _normalize_model_path)
- marin.evaluation.evaluators.evaluator (Evaluator ABC, ModelConfig,
Dependency, v1-Fray launch_evaluate_with_ray free function)
- marin.evaluation.evaluators.simple_evaluator (the "debug" mapping
was only reachable via the deleted run.py:main)
- marin.evaluation.evaluators.levanter_tpu_evaluator (base class for
the removed LevanterLmEvalEvaluator)
- marin.evaluation.evaluation_config.EvaluationConfig
- marin.inference.vllm_server.resolve_model_name_or_path /
_maybe_enable_streaming (legacy ModelConfig shims)
Tests
- tests/evals/test_lm_eval.py: @tpu_ci guard updated to the new helper
signature; test_lm_eval_harness_levanter dropped (duplicate Tier C —
default_eval still exercises the Levanter path via other experiments).
- tests/evals/test_evals_helpers.py: new migration tests — each helper
builds the expected (ModelDeployment, run-config) pair, Harbor
external-API vs local-vLLM modes, parameterized step-suffix extraction.
Known out-of-scope issue, flagged for separate handoff: the evalchemy
commit pin 010412c set on main in PR #3690 is not reachable on either
teetone/evalchemy or mlfoundations/evalchemy, so `git checkout 010412c`
at runtime fails. Not introduced by this PR.1 parent 6db4c8d commit f67a1c2
File tree
13 files changed
+762
-1313
lines changed- experiments/evals
- lib/marin/src/marin/evaluation
- evaluators
- tests/evals
13 files changed
+762
-1313
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
| 4 | + | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
| 7 | + | |
7 | 8 | | |
8 | | - | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
0 commit comments