Scores model responses against ground-truth answers after an accuracy benchmark run; extracts answers from raw response text and supports LiveCodeBench code-execution evaluation via an external sandboxed server.
Component specs: async_utils · commands · config · core · dataset_manager · endpoint_client · evaluation · load_generator · metrics · openai · plugins · profiling · sglang · testing · utils
evaluation/ scores model responses against ground-truth answers for accuracy benchmarks.
It is invoked after a benchmark run that collected responses (i.e. --mode acc or --mode both).
Today that orchestration happens in commands/benchmark/execute.py, not in metrics/.
- Extract model answers from raw response text
- Score extracted answers against ground truth
- Support LiveCodeBench code execution evaluation (requires external server)
QueryResult.response_output (raw response text)
|
v
extractor.py --> extracted answer string
|
v
scoring.py --> correct / incorrect (per sample)
|
v
accuracy summary written into benchmark results
| File | Purpose |
|---|---|
extractor.py |
Extracts model answer from raw text (regex, boxed-answer parsing) |
scoring.py |
Compares extracted answer to ground truth label |
livecodebench/ |
LiveCodeBench-specific code execution pipeline |
LiveCodeBench requires a sandboxed code execution server. The livecodebench/ subdirectory
contains the server implementation and a Dockerfile. See
src/inference_endpoint/evaluation/livecodebench/README.md for setup
instructions.
Files:
_server.py— FastAPI server that executes submitted codelcb_serve.py— Server management utilitiesgenerate.py— Response generation utilitiesrun_lcb_tests.py— Test runner for LCB evaluationlcb_serve.dockerfile— Docker image for the execution server
The scorer registry in evaluation/scoring.py currently includes:
| Method | Description |
|---|---|
pass_at_1 |
Exact-match style scoring; also used by the LiveCodeBench path |
string_match |
Whitespace-trimmed string equality |
rouge |
ROUGE-based text generation scoring |
code_bench_scorer |
LiveCodeBench code-execution scoring |
shopify_category_f1 |
Shopify category F1 evaluation |
The scoring configuration used by benchmark execution is specified per accuracy dataset under
datasets[].accuracy_config, including accuracy_config.eval_method,
accuracy_config.extractor, and optional accuracy_config.ground_truth.
Extraction is separate from scoring
Model responses for tasks like GPQA often embed the answer in verbose reasoning text. Extraction (finding the answer in the text) and scoring (comparing the answer) are separate concerns. Different datasets may share a scoring method but require different extraction logic.
LiveCodeBench requires an external service
Code execution cannot be done safely in-process. The evaluation server runs in a Docker container with resource limits. This is a deliberate architecture choice — not a shortcut — and is documented prominently in the dataset README.
| Component | Role |
|---|---|
commands/benchmark/execute.py |
Builds scorer/extractor configs and runs scoring |
dataset_manager/predefined/ |
Provides ground truth labels alongside prompts |
evaluation/livecodebench/ |
Provides external execution path for LiveCodeBench |
results.json / benchmark reports |
Receives computed accuracy summary during finalization |