Cornetto is a benchmark for evaluating LLM-driven network configuration repair. Given a set of faulty router configurations and a formal specification of correct network behaviour (written as a set of data-plane predicates), a model must produce a corrected configuration. Correctness is verified with Batfish.
The benchmark supports four pipeline modes — zero-shot, repair (retry on failure), retrieval-augmented, and fully agentic — and covers 231 misconfiguration scenarios across small (>50 nodes), medium (50--100 nodes), and large (100+ nodes) network topologies.
git clone https://github.com/iprotogeros/cornetto-repo.git
cd cornetto-repo
pip install -e . # installs the package + all dependenciesOr install dependencies only:
pip install -r requirements.txtNote:
vllmrequires a CUDA-capable GPU. If you are using API-only providers (OpenAI, Anthropic, Google, Mistral) you can omit it.pip install -e .will still succeed — vllm is not in the required dependencies.
Editable install required: the
cornettoCLI entry point andpython -m src.benchmark.zero_shotboth resolve configs relative to the repo root, so always run from within the cloned directory.
Copy .env.example to .env and add your API keys:
cp .env.example .env # then edit .envOPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...
MISTRAL_API_KEY=...
You also need a running Batfish server. Cornetto uses a customised Batfish container that extends the standard engine with the ability to extract per-flow network forwarding behaviour, which is then parsed by the Config2Spec algorithm implemented in src/utils/evaluation_utils/ to generate and verify the formal specifications. The standard batfish/batfish image does not support these extensions.
Pull and start the container with the provided script:
bash scripts/batfish_setup/pull_and_run_container.shThis stops any existing container named batfish, pulls iprotogeros/batfish-allione:forwarding-analysis-0.1, and starts it on port 9996.
The dataset (231 scenarios) is hosted on HuggingFace as a public dataset. It contains everything needed to run the benchmark: faulty router configurations, formal specifications, and network topology. Ground-truth artifacts (correct configurations, fault annotations, route/forwarding diffs) are intentionally excluded to prevent contamination of evaluation results; scenario-003 is the only exception (see Dataset format).
Download the dataset once after cloning:
python download_dataset.pyThis saves the dataset to dataset/main_dataset/. The script accepts --token for private access and --local-dir to change the destination.
Features that require ground truth: The LLM-as-judge diagnosis scorer (
diagnosis_judge.enabled: True) compares model outputs againstfault_metrics.csv, and theoracle/random_with_oraclecontext-sampling modes use the correct (initial) configurations. These artifacts are not part of the public dataset.
# Zero-shot inference with gpt-5.4-mini (public dataset, all scenarios)
cornetto model.provider=openai model.model_name=gpt-5.4-mini-2026-03-17
# Agentic pipeline with Claude (public dataset)
cornetto --config-name agent_config \
model.provider=anthropic \
model.model_name=claude-opus-4-5-20251101
# Private evaluation on scenario-003 with oracle context and LLM-as-judge
# (requires ground-truth artifacts — see Dataset section)
cornetto --config-name private_eval model.provider=openai model.model_name=gpt-5.4-mini-2026-03-17cornetto is equivalent to python -m src.benchmark.zero_shot. Results are written to results_main/ (path printed at startup).
Three configs are provided in configs/:
| Config | Use case |
|---|---|
zero_shot.yaml |
Default — public dataset, all pipeline modes except agentic |
agent_config.yaml |
Agentic pipeline (pipeline_mode: agentic) |
private_eval.yaml |
Full evaluation on scenario-003 with oracle context and LLM-as-judge (requires ground-truth artifacts) |
Set pipeline_mode in the config or on the command line:
| Mode | Description |
|---|---|
naive |
Single-pass zero-shot generation |
repair |
Retry on verification failure (can warm-start from a previous naive run) |
retrieval |
Retrieval-augmented generation |
agentic |
Fully agentic loop with tool access (Batfish verification, config inspection) |
naive, repair, and retrieval use configs/zero_shot.yaml. agentic uses configs/agent_config.yaml.
| Key | Default | Description |
|---|---|---|
model.provider |
zai |
API provider: openai, anthropic, google, mistral, hf, ollama |
model.model_name |
glm-4.7 |
Model identifier as expected by the provider |
model.max_tokens |
40000 |
Maximum output tokens |
model.temperature |
1.0 |
Sampling temperature |
model.context_sampling |
random |
Which router configs to include: random (default), oracle / random_with_oracle (require initial_configs — not in public dataset) |
model.max_context_tokens |
full |
Token budget for configs in the prompt (full or integer) |
model.batch_api |
False |
Use batch API (OpenAI/Anthropic) for offline jobs |
model.few_shot_mode |
False |
Enable few-shot examples in the prompt |
| Key | Default | Description |
|---|---|---|
agentic.max_steps |
30 |
Maximum tool-call steps per scenario |
agentic.verification_mode |
per_step |
When to run Batfish: per_step or post_hoc |
agentic.context_prefill |
True |
Pre-populate agent context with network summary |
| Key | Description |
|---|---|
naive_results_dir |
Path to a prior naive run; successful tasks are skipped |
naive_success_threshold |
Score threshold for considering a prior result successful (default 1.0) |
naive_retry_parse_failures |
Re-run tasks that had parse errors (default true) |
When diagnosis_judge.enabled: True, a panel of LLM judges scores the model's fault diagnosis against the ground truth in fault_metrics.csv. This feature requires ground-truth artifacts not included in the public dataset. Set diagnosis_judge.enabled: False (or omit it) when running against the public dataset.
Set model.use_vllm: True and model.num_gpus: N to serve a HuggingFace model locally via vLLM.
After downloading, each scenario lives in dataset/main_dataset/scenario-NNN/. The public dataset includes only the artifacts needed to run the benchmark without leaking the ground-truth repair:
scenario-NNN/
├── final_configs/configs/ # faulty router configs (.cfg) presented to the model
├── specifications/
│ └── specifications_*.csv # Batfish predicates at multiple sampling rates
└── data_and_metrics/
└── topology.json # network graph
scenario-003 is an exception — it is a complete example scenario (BGP ASN mismatch, small topology) that also includes initial_configs/ (correct configs), data_and_metrics/fault_metrics.csv, route/forwarding diffs, and fault annotations. See scenario-003/EXAMPLE.md.
The specifications CSV has columns predicate_type, node, prefix, waypoint, num_routes, sources. Rows with sources containing broken are the predicates that fail under the faulty configuration; the model's fix is evaluated by checking how many it restores.
Public dataset limitations:
oracleandrandom_with_oraclecontext-sampling requireinitial_configs/(correct configurations), and the LLM-as-judge scorer requiresfault_metrics.csv. Neither is included in the public dataset. Usecontext_sampling: randomanddiagnosis_judge.enabled: Falsewhen running against the public dataset.
If you use Cornetto in your research, please cite:
@misc{protogeros2026benchmarkingllmdrivennetworkconfiguration,
title={Benchmarking LLM-Driven Network Configuration Repair},
author={Ioannis Protogeros and Rufat Asadli and Benjamin Hoffman and Laurent Vanbever},
year={2026},
eprint={2604.22513},
archivePrefix={arXiv},
primaryClass={cs.NI},
url={https://arxiv.org/abs/2604.22513},
}