A reproducible text-to-SQL benchmark that quantifies the value of giving an LLM a small hand-authored semantic-layer markdown alongside the raw schema DDL.
Does adding a 4 KB hand-authored "semantic layer" markdown — describing business conventions, calculated metrics, default fact tables, and snapshot-data caveats — measurably improve text-to-SQL accuracy compared to giving the model only the raw DDL?
Yes — across three top-tier frontier models (Claude Opus 4.7, Claude Sonnet 4.6, OpenAI GPT-5.4) the semantic layer adds +17 to +23 percentage points of analytical accuracy on a 100-question Contoso retail benchmark, with paired McNemar exact p < 0.002 in every case.
With the semantic layer, all three models converge to ~ 68 % accuracy and are statistically indistinguishable. Without it, all three sit at ~ 46–51 % and are also statistically indistinguishable. Whether the model has the semantic layer matters more than which model it is.
The 68 % figure is a floor, not a ceiling. The semantic layer in this repo was hand-authored once to demonstrate the presence of the effect, not iterated to maximize it. Adding named measures for percentile / corr / threshold-rank metrics, worked examples for ambiguous business terms, and a handful of anti-patterns from observed failures will push the +Sem cluster meaningfully higher. We left those gains on the table to keep the comparison fair. Swap in your own MD and rerun — see "Adding a new model" below; the same workflow applies to swapping the MD.
See REPORT.md for the full study.
For each model under test there are two configurations:
| Config | Prompt = |
|---|---|
| raw | system prompt + SHOW CREATE TABLE DDL for retail.* + the question |
| +Sem | same as raw, but with fixtures/semantic_layer.md prepended |
A single execution pipeline:
- Reference SQL (from the benchmark JSON) is run against ClickHouse to produce the ground-truth rows.
- Each model under test is given the prompt above and produces SQL.
- The candidate SQL is executed against ClickHouse.
- A judge (Claude Sonnet 4.6, different from any model under test) compares
candidate rows vs reference rows and emits two verdicts:
- strict — row-level match
- analytical — OLAP-aware (accepts grouped breakdowns that aggregate to the reference scalar, decimal-vs-percentage unit equivalents, etc.)
- Aggregate accuracy + 95 % Wilson CI + paired McNemar exact tests across models.
- Python 3.11
uvclaudeCLI (Anthropic Claude Code, ≥ 2.x), authenticatedcodexCLI (OpenAI Codex), authenticated — only if you want GPT-5.4 results- ClickHouse instance with the Contoso retail dataset
cp .env.example .env
# edit .env with your CLICKHOUSE_* values
uv sync
uv run --env-file .env python scripts/ping_clickhouse.py # expect "ClickHouse OK"
uv run --env-file .env python scripts/fetch_schema.py # writes fixtures/retail_schema.sqlRUN_ID=$(date -u +%Y%m%d-%H%M%S)
# 1. Reference (executes the benchmark's expected SQL once)
uv run --env-file .env python scripts/run_reference.py --run-id "$RUN_ID"
# 2. The systems under test (parallelizable — each writes its own dir)
uv run --env-file .env python scripts/run_claude.py --run-id "$RUN_ID" --model claude-opus-4-7
uv run --env-file .env python scripts/run_claude.py --run-id "$RUN_ID" --model claude-opus-4-7 --semantic
uv run --env-file .env python scripts/run_claude.py --run-id "$RUN_ID" --model claude-sonnet-4-6
uv run --env-file .env python scripts/run_claude.py --run-id "$RUN_ID" --model claude-sonnet-4-6 --semantic
uv run --env-file .env python scripts/run_codex.py --run-id "$RUN_ID" --model gpt-5.4
uv run --env-file .env python scripts/run_codex.py --run-id "$RUN_ID" --model gpt-5.4 --semantic
# 3. Score every (system, question) pair with the judge
uv run --env-file .env python scripts/score_run.py --run-id "$RUN_ID"
# 4. Print the report — auto-discovers all systems with scoring data
uv run --env-file .env python scripts/report_run.py --run-id "$RUN_ID"Runs are resumable — each per-question JSON file is written atomically and skipped on re-run. A full 6-system run takes ~ 25–35 min wall-clock when launched in parallel.
uv run --env-file .env python scripts/run_claude.py --run-id "$RUN_ID" --model NEW_MODEL_ID
uv run --env-file .env python scripts/run_claude.py --run-id "$RUN_ID" --model NEW_MODEL_ID --semantic
uv run --env-file .env python scripts/score_run.py --run-id "$RUN_ID"
uv run --env-file .env python scripts/report_run.py --run-id "$RUN_ID"The reporter auto-discovers any runs/<RUN_ID>/scoring/<system>/ directory
that has verdict files in it, so new models appear in the headline table and
the McNemar matrix without code changes.
All systems are run at medium reasoning effort by default:
- Claude (
claude -p):--effort medium - Codex (
codex exec):-c model_reasoning_effort="medium"
This is set on each runner so accuracy / latency / token comparisons are not confounded by some models thinking more than others.
runs/<RUN_ID>/
├── reference/q001.json … q100.json
├── claude_claude_opus_4_7_raw/q001.json …
├── claude_claude_opus_4_7_semantic/q001.json …
├── claude_claude_sonnet_4_6_raw/q001.json …
├── claude_claude_sonnet_4_6_semantic/q001.json …
├── codex_gpt_5_4_raw/q001.json …
├── codex_gpt_5_4_semantic/q001.json …
└── scoring/
├── claude_claude_opus_4_7_raw/q001.json … # {verdict, analytical_verdict, reason, judge_tokens, ...}
├── claude_claude_opus_4_7_semantic/q001.json …
└── …
Each per-question JSON contains the generated SQL, the executed rows, latency, input/output token counts, and any error.
100 hand-authored natural-language analyst questions across 5 difficulty tiers
(20 per tier: simple → medium → complex → advanced → expert). Each has a
reference SQL that executes against the Contoso retail.* schema in
ClickHouse. See text_to_sql_benchmark.json.
| Tier | What it tests |
|---|---|
| simple | Single table, COUNT/SUM, basic filter |
| medium | 2–3 table joins, GROUP BY, HAVING, date functions |
| complex | Window functions, MAX(date) per group, snapshot semantics |
| advanced | Multiple CTEs, scenario pivots, cohort/LTV filters, MAPE |
| expert | Ambiguous business terms, recursive CTEs, business-context inference |
fixtures/semantic_layer.md (~ 9 KB / ~ 2.2 K tokens) is a hand-authored
analyst guide covering:
- Which fact table to default to (online vs all-channel sales)
- Snapshot-data caveat for FactInventory (use MAX(DateKey) per (product, store))
- String-boolean enum values (
IsWorkDay = 'WorkDay'/'WeekEnd', etc.) - Implicit defaults (
PromotionKey = 1= "No Discount") - Calculated metric formulas (margin %, return rate, same-store growth, inventory turnover, recency segments)
- Join paths and customer-vs-store geography conventions
- Time anchoring (dataset latest date is 2009-12-31)
- 13 explicit agent-steering rules
Swap this file to test your own semantic-layer authoring against this benchmark.
See REPORT.md for full methodology, the 6-way result table,
within-model McNemar tests, the cross-model McNemar matrix, cost / latency
comparison, and limitations.
MIT.