[FEATURE] [Benchmarks] Pillar 1  Decision Intelligence



## Scope

Six tracks that validate Semantica's core decision reasoning capabilities. Each track is independent; the entire sub-issue is owned by one contributor and lives entirely in `benchmarks/decision_intelligence/`.

- Track 1.1 — Precedent Search Quality: does `find_precedents()` return the right prior decisions at scale?
- Track 1.2 — Causal Chain Traversal: does `get_causal_chain()` correctly identify cause→effect relationships fast enough for production?
- Track 1.3 — Decision Influence Analysis: does `analyze_decision_influence()` propagate compliance changes without silent false positives?
- Track 1.4 — Policy Compliance Classification: does `PolicyEngine` catch every policy violation? False negatives are illegal decisions passing unchecked.
- Track 30 — Decision Influence (lightweight): the same logic as 1.3 run against the committed fixture alone, no latency assertions.
- Track 31 — Policy Engine Compliance (lightweight): the same logic as 1.4 run against CUAD, LEDGAR, and TREC CT 2022, no latency assertions.

**Do not touch any file outside `benchmarks/decision_intelligence/` and `datasets/decision_intelligence/`.**

---

## Track 1.1 — Precedent Search Quality

**Why it matters:** If `find_precedents()` returns irrelevant prior decisions, every downstream policy check and audit trail is built on noise. This is the most critical differentiator Semantica has over plain RAG. Scale is equally important — the benchmark must validate that quality holds at 1k, 10k, and 100k decisions.

**APIs under test:** `ContextGraph.find_precedents()` · `AgentContext.find_precedents()` · `ContextRetriever.find_precedents_hybrid()` · `DecisionQuery.find_precedents_hybrid()` · `decision_methods.multi_hop_query()`

Datasets:

- **German Credit** — 1,000 structured lending decisions annotated with ground-truth precedent pairs by domain experts. Covers income, employment, loan purpose, credit history. Used to compute MRR on structured decision retrieval. License: CC BY 4.0. Citation: Hofmann, UCI ML Repository 1994. Download: [archive.ics.uci.edu/dataset/144](https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data). Path: `datasets/decision_intelligence/german_credit/`

- **CUAD** — 510 commercial contracts with 41 clause-type annotations (non-compete, indemnification, termination rights, etc.). Provides nDCG@10 and graph-lift ground truth for precedent retrieval. License: CC BY 4.0. Citation: Hendrycks et al., arxiv:2103.06268. Download: [atticusprojectai.org/cuad](https://www.atticusprojectai.org/cuad). Path: `datasets/decision_intelligence/cuad/`

- **Precedent Scale Fixture** — synthetic decision graphs at 1k, 10k, and 100k. Used exclusively for P95 latency measurement — no recall or precision scoring. Generated once and committed to `fixtures/decision_scale/{1k,10k,100k}.json`. Files over 5 MB go through git-lfs.

Metrics:

- `MRR = (1/|Q|) × Σ_q (1 / rank_q)` — Mean Reciprocal Rank; rank_q is the position of the first relevant result for query q. Evaluated on German Credit.
- `nDCG@10 = Σ_i (rel_i / log2(i+1)) / ideal_DCG` — Normalized Discounted Cumulative Gain at rank 10. Evaluated on CUAD.
- `Graph lift = nDCG@10(graph-assisted) − nDCG@10(BM25-flat)` — absolute improvement of graph-augmented retrieval over flat BM25. Positive lift proves the graph adds value.
- `P95 latency at 1k / 10k / 100k decisions` — 95th percentile wall time for a `find_precedents()` call at each scale, measured against the scale fixtures.

Thresholds:

- MRR on German Credit ≥ 0.70 — [DPR baseline on CUAD, arxiv:2103.06268](https://arxiv.org/abs/2103.06268)
- nDCG@10 on CUAD ≥ 0.65 — BM25 CUAD baseline, Guo et al. 2022
- Graph lift over BM25 ≥ 0.05 — [DPR vs BM25 on NQ, arxiv:2004.04906](https://arxiv.org/abs/2004.04906)
- P95 latency at 10k decisions < 100 ms — Semantica production SLA 2026-Q1
- P95 latency at 100k decisions < 500 ms — Semantica production SLA 2026-Q1

---

## Track 1.2 — Causal Chain Traversal

**Why it matters:** `get_causal_chain()` depth accuracy determines whether root-cause attribution in high-stakes decisions (credit, healthcare, legal) is trustworthy. Without depth-vs-latency testing, a correct but slow chain analysis fails its production SLA before it is ever used.

**APIs under test:** `ContextGraph.get_causal_chain()` · `CausalChainAnalyzer.get_causal_chain()` · `AgentContext.get_causal_chain()` · `decision_methods.get_causal_chain()`

Datasets:

- **ATOMIC 2020** — 1.33 million commonsense causal triples covering 23 If-Then relation types (xIntent, xCause, xEffect, xNeed, xWant). Use a 500-pair subset of cause→effect pairs only. License: CC BY 4.0. Citation: Hwang et al., AAAI 2021. Download: [allenai.org/data/atomic-2020](https://allenai.org/data/atomic-2020). Path: `datasets/decision_intelligence/atomic_subset/`

- **e-CARE** — 21,324 causal Q&A records with free-text explanation annotations for both the correct cause and why it holds. Unlike ATOMIC, e-CARE includes annotated explanations enabling evaluation of explanation completeness alongside causal direction. License: research open. Citation: Du et al., ACL 2022, arxiv:2205.02593. Path: `datasets/decision_intelligence/ecare/`

- **CausalBench** — four task dimensions: cause→effect, effect→cause, both directions with intervention (counterfactual held-out pairs). Nineteen published LLM baselines across GPT-4, Claude, and open-source models. Tests both direction accuracy and intervention accuracy. Citation: NeurIPS 2024. Path: `datasets/decision_intelligence/causalbench/`

- **Depth-vs-latency fixture** — synthetic causal chain benchmarks at depth {3, 5, 8, 10} × graph size {500, 1k, 5k, 10k}. Exclusively for chain P95 latency sweep. Generated once and committed to `fixtures/causal_depth_latency/`.

Metrics:

- `Causal recall = |retrieved_ancestors ∩ gold_ancestors| / |gold_ancestors|` — fraction of ground-truth causal ancestors recovered.
- `Causal precision = |retrieved_ancestors ∩ gold_ancestors| / |retrieved_ancestors|` — fraction of retrieved ancestors that are correct.
- `Direction accuracy = |correct_direction_pairs| / |total_pairs|` — fraction of cause→effect pairs where direction is identified correctly.
- `Intervention accuracy = |correct_counterfactual_tests| / |total_withheld_tests|` — fraction of counterfactual held-out pairs correctly classified.
- `Chain P95 latency at depth D, graph size N` — 95th percentile latency for traversing a chain of depth D on a graph of N nodes.

Thresholds:

- Causal direction accuracy ≥ 0.72 — CausalBench LLM median, NeurIPS 2024
- Recall on ATOMIC 2020 subset ≥ 0.80 — KG-RAG literature baseline
- Precision on ATOMIC 2020 subset ≥ 0.85 — KG-RAG literature baseline
- Intervention accuracy on CausalBench ≥ 0.60 — CausalBench weakest published LLM baseline, NeurIPS 2024
- Chain P95 at depth=10, 10k graph < 500 ms — Semantica production SLA 2026-Q1

---

## Track 1.3 — Decision Influence Analysis

**Why it matters:** `analyze_decision_influence()` and `trace_decision_causality()` propagate compliance changes across a decision graph. Silent false positives here mean incorrect policy rollbacks; missed downstream decisions mean compliance changes fail to propagate.

**APIs under test:** `ContextGraph.analyze_decision_influence()` · `ContextGraph.trace_decision_causality()` · `AgentContext.analyze_decision_influence()` · `DecisionQuery.analyze_decision_influence()`

Datasets:

- **Decision influence ground-truth fixture** — 500 synthetic decisions with fully annotated causal edges and 50 influence queries, each with the expected set of downstream decisions that should be flagged at depth 3 and depth 5. Generated once; committed to `fixtures/decision_influence_ground_truth.json`. Generate locally and commit; do not regenerate.

Metrics:

- `Influence recall@D = |found_influenced ∩ gold_influenced| / |gold_influenced|` — fraction of gold downstream decisions found at search depth D. Reported at D=3 and D=5.
- `Spurious rate = |found_influenced \ gold_influenced| / |found_influenced|` — fraction of returned decisions not in the gold set (false positives).

Thresholds:

- Influence recall@3 ≥ 0.85 — internal ground-truth fixture
- Influence recall@5 ≥ 0.75 — internal ground-truth fixture
- Spurious rate ≤ 0.10 — Semantica production SLA

---

## Track 1.4 — Policy Compliance Classification

**Why it matters:** `PolicyEngine.get_applicable_policies()` and `check_decision_compliance()` are high-stakes. A missed violation (false negative) is worse than a false alarm — it means an illegal decision passes unchecked.

**APIs under test:** `PolicyEngine.get_applicable_policies()` · `PolicyEngine.check_decision_compliance()` · `ContextGraph.find_applicable_policies()`

Datasets:

- **CUAD** — 510 contracts, 41 clause types covering non-compete, indemnification, termination rights, limitation of liability, and 37 other clause categories. For compliance testing, evaluate whether PolicyEngine correctly identifies which clauses apply to a given decision context. License: CC BY 4.0. Citation: Hendrycks et al., arxiv:2103.06268. Path: `datasets/decision_intelligence/cuad/`

- **LEDGAR** — 60,000 legal provisions from SEC filings, each labelled with multi-label compliance tags across 100 categories. Used for clause-level F1 evaluation. License: research open. Citation: Tuggener et al., LREC 2020. Download: [metatext.io/datasets/ledgar](https://metatext.io/datasets/ledgar). Path: `datasets/decision_intelligence/ledgar/`

- **TREC Clinical Trials 2022** — 75 patient topics with ground-truth trial eligibility labels. Tests compliance classification on structured patient-trial matching. Citation: TREC 2022. Download: [trec.nist.gov](https://trec.nist.gov). Path: `datasets/decision_intelligence/trec_ct_2022/`

Metrics:

- `Compliance accuracy = (TP + TN) / total` — fraction of compliance decisions classified correctly.
- `False negative rate (FNR) = FN / (FN + TP)` — fraction of true violations missed. **Hard gate metric.** Missed violations = illegal decisions pass unchecked.
- `False positive rate = FP / (FP + TN)` — fraction of compliant decisions incorrectly flagged.
- `Clause-level F1 = 2 × precision × recall / (precision + recall)` — harmonic mean of clause detection precision and recall on LEDGAR.

Thresholds:

- Compliance accuracy ≥ 0.88 — [CUAD clause detection baseline, arxiv:2103.06268](https://arxiv.org/abs/2103.06268)
- **False negative rate ≤ 0.05 — HARD GATE.** Internal compliance SLA. Any value above 0.05 fails the test immediately with an explicit error message naming the regulatory implication.
- Clause-level F1 ≥ 0.75 — LEDGAR multi-label baseline, Tuggener et al. 2020

---

## Track 30 — Decision Influence Accuracy (lightweight)

**Why it matters:** The full Track 1.3 may require scale fixtures; Track 30 runs only the precision/recall metrics against the committed ground-truth fixture — no latency assertions, no large downloads.

**APIs under test:** `ContextGraph.analyze_decision_influence()` · `ContextGraph.trace_decision_causality()`

Datasets: same as Track 1.3 — `fixtures/decision_influence_ground_truth.json`.

Thresholds: same as Track 1.3 — recall@3 ≥ 0.85, recall@5 ≥ 0.75, spurious rate ≤ 0.10.

Implementation note: Track 30 shares test functions with Track 1.3 via `conftest.py`. The only difference is that the offline track skips scale latency assertions.

---

## Track 31 — Policy Engine Compliance (lightweight)

**Why it matters:** False negatives in compliance are the most dangerous production failure. Track 31 runs the same compliance classification test as Track 1.4 independently — classification regressions are caught even when the retrieval track is skipped.

**APIs under test:** `PolicyEngine.get_applicable_policies()` · `PolicyEngine.check_decision_compliance()`

Datasets: CUAD (510 contracts), LEDGAR (60k provisions), TREC CT 2022 (75 topics) — same as Track 1.4.

Thresholds: same as Track 1.4 — compliance accuracy ≥ 0.88, FNR ≤ 0.05 (hard gate), clause F1 ≥ 0.75.

---

## Fixtures to generate

Generate these scripts once, run them locally, and commit the output. Do not regenerate fixtures after committing.

- `scripts/generate_decision_influence_gt.py` — generates `fixtures/decision_influence_ground_truth.json`. 500 synthetic decisions, 50 influence queries with known downstream sets.
- `scripts/generate_precedent_scale.py` — generates `fixtures/decision_scale/{1k,10k,100k}.json`. Route 10k and 100k through git-lfs (files exceed 5 MB).

---

## Files to create

```text
benchmarks/decision_intelligence/
  conftest.py
  test_precedent_search.py
  test_causal_chain.py
  test_decision_influence.py
  test_policy_compliance.py

scripts/
  generate_decision_influence_gt.py
  generate_precedent_scale.py

datasets/decision_intelligence/
  README.md
```

`conftest.py` — define all threshold constants at the top as `THRESHOLD_<METRIC>_<DATASET> = <value>`. Implement `session`-scoped pytest fixtures for each dataset. Do not load datasets inside test functions.

`test_precedent_search.py` — one function per metric: `test_mrr_german_credit`, `test_ndcg10_cuad`, `test_graph_lift_over_bm25`, `test_p95_latency_10k`, `test_p95_latency_100k`.

`test_causal_chain.py` — `test_direction_accuracy_causalbench`, `test_recall_atomic`, `test_precision_atomic`, `test_intervention_accuracy`, `test_chain_p95_latency_depth10`.

`test_decision_influence.py` — `test_influence_recall_at_3`, `test_influence_recall_at_5`, `test_spurious_rate`. Tracks 1.3 and 30 share these functions; Track 30 skips latency.

`test_policy_compliance.py` — `test_compliance_accuracy_cuad`, `test_false_negative_rate_hard_gate`, `test_clause_f1_ledgar`. Tracks 1.4 and 31 share these functions.

`datasets/decision_intelligence/README.md` — one section per dataset: source URL, download command, expected directory structure, license.

---

## Key implementation patterns

```python
# conftest.py — threshold constants and session fixtures
THRESHOLD_MRR_GERMAN_CREDIT = 0.70
THRESHOLD_NDCG10_CUAD = 0.65
THRESHOLD_GRAPH_LIFT = 0.05
THRESHOLD_P95_10K_MS = 100
THRESHOLD_P95_100K_MS = 500
THRESHOLD_DIRECTION_ACCURACY = 0.72
THRESHOLD_ATOMIC_RECALL = 0.80
THRESHOLD_ATOMIC_PRECISION = 0.85
THRESHOLD_INTERVENTION_ACCURACY = 0.60
THRESHOLD_CHAIN_P95_MS = 500
THRESHOLD_INFLUENCE_RECALL_3 = 0.85
THRESHOLD_INFLUENCE_RECALL_5 = 0.75
THRESHOLD_SPURIOUS_RATE = 0.10
THRESHOLD_COMPLIANCE_ACCURACY = 0.88
THRESHOLD_FALSE_NEGATIVE_RATE = 0.05   # HARD GATE
THRESHOLD_CLAUSE_F1 = 0.75

@pytest.fixture(scope="session")
def german_credit_dataset():
    return load_german_credit("datasets/decision_intelligence/german_credit/")

@pytest.fixture(scope="session")
def cuad_dataset():
    return load_cuad("datasets/decision_intelligence/cuad/")
```

```python
# test_policy_compliance.py — hard gate with explicit message
def test_false_negative_rate_hard_gate(cuad_dataset, policy_engine):
    results = evaluate_compliance(policy_engine, cuad_dataset)
    fnr = results["false_negative_rate"]
    assert fnr <= THRESHOLD_FALSE_NEGATIVE_RATE, (
        f"Policy FNR {fnr:.3f} exceeds hard gate {THRESHOLD_FALSE_NEGATIVE_RATE}. "
        f"Missed violations = illegal decisions pass unchecked."
    )
```

```python
# test_precedent_search.py — one function per metric
def test_graph_lift_over_bm25(cuad_dataset, context_retriever):
    bm25_ndcg = evaluate_ndcg10_bm25(cuad_dataset)
    graph_ndcg = evaluate_ndcg10_graph(context_retriever, cuad_dataset)
    lift = graph_ndcg - bm25_ndcg
    assert lift >= THRESHOLD_GRAPH_LIFT, (
        f"Graph lift {lift:.4f} < {THRESHOLD_GRAPH_LIFT}. "
        f"Graph-augmented retrieval must outperform flat BM25."
    )
```

---

## Notes

- TACRED (106k sentences, LDC license) — do not include here; it belongs to Sub-issue 4 gated with `TACRED_AVAILABLE=1`.
- The `detect_duplicates()` regression guard belongs to Sub-issue 4.
- Tracks 30 and 31 are lightweight versions of 1.3 and 1.4. They share all fixtures and threshold constants via `conftest.py`.

---

## Checklist

- [ ] `datasets/decision_intelligence/german_credit/` downloaded
- [ ] `datasets/decision_intelligence/cuad/` downloaded
- [ ] `datasets/decision_intelligence/ledgar/` downloaded
- [ ] `datasets/decision_intelligence/trec_ct_2022/` downloaded
- [ ] `datasets/decision_intelligence/atomic_subset/` downloaded (500-pair subset)
- [ ] `datasets/decision_intelligence/ecare/` downloaded
- [ ] `datasets/decision_intelligence/causalbench/` downloaded
- [ ] `fixtures/decision_influence_ground_truth.json` generated and committed
- [ ] `fixtures/decision_scale/{1k,10k,100k}.json` generated and committed (10k, 100k via git-lfs)
- [ ] `conftest.py` — all 15 threshold constants defined, all dataset fixtures implemented
- [ ] `test_precedent_search.py` — MRR, nDCG@10, graph lift, P95 at 10k and 100k
- [ ] `test_causal_chain.py` — direction accuracy, recall, precision, intervention accuracy, depth-latency sweep
- [ ] `test_decision_influence.py` — recall@3, recall@5, spurious rate (Tracks 1.3 + 30)
- [ ] `test_policy_compliance.py` — accuracy, FNR hard gate, clause F1 (Tracks 1.4 + 31)
- [ ] All tests pass: `pytest benchmarks/decision_intelligence/ -p no:langsmith`
- [ ] No test requires a real LLM (no `SEMANTICA_REAL_LLM` dependency)
- [ ] `datasets/decision_intelligence/README.md` written
- [ ] PR title: `benchmarks: decision intelligence tracks 1.1–1.4, 30–31`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] [Benchmarks] Pillar 1 Decision Intelligence #571

Scope

Track 1.1 — Precedent Search Quality

Track 1.2 — Causal Chain Traversal

Track 1.3 — Decision Influence Analysis

Track 1.4 — Policy Compliance Classification

Track 30 — Decision Influence Accuracy (lightweight)

Track 31 — Policy Engine Compliance (lightweight)

Fixtures to generate

Files to create

Key implementation patterns

Notes

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[FEATURE] [Benchmarks] Pillar 1 Decision Intelligence #571

Description

Scope

Track 1.1 — Precedent Search Quality

Track 1.2 — Causal Chain Traversal

Track 1.3 — Decision Influence Analysis

Track 1.4 — Policy Compliance Classification

Track 30 — Decision Influence Accuracy (lightweight)

Track 31 — Policy Engine Compliance (lightweight)

Fixtures to generate

Files to create

Key implementation patterns

Notes

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions