Skip to content

[FEATURE] [Benchmarks] Pillar 1 Decision Intelligence #571

@KaifAhmad1

Description

@KaifAhmad1

Scope

Six tracks that validate Semantica's core decision reasoning capabilities. Each track is independent; the entire sub-issue is owned by one contributor and lives entirely in benchmarks/decision_intelligence/.

  • Track 1.1 — Precedent Search Quality: does find_precedents() return the right prior decisions at scale?
  • Track 1.2 — Causal Chain Traversal: does get_causal_chain() correctly identify cause→effect relationships fast enough for production?
  • Track 1.3 — Decision Influence Analysis: does analyze_decision_influence() propagate compliance changes without silent false positives?
  • Track 1.4 — Policy Compliance Classification: does PolicyEngine catch every policy violation? False negatives are illegal decisions passing unchecked.
  • Track 30 — Decision Influence (lightweight): the same logic as 1.3 run against the committed fixture alone, no latency assertions.
  • Track 31 — Policy Engine Compliance (lightweight): the same logic as 1.4 run against CUAD, LEDGAR, and TREC CT 2022, no latency assertions.

Do not touch any file outside benchmarks/decision_intelligence/ and datasets/decision_intelligence/.


Track 1.1 — Precedent Search Quality

Why it matters: If find_precedents() returns irrelevant prior decisions, every downstream policy check and audit trail is built on noise. This is the most critical differentiator Semantica has over plain RAG. Scale is equally important — the benchmark must validate that quality holds at 1k, 10k, and 100k decisions.

APIs under test: ContextGraph.find_precedents() · AgentContext.find_precedents() · ContextRetriever.find_precedents_hybrid() · DecisionQuery.find_precedents_hybrid() · decision_methods.multi_hop_query()

Datasets:

  • German Credit — 1,000 structured lending decisions annotated with ground-truth precedent pairs by domain experts. Covers income, employment, loan purpose, credit history. Used to compute MRR on structured decision retrieval. License: CC BY 4.0. Citation: Hofmann, UCI ML Repository 1994. Download: archive.ics.uci.edu/dataset/144. Path: datasets/decision_intelligence/german_credit/

  • CUAD — 510 commercial contracts with 41 clause-type annotations (non-compete, indemnification, termination rights, etc.). Provides nDCG@10 and graph-lift ground truth for precedent retrieval. License: CC BY 4.0. Citation: Hendrycks et al., arxiv:2103.06268. Download: atticusprojectai.org/cuad. Path: datasets/decision_intelligence/cuad/

  • Precedent Scale Fixture — synthetic decision graphs at 1k, 10k, and 100k. Used exclusively for P95 latency measurement — no recall or precision scoring. Generated once and committed to fixtures/decision_scale/{1k,10k,100k}.json. Files over 5 MB go through git-lfs.

Metrics:

  • MRR = (1/|Q|) × Σ_q (1 / rank_q) — Mean Reciprocal Rank; rank_q is the position of the first relevant result for query q. Evaluated on German Credit.
  • nDCG@10 = Σ_i (rel_i / log2(i+1)) / ideal_DCG — Normalized Discounted Cumulative Gain at rank 10. Evaluated on CUAD.
  • Graph lift = nDCG@10(graph-assisted) − nDCG@10(BM25-flat) — absolute improvement of graph-augmented retrieval over flat BM25. Positive lift proves the graph adds value.
  • P95 latency at 1k / 10k / 100k decisions — 95th percentile wall time for a find_precedents() call at each scale, measured against the scale fixtures.

Thresholds:


Track 1.2 — Causal Chain Traversal

Why it matters: get_causal_chain() depth accuracy determines whether root-cause attribution in high-stakes decisions (credit, healthcare, legal) is trustworthy. Without depth-vs-latency testing, a correct but slow chain analysis fails its production SLA before it is ever used.

APIs under test: ContextGraph.get_causal_chain() · CausalChainAnalyzer.get_causal_chain() · AgentContext.get_causal_chain() · decision_methods.get_causal_chain()

Datasets:

  • ATOMIC 2020 — 1.33 million commonsense causal triples covering 23 If-Then relation types (xIntent, xCause, xEffect, xNeed, xWant). Use a 500-pair subset of cause→effect pairs only. License: CC BY 4.0. Citation: Hwang et al., AAAI 2021. Download: allenai.org/data/atomic-2020. Path: datasets/decision_intelligence/atomic_subset/

  • e-CARE — 21,324 causal Q&A records with free-text explanation annotations for both the correct cause and why it holds. Unlike ATOMIC, e-CARE includes annotated explanations enabling evaluation of explanation completeness alongside causal direction. License: research open. Citation: Du et al., ACL 2022, arxiv:2205.02593. Path: datasets/decision_intelligence/ecare/

  • CausalBench — four task dimensions: cause→effect, effect→cause, both directions with intervention (counterfactual held-out pairs). Nineteen published LLM baselines across GPT-4, Claude, and open-source models. Tests both direction accuracy and intervention accuracy. Citation: NeurIPS 2024. Path: datasets/decision_intelligence/causalbench/

  • Depth-vs-latency fixture — synthetic causal chain benchmarks at depth {3, 5, 8, 10} × graph size {500, 1k, 5k, 10k}. Exclusively for chain P95 latency sweep. Generated once and committed to fixtures/causal_depth_latency/.

Metrics:

  • Causal recall = |retrieved_ancestors ∩ gold_ancestors| / |gold_ancestors| — fraction of ground-truth causal ancestors recovered.
  • Causal precision = |retrieved_ancestors ∩ gold_ancestors| / |retrieved_ancestors| — fraction of retrieved ancestors that are correct.
  • Direction accuracy = |correct_direction_pairs| / |total_pairs| — fraction of cause→effect pairs where direction is identified correctly.
  • Intervention accuracy = |correct_counterfactual_tests| / |total_withheld_tests| — fraction of counterfactual held-out pairs correctly classified.
  • Chain P95 latency at depth D, graph size N — 95th percentile latency for traversing a chain of depth D on a graph of N nodes.

Thresholds:

  • Causal direction accuracy ≥ 0.72 — CausalBench LLM median, NeurIPS 2024
  • Recall on ATOMIC 2020 subset ≥ 0.80 — KG-RAG literature baseline
  • Precision on ATOMIC 2020 subset ≥ 0.85 — KG-RAG literature baseline
  • Intervention accuracy on CausalBench ≥ 0.60 — CausalBench weakest published LLM baseline, NeurIPS 2024
  • Chain P95 at depth=10, 10k graph < 500 ms — Semantica production SLA 2026-Q1

Track 1.3 — Decision Influence Analysis

Why it matters: analyze_decision_influence() and trace_decision_causality() propagate compliance changes across a decision graph. Silent false positives here mean incorrect policy rollbacks; missed downstream decisions mean compliance changes fail to propagate.

APIs under test: ContextGraph.analyze_decision_influence() · ContextGraph.trace_decision_causality() · AgentContext.analyze_decision_influence() · DecisionQuery.analyze_decision_influence()

Datasets:

  • Decision influence ground-truth fixture — 500 synthetic decisions with fully annotated causal edges and 50 influence queries, each with the expected set of downstream decisions that should be flagged at depth 3 and depth 5. Generated once; committed to fixtures/decision_influence_ground_truth.json. Generate locally and commit; do not regenerate.

Metrics:

  • Influence recall@D = |found_influenced ∩ gold_influenced| / |gold_influenced| — fraction of gold downstream decisions found at search depth D. Reported at D=3 and D=5.
  • Spurious rate = |found_influenced \ gold_influenced| / |found_influenced| — fraction of returned decisions not in the gold set (false positives).

Thresholds:

  • Influence recall@3 ≥ 0.85 — internal ground-truth fixture
  • Influence recall@5 ≥ 0.75 — internal ground-truth fixture
  • Spurious rate ≤ 0.10 — Semantica production SLA

Track 1.4 — Policy Compliance Classification

Why it matters: PolicyEngine.get_applicable_policies() and check_decision_compliance() are high-stakes. A missed violation (false negative) is worse than a false alarm — it means an illegal decision passes unchecked.

APIs under test: PolicyEngine.get_applicable_policies() · PolicyEngine.check_decision_compliance() · ContextGraph.find_applicable_policies()

Datasets:

  • CUAD — 510 contracts, 41 clause types covering non-compete, indemnification, termination rights, limitation of liability, and 37 other clause categories. For compliance testing, evaluate whether PolicyEngine correctly identifies which clauses apply to a given decision context. License: CC BY 4.0. Citation: Hendrycks et al., arxiv:2103.06268. Path: datasets/decision_intelligence/cuad/

  • LEDGAR — 60,000 legal provisions from SEC filings, each labelled with multi-label compliance tags across 100 categories. Used for clause-level F1 evaluation. License: research open. Citation: Tuggener et al., LREC 2020. Download: metatext.io/datasets/ledgar. Path: datasets/decision_intelligence/ledgar/

  • TREC Clinical Trials 2022 — 75 patient topics with ground-truth trial eligibility labels. Tests compliance classification on structured patient-trial matching. Citation: TREC 2022. Download: trec.nist.gov. Path: datasets/decision_intelligence/trec_ct_2022/

Metrics:

  • Compliance accuracy = (TP + TN) / total — fraction of compliance decisions classified correctly.
  • False negative rate (FNR) = FN / (FN + TP) — fraction of true violations missed. Hard gate metric. Missed violations = illegal decisions pass unchecked.
  • False positive rate = FP / (FP + TN) — fraction of compliant decisions incorrectly flagged.
  • Clause-level F1 = 2 × precision × recall / (precision + recall) — harmonic mean of clause detection precision and recall on LEDGAR.

Thresholds:

  • Compliance accuracy ≥ 0.88 — CUAD clause detection baseline, arxiv:2103.06268
  • False negative rate ≤ 0.05 — HARD GATE. Internal compliance SLA. Any value above 0.05 fails the test immediately with an explicit error message naming the regulatory implication.
  • Clause-level F1 ≥ 0.75 — LEDGAR multi-label baseline, Tuggener et al. 2020

Track 30 — Decision Influence Accuracy (lightweight)

Why it matters: The full Track 1.3 may require scale fixtures; Track 30 runs only the precision/recall metrics against the committed ground-truth fixture — no latency assertions, no large downloads.

APIs under test: ContextGraph.analyze_decision_influence() · ContextGraph.trace_decision_causality()

Datasets: same as Track 1.3 — fixtures/decision_influence_ground_truth.json.

Thresholds: same as Track 1.3 — recall@3 ≥ 0.85, recall@5 ≥ 0.75, spurious rate ≤ 0.10.

Implementation note: Track 30 shares test functions with Track 1.3 via conftest.py. The only difference is that the offline track skips scale latency assertions.


Track 31 — Policy Engine Compliance (lightweight)

Why it matters: False negatives in compliance are the most dangerous production failure. Track 31 runs the same compliance classification test as Track 1.4 independently — classification regressions are caught even when the retrieval track is skipped.

APIs under test: PolicyEngine.get_applicable_policies() · PolicyEngine.check_decision_compliance()

Datasets: CUAD (510 contracts), LEDGAR (60k provisions), TREC CT 2022 (75 topics) — same as Track 1.4.

Thresholds: same as Track 1.4 — compliance accuracy ≥ 0.88, FNR ≤ 0.05 (hard gate), clause F1 ≥ 0.75.


Fixtures to generate

Generate these scripts once, run them locally, and commit the output. Do not regenerate fixtures after committing.

  • scripts/generate_decision_influence_gt.py — generates fixtures/decision_influence_ground_truth.json. 500 synthetic decisions, 50 influence queries with known downstream sets.
  • scripts/generate_precedent_scale.py — generates fixtures/decision_scale/{1k,10k,100k}.json. Route 10k and 100k through git-lfs (files exceed 5 MB).

Files to create

benchmarks/decision_intelligence/
  conftest.py
  test_precedent_search.py
  test_causal_chain.py
  test_decision_influence.py
  test_policy_compliance.py

scripts/
  generate_decision_influence_gt.py
  generate_precedent_scale.py

datasets/decision_intelligence/
  README.md

conftest.py — define all threshold constants at the top as THRESHOLD_<METRIC>_<DATASET> = <value>. Implement session-scoped pytest fixtures for each dataset. Do not load datasets inside test functions.

test_precedent_search.py — one function per metric: test_mrr_german_credit, test_ndcg10_cuad, test_graph_lift_over_bm25, test_p95_latency_10k, test_p95_latency_100k.

test_causal_chain.pytest_direction_accuracy_causalbench, test_recall_atomic, test_precision_atomic, test_intervention_accuracy, test_chain_p95_latency_depth10.

test_decision_influence.pytest_influence_recall_at_3, test_influence_recall_at_5, test_spurious_rate. Tracks 1.3 and 30 share these functions; Track 30 skips latency.

test_policy_compliance.pytest_compliance_accuracy_cuad, test_false_negative_rate_hard_gate, test_clause_f1_ledgar. Tracks 1.4 and 31 share these functions.

datasets/decision_intelligence/README.md — one section per dataset: source URL, download command, expected directory structure, license.


Key implementation patterns

# conftest.py — threshold constants and session fixtures
THRESHOLD_MRR_GERMAN_CREDIT = 0.70
THRESHOLD_NDCG10_CUAD = 0.65
THRESHOLD_GRAPH_LIFT = 0.05
THRESHOLD_P95_10K_MS = 100
THRESHOLD_P95_100K_MS = 500
THRESHOLD_DIRECTION_ACCURACY = 0.72
THRESHOLD_ATOMIC_RECALL = 0.80
THRESHOLD_ATOMIC_PRECISION = 0.85
THRESHOLD_INTERVENTION_ACCURACY = 0.60
THRESHOLD_CHAIN_P95_MS = 500
THRESHOLD_INFLUENCE_RECALL_3 = 0.85
THRESHOLD_INFLUENCE_RECALL_5 = 0.75
THRESHOLD_SPURIOUS_RATE = 0.10
THRESHOLD_COMPLIANCE_ACCURACY = 0.88
THRESHOLD_FALSE_NEGATIVE_RATE = 0.05   # HARD GATE
THRESHOLD_CLAUSE_F1 = 0.75

@pytest.fixture(scope="session")
def german_credit_dataset():
    return load_german_credit("datasets/decision_intelligence/german_credit/")

@pytest.fixture(scope="session")
def cuad_dataset():
    return load_cuad("datasets/decision_intelligence/cuad/")
# test_policy_compliance.py — hard gate with explicit message
def test_false_negative_rate_hard_gate(cuad_dataset, policy_engine):
    results = evaluate_compliance(policy_engine, cuad_dataset)
    fnr = results["false_negative_rate"]
    assert fnr <= THRESHOLD_FALSE_NEGATIVE_RATE, (
        f"Policy FNR {fnr:.3f} exceeds hard gate {THRESHOLD_FALSE_NEGATIVE_RATE}. "
        f"Missed violations = illegal decisions pass unchecked."
    )
# test_precedent_search.py — one function per metric
def test_graph_lift_over_bm25(cuad_dataset, context_retriever):
    bm25_ndcg = evaluate_ndcg10_bm25(cuad_dataset)
    graph_ndcg = evaluate_ndcg10_graph(context_retriever, cuad_dataset)
    lift = graph_ndcg - bm25_ndcg
    assert lift >= THRESHOLD_GRAPH_LIFT, (
        f"Graph lift {lift:.4f} < {THRESHOLD_GRAPH_LIFT}. "
        f"Graph-augmented retrieval must outperform flat BM25."
    )

Notes

  • TACRED (106k sentences, LDC license) — do not include here; it belongs to Sub-issue 4 gated with TACRED_AVAILABLE=1.
  • The detect_duplicates() regression guard belongs to Sub-issue 4.
  • Tracks 30 and 31 are lightweight versions of 1.3 and 1.4. They share all fixtures and threshold constants via conftest.py.

Checklist

  • datasets/decision_intelligence/german_credit/ downloaded
  • datasets/decision_intelligence/cuad/ downloaded
  • datasets/decision_intelligence/ledgar/ downloaded
  • datasets/decision_intelligence/trec_ct_2022/ downloaded
  • datasets/decision_intelligence/atomic_subset/ downloaded (500-pair subset)
  • datasets/decision_intelligence/ecare/ downloaded
  • datasets/decision_intelligence/causalbench/ downloaded
  • fixtures/decision_influence_ground_truth.json generated and committed
  • fixtures/decision_scale/{1k,10k,100k}.json generated and committed (10k, 100k via git-lfs)
  • conftest.py — all 15 threshold constants defined, all dataset fixtures implemented
  • test_precedent_search.py — MRR, nDCG@10, graph lift, P95 at 10k and 100k
  • test_causal_chain.py — direction accuracy, recall, precision, intervention accuracy, depth-latency sweep
  • test_decision_influence.py — recall@3, recall@5, spurious rate (Tracks 1.3 + 30)
  • test_policy_compliance.py — accuracy, FNR hard gate, clause F1 (Tracks 1.4 + 31)
  • All tests pass: pytest benchmarks/decision_intelligence/ -p no:langsmith
  • No test requires a real LLM (no SEMANTICA_REAL_LLM dependency)
  • datasets/decision_intelligence/README.md written
  • PR title: benchmarks: decision intelligence tracks 1.1–1.4, 30–31

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

Status
In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions