You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Six tracks that validate Semantica's core decision reasoning capabilities. Each track is independent; the entire sub-issue is owned by one contributor and lives entirely in benchmarks/decision_intelligence/.
Track 1.1 — Precedent Search Quality: does find_precedents() return the right prior decisions at scale?
Track 1.2 — Causal Chain Traversal: does get_causal_chain() correctly identify cause→effect relationships fast enough for production?
Track 1.3 — Decision Influence Analysis: does analyze_decision_influence() propagate compliance changes without silent false positives?
Track 1.4 — Policy Compliance Classification: does PolicyEngine catch every policy violation? False negatives are illegal decisions passing unchecked.
Track 30 — Decision Influence (lightweight): the same logic as 1.3 run against the committed fixture alone, no latency assertions.
Track 31 — Policy Engine Compliance (lightweight): the same logic as 1.4 run against CUAD, LEDGAR, and TREC CT 2022, no latency assertions.
Do not touch any file outside benchmarks/decision_intelligence/ and datasets/decision_intelligence/.
Track 1.1 — Precedent Search Quality
Why it matters: If find_precedents() returns irrelevant prior decisions, every downstream policy check and audit trail is built on noise. This is the most critical differentiator Semantica has over plain RAG. Scale is equally important — the benchmark must validate that quality holds at 1k, 10k, and 100k decisions.
German Credit — 1,000 structured lending decisions annotated with ground-truth precedent pairs by domain experts. Covers income, employment, loan purpose, credit history. Used to compute MRR on structured decision retrieval. License: CC BY 4.0. Citation: Hofmann, UCI ML Repository 1994. Download: archive.ics.uci.edu/dataset/144. Path: datasets/decision_intelligence/german_credit/
CUAD — 510 commercial contracts with 41 clause-type annotations (non-compete, indemnification, termination rights, etc.). Provides nDCG@10 and graph-lift ground truth for precedent retrieval. License: CC BY 4.0. Citation: Hendrycks et al., arxiv:2103.06268. Download: atticusprojectai.org/cuad. Path: datasets/decision_intelligence/cuad/
Precedent Scale Fixture — synthetic decision graphs at 1k, 10k, and 100k. Used exclusively for P95 latency measurement — no recall or precision scoring. Generated once and committed to fixtures/decision_scale/{1k,10k,100k}.json. Files over 5 MB go through git-lfs.
Metrics:
MRR = (1/|Q|) × Σ_q (1 / rank_q) — Mean Reciprocal Rank; rank_q is the position of the first relevant result for query q. Evaluated on German Credit.
nDCG@10 = Σ_i (rel_i / log2(i+1)) / ideal_DCG — Normalized Discounted Cumulative Gain at rank 10. Evaluated on CUAD.
Graph lift = nDCG@10(graph-assisted) − nDCG@10(BM25-flat) — absolute improvement of graph-augmented retrieval over flat BM25. Positive lift proves the graph adds value.
P95 latency at 1k / 10k / 100k decisions — 95th percentile wall time for a find_precedents() call at each scale, measured against the scale fixtures.
P95 latency at 10k decisions < 100 ms — Semantica production SLA 2026-Q1
P95 latency at 100k decisions < 500 ms — Semantica production SLA 2026-Q1
Track 1.2 — Causal Chain Traversal
Why it matters:get_causal_chain() depth accuracy determines whether root-cause attribution in high-stakes decisions (credit, healthcare, legal) is trustworthy. Without depth-vs-latency testing, a correct but slow chain analysis fails its production SLA before it is ever used.
APIs under test:ContextGraph.get_causal_chain() · CausalChainAnalyzer.get_causal_chain() · AgentContext.get_causal_chain() · decision_methods.get_causal_chain()
Datasets:
ATOMIC 2020 — 1.33 million commonsense causal triples covering 23 If-Then relation types (xIntent, xCause, xEffect, xNeed, xWant). Use a 500-pair subset of cause→effect pairs only. License: CC BY 4.0. Citation: Hwang et al., AAAI 2021. Download: allenai.org/data/atomic-2020. Path: datasets/decision_intelligence/atomic_subset/
e-CARE — 21,324 causal Q&A records with free-text explanation annotations for both the correct cause and why it holds. Unlike ATOMIC, e-CARE includes annotated explanations enabling evaluation of explanation completeness alongside causal direction. License: research open. Citation: Du et al., ACL 2022, arxiv:2205.02593. Path: datasets/decision_intelligence/ecare/
CausalBench — four task dimensions: cause→effect, effect→cause, both directions with intervention (counterfactual held-out pairs). Nineteen published LLM baselines across GPT-4, Claude, and open-source models. Tests both direction accuracy and intervention accuracy. Citation: NeurIPS 2024. Path: datasets/decision_intelligence/causalbench/
Depth-vs-latency fixture — synthetic causal chain benchmarks at depth {3, 5, 8, 10} × graph size {500, 1k, 5k, 10k}. Exclusively for chain P95 latency sweep. Generated once and committed to fixtures/causal_depth_latency/.
Recall on ATOMIC 2020 subset ≥ 0.80 — KG-RAG literature baseline
Precision on ATOMIC 2020 subset ≥ 0.85 — KG-RAG literature baseline
Intervention accuracy on CausalBench ≥ 0.60 — CausalBench weakest published LLM baseline, NeurIPS 2024
Chain P95 at depth=10, 10k graph < 500 ms — Semantica production SLA 2026-Q1
Track 1.3 — Decision Influence Analysis
Why it matters:analyze_decision_influence() and trace_decision_causality() propagate compliance changes across a decision graph. Silent false positives here mean incorrect policy rollbacks; missed downstream decisions mean compliance changes fail to propagate.
APIs under test:ContextGraph.analyze_decision_influence() · ContextGraph.trace_decision_causality() · AgentContext.analyze_decision_influence() · DecisionQuery.analyze_decision_influence()
Datasets:
Decision influence ground-truth fixture — 500 synthetic decisions with fully annotated causal edges and 50 influence queries, each with the expected set of downstream decisions that should be flagged at depth 3 and depth 5. Generated once; committed to fixtures/decision_influence_ground_truth.json. Generate locally and commit; do not regenerate.
Metrics:
Influence recall@D = |found_influenced ∩ gold_influenced| / |gold_influenced| — fraction of gold downstream decisions found at search depth D. Reported at D=3 and D=5.
Spurious rate = |found_influenced \ gold_influenced| / |found_influenced| — fraction of returned decisions not in the gold set (false positives).
Why it matters:PolicyEngine.get_applicable_policies() and check_decision_compliance() are high-stakes. A missed violation (false negative) is worse than a false alarm — it means an illegal decision passes unchecked.
APIs under test:PolicyEngine.get_applicable_policies() · PolicyEngine.check_decision_compliance() · ContextGraph.find_applicable_policies()
Datasets:
CUAD — 510 contracts, 41 clause types covering non-compete, indemnification, termination rights, limitation of liability, and 37 other clause categories. For compliance testing, evaluate whether PolicyEngine correctly identifies which clauses apply to a given decision context. License: CC BY 4.0. Citation: Hendrycks et al., arxiv:2103.06268. Path: datasets/decision_intelligence/cuad/
LEDGAR — 60,000 legal provisions from SEC filings, each labelled with multi-label compliance tags across 100 categories. Used for clause-level F1 evaluation. License: research open. Citation: Tuggener et al., LREC 2020. Download: metatext.io/datasets/ledgar. Path: datasets/decision_intelligence/ledgar/
False negative rate ≤ 0.05 — HARD GATE. Internal compliance SLA. Any value above 0.05 fails the test immediately with an explicit error message naming the regulatory implication.
Clause-level F1 ≥ 0.75 — LEDGAR multi-label baseline, Tuggener et al. 2020
Why it matters: The full Track 1.3 may require scale fixtures; Track 30 runs only the precision/recall metrics against the committed ground-truth fixture — no latency assertions, no large downloads.
APIs under test:ContextGraph.analyze_decision_influence() · ContextGraph.trace_decision_causality()
Datasets: same as Track 1.3 — fixtures/decision_influence_ground_truth.json.
Thresholds: same as Track 1.3 — recall@3 ≥ 0.85, recall@5 ≥ 0.75, spurious rate ≤ 0.10.
Implementation note: Track 30 shares test functions with Track 1.3 via conftest.py. The only difference is that the offline track skips scale latency assertions.
Track 31 — Policy Engine Compliance (lightweight)
Why it matters: False negatives in compliance are the most dangerous production failure. Track 31 runs the same compliance classification test as Track 1.4 independently — classification regressions are caught even when the retrieval track is skipped.
APIs under test:PolicyEngine.get_applicable_policies() · PolicyEngine.check_decision_compliance()
Datasets: CUAD (510 contracts), LEDGAR (60k provisions), TREC CT 2022 (75 topics) — same as Track 1.4.
Thresholds: same as Track 1.4 — compliance accuracy ≥ 0.88, FNR ≤ 0.05 (hard gate), clause F1 ≥ 0.75.
Fixtures to generate
Generate these scripts once, run them locally, and commit the output. Do not regenerate fixtures after committing.
scripts/generate_decision_influence_gt.py — generates fixtures/decision_influence_ground_truth.json. 500 synthetic decisions, 50 influence queries with known downstream sets.
scripts/generate_precedent_scale.py — generates fixtures/decision_scale/{1k,10k,100k}.json. Route 10k and 100k through git-lfs (files exceed 5 MB).
conftest.py — define all threshold constants at the top as THRESHOLD_<METRIC>_<DATASET> = <value>. Implement session-scoped pytest fixtures for each dataset. Do not load datasets inside test functions.
test_precedent_search.py — one function per metric: test_mrr_german_credit, test_ndcg10_cuad, test_graph_lift_over_bm25, test_p95_latency_10k, test_p95_latency_100k.
Scope
Six tracks that validate Semantica's core decision reasoning capabilities. Each track is independent; the entire sub-issue is owned by one contributor and lives entirely in
benchmarks/decision_intelligence/.find_precedents()return the right prior decisions at scale?get_causal_chain()correctly identify cause→effect relationships fast enough for production?analyze_decision_influence()propagate compliance changes without silent false positives?PolicyEnginecatch every policy violation? False negatives are illegal decisions passing unchecked.Do not touch any file outside
benchmarks/decision_intelligence/anddatasets/decision_intelligence/.Track 1.1 — Precedent Search Quality
Why it matters: If
find_precedents()returns irrelevant prior decisions, every downstream policy check and audit trail is built on noise. This is the most critical differentiator Semantica has over plain RAG. Scale is equally important — the benchmark must validate that quality holds at 1k, 10k, and 100k decisions.APIs under test:
ContextGraph.find_precedents()·AgentContext.find_precedents()·ContextRetriever.find_precedents_hybrid()·DecisionQuery.find_precedents_hybrid()·decision_methods.multi_hop_query()Datasets:
German Credit — 1,000 structured lending decisions annotated with ground-truth precedent pairs by domain experts. Covers income, employment, loan purpose, credit history. Used to compute MRR on structured decision retrieval. License: CC BY 4.0. Citation: Hofmann, UCI ML Repository 1994. Download: archive.ics.uci.edu/dataset/144. Path:
datasets/decision_intelligence/german_credit/CUAD — 510 commercial contracts with 41 clause-type annotations (non-compete, indemnification, termination rights, etc.). Provides nDCG@10 and graph-lift ground truth for precedent retrieval. License: CC BY 4.0. Citation: Hendrycks et al., arxiv:2103.06268. Download: atticusprojectai.org/cuad. Path:
datasets/decision_intelligence/cuad/Precedent Scale Fixture — synthetic decision graphs at 1k, 10k, and 100k. Used exclusively for P95 latency measurement — no recall or precision scoring. Generated once and committed to
fixtures/decision_scale/{1k,10k,100k}.json. Files over 5 MB go through git-lfs.Metrics:
MRR = (1/|Q|) × Σ_q (1 / rank_q)— Mean Reciprocal Rank; rank_q is the position of the first relevant result for query q. Evaluated on German Credit.nDCG@10 = Σ_i (rel_i / log2(i+1)) / ideal_DCG— Normalized Discounted Cumulative Gain at rank 10. Evaluated on CUAD.Graph lift = nDCG@10(graph-assisted) − nDCG@10(BM25-flat)— absolute improvement of graph-augmented retrieval over flat BM25. Positive lift proves the graph adds value.P95 latency at 1k / 10k / 100k decisions— 95th percentile wall time for afind_precedents()call at each scale, measured against the scale fixtures.Thresholds:
Track 1.2 — Causal Chain Traversal
Why it matters:
get_causal_chain()depth accuracy determines whether root-cause attribution in high-stakes decisions (credit, healthcare, legal) is trustworthy. Without depth-vs-latency testing, a correct but slow chain analysis fails its production SLA before it is ever used.APIs under test:
ContextGraph.get_causal_chain()·CausalChainAnalyzer.get_causal_chain()·AgentContext.get_causal_chain()·decision_methods.get_causal_chain()Datasets:
ATOMIC 2020 — 1.33 million commonsense causal triples covering 23 If-Then relation types (xIntent, xCause, xEffect, xNeed, xWant). Use a 500-pair subset of cause→effect pairs only. License: CC BY 4.0. Citation: Hwang et al., AAAI 2021. Download: allenai.org/data/atomic-2020. Path:
datasets/decision_intelligence/atomic_subset/e-CARE — 21,324 causal Q&A records with free-text explanation annotations for both the correct cause and why it holds. Unlike ATOMIC, e-CARE includes annotated explanations enabling evaluation of explanation completeness alongside causal direction. License: research open. Citation: Du et al., ACL 2022, arxiv:2205.02593. Path:
datasets/decision_intelligence/ecare/CausalBench — four task dimensions: cause→effect, effect→cause, both directions with intervention (counterfactual held-out pairs). Nineteen published LLM baselines across GPT-4, Claude, and open-source models. Tests both direction accuracy and intervention accuracy. Citation: NeurIPS 2024. Path:
datasets/decision_intelligence/causalbench/Depth-vs-latency fixture — synthetic causal chain benchmarks at depth {3, 5, 8, 10} × graph size {500, 1k, 5k, 10k}. Exclusively for chain P95 latency sweep. Generated once and committed to
fixtures/causal_depth_latency/.Metrics:
Causal recall = |retrieved_ancestors ∩ gold_ancestors| / |gold_ancestors|— fraction of ground-truth causal ancestors recovered.Causal precision = |retrieved_ancestors ∩ gold_ancestors| / |retrieved_ancestors|— fraction of retrieved ancestors that are correct.Direction accuracy = |correct_direction_pairs| / |total_pairs|— fraction of cause→effect pairs where direction is identified correctly.Intervention accuracy = |correct_counterfactual_tests| / |total_withheld_tests|— fraction of counterfactual held-out pairs correctly classified.Chain P95 latency at depth D, graph size N— 95th percentile latency for traversing a chain of depth D on a graph of N nodes.Thresholds:
Track 1.3 — Decision Influence Analysis
Why it matters:
analyze_decision_influence()andtrace_decision_causality()propagate compliance changes across a decision graph. Silent false positives here mean incorrect policy rollbacks; missed downstream decisions mean compliance changes fail to propagate.APIs under test:
ContextGraph.analyze_decision_influence()·ContextGraph.trace_decision_causality()·AgentContext.analyze_decision_influence()·DecisionQuery.analyze_decision_influence()Datasets:
fixtures/decision_influence_ground_truth.json. Generate locally and commit; do not regenerate.Metrics:
Influence recall@D = |found_influenced ∩ gold_influenced| / |gold_influenced|— fraction of gold downstream decisions found at search depth D. Reported at D=3 and D=5.Spurious rate = |found_influenced \ gold_influenced| / |found_influenced|— fraction of returned decisions not in the gold set (false positives).Thresholds:
Track 1.4 — Policy Compliance Classification
Why it matters:
PolicyEngine.get_applicable_policies()andcheck_decision_compliance()are high-stakes. A missed violation (false negative) is worse than a false alarm — it means an illegal decision passes unchecked.APIs under test:
PolicyEngine.get_applicable_policies()·PolicyEngine.check_decision_compliance()·ContextGraph.find_applicable_policies()Datasets:
CUAD — 510 contracts, 41 clause types covering non-compete, indemnification, termination rights, limitation of liability, and 37 other clause categories. For compliance testing, evaluate whether PolicyEngine correctly identifies which clauses apply to a given decision context. License: CC BY 4.0. Citation: Hendrycks et al., arxiv:2103.06268. Path:
datasets/decision_intelligence/cuad/LEDGAR — 60,000 legal provisions from SEC filings, each labelled with multi-label compliance tags across 100 categories. Used for clause-level F1 evaluation. License: research open. Citation: Tuggener et al., LREC 2020. Download: metatext.io/datasets/ledgar. Path:
datasets/decision_intelligence/ledgar/TREC Clinical Trials 2022 — 75 patient topics with ground-truth trial eligibility labels. Tests compliance classification on structured patient-trial matching. Citation: TREC 2022. Download: trec.nist.gov. Path:
datasets/decision_intelligence/trec_ct_2022/Metrics:
Compliance accuracy = (TP + TN) / total— fraction of compliance decisions classified correctly.False negative rate (FNR) = FN / (FN + TP)— fraction of true violations missed. Hard gate metric. Missed violations = illegal decisions pass unchecked.False positive rate = FP / (FP + TN)— fraction of compliant decisions incorrectly flagged.Clause-level F1 = 2 × precision × recall / (precision + recall)— harmonic mean of clause detection precision and recall on LEDGAR.Thresholds:
Track 30 — Decision Influence Accuracy (lightweight)
Why it matters: The full Track 1.3 may require scale fixtures; Track 30 runs only the precision/recall metrics against the committed ground-truth fixture — no latency assertions, no large downloads.
APIs under test:
ContextGraph.analyze_decision_influence()·ContextGraph.trace_decision_causality()Datasets: same as Track 1.3 —
fixtures/decision_influence_ground_truth.json.Thresholds: same as Track 1.3 — recall@3 ≥ 0.85, recall@5 ≥ 0.75, spurious rate ≤ 0.10.
Implementation note: Track 30 shares test functions with Track 1.3 via
conftest.py. The only difference is that the offline track skips scale latency assertions.Track 31 — Policy Engine Compliance (lightweight)
Why it matters: False negatives in compliance are the most dangerous production failure. Track 31 runs the same compliance classification test as Track 1.4 independently — classification regressions are caught even when the retrieval track is skipped.
APIs under test:
PolicyEngine.get_applicable_policies()·PolicyEngine.check_decision_compliance()Datasets: CUAD (510 contracts), LEDGAR (60k provisions), TREC CT 2022 (75 topics) — same as Track 1.4.
Thresholds: same as Track 1.4 — compliance accuracy ≥ 0.88, FNR ≤ 0.05 (hard gate), clause F1 ≥ 0.75.
Fixtures to generate
Generate these scripts once, run them locally, and commit the output. Do not regenerate fixtures after committing.
scripts/generate_decision_influence_gt.py— generatesfixtures/decision_influence_ground_truth.json. 500 synthetic decisions, 50 influence queries with known downstream sets.scripts/generate_precedent_scale.py— generatesfixtures/decision_scale/{1k,10k,100k}.json. Route 10k and 100k through git-lfs (files exceed 5 MB).Files to create
conftest.py— define all threshold constants at the top asTHRESHOLD_<METRIC>_<DATASET> = <value>. Implementsession-scoped pytest fixtures for each dataset. Do not load datasets inside test functions.test_precedent_search.py— one function per metric:test_mrr_german_credit,test_ndcg10_cuad,test_graph_lift_over_bm25,test_p95_latency_10k,test_p95_latency_100k.test_causal_chain.py—test_direction_accuracy_causalbench,test_recall_atomic,test_precision_atomic,test_intervention_accuracy,test_chain_p95_latency_depth10.test_decision_influence.py—test_influence_recall_at_3,test_influence_recall_at_5,test_spurious_rate. Tracks 1.3 and 30 share these functions; Track 30 skips latency.test_policy_compliance.py—test_compliance_accuracy_cuad,test_false_negative_rate_hard_gate,test_clause_f1_ledgar. Tracks 1.4 and 31 share these functions.datasets/decision_intelligence/README.md— one section per dataset: source URL, download command, expected directory structure, license.Key implementation patterns
Notes
TACRED_AVAILABLE=1.detect_duplicates()regression guard belongs to Sub-issue 4.conftest.py.Checklist
datasets/decision_intelligence/german_credit/downloadeddatasets/decision_intelligence/cuad/downloadeddatasets/decision_intelligence/ledgar/downloadeddatasets/decision_intelligence/trec_ct_2022/downloadeddatasets/decision_intelligence/atomic_subset/downloaded (500-pair subset)datasets/decision_intelligence/ecare/downloadeddatasets/decision_intelligence/causalbench/downloadedfixtures/decision_influence_ground_truth.jsongenerated and committedfixtures/decision_scale/{1k,10k,100k}.jsongenerated and committed (10k, 100k via git-lfs)conftest.py— all 15 threshold constants defined, all dataset fixtures implementedtest_precedent_search.py— MRR, nDCG@10, graph lift, P95 at 10k and 100ktest_causal_chain.py— direction accuracy, recall, precision, intervention accuracy, depth-latency sweeptest_decision_influence.py— recall@3, recall@5, spurious rate (Tracks 1.3 + 30)test_policy_compliance.py— accuracy, FNR hard gate, clause F1 (Tracks 1.4 + 31)pytest benchmarks/decision_intelligence/ -p no:langsmithSEMANTICA_REAL_LLMdependency)datasets/decision_intelligence/README.mdwrittenbenchmarks: decision intelligence tracks 1.1–1.4, 30–31