[Epic] Real-World Benchmark Suite  All Semantica Modules


## What this epic covers

This epic delivers the full benchmark suite defined in `docs/benchmarks/real_world_benchmarks.md`.

- 35 evaluation tracks across every Semantica module
- 50+ public datasets with published baselines — no synthetic-only fixtures, no uncited thresholds
- Every track calls real Semantica module APIs against real downloaded datasets — not hardcoded expected outputs
- Every track is owned by exactly one contributor via a dedicated issue
- Issues are designed to be fully independent so contributors never block each other

The suite proves Semantica's core claims with real data:

- **Decision Intelligence** — precedent retrieval, causal analysis, and policy compliance outperform plain RAG
- **Temporal & Bitemporal Reasoning** — stale facts are never injected; audit trails are gap-free
- **Explainability & Provenance** — W3C PROV-O conformance; every decision traceable to evidence
- **Memory & Context Management** — agent memory holds precision at 1M and 10M token scale
- **Structural Intelligence** — entity resolution, KG completion, and embedding quality anchored to public SOTA
- **New Module Tracks** — document parsing, chunking, vector store, and ingest pipeline benchmarked for the first time

**Benchmarks are run locally only.** They are not wired into any CI pipeline — the suite is too large and too slow for automated runs. Contributors run them manually before opening a PR.

---

## Modules under test

Every benchmark calls a real Semantica API. The table below maps each pillar to its primary APIs so reviewers can verify coverage is genuine.

### Decision Intelligence

- `ContextGraph.find_precedents()` · `AgentContext.find_precedents()` · `ContextRetriever.find_precedents_hybrid()`
- `ContextGraph.get_causal_chain()` · `CausalChainAnalyzer.get_causal_chain()`
- `ContextGraph.analyze_decision_influence()` · `ContextGraph.trace_decision_causality()`
- `PolicyEngine.get_applicable_policies()` · `PolicyEngine.check_decision_compliance()`

### Temporal & Provenance

- `TemporalGraphRetriever` · `ContextRetriever` (with `valid_from`/`valid_until` filters)
- `TemporalVersionManager` · `TemporalGraphQuery` · `validate_temporal_consistency()`
- `TemporalPatternDetector` · `kg.temporal_query.TemporalPatternDetector`
- `AgentContext.trace_decision_explainability()` · `DecisionContext.explain_decision()`
- `kg.provenance_tracker.ProvenanceTracker` · `kg.kg_provenance.GraphBuilderWithProvenance`
- `context.context_provenance.ContextManagerWithProvenance`

### Memory & Context Management

- `AgentMemory.store()` · `AgentMemory.get_memory()` · `AgentMemory.retrieve()` · `AgentMemory._prune_short_term_memory()`
- `ContextRetriever.hybrid_search()` (alpha sweep across {0.0 … 1.0})
- `ContextRetriever.multi_hop_context_assembly()` · `AgentContext.multi_hop_context_query()`
- `AgentContext` (multi-turn consistency) · `ContextGraph` metric node storage

### Structural Intelligence

- `SimilarityCalculator.calculate_similarity()` · `ClusterBuilder.build_clusters()` · `EntityMerger.merge_entities()`
- `LinkPredictor.score_link()` · `LinkPredictor` (MRR, Hits@k)
- `SemanticExtractor` (relation extraction) · `EntityExtractor`
- `EmbeddingGenerator.generate_embeddings()` (batch mode)
- `ConflictResolver.detect_conflicts()` · `ConflictResolver.resolve_conflict()`
- `Reasoner.infer_facts()` · `ontology.Reasoner`

### New Module Tracks

- `parse.DocumentParser` · `parse.PDFParser` · `parse.TableExtractor`
- `split.Splitter` · `split.SemanticChunker` · `split.RecursiveCharacterSplitter` · `split.TokenSplitter`
- `vector_store.VectorStore.search()` · `vector_store.VectorStore.filtered_search()`
- `ingest.IngestPipeline.run()` · `ingest.IngestPipeline.run_batch()`

---

## Architecture rules — read before touching a single file

**Rule 1 — One directory per issue.**
Never create or edit files outside your assigned `benchmarks/<dir>/` and `datasets/<dir>/` folders.
The only shared files you may touch are `.gitattributes` (Infrastructure & Module Tracks issue only) and `benchmarks/benchmarks_runner.py` (Infrastructure & Module Tracks issue only).

**Rule 2 — Thresholds live in `conftest.py`.**
Never hardcode a float inside a test function.
Every threshold constant must be defined at the top of `conftest.py` with the form `THRESHOLD_<METRIC>_<DATASET> = <value>`.
No PR will be merged with inline numbers like `assert score >= 0.70`.

**Rule 3 — Real LLM tests are skipped by default.**
Any test that requires a real language model must be gated so it does not run unless explicitly enabled:

```python
real_llm = pytest.mark.skipif(
    not os.getenv("SEMANTICA_REAL_LLM"),
    reason="requires real LLM — set SEMANTICA_REAL_LLM=1 to enable"
)
```

Without this gate the test would fail on any machine that does not have an LLM available.

**Rule 4 — Files over 5 MB go through git-lfs.**
The Infrastructure & Module Tracks issue sets up `.gitattributes` before any large file is committed.
Committing a raw binary over 5 MB to regular git history will be rejected at review.

**Rule 5 — Quality metrics must come from real public datasets.**
Every quality threshold (F1, MRR, recall, precision, accuracy) must be measured by calling a real Semantica module API against a real downloaded public dataset. The threshold value must have a published baseline citation.
Synthetic fixtures (`fixtures/*.json`) are permitted only for latency and scale tests (P95 latency at N entities, throughput at N docs/s) — never as the primary signal for a quality metric.
A test that computes F1 entirely from a synthetic JSON file will be rejected at review regardless of the score.

---

## How to run benchmarks

Benchmarks are a local tool. Run your sub-issue's tests before opening a PR:

```bash
pytest benchmarks/<dir>/ -p no:langsmith -v
```

To run a specific track:

```bash
pytest benchmarks/decision_intelligence/test_precedent_search.py -v
```

To enable real-LLM tests (requires `SEMANTICA_REAL_LLM` env var):

```bash
SEMANTICA_REAL_LLM=1 pytest benchmarks/<dir>/ -v
```

To enable the BEAM 10M test (requires the large dataset downloaded):

```bash
BEAM_10M=1 pytest benchmarks/memory_context/test_agent_memory_beam10m.py -v
```

---

## Issues and their scope

**The Infrastructure & Module Tracks issue should be at least partially completed before the others merge** — it sets up git-lfs and JSON schema validation that all other issues depend on.

- **[Decision Intelligence (Tracks 1.1–1.4, 30–31)](#571 )**
  - Precedent search quality, causal chain traversal, decision influence analysis, policy compliance classification
  - Datasets: German Credit, CUAD, LEDGAR, TREC CT 2022, ATOMIC 2020, e-CARE, CausalBench
  - Hard gate: policy false negative rate ≤ 0.05
  - Owner directory: `benchmarks/decision_intelligence/`

- **[Temporal, Bitemporal & Provenance (#572 )**
  - Temporal validity, bitemporal revision integrity, pattern detection, decision explainability, provenance lineage, cross-module continuity
  - Datasets: MultiTQ, CronQuestions, TGB 2.0 tkgl-icews, TGB 2.0 tkgl-wikidata, ICEWS, FEVER, ERASER, W3C PROV-O suite
  - Hard gates: gap count = 0, overlap count = 0, PROV-O SPARQL violations = 0
  - Owner directory: `benchmarks/temporal_provenance/`

- **[Memory & Context Management (#573 )**
  - Agent memory persistence, hybrid retrieval alpha sweep, multi-hop context assembly, agentic semantic consistency
  - Datasets: LoCoMo, LongMemEval, BEAM 1M, BEAM 10M (optional, large download), MemoryArena, MSC, GrailQA, CWQ, HotpotQA, 2WikiMultiHop, MuSiQue, MetaQA, τ-bench
  - Hard gate: alpha=0.5 must outperform alpha=0.0 AND alpha=1.0 on nDCG@10
  - Owner directory: `benchmarks/memory_context/`

- **[Structural Intelligence (#574 )**
  - Entity resolution, KG completion, semantic extraction, embedding quality, conflict resolution, ontology reasoning
  - Datasets: DBLP-ACM, DBLP-Scholar, Abt-Buy, Amazon-Google, Walmart-Amazon dirty, WDC Products, FB15k-237, WN18RR, ogbl-biokg, CoDEx-S/M/L, Wikidata5M, SemEval 2010, NYT10, DocRED, Re-DocRED, REBEL, BEIR, MTEB, STS-Benchmark, WikiContradict, CONFLICTBANK, W3C OWL-RL
  - Regression guard: `detect_duplicates()` must never be called from the benchmark pipeline
  - Owner directory: `benchmarks/structural_intelligence/`

- **[Infrastructure, New Module Tracks & Dataset Registry (#575 )**
  - Document parsing, chunking quality, vector store performance, ingest pipeline throughput
  - Infrastructure: git-lfs setup, JSON schema validation, `benchmarks_runner.py` discovery, dataset registry
  - Datasets: OmniDocBench, SIFT1M, NeurIPS'23 Filtered Search, DEEP1B, synthetic ingest corpus
  - Owner directories: `benchmarks/module_tracks/` · `benchmarks/infrastructure/`

---

## Directory isolation map

Every issue owns exactly the directories below — and nothing else.

- `benchmarks/decision_intelligence/` — Decision Intelligence
- `benchmarks/temporal_provenance/` — Temporal, Bitemporal & Provenance
- `benchmarks/memory_context/` — Memory & Context Management
- `benchmarks/structural_intelligence/` — Structural Intelligence
- `benchmarks/module_tracks/` — Infrastructure & Module Tracks
- `benchmarks/infrastructure/` — Infrastructure & Module Tracks
- `benchmarks/context_graph_effectiveness/` — existing, do not move or modify

Dataset directories follow the same pattern:

- `datasets/decision_intelligence/` — Decision Intelligence
- `datasets/temporal_provenance/` — Temporal, Bitemporal & Provenance
- `datasets/memory_context/` — Memory & Context Management
- `datasets/structural_intelligence/` — Structural Intelligence
- `datasets/module_tracks/` — Infrastructure & Module Tracks

Shared directories — each issue commits only its own files:

- `fixtures/` — each issue adds its own fixtures; never overwrite another issue's files
- `scripts/` — each issue adds its own generator scripts

---

## Overall checklist

- [ ] Infrastructure & Module Tracks: `.gitattributes` with git-lfs patterns, JSON schema files, `benchmarks_runner.py` updated
- [ ] Decision Intelligence merged — tracks 1.1–1.4, 30–31
- [ ] Temporal, Bitemporal & Provenance merged — tracks 2.1–2.3, 3.1–3.3
- [ ] Memory & Context Management merged — tracks 4.1–4.4, 26
- [ ] Structural Intelligence merged — tracks 5.1–5.5, 27–29
- [ ] Infrastructure & Module Tracks merged — document parsing, chunking, vector store, ingest pipeline (tracks 32–35)
- [ ] All tests pass locally: `pytest benchmarks/ -p no:langsmith`
- [ ] `CHANGELOG.md` entry added under the release that ships the suite
- [ ] `docs/benchmarks/real_world_benchmarks.md` version bumped to 3.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Epic] Real-World Benchmark Suite All Semantica Modules #570

What this epic covers

Modules under test

Decision Intelligence

Temporal & Provenance

Memory & Context Management

Structural Intelligence

New Module Tracks

Architecture rules — read before touching a single file

How to run benchmarks

Issues and their scope

Directory isolation map

Overall checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Epic] Real-World Benchmark Suite All Semantica Modules #570

Description

What this epic covers

Modules under test

Decision Intelligence

Temporal & Provenance

Memory & Context Management

Structural Intelligence

New Module Tracks

Architecture rules — read before touching a single file

How to run benchmarks

Issues and their scope

Directory isolation map

Overall checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions