[FEATURE] [Benchmarks] Pillar 2 & 3 Temporal, Bitemporal & Provenance

## Scope

Six tracks that validate Semantica's temporal reasoning and W3C-conformant audit trail correctness. All six tracks live in `benchmarks/temporal_provenance/`.

- Track 2.1 — Temporal Validity & Stale Context Prevention: does `TemporalGraphRetriever` return only facts valid at query time? Stale facts in LLM context produce confidently-wrong answers — a patient safety or compliance risk.
- Track 2.2 — Bitemporal Revision Integrity: does `TemporalVersionManager` maintain gap-free, non-overlapping revision histories under concurrent write load? Gaps are invisible without explicit tests.
- Track 2.3 — Temporal Pattern Detection: does `TemporalPatternDetector` surface trends, anomalies, and regime changes from time-series graph data?
- Track 3.1 — Decision Explainability: does `trace_decision_explainability()` surface the full reasoning chain — evidence nodes, policy references, precedent links? Incomplete explanations fail auditors.
- Track 3.2 — Provenance Lineage Integrity (W3C PROV-O): is every hop in the lineage checksum-protected and W3C-conformant? A chain with gaps is legally inadmissible as an audit trail.
- Track 3.3 — Cross-module Provenance Continuity: does the provenance chain stay unbroken as a document moves through `ingest → embedding → KG → audit_log`?

**Do not touch any file outside `benchmarks/temporal_provenance/` and `datasets/temporal_provenance/`.**

---

## Track 2.1 — Temporal Validity & Stale Context Prevention

**Why it matters:** If `TemporalGraphRetriever` injects stale facts into an LLM context, the model generates outdated answers with false confidence. In healthcare or legal domains, this is a direct patient safety or compliance risk.

**APIs under test:** `TemporalGraphRetriever` · `ContextRetriever` (temporal filters: `valid_from`, `valid_until`) · `ContextGraph.add_node()` with `valid_from`/`valid_until` parameters

Datasets:

- **MultiTQ** — 7,000 multi-temporal Q&A pairs designed specifically for retrieval-before-answer systems. Each question requires reasoning over multiple simultaneous temporal constraints (e.g., "who held role X during year Y when Z was also true"). Directly exposes retrievers that ignore `valid_from`/`valid_until` boundaries. License: research open. Citation: Chen et al., EMNLP 2023, [arxiv:2310.01253](https://arxiv.org/abs/2310.01253). Path: `datasets/temporal_provenance/multitq/`

- **CronQuestions** — 410,357 temporal Q&A pairs covering the full range of temporal reasoning: role holders during a year, last events before a timestamp, concurrent events. Largest temporal QA benchmark available; provides MRR signal at scale. License: research open. Citation: Saxena et al., ACL 2021. Download: [github.com/apoorvumang/CronKGQA](https://github.com/apoorvumang/CronKGQA). Path: `datasets/temporal_provenance/cronquestions/`

- **TGB 2.0 tkgl-icews** — 9.8 million political event quadruples (subject, relation, object, timestamp) from the Integrated Crisis Early Warning System. Tests time-aware MRR: how well does the retriever rank temporally-valid facts above stale ones? Requires **git-lfs** — exceeds 5 MB. License: research open. Citation: TGB 2.0, NeurIPS 2024. Download: [tgb.complexdatalab.com](https://tgb.complexdatalab.com/). Path: `datasets/temporal_provenance/tkgl_icews/`

- **TGB 2.0 tkgl-wikidata** — 1.5 million general temporal KG quadruples from Wikidata, covering a wide variety of entity types and relation types. Closer to Semantica's ontology than ICEWS; shared with Track 2.3. Requires **git-lfs**. License: CC BY 4.0. Citation: TGB 2.0, NeurIPS 2024. Path: `datasets/temporal_provenance/tkgl_wikidata/`

> WebQSP removed: a SIGIR 2022 audit found ~52% stale ground truth, making Precision@k unreliable. TimeQA (150 Q&A) removed: ±8% confidence intervals at 95% make pass/fail meaningless on a single wrong answer.

Metrics:

- `Stale injection rate = |retrieved_facts where valid_until < query_time| / |total_retrieved|` — fraction of retrieved facts that had already expired at query time.
- `Future injection rate = |retrieved_facts where valid_from > query_time| / |total_retrieved|` — fraction of retrieved facts not yet valid at query time.
- `Temporal precision@5 = |correct_temporal_facts in top-5| / 5` — fraction of top-5 results that are temporally valid and factually correct.
- `MRR (time-aware) = (1/|Q|) × Σ_q (1 / rank_q)` — Mean Reciprocal Rank where only temporally-valid facts count as relevant. Evaluated on tkgl-icews.

Thresholds:

- Stale injection rate < 0.05 — Semantica production SLA 2026-Q1
- Future injection rate < 0.05 — Semantica production SLA 2026-Q1
- Temporal precision@5 ≥ 0.80 — TempQA framework 2024
- MRR on tkgl-icews ≥ 0.45 — [RE-GCN baseline, TGB 2.0 NeurIPS 2024](https://tgb.complexdatalab.com/)

---

## Track 2.2 — Bitemporal Revision Integrity

**Why it matters:** A bitemporal graph tracks both valid time (when a fact was true in the world) and transaction time (when it was recorded). Gaps or overlaps in either dimension corrupt audit trails and are completely invisible without explicit correctness tests.

**APIs under test:** `TemporalVersionManager` · `kg.temporal_query.TemporalVersionManager` · `TemporalGraphQuery` · `validate_temporal_consistency()`

Datasets:

- **Synthetic bitemporal stress corpus** — 50,000 revision events with zero injected gaps or overlaps. Any test that passes on this fixture is sensitive to any introduced error; if a gap is injected and the test still passes, the detector is broken. Generated once; committed to `fixtures/bitemporal_stress.json`.

- **Concurrent-write fixture** — generated at test time, not from a file. Fifty threads each write the same entity simultaneously. Expected output: zero overlapping revision windows in the final revision list.

- **Wikidata revision history subset** — 10k entity edit pairs with `valid_from` / `transaction_time` metadata. Tests the bitemporal manager on real revision patterns extracted from Wikidata history dumps. Committed via git-lfs.

Metrics:

- `Revision monotonicity = |pairs where valid_until[v] == valid_from[v+1]| / |total_consecutive_pairs|` — fraction of consecutive revision pairs with exactly-contiguous windows. Must equal 1.0.
- `Temporal gap count = |pairs where valid_until[v] < valid_from[v+1]|` — count of revision pairs with a gap between them. Must be exactly 0.
- `Overlap count = |pairs where valid_until[v] > valid_from[v+1]|` — count of revision pairs where windows overlap. Must be exactly 0.
- `Concurrent write safety` — count of overlapping revision windows observed after 50 threads simultaneously write the same entity. Must be 0.

Thresholds (binary gates — any non-zero value fails the test immediately):

- Temporal gap count = 0 — [W3C PROV-DM §4.4 monotonicity invariant](https://www.w3.org/TR/prov-dm/)
- Overlap count = 0 — [W3C PROV-DM §4.4](https://www.w3.org/TR/prov-dm/)
- Revision monotonicity = 1.0 — binary correctness
- Concurrent write safety (50 threads) = 0 overlapping windows — internal SLA

---

## Track 2.3 — Temporal Pattern Detection

**Why it matters:** `TemporalPatternDetector` surfaces trends, periodicity, and regime changes in time-series graph data. Poor recall means anomalies are missed; poor precision causes false alerts that waste investigator time; slow detection misses real-time SLA.

**APIs under test:** `kg.temporal_query.TemporalPatternDetector` · `TemporalGraphQuery`

Datasets:

- **ICEWS** — International Conflict and Event data; political event dataset covering global events from 1995 to present, encoded as subject–predicate–object–time quadruples. Used to evaluate anomaly detection: do detected anomaly clusters align with known geopolitical event clusters? License: public domain. Path: `datasets/temporal_provenance/icews/`

- **TGB 2.0 tkgl-wikidata** — shared with Track 2.1. Used for periodicity and drift detection across multiple entity types.

- **Synthetic temporal anomaly corpus** — 10,000 events with injected known patterns: spikes (sudden frequency increase), drift (gradual distribution shift), periodicity (weekly/monthly cycles), and normal background. Every event has a ground-truth pattern label. Generated once; committed to `fixtures/temporal_patterns.json`.

Metrics:

- `Pattern recall = |detected_patterns ∩ gold_patterns| / |gold_patterns|` — fraction of known patterns in the fixture that are detected.
- `Pattern precision = |detected_patterns ∩ gold_patterns| / |detected_patterns|` — fraction of detected patterns that are real.
- `Anomaly F1 = 2 × precision × recall / (precision + recall)` — harmonic mean on the ICEWS anomaly detection task.
- `Detection latency P95 at 1M event window` — 95th percentile wall time to run pattern detection on a 1-million-event window.

Thresholds:

- Pattern recall ≥ 0.75 — internal synthetic corpus
- Anomaly F1 ≥ 0.70 — ICEWS event detection literature, Ward et al. 2013
- Detection latency P95 < 1s at 1M events — Semantica production SLA

---

## Track 3.1 — Decision Explainability

**Why it matters:** `trace_decision_explainability()` and `explain_decision()` must surface the full reasoning chain — evidence nodes, policy references, precedent links. Incomplete explanations fail auditors and regulators who are legally required to understand why a decision was made.

**APIs under test:** `AgentContext.trace_decision_explainability()` · `DecisionContext.explain_decision()` · `ContextRetriever.explainable_retrieval()`

Datasets:

- **German Credit** — 1,000 lending decisions with human-annotated feature importance (top 3 factors per decision) and reason codes. Verifies that `explain_decision()` surfaces the same top factors as human annotators. Shared dataset; load from `datasets/decision_intelligence/german_credit/` if Sub-issue 1 has merged, otherwise download to `datasets/temporal_provenance/german_credit/`. License: CC BY 4.0.

- **IBM HR Attrition** — 1,470 employee records with 35 feature attributes and binary attrition labels. Domain experts have annotated the top decision factors for a 200-record subset. License: public domain (IBM). Download: [kaggle IBM HR dataset](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset). Path: `datasets/temporal_provenance/ibm_hr/`

- **ERASER benchmark** — 56,000 instances across 7 NLP tasks (FEVER, MultiRC, BoolQ, CosmosQA, SciF, Movies, e-SNLI) with human-highlighted rationale annotations at token level. Used to evaluate `trace_decision_explainability()` rationale F1: do the explanation tokens overlap with human-highlighted reasoning tokens? License: research open. Citation: DeYoung et al., ACL 2020. Download: [eraser-benchmark.github.io](https://eraser-benchmark.github.io/). Path: `datasets/temporal_provenance/eraser/`

Metrics:

- `Explanation completeness = |gold_factors ∩ explained_factors| / |gold_factors|` — fraction of ground-truth decision factors surfaced by `explain_decision()`.
- `Explanation precision = |gold_factors ∩ explained_factors| / |explained_factors|` — fraction of surfaced factors in the gold set.
- `Rationale F1` — token-level F1 between explanation tokens and ERASER human-annotated rationale tokens. Harmonic mean of token recall and token precision.
- `Citation groundedness = |explanations citing a retrievable KG node| / |total explanations|` — fraction of explanation steps that cite a node actually present in the knowledge graph.

Thresholds:

- Explanation completeness ≥ 0.85 — [ERASER benchmark, DeYoung et al. ACL 2020](https://arxiv.org/abs/2004.06871)
- Rationale F1 ≥ 0.65 — ERASER multi-task baseline
- Citation groundedness ≥ 0.90 — [RAGAS faithfulness, arxiv:2309.15217](https://arxiv.org/abs/2309.15217)

---

## Track 3.2 — Provenance Lineage Integrity (W3C PROV-O)

**Why it matters:** A provenance chain with gaps or incorrect parent pointers is legally inadmissible as an audit trail. Every hop must be verifiable and checksum-protected. The W3C PROV-O SPARQL integrity constraint queries must return zero results — any non-zero result means a PROV-O constraint is violated and the test fails immediately.

**APIs under test:** `kg.provenance_tracker.ProvenanceTracker` · `kg.kg_provenance.GraphBuilderWithProvenance` · `context.context_provenance.ContextManagerWithProvenance`

Datasets:

- **FEVER** — 185,445 claim–evidence pairs where each claim links to one or more Wikipedia revision sources. Provenance chain: claim → evidence sentences → Wikipedia article → Wikipedia revision. Tests that `ProvenanceTracker` correctly links every evidence node to its source. License: CC BY 4.0. Citation: Thorne et al., NAACL 2018. Download: [fever.ai](https://fever.ai/). Path: `datasets/temporal_provenance/fever/`

- **W3C PROV-O conformance test suite** — 57 positive entailment tests (these triples must be inferred from the provenance graph) and 38 negative tests (these inferences must not be made). All 95 tests must pass. License: W3C document. Download: [w3.org/TR/prov-o](https://www.w3.org/TR/prov-o/). Path: `datasets/temporal_provenance/w3c_prov/`

- **Synthetic 10k-node provenance chain** — 4-hop lineage graph with 10,000 nodes and gold ancestor sets for each leaf node. Used to verify `ProvenanceTracker` completeness at depth 4. Generated once; committed to `fixtures/provenance_chain_10k.json`.

- **Concurrent-write provenance fixture** — 50 threads writing overlapping provenance entries simultaneously. Generated at test time. Validates PROV-O monotonicity under concurrency.

Metrics:

- `Lineage completeness at depth D = |retrieved_ancestors ∩ gold_ancestors| / |gold_ancestors|` — fraction of gold ancestors recovered at depth D. Evaluated at D=4.
- `Checksum integrity = |nodes where stored_hash == recomputed_hash| / |total_nodes|` — fraction of provenance nodes with intact hash values.
- `W3C PROV-O SPARQL violations = count of integrity constraint queries returning non-empty result` — hard gate; any non-zero value means a PROV-O constraint is violated.
- `Revision monotonicity violations = gap_count + overlap_count` — combined count of bitemporal violations. Must be 0.

Thresholds (binary gates):

- Lineage completeness at 4-hop = 1.0 — binary correctness, [W3C PROV-DM §4.4](https://www.w3.org/TR/prov-dm/)
- Checksum integrity = 1.0 — binary correctness
- **W3C PROV-O SPARQL violations = 0 — HARD GATE.** [W3C PROV-O specification](https://www.w3.org/TR/prov-o/). Any non-zero result fails the test immediately.
- Revision monotonicity violations = 0 — [W3C PROV-DM §4.4](https://www.w3.org/TR/prov-dm/)

---

## Track 3.3 — Cross-module Provenance Continuity

**Why it matters:** Every Semantica module (`ingest`, `semantic_extract`, `kg`, `deduplication`) appends provenance records. If the chain breaks between modules, the audit trail is incomplete and the full lineage cannot be reconstructed — the ingest source of a KG node becomes untraceable.

**APIs under test:** `IngestProvenanceMixin` → `EmbeddingGeneratorWithProvenance` → `GraphBuilderWithProvenance` → `AlgorithmTrackerWithProvenance` → `ProvenanceTracker.export_audit_log()`

Datasets:

- **End-to-end provenance pipeline fixture** — 1,000 documents traced from raw file through the full pipeline: ingest → embedding → KG node. Every inter-module link is recorded. Generated once; committed to `fixtures/e2e_provenance.json`.

Metrics:

- `Chain continuity = |documents where every inter-module link resolves| / |total_documents|` — fraction of documents for which the full ingest-to-KG provenance chain is intact.
- `Orphan rate = |KG nodes with no traceable ingest source| / |total_KG_nodes|` — fraction of graph nodes with no provenance record. Must be 0.
- `Audit log completeness = |pipeline stages present in exported log| / |expected_stages|` — fraction of pipeline stages present in the exported audit log. Must be 1.0.

Thresholds (binary gates):

- Chain continuity = 1.0 — binary correctness
- Orphan rate = 0 — internal SLA
- Audit log completeness = 1.0 — binary correctness

---

## Fixtures to generate

Generate once locally, commit the output. Do not regenerate fixtures after committing.

- `scripts/generate_temporal_provenance_fixtures.py` — generates all four fixtures below.
- `fixtures/bitemporal_stress.json` — 50k revision events, zero injected gaps or overlaps.
- `fixtures/temporal_patterns.json` — 10k events with spike/drift/periodicity/normal labels.
- `fixtures/provenance_chain_10k.json` — 4-hop lineage graph, 10k nodes, gold ancestor sets.
- `fixtures/e2e_provenance.json` — 1k documents with full pipeline provenance trace.

---

## Files to create

```text
benchmarks/temporal_provenance/
  conftest.py
  test_temporal_validity.py
  test_bitemporal_integrity.py
  test_temporal_patterns.py
  test_decision_explainability.py
  test_provenance_lineage.py
  test_provenance_continuity.py

scripts/
  generate_temporal_provenance_fixtures.py

datasets/temporal_provenance/
  README.md
```

`conftest.py` — all threshold constants at the top; `session`-scoped fixtures for all datasets; shared fixture loaders for all four JSON fixture files.

`test_temporal_validity.py` — `test_stale_injection_rate`, `test_future_injection_rate`, `test_temporal_precision_at_5`, `test_mrr_tkgl_icews`.

`test_bitemporal_integrity.py` — `test_gap_count_zero`, `test_overlap_count_zero`, `test_revision_monotonicity`, `test_concurrent_write_no_overlap`.

`test_temporal_patterns.py` — `test_pattern_recall`, `test_anomaly_f1_icews`, `test_detection_latency_p95`.

`test_decision_explainability.py` — `test_explanation_completeness`, `test_rationale_f1_eraser`, `test_citation_groundedness`.

`test_provenance_lineage.py` — `test_lineage_completeness_4hop`, `test_checksum_integrity`, `test_prov_o_sparql_violations`, `test_revision_monotonicity_violations`.

`test_provenance_continuity.py` — `test_chain_continuity`, `test_orphan_rate`, `test_audit_log_completeness`.

---

## Key implementation patterns

W3C PROV-O SPARQL test — exports provenance graph as RDF Turtle and runs all integrity constraint queries:

```python
def test_prov_o_sparql_violations(provenance_tracker):
    rdf_graph = provenance_tracker.export_rdf(format="turtle")
    violations = []
    for name, query in W3C_PROV_O_INTEGRITY_QUERIES.items():
        if len(rdf_graph.query(query)) > 0:
            violations.append(name)
    assert not violations, (
        f"PROV-O violations: {violations}. "
        f"All W3C PROV-O integrity constraints must return zero results."
    )
```

50-thread concurrent write test — generates fixture at test time:

```python
import threading, random

def test_concurrent_write_no_overlap():
    manager = TemporalVersionManager()
    errors = []

    def write_entity():
        try:
            manager.create_revision("entity_1", data={"v": random.random()})
        except Exception as e:
            errors.append(e)

    threads = [threading.Thread(target=write_entity) for _ in range(50)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

    assert not errors, f"Concurrent write raised exceptions: {errors}"
    overlaps = count_overlapping_windows(manager.get_all_revisions("entity_1"))
    assert overlaps == 0, (
        f"{overlaps} overlapping revision windows under 50-thread concurrent write."
    )
```

Binary gate pattern — use `== 0` not `<= threshold` for integrity counts:

```python
def test_gap_count_zero(bitemporal_stress_fixture, temporal_manager):
    revisions = temporal_manager.load(bitemporal_stress_fixture)
    gaps = count_gaps(revisions)
    assert gaps == 0, (
        f"Found {gaps} temporal gaps. W3C PROV-DM §4.4 requires exactly zero."
    )
```

---

## Checklist

- [ ] `datasets/temporal_provenance/multitq/` downloaded
- [ ] `datasets/temporal_provenance/cronquestions/` downloaded
- [ ] `datasets/temporal_provenance/tkgl_icews/` downloaded and added to git-lfs
- [ ] `datasets/temporal_provenance/tkgl_wikidata/` downloaded and added to git-lfs
- [ ] `datasets/temporal_provenance/icews/` downloaded
- [ ] `datasets/temporal_provenance/fever/` downloaded
- [ ] `datasets/temporal_provenance/w3c_prov/` downloaded
- [ ] `datasets/temporal_provenance/eraser/` downloaded
- [ ] `datasets/temporal_provenance/ibm_hr/` downloaded
- [ ] `fixtures/bitemporal_stress.json` generated and committed (50k events, zero injected gaps)
- [ ] `fixtures/temporal_patterns.json` generated and committed (10k events, ground-truth labels)
- [ ] `fixtures/provenance_chain_10k.json` generated and committed (4-hop, 10k nodes)
- [ ] `fixtures/e2e_provenance.json` generated and committed (1k documents, full pipeline trace)
- [ ] `conftest.py` — all threshold constants, all dataset loaders
- [ ] `test_temporal_validity.py` — stale rate, future rate, precision@5, MRR on tkgl-icews
- [ ] `test_bitemporal_integrity.py` — gap=0, overlap=0, monotonicity=1.0, concurrent write safe
- [ ] `test_temporal_patterns.py` — pattern recall ≥ 0.75, anomaly F1 ≥ 0.70, latency < 1s
- [ ] `test_decision_explainability.py` — completeness ≥ 0.85, rationale F1, citation groundedness
- [ ] `test_provenance_lineage.py` — PROV-O SPARQL = 0, checksum = 1.0, lineage = 1.0
- [ ] `test_provenance_continuity.py` — chain continuity = 1.0, orphan rate = 0, audit log = 1.0
- [ ] All tests pass: `pytest benchmarks/temporal_provenance/ -p no:langsmith`
- [ ] No test requires a real LLM (no `SEMANTICA_REAL_LLM` dependency)
- [ ] `datasets/temporal_provenance/README.md` written
- [ ] PR title: `benchmarks: temporal and provenance tracks 2.1–2.3, 3.1–3.3`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] [Benchmarks] Pillar 2 & 3 Temporal, Bitemporal & Provenance #572

Scope

Track 2.1 — Temporal Validity & Stale Context Prevention

Track 2.2 — Bitemporal Revision Integrity

Track 2.3 — Temporal Pattern Detection

Track 3.1 — Decision Explainability

Track 3.2 — Provenance Lineage Integrity (W3C PROV-O)

Track 3.3 — Cross-module Provenance Continuity

Fixtures to generate

Files to create

Key implementation patterns

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[FEATURE] [Benchmarks] Pillar 2 & 3 Temporal, Bitemporal & Provenance #572

Description

Scope

Track 2.1 — Temporal Validity & Stale Context Prevention

Track 2.2 — Bitemporal Revision Integrity

Track 2.3 — Temporal Pattern Detection

Track 3.1 — Decision Explainability

Track 3.2 — Provenance Lineage Integrity (W3C PROV-O)

Track 3.3 — Cross-module Provenance Continuity

Fixtures to generate

Files to create

Key implementation patterns

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions