You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Six tracks that validate Semantica's temporal reasoning and W3C-conformant audit trail correctness. All six tracks live in benchmarks/temporal_provenance/.
Track 2.1 — Temporal Validity & Stale Context Prevention: does TemporalGraphRetriever return only facts valid at query time? Stale facts in LLM context produce confidently-wrong answers — a patient safety or compliance risk.
Track 2.2 — Bitemporal Revision Integrity: does TemporalVersionManager maintain gap-free, non-overlapping revision histories under concurrent write load? Gaps are invisible without explicit tests.
Track 2.3 — Temporal Pattern Detection: does TemporalPatternDetector surface trends, anomalies, and regime changes from time-series graph data?
Track 3.1 — Decision Explainability: does trace_decision_explainability() surface the full reasoning chain — evidence nodes, policy references, precedent links? Incomplete explanations fail auditors.
Track 3.2 — Provenance Lineage Integrity (W3C PROV-O): is every hop in the lineage checksum-protected and W3C-conformant? A chain with gaps is legally inadmissible as an audit trail.
Track 3.3 — Cross-module Provenance Continuity: does the provenance chain stay unbroken as a document moves through ingest → embedding → KG → audit_log?
Do not touch any file outside benchmarks/temporal_provenance/ and datasets/temporal_provenance/.
Track 2.1 — Temporal Validity & Stale Context Prevention
Why it matters: If TemporalGraphRetriever injects stale facts into an LLM context, the model generates outdated answers with false confidence. In healthcare or legal domains, this is a direct patient safety or compliance risk.
APIs under test:TemporalGraphRetriever · ContextRetriever (temporal filters: valid_from, valid_until) · ContextGraph.add_node() with valid_from/valid_until parameters
Datasets:
MultiTQ — 7,000 multi-temporal Q&A pairs designed specifically for retrieval-before-answer systems. Each question requires reasoning over multiple simultaneous temporal constraints (e.g., "who held role X during year Y when Z was also true"). Directly exposes retrievers that ignore valid_from/valid_until boundaries. License: research open. Citation: Chen et al., EMNLP 2023, arxiv:2310.01253. Path: datasets/temporal_provenance/multitq/
CronQuestions — 410,357 temporal Q&A pairs covering the full range of temporal reasoning: role holders during a year, last events before a timestamp, concurrent events. Largest temporal QA benchmark available; provides MRR signal at scale. License: research open. Citation: Saxena et al., ACL 2021. Download: github.com/apoorvumang/CronKGQA. Path: datasets/temporal_provenance/cronquestions/
TGB 2.0 tkgl-icews — 9.8 million political event quadruples (subject, relation, object, timestamp) from the Integrated Crisis Early Warning System. Tests time-aware MRR: how well does the retriever rank temporally-valid facts above stale ones? Requires git-lfs — exceeds 5 MB. License: research open. Citation: TGB 2.0, NeurIPS 2024. Download: tgb.complexdatalab.com. Path: datasets/temporal_provenance/tkgl_icews/
TGB 2.0 tkgl-wikidata — 1.5 million general temporal KG quadruples from Wikidata, covering a wide variety of entity types and relation types. Closer to Semantica's ontology than ICEWS; shared with Track 2.3. Requires git-lfs. License: CC BY 4.0. Citation: TGB 2.0, NeurIPS 2024. Path: datasets/temporal_provenance/tkgl_wikidata/
WebQSP removed: a SIGIR 2022 audit found ~52% stale ground truth, making Precision@k unreliable. TimeQA (150 Q&A) removed: ±8% confidence intervals at 95% make pass/fail meaningless on a single wrong answer.
Metrics:
Stale injection rate = |retrieved_facts where valid_until < query_time| / |total_retrieved| — fraction of retrieved facts that had already expired at query time.
Future injection rate = |retrieved_facts where valid_from > query_time| / |total_retrieved| — fraction of retrieved facts not yet valid at query time.
Temporal precision@5 = |correct_temporal_facts in top-5| / 5 — fraction of top-5 results that are temporally valid and factually correct.
MRR (time-aware) = (1/|Q|) × Σ_q (1 / rank_q) — Mean Reciprocal Rank where only temporally-valid facts count as relevant. Evaluated on tkgl-icews.
Thresholds:
Stale injection rate < 0.05 — Semantica production SLA 2026-Q1
Future injection rate < 0.05 — Semantica production SLA 2026-Q1
Why it matters: A bitemporal graph tracks both valid time (when a fact was true in the world) and transaction time (when it was recorded). Gaps or overlaps in either dimension corrupt audit trails and are completely invisible without explicit correctness tests.
APIs under test:TemporalVersionManager · kg.temporal_query.TemporalVersionManager · TemporalGraphQuery · validate_temporal_consistency()
Datasets:
Synthetic bitemporal stress corpus — 50,000 revision events with zero injected gaps or overlaps. Any test that passes on this fixture is sensitive to any introduced error; if a gap is injected and the test still passes, the detector is broken. Generated once; committed to fixtures/bitemporal_stress.json.
Concurrent-write fixture — generated at test time, not from a file. Fifty threads each write the same entity simultaneously. Expected output: zero overlapping revision windows in the final revision list.
Wikidata revision history subset — 10k entity edit pairs with valid_from / transaction_time metadata. Tests the bitemporal manager on real revision patterns extracted from Wikidata history dumps. Committed via git-lfs.
Metrics:
Revision monotonicity = |pairs where valid_until[v] == valid_from[v+1]| / |total_consecutive_pairs| — fraction of consecutive revision pairs with exactly-contiguous windows. Must equal 1.0.
Temporal gap count = |pairs where valid_until[v] < valid_from[v+1]| — count of revision pairs with a gap between them. Must be exactly 0.
Overlap count = |pairs where valid_until[v] > valid_from[v+1]| — count of revision pairs where windows overlap. Must be exactly 0.
Concurrent write safety — count of overlapping revision windows observed after 50 threads simultaneously write the same entity. Must be 0.
Thresholds (binary gates — any non-zero value fails the test immediately):
Why it matters:TemporalPatternDetector surfaces trends, periodicity, and regime changes in time-series graph data. Poor recall means anomalies are missed; poor precision causes false alerts that waste investigator time; slow detection misses real-time SLA.
APIs under test:kg.temporal_query.TemporalPatternDetector · TemporalGraphQuery
Datasets:
ICEWS — International Conflict and Event data; political event dataset covering global events from 1995 to present, encoded as subject–predicate–object–time quadruples. Used to evaluate anomaly detection: do detected anomaly clusters align with known geopolitical event clusters? License: public domain. Path: datasets/temporal_provenance/icews/
TGB 2.0 tkgl-wikidata — shared with Track 2.1. Used for periodicity and drift detection across multiple entity types.
Synthetic temporal anomaly corpus — 10,000 events with injected known patterns: spikes (sudden frequency increase), drift (gradual distribution shift), periodicity (weekly/monthly cycles), and normal background. Every event has a ground-truth pattern label. Generated once; committed to fixtures/temporal_patterns.json.
Metrics:
Pattern recall = |detected_patterns ∩ gold_patterns| / |gold_patterns| — fraction of known patterns in the fixture that are detected.
Pattern precision = |detected_patterns ∩ gold_patterns| / |detected_patterns| — fraction of detected patterns that are real.
Anomaly F1 = 2 × precision × recall / (precision + recall) — harmonic mean on the ICEWS anomaly detection task.
Detection latency P95 at 1M event window — 95th percentile wall time to run pattern detection on a 1-million-event window.
Thresholds:
Pattern recall ≥ 0.75 — internal synthetic corpus
Anomaly F1 ≥ 0.70 — ICEWS event detection literature, Ward et al. 2013
Detection latency P95 < 1s at 1M events — Semantica production SLA
Track 3.1 — Decision Explainability
Why it matters:trace_decision_explainability() and explain_decision() must surface the full reasoning chain — evidence nodes, policy references, precedent links. Incomplete explanations fail auditors and regulators who are legally required to understand why a decision was made.
APIs under test:AgentContext.trace_decision_explainability() · DecisionContext.explain_decision() · ContextRetriever.explainable_retrieval()
Datasets:
German Credit — 1,000 lending decisions with human-annotated feature importance (top 3 factors per decision) and reason codes. Verifies that explain_decision() surfaces the same top factors as human annotators. Shared dataset; load from datasets/decision_intelligence/german_credit/ if Sub-issue 1 has merged, otherwise download to datasets/temporal_provenance/german_credit/. License: CC BY 4.0.
IBM HR Attrition — 1,470 employee records with 35 feature attributes and binary attrition labels. Domain experts have annotated the top decision factors for a 200-record subset. License: public domain (IBM). Download: kaggle IBM HR dataset. Path: datasets/temporal_provenance/ibm_hr/
ERASER benchmark — 56,000 instances across 7 NLP tasks (FEVER, MultiRC, BoolQ, CosmosQA, SciF, Movies, e-SNLI) with human-highlighted rationale annotations at token level. Used to evaluate trace_decision_explainability() rationale F1: do the explanation tokens overlap with human-highlighted reasoning tokens? License: research open. Citation: DeYoung et al., ACL 2020. Download: eraser-benchmark.github.io. Path: datasets/temporal_provenance/eraser/
Metrics:
Explanation completeness = |gold_factors ∩ explained_factors| / |gold_factors| — fraction of ground-truth decision factors surfaced by explain_decision().
Explanation precision = |gold_factors ∩ explained_factors| / |explained_factors| — fraction of surfaced factors in the gold set.
Rationale F1 — token-level F1 between explanation tokens and ERASER human-annotated rationale tokens. Harmonic mean of token recall and token precision.
Citation groundedness = |explanations citing a retrievable KG node| / |total explanations| — fraction of explanation steps that cite a node actually present in the knowledge graph.
Why it matters: A provenance chain with gaps or incorrect parent pointers is legally inadmissible as an audit trail. Every hop must be verifiable and checksum-protected. The W3C PROV-O SPARQL integrity constraint queries must return zero results — any non-zero result means a PROV-O constraint is violated and the test fails immediately.
APIs under test:kg.provenance_tracker.ProvenanceTracker · kg.kg_provenance.GraphBuilderWithProvenance · context.context_provenance.ContextManagerWithProvenance
Datasets:
FEVER — 185,445 claim–evidence pairs where each claim links to one or more Wikipedia revision sources. Provenance chain: claim → evidence sentences → Wikipedia article → Wikipedia revision. Tests that ProvenanceTracker correctly links every evidence node to its source. License: CC BY 4.0. Citation: Thorne et al., NAACL 2018. Download: fever.ai. Path: datasets/temporal_provenance/fever/
W3C PROV-O conformance test suite — 57 positive entailment tests (these triples must be inferred from the provenance graph) and 38 negative tests (these inferences must not be made). All 95 tests must pass. License: W3C document. Download: w3.org/TR/prov-o. Path: datasets/temporal_provenance/w3c_prov/
Synthetic 10k-node provenance chain — 4-hop lineage graph with 10,000 nodes and gold ancestor sets for each leaf node. Used to verify ProvenanceTracker completeness at depth 4. Generated once; committed to fixtures/provenance_chain_10k.json.
Concurrent-write provenance fixture — 50 threads writing overlapping provenance entries simultaneously. Generated at test time. Validates PROV-O monotonicity under concurrency.
Metrics:
Lineage completeness at depth D = |retrieved_ancestors ∩ gold_ancestors| / |gold_ancestors| — fraction of gold ancestors recovered at depth D. Evaluated at D=4.
Checksum integrity = |nodes where stored_hash == recomputed_hash| / |total_nodes| — fraction of provenance nodes with intact hash values.
W3C PROV-O SPARQL violations = count of integrity constraint queries returning non-empty result — hard gate; any non-zero value means a PROV-O constraint is violated.
Revision monotonicity violations = gap_count + overlap_count — combined count of bitemporal violations. Must be 0.
Why it matters: Every Semantica module (ingest, semantic_extract, kg, deduplication) appends provenance records. If the chain breaks between modules, the audit trail is incomplete and the full lineage cannot be reconstructed — the ingest source of a KG node becomes untraceable.
End-to-end provenance pipeline fixture — 1,000 documents traced from raw file through the full pipeline: ingest → embedding → KG node. Every inter-module link is recorded. Generated once; committed to fixtures/e2e_provenance.json.
Metrics:
Chain continuity = |documents where every inter-module link resolves| / |total_documents| — fraction of documents for which the full ingest-to-KG provenance chain is intact.
Orphan rate = |KG nodes with no traceable ingest source| / |total_KG_nodes| — fraction of graph nodes with no provenance record. Must be 0.
Audit log completeness = |pipeline stages present in exported log| / |expected_stages| — fraction of pipeline stages present in the exported audit log. Must be 1.0.
Thresholds (binary gates):
Chain continuity = 1.0 — binary correctness
Orphan rate = 0 — internal SLA
Audit log completeness = 1.0 — binary correctness
Fixtures to generate
Generate once locally, commit the output. Do not regenerate fixtures after committing.
scripts/generate_temporal_provenance_fixtures.py — generates all four fixtures below.
fixtures/bitemporal_stress.json — 50k revision events, zero injected gaps or overlaps.
fixtures/temporal_patterns.json — 10k events with spike/drift/periodicity/normal labels.
Scope
Six tracks that validate Semantica's temporal reasoning and W3C-conformant audit trail correctness. All six tracks live in
benchmarks/temporal_provenance/.TemporalGraphRetrieverreturn only facts valid at query time? Stale facts in LLM context produce confidently-wrong answers — a patient safety or compliance risk.TemporalVersionManagermaintain gap-free, non-overlapping revision histories under concurrent write load? Gaps are invisible without explicit tests.TemporalPatternDetectorsurface trends, anomalies, and regime changes from time-series graph data?trace_decision_explainability()surface the full reasoning chain — evidence nodes, policy references, precedent links? Incomplete explanations fail auditors.ingest → embedding → KG → audit_log?Do not touch any file outside
benchmarks/temporal_provenance/anddatasets/temporal_provenance/.Track 2.1 — Temporal Validity & Stale Context Prevention
Why it matters: If
TemporalGraphRetrieverinjects stale facts into an LLM context, the model generates outdated answers with false confidence. In healthcare or legal domains, this is a direct patient safety or compliance risk.APIs under test:
TemporalGraphRetriever·ContextRetriever(temporal filters:valid_from,valid_until) ·ContextGraph.add_node()withvalid_from/valid_untilparametersDatasets:
MultiTQ — 7,000 multi-temporal Q&A pairs designed specifically for retrieval-before-answer systems. Each question requires reasoning over multiple simultaneous temporal constraints (e.g., "who held role X during year Y when Z was also true"). Directly exposes retrievers that ignore
valid_from/valid_untilboundaries. License: research open. Citation: Chen et al., EMNLP 2023, arxiv:2310.01253. Path:datasets/temporal_provenance/multitq/CronQuestions — 410,357 temporal Q&A pairs covering the full range of temporal reasoning: role holders during a year, last events before a timestamp, concurrent events. Largest temporal QA benchmark available; provides MRR signal at scale. License: research open. Citation: Saxena et al., ACL 2021. Download: github.com/apoorvumang/CronKGQA. Path:
datasets/temporal_provenance/cronquestions/TGB 2.0 tkgl-icews — 9.8 million political event quadruples (subject, relation, object, timestamp) from the Integrated Crisis Early Warning System. Tests time-aware MRR: how well does the retriever rank temporally-valid facts above stale ones? Requires git-lfs — exceeds 5 MB. License: research open. Citation: TGB 2.0, NeurIPS 2024. Download: tgb.complexdatalab.com. Path:
datasets/temporal_provenance/tkgl_icews/TGB 2.0 tkgl-wikidata — 1.5 million general temporal KG quadruples from Wikidata, covering a wide variety of entity types and relation types. Closer to Semantica's ontology than ICEWS; shared with Track 2.3. Requires git-lfs. License: CC BY 4.0. Citation: TGB 2.0, NeurIPS 2024. Path:
datasets/temporal_provenance/tkgl_wikidata/Metrics:
Stale injection rate = |retrieved_facts where valid_until < query_time| / |total_retrieved|— fraction of retrieved facts that had already expired at query time.Future injection rate = |retrieved_facts where valid_from > query_time| / |total_retrieved|— fraction of retrieved facts not yet valid at query time.Temporal precision@5 = |correct_temporal_facts in top-5| / 5— fraction of top-5 results that are temporally valid and factually correct.MRR (time-aware) = (1/|Q|) × Σ_q (1 / rank_q)— Mean Reciprocal Rank where only temporally-valid facts count as relevant. Evaluated on tkgl-icews.Thresholds:
Track 2.2 — Bitemporal Revision Integrity
Why it matters: A bitemporal graph tracks both valid time (when a fact was true in the world) and transaction time (when it was recorded). Gaps or overlaps in either dimension corrupt audit trails and are completely invisible without explicit correctness tests.
APIs under test:
TemporalVersionManager·kg.temporal_query.TemporalVersionManager·TemporalGraphQuery·validate_temporal_consistency()Datasets:
Synthetic bitemporal stress corpus — 50,000 revision events with zero injected gaps or overlaps. Any test that passes on this fixture is sensitive to any introduced error; if a gap is injected and the test still passes, the detector is broken. Generated once; committed to
fixtures/bitemporal_stress.json.Concurrent-write fixture — generated at test time, not from a file. Fifty threads each write the same entity simultaneously. Expected output: zero overlapping revision windows in the final revision list.
Wikidata revision history subset — 10k entity edit pairs with
valid_from/transaction_timemetadata. Tests the bitemporal manager on real revision patterns extracted from Wikidata history dumps. Committed via git-lfs.Metrics:
Revision monotonicity = |pairs where valid_until[v] == valid_from[v+1]| / |total_consecutive_pairs|— fraction of consecutive revision pairs with exactly-contiguous windows. Must equal 1.0.Temporal gap count = |pairs where valid_until[v] < valid_from[v+1]|— count of revision pairs with a gap between them. Must be exactly 0.Overlap count = |pairs where valid_until[v] > valid_from[v+1]|— count of revision pairs where windows overlap. Must be exactly 0.Concurrent write safety— count of overlapping revision windows observed after 50 threads simultaneously write the same entity. Must be 0.Thresholds (binary gates — any non-zero value fails the test immediately):
Track 2.3 — Temporal Pattern Detection
Why it matters:
TemporalPatternDetectorsurfaces trends, periodicity, and regime changes in time-series graph data. Poor recall means anomalies are missed; poor precision causes false alerts that waste investigator time; slow detection misses real-time SLA.APIs under test:
kg.temporal_query.TemporalPatternDetector·TemporalGraphQueryDatasets:
ICEWS — International Conflict and Event data; political event dataset covering global events from 1995 to present, encoded as subject–predicate–object–time quadruples. Used to evaluate anomaly detection: do detected anomaly clusters align with known geopolitical event clusters? License: public domain. Path:
datasets/temporal_provenance/icews/TGB 2.0 tkgl-wikidata — shared with Track 2.1. Used for periodicity and drift detection across multiple entity types.
Synthetic temporal anomaly corpus — 10,000 events with injected known patterns: spikes (sudden frequency increase), drift (gradual distribution shift), periodicity (weekly/monthly cycles), and normal background. Every event has a ground-truth pattern label. Generated once; committed to
fixtures/temporal_patterns.json.Metrics:
Pattern recall = |detected_patterns ∩ gold_patterns| / |gold_patterns|— fraction of known patterns in the fixture that are detected.Pattern precision = |detected_patterns ∩ gold_patterns| / |detected_patterns|— fraction of detected patterns that are real.Anomaly F1 = 2 × precision × recall / (precision + recall)— harmonic mean on the ICEWS anomaly detection task.Detection latency P95 at 1M event window— 95th percentile wall time to run pattern detection on a 1-million-event window.Thresholds:
Track 3.1 — Decision Explainability
Why it matters:
trace_decision_explainability()andexplain_decision()must surface the full reasoning chain — evidence nodes, policy references, precedent links. Incomplete explanations fail auditors and regulators who are legally required to understand why a decision was made.APIs under test:
AgentContext.trace_decision_explainability()·DecisionContext.explain_decision()·ContextRetriever.explainable_retrieval()Datasets:
German Credit — 1,000 lending decisions with human-annotated feature importance (top 3 factors per decision) and reason codes. Verifies that
explain_decision()surfaces the same top factors as human annotators. Shared dataset; load fromdatasets/decision_intelligence/german_credit/if Sub-issue 1 has merged, otherwise download todatasets/temporal_provenance/german_credit/. License: CC BY 4.0.IBM HR Attrition — 1,470 employee records with 35 feature attributes and binary attrition labels. Domain experts have annotated the top decision factors for a 200-record subset. License: public domain (IBM). Download: kaggle IBM HR dataset. Path:
datasets/temporal_provenance/ibm_hr/ERASER benchmark — 56,000 instances across 7 NLP tasks (FEVER, MultiRC, BoolQ, CosmosQA, SciF, Movies, e-SNLI) with human-highlighted rationale annotations at token level. Used to evaluate
trace_decision_explainability()rationale F1: do the explanation tokens overlap with human-highlighted reasoning tokens? License: research open. Citation: DeYoung et al., ACL 2020. Download: eraser-benchmark.github.io. Path:datasets/temporal_provenance/eraser/Metrics:
Explanation completeness = |gold_factors ∩ explained_factors| / |gold_factors|— fraction of ground-truth decision factors surfaced byexplain_decision().Explanation precision = |gold_factors ∩ explained_factors| / |explained_factors|— fraction of surfaced factors in the gold set.Rationale F1— token-level F1 between explanation tokens and ERASER human-annotated rationale tokens. Harmonic mean of token recall and token precision.Citation groundedness = |explanations citing a retrievable KG node| / |total explanations|— fraction of explanation steps that cite a node actually present in the knowledge graph.Thresholds:
Track 3.2 — Provenance Lineage Integrity (W3C PROV-O)
Why it matters: A provenance chain with gaps or incorrect parent pointers is legally inadmissible as an audit trail. Every hop must be verifiable and checksum-protected. The W3C PROV-O SPARQL integrity constraint queries must return zero results — any non-zero result means a PROV-O constraint is violated and the test fails immediately.
APIs under test:
kg.provenance_tracker.ProvenanceTracker·kg.kg_provenance.GraphBuilderWithProvenance·context.context_provenance.ContextManagerWithProvenanceDatasets:
FEVER — 185,445 claim–evidence pairs where each claim links to one or more Wikipedia revision sources. Provenance chain: claim → evidence sentences → Wikipedia article → Wikipedia revision. Tests that
ProvenanceTrackercorrectly links every evidence node to its source. License: CC BY 4.0. Citation: Thorne et al., NAACL 2018. Download: fever.ai. Path:datasets/temporal_provenance/fever/W3C PROV-O conformance test suite — 57 positive entailment tests (these triples must be inferred from the provenance graph) and 38 negative tests (these inferences must not be made). All 95 tests must pass. License: W3C document. Download: w3.org/TR/prov-o. Path:
datasets/temporal_provenance/w3c_prov/Synthetic 10k-node provenance chain — 4-hop lineage graph with 10,000 nodes and gold ancestor sets for each leaf node. Used to verify
ProvenanceTrackercompleteness at depth 4. Generated once; committed tofixtures/provenance_chain_10k.json.Concurrent-write provenance fixture — 50 threads writing overlapping provenance entries simultaneously. Generated at test time. Validates PROV-O monotonicity under concurrency.
Metrics:
Lineage completeness at depth D = |retrieved_ancestors ∩ gold_ancestors| / |gold_ancestors|— fraction of gold ancestors recovered at depth D. Evaluated at D=4.Checksum integrity = |nodes where stored_hash == recomputed_hash| / |total_nodes|— fraction of provenance nodes with intact hash values.W3C PROV-O SPARQL violations = count of integrity constraint queries returning non-empty result— hard gate; any non-zero value means a PROV-O constraint is violated.Revision monotonicity violations = gap_count + overlap_count— combined count of bitemporal violations. Must be 0.Thresholds (binary gates):
Track 3.3 — Cross-module Provenance Continuity
Why it matters: Every Semantica module (
ingest,semantic_extract,kg,deduplication) appends provenance records. If the chain breaks between modules, the audit trail is incomplete and the full lineage cannot be reconstructed — the ingest source of a KG node becomes untraceable.APIs under test:
IngestProvenanceMixin→EmbeddingGeneratorWithProvenance→GraphBuilderWithProvenance→AlgorithmTrackerWithProvenance→ProvenanceTracker.export_audit_log()Datasets:
fixtures/e2e_provenance.json.Metrics:
Chain continuity = |documents where every inter-module link resolves| / |total_documents|— fraction of documents for which the full ingest-to-KG provenance chain is intact.Orphan rate = |KG nodes with no traceable ingest source| / |total_KG_nodes|— fraction of graph nodes with no provenance record. Must be 0.Audit log completeness = |pipeline stages present in exported log| / |expected_stages|— fraction of pipeline stages present in the exported audit log. Must be 1.0.Thresholds (binary gates):
Fixtures to generate
Generate once locally, commit the output. Do not regenerate fixtures after committing.
scripts/generate_temporal_provenance_fixtures.py— generates all four fixtures below.fixtures/bitemporal_stress.json— 50k revision events, zero injected gaps or overlaps.fixtures/temporal_patterns.json— 10k events with spike/drift/periodicity/normal labels.fixtures/provenance_chain_10k.json— 4-hop lineage graph, 10k nodes, gold ancestor sets.fixtures/e2e_provenance.json— 1k documents with full pipeline provenance trace.Files to create
conftest.py— all threshold constants at the top;session-scoped fixtures for all datasets; shared fixture loaders for all four JSON fixture files.test_temporal_validity.py—test_stale_injection_rate,test_future_injection_rate,test_temporal_precision_at_5,test_mrr_tkgl_icews.test_bitemporal_integrity.py—test_gap_count_zero,test_overlap_count_zero,test_revision_monotonicity,test_concurrent_write_no_overlap.test_temporal_patterns.py—test_pattern_recall,test_anomaly_f1_icews,test_detection_latency_p95.test_decision_explainability.py—test_explanation_completeness,test_rationale_f1_eraser,test_citation_groundedness.test_provenance_lineage.py—test_lineage_completeness_4hop,test_checksum_integrity,test_prov_o_sparql_violations,test_revision_monotonicity_violations.test_provenance_continuity.py—test_chain_continuity,test_orphan_rate,test_audit_log_completeness.Key implementation patterns
W3C PROV-O SPARQL test — exports provenance graph as RDF Turtle and runs all integrity constraint queries:
50-thread concurrent write test — generates fixture at test time:
Binary gate pattern — use
== 0not<= thresholdfor integrity counts:Checklist
datasets/temporal_provenance/multitq/downloadeddatasets/temporal_provenance/cronquestions/downloadeddatasets/temporal_provenance/tkgl_icews/downloaded and added to git-lfsdatasets/temporal_provenance/tkgl_wikidata/downloaded and added to git-lfsdatasets/temporal_provenance/icews/downloadeddatasets/temporal_provenance/fever/downloadeddatasets/temporal_provenance/w3c_prov/downloadeddatasets/temporal_provenance/eraser/downloadeddatasets/temporal_provenance/ibm_hr/downloadedfixtures/bitemporal_stress.jsongenerated and committed (50k events, zero injected gaps)fixtures/temporal_patterns.jsongenerated and committed (10k events, ground-truth labels)fixtures/provenance_chain_10k.jsongenerated and committed (4-hop, 10k nodes)fixtures/e2e_provenance.jsongenerated and committed (1k documents, full pipeline trace)conftest.py— all threshold constants, all dataset loaderstest_temporal_validity.py— stale rate, future rate, precision@5, MRR on tkgl-icewstest_bitemporal_integrity.py— gap=0, overlap=0, monotonicity=1.0, concurrent write safetest_temporal_patterns.py— pattern recall ≥ 0.75, anomaly F1 ≥ 0.70, latency < 1stest_decision_explainability.py— completeness ≥ 0.85, rationale F1, citation groundednesstest_provenance_lineage.py— PROV-O SPARQL = 0, checksum = 1.0, lineage = 1.0test_provenance_continuity.py— chain continuity = 1.0, orphan rate = 0, audit log = 1.0pytest benchmarks/temporal_provenance/ -p no:langsmithSEMANTICA_REAL_LLMdependency)datasets/temporal_provenance/README.mdwrittenbenchmarks: temporal and provenance tracks 2.1–2.3, 3.1–3.3