Skip to content

[Epic] Real-World Benchmark Suite All Semantica Modules #570

@KaifAhmad1

Description

@KaifAhmad1

What this epic covers

This epic delivers the full benchmark suite defined in docs/benchmarks/real_world_benchmarks.md.

  • 35 evaluation tracks across every Semantica module
  • 50+ public datasets with published baselines — no synthetic-only fixtures, no uncited thresholds
  • Every track calls real Semantica module APIs against real downloaded datasets — not hardcoded expected outputs
  • Every track is owned by exactly one contributor via a dedicated issue
  • Issues are designed to be fully independent so contributors never block each other

The suite proves Semantica's core claims with real data:

  • Decision Intelligence — precedent retrieval, causal analysis, and policy compliance outperform plain RAG
  • Temporal & Bitemporal Reasoning — stale facts are never injected; audit trails are gap-free
  • Explainability & Provenance — W3C PROV-O conformance; every decision traceable to evidence
  • Memory & Context Management — agent memory holds precision at 1M and 10M token scale
  • Structural Intelligence — entity resolution, KG completion, and embedding quality anchored to public SOTA
  • New Module Tracks — document parsing, chunking, vector store, and ingest pipeline benchmarked for the first time

Benchmarks are run locally only. They are not wired into any CI pipeline — the suite is too large and too slow for automated runs. Contributors run them manually before opening a PR.


Modules under test

Every benchmark calls a real Semantica API. The table below maps each pillar to its primary APIs so reviewers can verify coverage is genuine.

Decision Intelligence

  • ContextGraph.find_precedents() · AgentContext.find_precedents() · ContextRetriever.find_precedents_hybrid()
  • ContextGraph.get_causal_chain() · CausalChainAnalyzer.get_causal_chain()
  • ContextGraph.analyze_decision_influence() · ContextGraph.trace_decision_causality()
  • PolicyEngine.get_applicable_policies() · PolicyEngine.check_decision_compliance()

Temporal & Provenance

  • TemporalGraphRetriever · ContextRetriever (with valid_from/valid_until filters)
  • TemporalVersionManager · TemporalGraphQuery · validate_temporal_consistency()
  • TemporalPatternDetector · kg.temporal_query.TemporalPatternDetector
  • AgentContext.trace_decision_explainability() · DecisionContext.explain_decision()
  • kg.provenance_tracker.ProvenanceTracker · kg.kg_provenance.GraphBuilderWithProvenance
  • context.context_provenance.ContextManagerWithProvenance

Memory & Context Management

  • AgentMemory.store() · AgentMemory.get_memory() · AgentMemory.retrieve() · AgentMemory._prune_short_term_memory()
  • ContextRetriever.hybrid_search() (alpha sweep across {0.0 … 1.0})
  • ContextRetriever.multi_hop_context_assembly() · AgentContext.multi_hop_context_query()
  • AgentContext (multi-turn consistency) · ContextGraph metric node storage

Structural Intelligence

  • SimilarityCalculator.calculate_similarity() · ClusterBuilder.build_clusters() · EntityMerger.merge_entities()
  • LinkPredictor.score_link() · LinkPredictor (MRR, Hits@k)
  • SemanticExtractor (relation extraction) · EntityExtractor
  • EmbeddingGenerator.generate_embeddings() (batch mode)
  • ConflictResolver.detect_conflicts() · ConflictResolver.resolve_conflict()
  • Reasoner.infer_facts() · ontology.Reasoner

New Module Tracks

  • parse.DocumentParser · parse.PDFParser · parse.TableExtractor
  • split.Splitter · split.SemanticChunker · split.RecursiveCharacterSplitter · split.TokenSplitter
  • vector_store.VectorStore.search() · vector_store.VectorStore.filtered_search()
  • ingest.IngestPipeline.run() · ingest.IngestPipeline.run_batch()

Architecture rules — read before touching a single file

Rule 1 — One directory per issue.
Never create or edit files outside your assigned benchmarks/<dir>/ and datasets/<dir>/ folders.
The only shared files you may touch are .gitattributes (Infrastructure & Module Tracks issue only) and benchmarks/benchmarks_runner.py (Infrastructure & Module Tracks issue only).

Rule 2 — Thresholds live in conftest.py.
Never hardcode a float inside a test function.
Every threshold constant must be defined at the top of conftest.py with the form THRESHOLD_<METRIC>_<DATASET> = <value>.
No PR will be merged with inline numbers like assert score >= 0.70.

Rule 3 — Real LLM tests are skipped by default.
Any test that requires a real language model must be gated so it does not run unless explicitly enabled:

real_llm = pytest.mark.skipif(
    not os.getenv("SEMANTICA_REAL_LLM"),
    reason="requires real LLM — set SEMANTICA_REAL_LLM=1 to enable"
)

Without this gate the test would fail on any machine that does not have an LLM available.

Rule 4 — Files over 5 MB go through git-lfs.
The Infrastructure & Module Tracks issue sets up .gitattributes before any large file is committed.
Committing a raw binary over 5 MB to regular git history will be rejected at review.

Rule 5 — Quality metrics must come from real public datasets.
Every quality threshold (F1, MRR, recall, precision, accuracy) must be measured by calling a real Semantica module API against a real downloaded public dataset. The threshold value must have a published baseline citation.
Synthetic fixtures (fixtures/*.json) are permitted only for latency and scale tests (P95 latency at N entities, throughput at N docs/s) — never as the primary signal for a quality metric.
A test that computes F1 entirely from a synthetic JSON file will be rejected at review regardless of the score.


How to run benchmarks

Benchmarks are a local tool. Run your sub-issue's tests before opening a PR:

pytest benchmarks/<dir>/ -p no:langsmith -v

To run a specific track:

pytest benchmarks/decision_intelligence/test_precedent_search.py -v

To enable real-LLM tests (requires SEMANTICA_REAL_LLM env var):

SEMANTICA_REAL_LLM=1 pytest benchmarks/<dir>/ -v

To enable the BEAM 10M test (requires the large dataset downloaded):

BEAM_10M=1 pytest benchmarks/memory_context/test_agent_memory_beam10m.py -v

Issues and their scope

The Infrastructure & Module Tracks issue should be at least partially completed before the others merge — it sets up git-lfs and JSON schema validation that all other issues depend on.

  • Decision Intelligence (Tracks 1.1–1.4, 30–31)

    • Precedent search quality, causal chain traversal, decision influence analysis, policy compliance classification
    • Datasets: German Credit, CUAD, LEDGAR, TREC CT 2022, ATOMIC 2020, e-CARE, CausalBench
    • Hard gate: policy false negative rate ≤ 0.05
    • Owner directory: benchmarks/decision_intelligence/
  • [Temporal, Bitemporal & Provenance ([FEATURE] [Benchmarks] Pillar 2 & 3 Temporal, Bitemporal & Provenance #572 )

    • Temporal validity, bitemporal revision integrity, pattern detection, decision explainability, provenance lineage, cross-module continuity
    • Datasets: MultiTQ, CronQuestions, TGB 2.0 tkgl-icews, TGB 2.0 tkgl-wikidata, ICEWS, FEVER, ERASER, W3C PROV-O suite
    • Hard gates: gap count = 0, overlap count = 0, PROV-O SPARQL violations = 0
    • Owner directory: benchmarks/temporal_provenance/
  • [Memory & Context Management ([FEATURE] Pillar 4 Memory & Context Management #573 )

    • Agent memory persistence, hybrid retrieval alpha sweep, multi-hop context assembly, agentic semantic consistency
    • Datasets: LoCoMo, LongMemEval, BEAM 1M, BEAM 10M (optional, large download), MemoryArena, MSC, GrailQA, CWQ, HotpotQA, 2WikiMultiHop, MuSiQue, MetaQA, τ-bench
    • Hard gate: alpha=0.5 must outperform alpha=0.0 AND alpha=1.0 on nDCG@10
    • Owner directory: benchmarks/memory_context/
  • [Structural Intelligence ([FEATURE] Pillar 5 Structural Intelligence #574 )

    • Entity resolution, KG completion, semantic extraction, embedding quality, conflict resolution, ontology reasoning
    • Datasets: DBLP-ACM, DBLP-Scholar, Abt-Buy, Amazon-Google, Walmart-Amazon dirty, WDC Products, FB15k-237, WN18RR, ogbl-biokg, CoDEx-S/M/L, Wikidata5M, SemEval 2010, NYT10, DocRED, Re-DocRED, REBEL, BEIR, MTEB, STS-Benchmark, WikiContradict, CONFLICTBANK, W3C OWL-RL
    • Regression guard: detect_duplicates() must never be called from the benchmark pipeline
    • Owner directory: benchmarks/structural_intelligence/
  • [Infrastructure, New Module Tracks & Dataset Registry ([FEATURE] Infrastructure, New Module Tracks & Dataset Registry #575 )

    • Document parsing, chunking quality, vector store performance, ingest pipeline throughput
    • Infrastructure: git-lfs setup, JSON schema validation, benchmarks_runner.py discovery, dataset registry
    • Datasets: OmniDocBench, SIFT1M, NeurIPS'23 Filtered Search, DEEP1B, synthetic ingest corpus
    • Owner directories: benchmarks/module_tracks/ · benchmarks/infrastructure/

Directory isolation map

Every issue owns exactly the directories below — and nothing else.

  • benchmarks/decision_intelligence/ — Decision Intelligence
  • benchmarks/temporal_provenance/ — Temporal, Bitemporal & Provenance
  • benchmarks/memory_context/ — Memory & Context Management
  • benchmarks/structural_intelligence/ — Structural Intelligence
  • benchmarks/module_tracks/ — Infrastructure & Module Tracks
  • benchmarks/infrastructure/ — Infrastructure & Module Tracks
  • benchmarks/context_graph_effectiveness/ — existing, do not move or modify

Dataset directories follow the same pattern:

  • datasets/decision_intelligence/ — Decision Intelligence
  • datasets/temporal_provenance/ — Temporal, Bitemporal & Provenance
  • datasets/memory_context/ — Memory & Context Management
  • datasets/structural_intelligence/ — Structural Intelligence
  • datasets/module_tracks/ — Infrastructure & Module Tracks

Shared directories — each issue commits only its own files:

  • fixtures/ — each issue adds its own fixtures; never overwrite another issue's files
  • scripts/ — each issue adds its own generator scripts

Overall checklist

  • Infrastructure & Module Tracks: .gitattributes with git-lfs patterns, JSON schema files, benchmarks_runner.py updated
  • Decision Intelligence merged — tracks 1.1–1.4, 30–31
  • Temporal, Bitemporal & Provenance merged — tracks 2.1–2.3, 3.1–3.3
  • Memory & Context Management merged — tracks 4.1–4.4, 26
  • Structural Intelligence merged — tracks 5.1–5.5, 27–29
  • Infrastructure & Module Tracks merged — document parsing, chunking, vector store, ingest pipeline (tracks 32–35)
  • All tests pass locally: pytest benchmarks/ -p no:langsmith
  • CHANGELOG.md entry added under the release that ships the suite
  • docs/benchmarks/real_world_benchmarks.md version bumped to 3.0

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No fields configured for Task.

Projects

Status
Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions