You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Temporal & Bitemporal Reasoning — stale facts are never injected; audit trails are gap-free
Explainability & Provenance — W3C PROV-O conformance; every decision traceable to evidence
Memory & Context Management — agent memory holds precision at 1M and 10M token scale
Structural Intelligence — entity resolution, KG completion, and embedding quality anchored to public SOTA
New Module Tracks — document parsing, chunking, vector store, and ingest pipeline benchmarked for the first time
Benchmarks are run locally only. They are not wired into any CI pipeline — the suite is too large and too slow for automated runs. Contributors run them manually before opening a PR.
Modules under test
Every benchmark calls a real Semantica API. The table below maps each pillar to its primary APIs so reviewers can verify coverage is genuine.
Architecture rules — read before touching a single file
Rule 1 — One directory per issue.
Never create or edit files outside your assigned benchmarks/<dir>/ and datasets/<dir>/ folders.
The only shared files you may touch are .gitattributes (Infrastructure & Module Tracks issue only) and benchmarks/benchmarks_runner.py (Infrastructure & Module Tracks issue only).
Rule 2 — Thresholds live in conftest.py.
Never hardcode a float inside a test function.
Every threshold constant must be defined at the top of conftest.py with the form THRESHOLD_<METRIC>_<DATASET> = <value>.
No PR will be merged with inline numbers like assert score >= 0.70.
Rule 3 — Real LLM tests are skipped by default.
Any test that requires a real language model must be gated so it does not run unless explicitly enabled:
real_llm=pytest.mark.skipif(
notos.getenv("SEMANTICA_REAL_LLM"),
reason="requires real LLM — set SEMANTICA_REAL_LLM=1 to enable"
)
Without this gate the test would fail on any machine that does not have an LLM available.
Rule 4 — Files over 5 MB go through git-lfs.
The Infrastructure & Module Tracks issue sets up .gitattributes before any large file is committed.
Committing a raw binary over 5 MB to regular git history will be rejected at review.
Rule 5 — Quality metrics must come from real public datasets.
Every quality threshold (F1, MRR, recall, precision, accuracy) must be measured by calling a real Semantica module API against a real downloaded public dataset. The threshold value must have a published baseline citation.
Synthetic fixtures (fixtures/*.json) are permitted only for latency and scale tests (P95 latency at N entities, throughput at N docs/s) — never as the primary signal for a quality metric.
A test that computes F1 entirely from a synthetic JSON file will be rejected at review regardless of the score.
How to run benchmarks
Benchmarks are a local tool. Run your sub-issue's tests before opening a PR:
The Infrastructure & Module Tracks issue should be at least partially completed before the others merge — it sets up git-lfs and JSON schema validation that all other issues depend on.
What this epic covers
This epic delivers the full benchmark suite defined in
docs/benchmarks/real_world_benchmarks.md.The suite proves Semantica's core claims with real data:
Benchmarks are run locally only. They are not wired into any CI pipeline — the suite is too large and too slow for automated runs. Contributors run them manually before opening a PR.
Modules under test
Every benchmark calls a real Semantica API. The table below maps each pillar to its primary APIs so reviewers can verify coverage is genuine.
Decision Intelligence
ContextGraph.find_precedents()·AgentContext.find_precedents()·ContextRetriever.find_precedents_hybrid()ContextGraph.get_causal_chain()·CausalChainAnalyzer.get_causal_chain()ContextGraph.analyze_decision_influence()·ContextGraph.trace_decision_causality()PolicyEngine.get_applicable_policies()·PolicyEngine.check_decision_compliance()Temporal & Provenance
TemporalGraphRetriever·ContextRetriever(withvalid_from/valid_untilfilters)TemporalVersionManager·TemporalGraphQuery·validate_temporal_consistency()TemporalPatternDetector·kg.temporal_query.TemporalPatternDetectorAgentContext.trace_decision_explainability()·DecisionContext.explain_decision()kg.provenance_tracker.ProvenanceTracker·kg.kg_provenance.GraphBuilderWithProvenancecontext.context_provenance.ContextManagerWithProvenanceMemory & Context Management
AgentMemory.store()·AgentMemory.get_memory()·AgentMemory.retrieve()·AgentMemory._prune_short_term_memory()ContextRetriever.hybrid_search()(alpha sweep across {0.0 … 1.0})ContextRetriever.multi_hop_context_assembly()·AgentContext.multi_hop_context_query()AgentContext(multi-turn consistency) ·ContextGraphmetric node storageStructural Intelligence
SimilarityCalculator.calculate_similarity()·ClusterBuilder.build_clusters()·EntityMerger.merge_entities()LinkPredictor.score_link()·LinkPredictor(MRR, Hits@k)SemanticExtractor(relation extraction) ·EntityExtractorEmbeddingGenerator.generate_embeddings()(batch mode)ConflictResolver.detect_conflicts()·ConflictResolver.resolve_conflict()Reasoner.infer_facts()·ontology.ReasonerNew Module Tracks
parse.DocumentParser·parse.PDFParser·parse.TableExtractorsplit.Splitter·split.SemanticChunker·split.RecursiveCharacterSplitter·split.TokenSplittervector_store.VectorStore.search()·vector_store.VectorStore.filtered_search()ingest.IngestPipeline.run()·ingest.IngestPipeline.run_batch()Architecture rules — read before touching a single file
Rule 1 — One directory per issue.
Never create or edit files outside your assigned
benchmarks/<dir>/anddatasets/<dir>/folders.The only shared files you may touch are
.gitattributes(Infrastructure & Module Tracks issue only) andbenchmarks/benchmarks_runner.py(Infrastructure & Module Tracks issue only).Rule 2 — Thresholds live in
conftest.py.Never hardcode a float inside a test function.
Every threshold constant must be defined at the top of
conftest.pywith the formTHRESHOLD_<METRIC>_<DATASET> = <value>.No PR will be merged with inline numbers like
assert score >= 0.70.Rule 3 — Real LLM tests are skipped by default.
Any test that requires a real language model must be gated so it does not run unless explicitly enabled:
Without this gate the test would fail on any machine that does not have an LLM available.
Rule 4 — Files over 5 MB go through git-lfs.
The Infrastructure & Module Tracks issue sets up
.gitattributesbefore any large file is committed.Committing a raw binary over 5 MB to regular git history will be rejected at review.
Rule 5 — Quality metrics must come from real public datasets.
Every quality threshold (F1, MRR, recall, precision, accuracy) must be measured by calling a real Semantica module API against a real downloaded public dataset. The threshold value must have a published baseline citation.
Synthetic fixtures (
fixtures/*.json) are permitted only for latency and scale tests (P95 latency at N entities, throughput at N docs/s) — never as the primary signal for a quality metric.A test that computes F1 entirely from a synthetic JSON file will be rejected at review regardless of the score.
How to run benchmarks
Benchmarks are a local tool. Run your sub-issue's tests before opening a PR:
To run a specific track:
To enable real-LLM tests (requires
SEMANTICA_REAL_LLMenv var):To enable the BEAM 10M test (requires the large dataset downloaded):
Issues and their scope
The Infrastructure & Module Tracks issue should be at least partially completed before the others merge — it sets up git-lfs and JSON schema validation that all other issues depend on.
Decision Intelligence (Tracks 1.1–1.4, 30–31)
benchmarks/decision_intelligence/[Temporal, Bitemporal & Provenance ([FEATURE] [Benchmarks] Pillar 2 & 3 Temporal, Bitemporal & Provenance #572 )
benchmarks/temporal_provenance/[Memory & Context Management ([FEATURE] Pillar 4 Memory & Context Management #573 )
benchmarks/memory_context/[Structural Intelligence ([FEATURE] Pillar 5 Structural Intelligence #574 )
detect_duplicates()must never be called from the benchmark pipelinebenchmarks/structural_intelligence/[Infrastructure, New Module Tracks & Dataset Registry ([FEATURE] Infrastructure, New Module Tracks & Dataset Registry #575 )
benchmarks_runner.pydiscovery, dataset registrybenchmarks/module_tracks/·benchmarks/infrastructure/Directory isolation map
Every issue owns exactly the directories below — and nothing else.
benchmarks/decision_intelligence/— Decision Intelligencebenchmarks/temporal_provenance/— Temporal, Bitemporal & Provenancebenchmarks/memory_context/— Memory & Context Managementbenchmarks/structural_intelligence/— Structural Intelligencebenchmarks/module_tracks/— Infrastructure & Module Tracksbenchmarks/infrastructure/— Infrastructure & Module Tracksbenchmarks/context_graph_effectiveness/— existing, do not move or modifyDataset directories follow the same pattern:
datasets/decision_intelligence/— Decision Intelligencedatasets/temporal_provenance/— Temporal, Bitemporal & Provenancedatasets/memory_context/— Memory & Context Managementdatasets/structural_intelligence/— Structural Intelligencedatasets/module_tracks/— Infrastructure & Module TracksShared directories — each issue commits only its own files:
fixtures/— each issue adds its own fixtures; never overwrite another issue's filesscripts/— each issue adds its own generator scriptsOverall checklist
.gitattributeswith git-lfs patterns, JSON schema files,benchmarks_runner.pyupdatedpytest benchmarks/ -p no:langsmithCHANGELOG.mdentry added under the release that ships the suitedocs/benchmarks/real_world_benchmarks.mdversion bumped to 3.0