Status: Core data pipelines and a multi-agent system are partially implemented. The immediate focus is on integrating these components and formalizing the agent workflows.
Audience: Developers, MLEs, platform/SRE, security, product
Goal: Maintain a living knowledge graph about Retrieval‑Augmented Generation (RAG) that ingests → audits → grafts → prunes to stay current, trustworthy, and useful for downstream RAG systems. This README.md describes the target architecture, with notes indicating the current implementation status.
KG-RAG is a project to build a self-updating knowledge graph for the RAG domain. The project has evolved to center around a Multi-Agent System responsible for query understanding, retrieval, and governance. The long-term vision is a system that:
- Continuously ingests RAG‑domain sources (papers, blogs, release notes).
- Extracts claims/entities/relations with provenance into a Neo4j knowledge graph.
- Embeds content for vector-based retrieval from a Qdrant vector store.
- Uses a multi-agent system to govern the knowledge base, orchestrate complex queries, and ensure data quality.
- Exposes a hybrid retriever (vector + graph) powered by specialized agents for advanced RAG applications.
Core design principles: idempotency, provenance, policy‑driven change, observability, safety defaults, modular agent-based architecture.
Current Implementation Status (as of Aug 14, 2025):
The project currently consists of several disconnected components. The vector ingestion pipeline and the user-facing API/GUI are functional, but the core knowledge graph population and multi-agent system are still in early development.
- ✅ Vector Ingestion Pipeline (Functional): A
LlamaIndex-based pipeline (ingest/) can fetch documents from sources, chunk them, generate embeddings, and store them in a Qdrant vector store. - ✅ API & GUI (Partially Implemented): A
FastAPIserver exposes functional/searchand/ingestendpoints. A simpletkinterGUI application (gui/app.py) provides a user interface for the search functionality. The/adminendpoints are scaffolds. - ✅ Multi-Agent System (Implemented): The query-side multi-agent system follows a
Plan-and-Executemodel. A top-levelComplexQueryAgentorchestrates the full workflow: aPlanningAgentcreates a plan, aRoutingAgentselects tools, anOrchestratorAgentexecutes the plan, andSynthesisandValidationagents produce the final answer. This system is runnable from the command line. - ✅ Knowledge Graph (KG) Module (Implemented, Not Integrated): Robust, tested modules for connecting to Neo4j and performing idempotent upserts for graph nodes (
kg/upsert.py). It is not yet used by the ingestion or agent systems. - 🔶 Natural Language Processing (NLP) Module (Scaffolded): Basic, unused modules for sentence splitting (
nlp/claims.py) and entity extraction (nlp/entities.py).
The diagram below illustrates the three core workflows. The "Vector Ingestion" pipeline is functional, the "Multi-Agent Query" system is partially implemented, and the "KG Population" pipeline remains aspirational.
flowchart LR
%% Styles
classDef implemented fill:#e7fff3,stroke:#10b981,stroke-width:1px,color:#064e3b;
classDef partial fill:#fffbe6,stroke:#f59e0b,stroke-width:1px,color:#7c2d12;
classDef scaffold fill:#f3f4f6,stroke:#6b7280,stroke-width:1px,color:#1f2937,stroke-dasharray: 5 5;
classDef sources fill:#eaf2ff,stroke:#4677f5,stroke-width:1px,color:#0b2b6a;
subgraph "Sources"
direction LR
A1["RSS/Blogs"]:::sources
A2["arXiv"]:::sources
end
subgraph "Pipeline 1: Vector Ingestion (Functional)"
direction TB
B1["ingest.fetch"] --> B2["ingest.normalize"] --> B3["ingest.builder<br>(LlamaIndex)"]
B3 --> C1[(Qdrant<br>Vectors)]
end
subgraph "Pipeline 2: KG Population (Aspirational)"
direction TB
D1["nlp.claims<br>(spaCy)"] -.-> D2["kg.upsert"]
D2 -.-> C2[(Neo4j<br>Graph)]
end
subgraph "Pipeline 3: Multi-Agent Query (Implemented)"
direction TB
E1["User Query"] --> E2["ComplexQueryAgent"]
E2 --> E3["Orchestrator"]
E3 --> E4{"Planning/Routing"}
E3 --> E5["Tool Execution"]
E5 --> C1
E5 --> C2
E2 --> E6["Synthesis"]
E2 --> E7["Validation"]
end
%% Apply classes
class A1,A2 sources;
class B1,B2,B3,C1 implemented;
class D1,D2,C2 scaffold;
class E1,E2,E3,E4,E5,E6,E7 implemented;
- Who calls ingest? Cron/agent (batch) or an operator via
/adminendpoints. - Granularity: per-source (feed, site, arXiv query).
- Config knobs:
INGEST_SOURCES,INGEST_RATE_LIMIT_QPS,MAX_FETCH_BYTES,RETRY_BACKOFF. - Metrics:
ingest_jobs_total,ingest_sources_active,ingest_errors_total.
Status: Implemented. The current implementation exists in
ingest/fetch.pyand supports RSS and arXiv sources. The details below describe the target state.
- Input: URL or query spec (RSS feed URL, arXiv query, static docs list).
- Process: HTTP GET with timeouts, size caps, redirects ≤ N, user-agent override; content-type allowlist (
text/html,application/xml,application/json,text/markdown). - Output:
(url, fetched_at, content_bytes, content_type, etag/last_modified?). - Failures & retries: network errors → exponential backoff (jitter), 4xx (except 429) are terminal, 5xx retried.
- Side-effects: dedupe by URL +
ETag/Last-Modified(skip unchanged). - Metrics:
fetch_ok_total,fetch_4xx_total,fetch_5xx_total, p95 latency. - Config:
HTTP_TIMEOUT_SEC,MAX_REDIRECTS,ALLOWED_SCHEMES=["https"].
Status: Implemented. Normalization using
trafilaturais iningest/normalize.py. Chunking is handled byLlamaIndex'sSentenceSplitteriningest/builder.py, which is the primary pipeline.
- Input: raw HTML/JSON/MD.
- Process: HTML→text (readability rules, boilerplate removal, code-block preservation); Markdown normalization; sentence segmentation; chunking ~800–1200 tokens (sentence-aware).
- Output:
NormalizedDoc {source_url, source_hash, text, chunks[]}withsource_hash = sha256(canonicalized_text). - Failures: malformed docs → skip with reason; record dead-letter.
- Metrics:
normalize_docs_total,normalize_skipped_total, chunk length distribution. - Config:
MAX_CHUNK_TOKENS,CHUNK_OVERLAP_TOKENS.
Status: Scaffolded, Not Integrated. The
nlpmodule contains basic implementations for claim and entity extraction. However, it is not currently used by the functionalLlamaIndexingestion pipeline. The descriptions below outline the target functionality for the future integrated KG pipeline.
- Input:
chunks[],source_url,source_hash. - Process: claim mining (sentence-level candidates, heuristic/LLM filter); entity linking →
Concept(name),Author(handle/name); tuples(claim_text, concepts[], author?, source_url, source_hash, ingested_at). - Output:
ExtractedClaim[]. - IDs:
claim_id = sha256(normalize(claim_text) + source_hash)(stable). - Metrics:
extract_claims_total,extract_concepts_total,extract_yield_per_doc. - Config:
MIN_CLAIM_LEN,MAX_CLAIM_LEN,ENTITY_CONFIDENCE_MIN.
Status: Implemented, Not Integrated. The
kgmodule contains robust, tested functions for idempotently upserting data into Neo4j. However, this module is not currently called by any ingestion pipeline.
- Input:
ExtractedClaim[]. - Process (transactional):
MERGESource, Claim, Concept(s), Author;MERGErelationsCITED_IN,ABOUT,WRITTEN_BY; set defaultsvisibility='active',score=0.0. - Output: graph writes (idempotent).
- Constraints: unique on
Claim.claim_id,Concept.name,Author.handle,Source.url. - Failures: constraint conflicts handled by
MERGE; transient errors → retry tx. - Metrics:
kg_upserts_total,kg_relations_created_total,kg_tx_retries_total. - Config:
NEO4J_URI,NEO4J_DB,WRITE_BATCH_SIZE.
Status: Implemented. This entire process is handled by the
LlamaIndexpipeline withiningest/builder.py:KnowledgeGraphBuilder. The standaloneembedmodule is not directly used by this pipeline.
- Input: Text nodes (chunks) from the
SentenceSplitter. - Process:
- The
HuggingFaceEmbeddingmodel (sentence-transformers) is used to generate a vector for each text node. - The node (with its text and metadata) and its corresponding vector are upserted into the Qdrant collection specified in the configuration.
- The
- Output: An indexed
VectorStoreIndexin Qdrant. - Note: The
claim_idassigned to the node serves as the point ID in Qdrant.
Status: Aspirational. The agent skeletons in
src/agents(auditor.py,curator.py) represent a future, aspirational system for automated graph maintenance. This system is distinct from the implemented query-focused multi-agent system insrc/multi_agent_system. The description below is the target design for this future governance loop.
- Trigger: Scheduled (hourly audit, daily prune) or manual via
/admin. - Audit: The
Auditoragent would sample candidate claim pairs from the KG and use an NLI service to score them for entailment or contradiction. - Graft & Prune: Based on audit results and configured policies, the
Curatoragent would perform graft (merge) and prune (shadow/delete) operations. - Outputs: Updated KG; counters for merges/shadows/deletes.
- Config: Policies in
configs/*_policies.yaml.
Status: Partially Implemented. The core RAG and query-answering logic is handled by a multi-agent system. The system uses a manual, sequential orchestration pattern. It is not yet connected to the API server.
- Input: User query (string).
- Orchestration Flow:
- Planning: A
PlanningAgentdecomposes the input query into a series of steps. - Routing: A
RoutingAgentdetermines the appropriate retrieval agent for each step. - Retrieval: The
Orchestratordynamically loads and executes the chosen retrieval agent (e.g.,VectorDatabaseRetrievalAgent,KnowledgeGraphRetrievalAgent,WebSearchRetrievalAgent) to gather context. - Synthesis: A
SynthesisAgentcombines the retrieved context into a draft answer. - Validation: A
ValidationAgentchecks the draft answer for consistency against the retrieved context.
- Planning: A
- Output: A final answer string, appended with a validation status.
- Metrics:
retrieval_tasks_total,synthesis_latency_ms,validation_passes_total. - Config: Agent configurations are loaded from
AGENTS.mdfiles within each module.
source_hash = sha256(canonical_text);claim_id = sha256(normalize(sentence) + source_hash).- Every Claim must have
(:Claim)-[:CITED_IN]->(:Source); lineage viaDERIVED_FROM. - Re-ingestion is a no-op unless text changed (MERGE semantics; selective re-embed).
- Cap in‑flight fetches/embeddings; queue retries; degrade governance if backlog grows.
- Dead‑letter store for failed sources; operator review.
- Circuit breakers: pause audit/graft/prune on anomalous merge/delete spikes.
- Structured logs with
request_id,source_url,claim_id, stage. - Metrics: counters/histograms listed above; Prometheus export.
- Tracing: spans for
fetch,normalize,extract,upsert,embed,qdrant.search,neo4j.query,rank,compose.
- Fetch+normalize ≤ 200 ms/doc (excl. network)
- Embed ≤ 15 ms/claim (CPU) or ≤ 3 ms (GPU)
- Qdrant top‑k ≤ 40 ms; KG expansion ≤ 80 ms
- End‑to‑end
/search≤ 300 ms
Note: This table reflects the current implementation status of the core components.
| Component | Status | Language | Key Deps | Responsibilities | Inputs | Outputs |
|---|---|---|---|---|---|---|
| Fetchers | Implemented | Python | httpx, feedparser |
Rate‑limited fetching (RSS, arXiv) | Source URLs | Raw HTML/JSON |
| Normalizer | Implemented | Python | trafilatura |
Clean and de‑noise HTML | HTML | Clean text |
| Ingestion Pipeline | Implemented | Python | llama-index |
Load, chunk, embed, and store documents into Qdrant | Clean text files | Vectors in Qdrant |
| Extractors | Scaffolded | Python | spaCy |
Claim & entity extraction | Chunks | Tuples for KG |
| KG Upsert | Implemented | Python | neo4j |
Idempotent node/edge upserts w/ provenance | Tuples | Nodes/Edges |
| ComplexQueryAgent | Implemented | Python | langchain-core |
End-to-end query processing (plan, execute, synthesize) | User Query | Final Answer |
| OrchestratorAgent | Implemented | Python | langchain-core |
Executes a plan of tool calls to gather information | Plan | Execution Results |
| Planning/Routing | Implemented | Python | langchain-core |
Generate plan; select tools for execution | Query | Plan / Tool Name |
| Synthesis/Validation | Implemented | Python | langchain-core |
Synthesize and validate the final answer | Retrieved Context | Final Answer |
| API (Search/Ingest) | Implemented | Python | FastAPI |
REST surface for /search and /ingest |
Requests | JSON responses |
| API (Admin) | Scaffolded | Python | FastAPI |
Placeholder REST surface for /admin |
Requests | JSON responses |
| RAG Retriever | Implemented | Python | llama-index |
Retrieve claims from vector store for the API | Query | Ranked Claims |
| RAG Answerer | Implemented | Python | langchain-core |
Compose a final answer from claims | Claims | Answer string |
The core of this project's query processing is a Multi-Agent System, located in src/multi_agent_system. It is designed as a modular, stateful pipeline of specialized agents that work together to answer a user's query, following a Plan-and-Execute model.
The system is orchestrated by the ComplexQueryAgent (complex_query/agent.py), which serves as the main entry point. It manages the end-to-end workflow, ensuring a clear separation of concerns between the different phases of query processing.
The agent workflow is as follows:
ComplexQueryAgent: Receives the user query and initiates the process.OrchestratorAgent: Called by theComplexQueryAgent, this agent is responsible for the "execute" portion of the strategy. It first calls thePlanningAgentto generate a multi-step plan. Then, for each step, it uses theRoutingAgentto select the appropriate tool and executes it to gather context. It returns the collected results to theComplexQueryAgent.SynthesisAgent: TheComplexQueryAgentthen passes the collected context to theSynthesisAgent, which aggregates the information and composes a draft answer.ValidationAgent: Finally, theComplexQueryAgentsends the draft answer and the original context to theValidationAgent. This agent performs a final check to ensure the answer is consistent and supported by the retrieved context.ComplexQueryAgent: The top-level agent formats and returns the final, validated response to the user.
Each agent is configured via its own AGENTS.md file, which is loaded at runtime by the loader.py module. This allows for decentralized and modular configuration of prompts, models, and other parameters for each agent.
Note: This diagram shows the primary modules and their relationships.
flowchart TB
%% Define styles
classDef implemented fill:#e7fff3,stroke:#10b981;
classDef partial fill:#fffbe6,stroke:#f59e0b;
classDef scaffold fill:#f3f4f6,stroke:#6b7280,stroke-dasharray: 5 5;
subgraph "Core Services"
C_CFG[core.config]
C_IDS[core.ids]
C_LOG[core.logging]
end
subgraph "Data Stores"
V_QDR[(Qdrant)]
K_NEO[(Neo4j)]
end
subgraph "Vector Pipeline (Functional)"
I_FET[ingest.fetch] --> I_NORM[ingest.normalize]
I_NORM --> I_BUILD[ingest.builder<br>(LlamaIndex)]
I_BUILD --> V_QDR
end
subgraph "KG Pipeline (Aspirational)"
N_CLM[nlp.claims]:::scaffold -.-> K_UPS[kg.upsert]
K_UPS -.-> K_NEO
end
subgraph "Multi-Agent System (Implemented)"
MAS_CQ[complex_query.agent]:::implemented
MAS_ORCH[orchestrator.agent]:::implemented
MAS_PLAN[planning.agent]:::implemented
MAS_ROUTE[routing.agent]:::implemented
MAS_SYNTH[synthesis.agent]:::implemented
MAS_VAL[validation.agent]:::implemented
MAS_CQ --> MAS_ORCH
MAS_CQ --> MAS_SYNTH
MAS_CQ --> MAS_VAL
MAS_ORCH --> MAS_PLAN
MAS_ORCH --> MAS_ROUTE
end
subgraph "RAG (Implemented)"
RAG_RET[rag.retriever] --> RAG_ANS[rag.answerer]
RAG_RET --> V_QDR
end
subgraph "API (Partially Implemented)"
A_SRV[api.server]:::partial
A_SRV --> RAG_ANS
A_SRV --> I_BUILD
end
%% Apply classes to implemented modules
class C_CFG, C_IDS, C_LOG, I_FET, I_NORM, I_BUILD, V_QDR, K_NEO, K_UPS, RAG_RET, RAG_ANS implemented;
kg_rag/
├── AGENTS.md # Agent contract (acceptance gates)
├── README.md # THIS DOCUMENT
├── docker-compose.yml # Neo4j + Qdrant
├── pyproject.toml # Ruff/mypy/pytest config
├── .env.example
├── .pre-commit-config.yaml
├── .github/
│ └── workflows/agents-validate.yml
├── configs/
│ ├── schema.yaml # Labels/rels/props/indexes
│ ├── prune_policies.yaml # Thresholds & decay
│ └── graft_policies.yaml # NLI & canonicalization
├── scripts/
│ ├── bootstrap.sh # first‑run: env, DBs
│ ├── run_audit_cycle.sh # auditor→graft→prune
│ └── backfill_embeddings.py
├── src/
│ ├── core/
│ │ ├── config.py
│ │ ├── logging.py
│ │ └── ids.py
│ ├── ingest/
│ │ ├── fetch.py
│ │ ├── normalize.py
│ │ └── builder.py
│ ├── nlp/
│ │ ├── claims.py
│ │ ├── entities.py
│ │ └── contradictions.py
│ ├── embed/
│ │ ├── encoder.py
│ │ └── qdrant.py
│ ├── kg/
│ │ ├── neo.py
│ │ ├── schema.py
│ │ ├── upsert.py
│ │ ├── queries/
│ │ │ └── *.cypher
│ ├── rag/
│ │ ├── retriever.py
│ │ └── answerer.py
│ ├── multi_agent_system/
│ │ ├── orchestrator/
│ │ ├── react/
│ │ ├── synthesis/
│ │ └── ...
│ ├── agents/
│ │ ├── auditor.py
│ │ ├── curator.py
│ │ └── scheduler.py
│ └── api/
│ └── server.py
└── tests/
├── test_*.py
This diagram shows the currently implemented, LlamaIndex-based ingestion pipeline that populates the Qdrant vector store.
sequenceDiagram
autonumber
participant User as User/Script
participant Fetch as ingest.fetch
participant Norm as ingest.normalize
participant Builder as ingest.builder.KnowledgeGraphBuilder
User->>Fetch: fetch_arxiv_papers(query)
Fetch-->>User: list[Document]
User->>Norm: normalize_documents(docs)
Norm-->>User: list[Document]
User->>Builder: build_from_directory(path)
Builder->>Builder: SimpleDirectoryReader.load_data()
Builder->>Builder: SentenceSplitter.get_nodes()
Builder->>Builder: HuggingFaceEmbedding.get_text_embedding()
Builder->>Qdrant: Upsert nodes and vectors
Qdrant-->>Builder: Index ready
Builder-->>User: VectorStoreIndex
This diagram illustrates the target workflow for populating the Neo4j knowledge graph, which is not yet implemented.
sequenceDiagram
autonumber
participant Ingest as Ingestion Pipeline
participant NLP as NLP Module
participant KG as KG Module (upsert.py)
participant Neo4j as Neo4j Database
Ingest->>NLP: Text chunks
NLP->>NLP: extract_claims()
NLP->>NLP: extract_entities()
NLP-->>KG: (Claim, Entities, Source) tuples
KG->>Neo4j: MERGE (s:Source), (c:Claim), ...
Neo4j-->>KG: Upsert complete
Status: Aspirational. This describes the target workflow for the governance agents, which are currently unimplemented.
sequenceDiagram
autonumber
participant CR as Cron/Scheduler
participant AU as Auditor
participant NLI as NLI Service
participant KG as Neo4j
CR->>AU: Start audit cycle
AU->>KG: Sample candidate pairs (near‑duplicate, conflicting)
AU->>NLI: Score entailment/contradiction
NLI-->>AU: {entails, contradicts}
AU->>KG: Propose actions (graft, keep_both, shadow, delete)
AU->>KG: Execute grafts (merge canonical, DERIVED_FROM edges)
AU->>KG: Prune per policy (score thresholds)
KG-->>CR: Summary (merged, shadowed, deleted)
Status: Implemented. This diagram shows the implemented query path exposed by the
/searchendpoint.
sequenceDiagram
autonumber
participant User as User/GUI
participant API as FastAPI Server
participant Retriever as rag.retriever.LlamaIndexRetriever
participant Answerer as rag.answerer.Answerer
participant Qdrant as Qdrant
participant LLM as OllamaLLM
User->>API: GET /search?q=...
API->>Retriever: retrieve(q)
Retriever->>Qdrant: vector_search(query_vector)
Qdrant-->>Retriever: Search results
Retriever-->>API: list[Claim]
API->>Answerer: compose_answer(q, claims)
Answerer->>LLM: generate_response(prompt)
LLM-->>Answerer: Answer text
Answerer-->>API: Final answer
API-->>User: JSON {answer, hits}
sequenceDiagram
autonumber
participant Dev as Dev/Codex
participant GH as GitHub Actions
participant Repo as AGENTS.md
Dev->>Repo: Propose diff
Dev->>GH: Open PR
GH->>Repo: Validate front‑matter (schema)
GH->>GH: Run fmt→lint→types→tests
GH-->>Dev: Pass/Fail report
Dev->>GH: Fix & re‑push until green
GH-->>Dev: Merge allowed
Status: Partially Implemented. The bootstrap script handles environment setup, but database initialization is currently a manual step.
sequenceDiagram
autonumber
participant User as User
participant Script as scripts/bootstrap.sh
participant Python as Python script
User->>Script: ./scripts/bootstrap.sh
Script->>Script: Copy .env.example to .env if needed
Script->>Python: "from src.core.config import settings"
Python-->>Script: Load and print settings
Script-->>User: "Bootstrap complete."
sequenceDiagram
autonumber
participant RET as Retriever
participant ANS as Answerer
participant LLM as LLM
RET-->>ANS: contexts (claims+citations)
ANS->>ANS: build prompt (cite claim_ids/source_urls)
ANS->>ANS: enforce min supporting claims
ANS->>LLM: call(model, prompt)
LLM-->>ANS: draft answer
ANS->>ANS: strip user tool directives / sanitize
ANS-->>RET: answer + citations
sequenceDiagram
autonumber
participant Op as Operator
participant API as /admin/policies
participant Neo as KG
Op->>API: PUT policies?dry=true
API->>Neo: validate against schema
Neo-->>API: diff + impact preview
Op->>API: PUT policies (apply)
API->>Neo: persist new Policy nodes
Neo-->>API: version stamped
| Label | Required Props | Optional Props | Types / Domains | Defaults |
|---|---|---|---|---|
| Claim | claim_id, text, source_hash, ingested_at |
updated_at, score, visibility, topics[], nli |
claim_id: string (stable hash), text: string, score: float∈[0,1], visibility: enum{active,shadow,deleted} |
visibility=active, score=0.0 |
| Concept | name |
aliases[] |
name: string, aliases: string[] |
— |
| Author | handle |
name |
handle: string, name: string |
— |
| Source | url, hash, retrieved_at |
publisher, type |
url: uri, hash: sha256, type: enum{arxiv,blog,doc,release} |
— |
| Policy (meta) | key, value |
updated_at |
free-form policy overrides (stored in graph for audit) | — |
Invariant: No orphan Claim nodes: every Claim must have (:Claim)-[:CITED_IN]->(:Source).
| Relation | From → To | Cardinality | Notes |
|---|---|---|---|
| ABOUT | Claim → Concept |
many→many | Concepts act as topical anchors |
| WRITTEN_BY | Claim → Author |
many→1 | Normalize authors to handle |
| CITED_IN | Claim → Source |
1→1..n | Provenance; may duplicate across claims |
| SUPPORTS / CONTRADICTS | Claim ↔ Claim |
many↔many | Derived from NLI + heuristics |
| DERIVED_FROM | Claim → (Claim, Source) |
many→many |
Graph invariants
SUPPORTSandCONTRADICTSare undirected semantics; store as single directed edge withbidirectional=trueor maintain symmetrical pairs for simpler querying.CONTRADICTSedges must not connect a claim to itself.visibility='deleted'nodes are tombstones kept ≤ 7 days (see retention).
CREATE CONSTRAINT claim_id IF NOT EXISTS FOR (c:Claim) REQUIRE c.claim_id IS UNIQUE;
CREATE INDEX claim_score IF NOT EXISTS FOR (c:Claim) ON (c.score);
CREATE INDEX claim_visibility IF NOT EXISTS FOR (c:Claim) ON (c.visibility);
CREATE CONSTRAINT concept_name IF NOT EXISTS FOR (k:Concept) REQUIRE k.name IS UNIQUE;
CREATE INDEX source_url IF NOT EXISTS FOR (s:Source) ON (s.url);- Shadowing:
visibility='shadow'excludes from top‑k but keeps for lineage. - Tombstone:
visibility='deleted'withdeleted_at; weekly job purges tombstones older thanretention_days.
// Orphan claims (should be 0)
MATCH (c:Claim) WHERE NOT (c)-[:CITED_IN]->(:Source) RETURN count(c);
// Self contradictions (should be 0)
MATCH (c:Claim)-[:CONTRADICTS]-(c) RETURN count(c);
// Duplicate claims by text+source (should be 0)
MATCH (c1:Claim),(c2:Claim)
WHERE c1.claim_id <> c2.claim_id AND c1.text=c2.text AND c1.source_hash=c2.source_hash
RETURN count(*) LIMIT 1;("Hybrid retrieval combines dense and graph hops")-[:ABOUT]->(:Concept{name:"Hybrid Retrieval"})
\-[:CITED_IN]->(:Source{url:"https://blog.example/rag-hybrid"})
\-[:SUPPORTS]->(:Claim{text:"Graph expansion increases recall@10 by 8–15% on tech QA"})
We retain the multiplicative core but add smoothing and weights:
score = (ε + F)^{w_f} × (ε + T)^{w_t} × (ε + U)^{w_u}
- F Freshness, T Trust, U Utility ∈ [0,1]; ε=0.05 prevents zero‑kill.
- Default weights: w_f=1.0, w_t=1.2, w_u=0.8 (trust slightly emphasized).
Exponential decay with half‑life:
F = 0.5^( age_days / half_life_days )
- Domain default:
half_life_days=270(RAG techniques age slower than product releases).
Combine source & author reputation and graph support:
T = 0.5·R_source + 0.3·R_author + 0.2·S_graph
R_source∈ [0,1] via allowlist (e.g., arXiv=0.9, random blog=0.4).R_authorfrom a simple prior (citations, history) ∈ [0,1].S_graph= normalized in‑support minus in‑contradict degree (sigmoid‑scaled).
Behavioral signals:
U = σ( a·clicks + b·answers + c·feedback )
- σ is logistic; defaults: a=0.02, b=0.03, c=0.1 (thumbs have stronger lift).
freshness: { half_life_days: 270 }
trust: { min_provenance_weight: 0.35 }
utility: { min_clicks: 1, min_answers_served: 1 }
thresholds: { delete_score: 0.18, downgrade_score: 0.32 }
contradictions:
keep_both_if_recent_days: 30
nli: { entail: 0.78, contradict: 0.78, hysteresis: 0.05 }- Entailment ≥ threshold: canonicalize newer/higher‑trust as winner; attach
:DERIVED_FROMto keep lineage. - Contradiction ≥ threshold: keep both; prefer the one with higher score in retrieval; open debate window before pruning.
- Uncertain: keep both; schedule re‑check after Δt or on new evidence.
if score < delete_score: delete(tombstone=True)
elif score < downgrade_score: set_visibility('shadow')
else: set_visibility('active')
- Paper posted 180 days ago ⇒ F = 0.5^(180/270) ≈ 0.65.
- Trusted source 0.9, author 0.6, support 0.4 ⇒ T ≈ 0.50.9+0.30.6+0.2*0.4 = 0.69.
- Utility from usage ⇒ U ≈ 0.55.
- Score ≈ (0.05+0.65)^(1.0) × (0.05+0.69)^(1.2) × (0.05+0.55)^(0.8) ≈ 0.46 → active.
- AuthN: Bearer JWT (OIDC).
- Roles:
reader(search only),operator(trigger audit/graft/prune),admin(policy edits). - All admin/operator routes require role claim
role∈{operator,admin}.
-
GET /search-
Query:
q(str, ≥3),k(int, ≤50, default 20),expand(bool),include(enum: active|shadow|all, default active),since(ISO8601). -
200 Response:
{"q":"...","hits":[ {"claim_id":"...","text":"...","score":0.61, "provenance": {"source_url":"...","publisher":"..."}, "neighbors": [{"claim_id":"...","rel":"SUPPORTS"}] }]} -
Errors: 400 invalid query, 429 rate limit, 500 server.
-
-
POST /admin/audit(operator+): run contradiction/drift audit (async job id) -
POST /admin/graft(operator+): execute graft on queued candidates -
POST /admin/prune(operator+): apply prune/shadow per thresholds -
GET /admin/policies(admin): current YAML snapshot -
PUT /admin/policies(admin): validate+update policies (dry‑run with?dry=true) -
GET /healthz: liveness probe
- Prefix future breaking API as
/v2/...; keep/v1stable for ≥ 6 months.
- Cursor pagination for
/searchwithnext_cursortoken; boundedk≤50.
- Global: 60 req/min per IP; burst 120; admin endpoints 10/min.
curl 'http://localhost:8080/search?q=hybrid%20retrieval&k=10'
curl -X POST -H 'Authorization: Bearer $TOKEN' \
'http://localhost:8080/admin/prune'
curl -X PUT -H 'Authorization: Bearer $TOKEN' -H 'Content-Type: application/yaml' \
--data-binary @configs/prune_policies.yaml 'http://localhost:8080/admin/policies?dry=true'flowchart TD
U[User: Researcher] -->|asks question| S[Search Box]
S --> R[Hybrid Retrieve]
R --> C[Results List<br/>Claim + Snippet + Score]
C --> P[Provenance Panel<br/>Sources, Authors, Edges]
P --> T[Trace Graph<br/>Neighborhood Explorer]
T --> F[Feedback<br/>👍/👎 relevance]
Screen anatomy (textual)
- Left: query + filters (topic, recency)
- Middle: ranked claims, per‑claim score bars, badges (fresh/trust)
- Right: provenance & mini‑graph; click to expand full graph overlay
- Footer: feedback widgets log utility signals
flowchart TD
OP[Operator] --> D[Dashboard]
D --> M[Metrics:<br>ingest rate, merges, deletes, shadowed]
D --> Q[Quality:<br>contradictions over time, eval scores]
D --> A[Actions:<br>Review candidates<br>(Graft/Prune overrides)]
A -->|Approve| Pipeline[Execute & Log]
Admin console panels
- Metrics: ingestion lag, audit throughput, %shadowed, delete rate, MRR@10, nDCG@10
- Candidates: sortable table with diff view (A vs B claim text, scores, NLI probs)
- Policy Editor: YAML with live schema validation & dry‑run preview
# 1. Set up environment variables
cp .env.example .env
# 2. Start Docker containers for databases
docker compose up -d
# 3. Run the bootstrap script to verify environment config
./scripts/bootstrap.sh
# 4. Run tests to ensure everything is working
pytest -q
# 5. Start the API server
uvicorn src.api.server:app --reload --port 8080Note: The bootstrap.sh script only validates your .env file. It does not initialize database schemas, constraints, or collections. These must be managed manually or through future migration scripts.
For interactive querying, a simple desktop GUI is provided.
-
Technology:
tkinter(standard Python library). -
Functionality: Allows a user to enter a natural language query and view a list of retrieved claims with their scores and sources.
-
How to run:
python -m src.gui.app
- Containers: separate services for API, auditor, curator; Neo4j & Qdrant as managed or stateful sets
- K8s: HPA on API; CronJobs for audit/graft/prune
- Secrets: Kubernetes Secrets + sealed‑secrets; never in repo
- Backups: nightly Neo4j dump; Qdrant snapshot; verify restore
-
Logs: structured JSON (request_id, claim_id)
-
Metrics: Prometheus exporters
ingest_docs_total,audit_pairs_scored_total,graft_merges_total,prune_deletes_total- Retrieval quality:
mrr_10,ndcg_10,hit_rate_5
-
Tracing: OpenTelemetry (API → retriever → stores)
- API: p95 search < 300 ms (warm caches)
- Freshness: 95% of eligible sources ingested < 24h
- Consistency: 99% idempotent upserts (no dup keys)
- AuthN/Z: admin endpoints require service account & RBAC; user tokens for write ops
- Provenance: every claim cites Source; no orphan claims
- PII: domain is technical content; if extended, add PII scanners and data handling rules
- Supply chain: pin images; continuously scan deps (Dependabot/Snyk)
ruff format→ruff check→mypy→pytest
- Unit: ids, scoring, normalize, NLI wrapper
- Integration: Neo4j upsert/merge/prune; Qdrant search
- E2E: ingest sample doc → query returns expected claims with provenance
- Curate
data/eval/qa.jsonlwith questions about RAG - Compute MRR@k, nDCG@k, recall@k on claims (not documents)
- Regression guard: fail PR if quality drops > ε
NEO4J_URI=bolt://neo4j:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=...
QDRANT_URL=http://qdrant:6333
EMBED_MODEL=text-embedding-3-large (or local)
NLI_MODEL=roberta-large-mnli (or local)
MAX_CHUNK_TOKENS=1000
PRUNE_DELETE_SCORE=0.18
PRUNE_DOWNGRADE_SCORE=0.32
FRESHNESS_HALF_LIFE_DAYS=270
configs/prune_policies.yamlconfigs/graft_policies.yaml
- Sources: add
ingest/*.pywith afetch()generator - Extractors: swap to specialized RAG ontologies (e.g., pipelines, evaluators)
- Embeddings: switch models transparently via
embed/encoder.py - Storage: Milvus/Weaviate adapters; add GQL API
- LLM: local vLLM or hosted; add citation‑strict prompt templates
Currently, the system does not implement caching for LLM responses. A valuable future enhancement would be to integrate a caching layer like gpt-cache. This would reduce latency and costs for repeated queries or NLI scoring tasks. The caching mechanism could be added to the OllamaLLM client in src/nlp/llm.py.
- Pause prune (
PRUNE_ENABLED=false) - Inspect recent score changes; verify policy edits
- Restore from snapshot if needed; re‑run audit in dry‑run mode
- Raise debate window to 90 days
- Switch NLI to higher‑capacity model
- Manually review top contradictions in Admin console
- Apache‑2.0 (templates, code)
- Cite sources when surfacing claims; store publisher metadata
- Admin UI (Next.js) for dashboards & candidate review
- Temporal edges & time‑aware retrieval
- Automatic excerpt selection for answer composer
- Graph‑aware reranking (learning‑to‑rank)
cp .env.example .envdocker compose up -d./scripts/bootstrap.shpytest -quvicorn src.api.server:app --reload- Use the
/ingestendpoint or the GUI to add data. - Use the
/searchendpoint or the GUI to query.
- Collection: cosine (or dot) distance; payload:
{claim_id, topics[], trust, freshness}for re‑rank. - HNSW: m=32, ef_construct=128, query ef_search=128 (tune vs recall).
- Quantization: enable scalar/product quantization for memory; verify recall on eval set before enabling in prod.
- Filters: pre‑filter by
visibility='active'and topic when available.
- Prefer index‑backed lookups; avoid label scans on hot paths.
- Use subqueries and
LIMITearly; batch writes withapoc.periodic.iterate. - Keep
:Claimdegree bounded; archive very high‑degree hubs to summary nodes.
| Stage | p95 Target |
|---|---|
| Vector search (Qdrant) | ≤ 40 ms |
| Graph expansion (Neo4j) | ≤ 80 ms |
| Re‑rank + compose | ≤ 100 ms |
| End‑to‑end /search | ≤ 300 ms |
- Spoofing: JWT validation, clock skew tolerance ±60s, key rotation.
- Tampering: signed container images; policy updates gated by admin role + audit log.
- Repudiation: structured request IDs; append‑only policy change log in graph.
- Information Disclosure: redact secrets; fetchers sandboxed (no filesystem writes); allowlist egress.
- DoS: rate limits; circuit breakers on external calls; bounded fan‑out.
- Elevation: RBAC; no shell‑exec in request path; least‑privilege DB users.
- Fuzz
/search(unicode, pathological queries). - SSRF‑hardening in fetchers: enforce scheme allowlist
https, max size,content-typechecks. - Prompt‑injection hardening in answer composer (see §23).
- Store
Source.license(e.g., CC‑BY‑4.0, All‑Rights‑Reserved). - Respect
robots.txtand usage terms; keep publisher metadata. - Downstream responses must return citations and observe license terms; optionally suppress excerpt text for non‑redistributable sources.
/migrations
├─ 0001_init.cypher
├─ 0002_add_claim_visibility.cypher
└─ 0003_index_score.cypher
- Each file is idempotent; include
IF NOT EXISTSguards. - Store applied migrations in
(:_Migration {id, checksum, applied_at}).
python -m scripts.migrate up # apply pending
python -m scripts.migrate down # rollback last (when reversible)- Neo4j: nightly dump; weekly offsite; verify restore monthly.
- Qdrant: snapshot API nightly; co‑version with Neo dump.
- RPO/RTO: RPO ≤ 24h, RTO ≤ 2h (tune per tier).
-
Grafana panels: ingest lag, audit throughput, merge/prune counts, shadow ratio, quality (MRR@10, nDCG@10).
-
Alerts:
ingest_lag_seconds > 86400 for 30m.contradictions_rate spike > 3× baseline.delete_spike > 5% of claims/day.
- Always include only retrieved claims; no unsupported assertions.
- Template enforces: “Cite claim_ids and source_urls in every paragraph.”
- Refuse answers when
supporting_claims < min_k. - Strip/escape user‑provided instructions in content (no tool directives leaking into prompt).
- Seed sources: 5 canonical RAG blog posts, 10 arXiv abstracts, 3 library docs pages.
- Script:
scripts/bootstrap.shingests seed and runs one Audit→Graft→Prune cycle. - Sample queries: "hybrid retrieval", "reranking vs hybrid", "graph‑aware retrieval benefits".
- Graft: merging near‑duplicate or entailing claims into a canonical node with lineage preserved.
- Shadow: visible in lineage but excluded from top‑k retrieval.
- Debate window: time before resolving contradictions to avoid premature pruning.
- MRR / nDCG: retrieval metrics used in our eval harness.