KG‑RAG: Self‑Regulating Knowledge Graph for the RAG Domain

Status: Core data pipelines and a multi-agent system are partially implemented. The immediate focus is on integrating these components and formalizing the agent workflows.

Audience: Developers, MLEs, platform/SRE, security, product

Goal: Maintain a living knowledge graph about Retrieval‑Augmented Generation (RAG) that ingests → audits → grafts → prunes to stay current, trustworthy, and useful for downstream RAG systems. This README.md describes the target architecture, with notes indicating the current implementation status.

0. Executive Summary

KG-RAG is a project to build a self-updating knowledge graph for the RAG domain. The project has evolved to center around a Multi-Agent System responsible for query understanding, retrieval, and governance. The long-term vision is a system that:

Continuously ingests RAG‑domain sources (papers, blogs, release notes).
Extracts claims/entities/relations with provenance into a Neo4j knowledge graph.
Embeds content for vector-based retrieval from a Qdrant vector store.
Uses a multi-agent system to govern the knowledge base, orchestrate complex queries, and ensure data quality.
Exposes a hybrid retriever (vector + graph) powered by specialized agents for advanced RAG applications.

Core design principles: idempotency, provenance, policy‑driven change, observability, safety defaults, modular agent-based architecture.

Current Implementation Status (as of Aug 14, 2025):

The project currently consists of several disconnected components. The vector ingestion pipeline and the user-facing API/GUI are functional, but the core knowledge graph population and multi-agent system are still in early development.

✅ Vector Ingestion Pipeline (Functional): A LlamaIndex-based pipeline (ingest/) can fetch documents from sources, chunk them, generate embeddings, and store them in a Qdrant vector store.
✅ API & GUI (Partially Implemented): A FastAPI server exposes functional /search and /ingest endpoints. A simple tkinter GUI application (gui/app.py) provides a user interface for the search functionality. The /admin endpoints are scaffolds.
✅ Multi-Agent System (Implemented): The query-side multi-agent system follows a Plan-and-Execute model. A top-level ComplexQueryAgent orchestrates the full workflow: a PlanningAgent creates a plan, a RoutingAgent selects tools, an OrchestratorAgent executes the plan, and Synthesis and Validation agents produce the final answer. This system is runnable from the command line.
✅ Knowledge Graph (KG) Module (Implemented, Not Integrated): Robust, tested modules for connecting to Neo4j and performing idempotent upserts for graph nodes (kg/upsert.py). It is not yet used by the ingestion or agent systems.
🔶 Natural Language Processing (NLP) Module (Scaffolded): Basic, unused modules for sentence splitting (nlp/claims.py) and entity extraction (nlp/entities.py).

1. Architecture Overview

1.1 High‑Level Dataflow

The diagram below illustrates the three core workflows. The "Vector Ingestion" pipeline is functional, the "Multi-Agent Query" system is partially implemented, and the "KG Population" pipeline remains aspirational.

flowchart LR
    %% Styles
    classDef implemented fill:#e7fff3,stroke:#10b981,stroke-width:1px,color:#064e3b;
    classDef partial fill:#fffbe6,stroke:#f59e0b,stroke-width:1px,color:#7c2d12;
    classDef scaffold fill:#f3f4f6,stroke:#6b7280,stroke-width:1px,color:#1f2937,stroke-dasharray: 5 5;
    classDef sources fill:#eaf2ff,stroke:#4677f5,stroke-width:1px,color:#0b2b6a;

    subgraph "Sources"
        direction LR
        A1["RSS/Blogs"]:::sources
        A2["arXiv"]:::sources
    end

    subgraph "Pipeline 1: Vector Ingestion (Functional)"
        direction TB
        B1["ingest.fetch"] --> B2["ingest.normalize"] --> B3["ingest.builder<br>(LlamaIndex)"]
        B3 --> C1[(Qdrant<br>Vectors)]
    end

    subgraph "Pipeline 2: KG Population (Aspirational)"
        direction TB
        D1["nlp.claims<br>(spaCy)"] -.-> D2["kg.upsert"]
        D2 -.-> C2[(Neo4j<br>Graph)]
    end

    subgraph "Pipeline 3: Multi-Agent Query (Implemented)"
        direction TB
        E1["User Query"] --> E2["ComplexQueryAgent"]
        E2 --> E3["Orchestrator"]
        E3 --> E4{"Planning/Routing"}
        E3 --> E5["Tool Execution"]
        E5 --> C1
        E5 --> C2
        E2 --> E6["Synthesis"]
        E2 --> E7["Validation"]
    end

    %% Apply classes
    class A1,A2 sources;
    class B1,B2,B3,C1 implemented;
    class D1,D2,C2 scaffold;
    class E1,E2,E3,E4,E5,E6,E7 implemented;

1.1.1 Triggers & Scheduling

Who calls ingest? Cron/agent (batch) or an operator via /admin endpoints.
Granularity: per-source (feed, site, arXiv query).
Config knobs: INGEST_SOURCES, INGEST_RATE_LIMIT_QPS, MAX_FETCH_BYTES, RETRY_BACKOFF.
Metrics: ingest_jobs_total, ingest_sources_active, ingest_errors_total.

1.1.2 Fetchers (`ingest/fetch.py`)

Status: Implemented. The current implementation exists in ingest/fetch.py and supports RSS and arXiv sources. The details below describe the target state.

Input: URL or query spec (RSS feed URL, arXiv query, static docs list).
Process: HTTP GET with timeouts, size caps, redirects ≤ N, user-agent override; content-type allowlist (text/html, application/xml, application/json, text/markdown).
Output: (url, fetched_at, content_bytes, content_type, etag/last_modified?).
Failures & retries: network errors → exponential backoff (jitter), 4xx (except 429) are terminal, 5xx retried.
Side-effects: dedupe by URL + ETag/Last-Modified (skip unchanged).
Metrics: fetch_ok_total, fetch_4xx_total, fetch_5xx_total, p95 latency.
Config: HTTP_TIMEOUT_SEC, MAX_REDIRECTS, ALLOWED_SCHEMES=["https"].

1.1.3 Normalize & Chunk (`ingest/normalize.py` and `ingest/builder.py`)

Status: Implemented. Normalization using trafilatura is in ingest/normalize.py. Chunking is handled by LlamaIndex's SentenceSplitter in ingest/builder.py, which is the primary pipeline.

Input: raw HTML/JSON/MD.
Process: HTML→text (readability rules, boilerplate removal, code-block preservation); Markdown normalization; sentence segmentation; chunking ~800–1200 tokens (sentence-aware).
Output: NormalizedDoc {source_url, source_hash, text, chunks[]} with source_hash = sha256(canonicalized_text).
Failures: malformed docs → skip with reason; record dead-letter.
Metrics: normalize_docs_total, normalize_skipped_total, chunk length distribution.
Config: MAX_CHUNK_TOKENS, CHUNK_OVERLAP_TOKENS.

1.1.4 Extraction (`nlp/claims.py`, `nlp/entities.py`)

Status: Scaffolded, Not Integrated. The nlp module contains basic implementations for claim and entity extraction. However, it is not currently used by the functional LlamaIndex ingestion pipeline. The descriptions below outline the target functionality for the future integrated KG pipeline.

Input: chunks[], source_url, source_hash.
Process: claim mining (sentence-level candidates, heuristic/LLM filter); entity linking → Concept(name), Author(handle/name); tuples (claim_text, concepts[], author?, source_url, source_hash, ingested_at).
Output: ExtractedClaim[].
IDs: claim_id = sha256(normalize(claim_text) + source_hash) (stable).
Metrics: extract_claims_total, extract_concepts_total, extract_yield_per_doc.
Config: MIN_CLAIM_LEN, MAX_CLAIM_LEN, ENTITY_CONFIDENCE_MIN.

1.1.5 KG Upsert (`kg/upsert.py`, `kg/schema.py`)

Status: Implemented, Not Integrated. The kg module contains robust, tested functions for idempotently upserting data into Neo4j. However, this module is not currently called by any ingestion pipeline.

Input: ExtractedClaim[].
Process (transactional): MERGE Source, Claim, Concept(s), Author; MERGE relations CITED_IN, ABOUT, WRITTEN_BY; set defaults visibility='active', score=0.0.
Output: graph writes (idempotent).
Constraints: unique on Claim.claim_id, Concept.name, Author.handle, Source.url.
Failures: constraint conflicts handled by MERGE; transient errors → retry tx.
Metrics: kg_upserts_total, kg_relations_created_total, kg_tx_retries_total.
Config: NEO4J_URI, NEO4J_DB, WRITE_BATCH_SIZE.

1.1.6 Embedding & Vector Upsert (`ingest/builder.py`)

Status: Implemented. This entire process is handled by the LlamaIndex pipeline within ingest/builder.py:KnowledgeGraphBuilder. The standalone embed module is not directly used by this pipeline.

Input: Text nodes (chunks) from the SentenceSplitter.
Process:
1. The HuggingFaceEmbedding model (sentence-transformers) is used to generate a vector for each text node.
2. The node (with its text and metadata) and its corresponding vector are upserted into the Qdrant collection specified in the configuration.
Output: An indexed VectorStoreIndex in Qdrant.
Note: The claim_id assigned to the node serves as the point ID in Qdrant.

1.1.7 Governance Agents (Aspirational) (`agents/*`)

Status: Aspirational. The agent skeletons in src/agents (auditor.py, curator.py) represent a future, aspirational system for automated graph maintenance. This system is distinct from the implemented query-focused multi-agent system in src/multi_agent_system. The description below is the target design for this future governance loop.

Trigger: Scheduled (hourly audit, daily prune) or manual via /admin.
Audit: The Auditor agent would sample candidate claim pairs from the KG and use an NLI service to score them for entailment or contradiction.
Graft & Prune: Based on audit results and configured policies, the Curator agent would perform graft (merge) and prune (shadow/delete) operations.
Outputs: Updated KG; counters for merges/shadows/deletes.
Config: Policies in configs/*_policies.yaml.

1.1.8 Serving Path (Multi-Agent System) (`src/multi_agent_system/`)

Status: Partially Implemented. The core RAG and query-answering logic is handled by a multi-agent system. The system uses a manual, sequential orchestration pattern. It is not yet connected to the API server.

Input: User query (string).
Orchestration Flow:
1. Planning: A PlanningAgent decomposes the input query into a series of steps.
2. Routing: A RoutingAgent determines the appropriate retrieval agent for each step.
3. Retrieval: The Orchestrator dynamically loads and executes the chosen retrieval agent (e.g., VectorDatabaseRetrievalAgent, KnowledgeGraphRetrievalAgent, WebSearchRetrievalAgent) to gather context.
4. Synthesis: A SynthesisAgent combines the retrieved context into a draft answer.
5. Validation: A ValidationAgent checks the draft answer for consistency against the retrieved context.
Output: A final answer string, appended with a validation status.
Metrics: retrieval_tasks_total, synthesis_latency_ms, validation_passes_total.
Config: Agent configurations are loaded from AGENTS.md files within each module.

1.1.9 IDs, Provenance & Idempotency

source_hash = sha256(canonical_text); claim_id = sha256(normalize(sentence) + source_hash).
Every Claim must have (:Claim)-[:CITED_IN]->(:Source); lineage via DERIVED_FROM.
Re-ingestion is a no-op unless text changed (MERGE semantics; selective re-embed).

1.1.10 Backpressure & Failure Handling

Cap in‑flight fetches/embeddings; queue retries; degrade governance if backlog grows.
Dead‑letter store for failed sources; operator review.
Circuit breakers: pause audit/graft/prune on anomalous merge/delete spikes.

1.1.11 Observability (per stage)

Structured logs with request_id, source_url, claim_id, stage.
Metrics: counters/histograms listed above; Prometheus export.
Tracing: spans for fetch, normalize, extract, upsert, embed, qdrant.search, neo4j.query, rank, compose.

1.1.12 Performance Targets (p95)

Fetch+normalize ≤ 200 ms/doc (excl. network)
Embed ≤ 15 ms/claim (CPU) or ≤ 3 ms (GPU)
Qdrant top‑k ≤ 40 ms; KG expansion ≤ 80 ms
End‑to‑end /search ≤ 300 ms

1.2 Component Responsibility Matrix

Note: This table reflects the current implementation status of the core components.

Component	Status	Language	Key Deps	Responsibilities	Inputs	Outputs
Fetchers	Implemented	Python	`httpx`, `feedparser`	Rate‑limited fetching (RSS, arXiv)	Source URLs	Raw HTML/JSON
Normalizer	Implemented	Python	`trafilatura`	Clean and de‑noise HTML	HTML	Clean text
Ingestion Pipeline	Implemented	Python	`llama-index`	Load, chunk, embed, and store documents into Qdrant	Clean text files	Vectors in Qdrant
Extractors	Scaffolded	Python	`spaCy`	Claim & entity extraction	Chunks	Tuples for KG
KG Upsert	Implemented	Python	`neo4j`	Idempotent node/edge upserts w/ provenance	Tuples	Nodes/Edges
ComplexQueryAgent	Implemented	Python	`langchain-core`	End-to-end query processing (plan, execute, synthesize)	User Query	Final Answer
OrchestratorAgent	Implemented	Python	`langchain-core`	Executes a plan of tool calls to gather information	Plan	Execution Results
Planning/Routing	Implemented	Python	`langchain-core`	Generate plan; select tools for execution	Query	Plan / Tool Name
Synthesis/Validation	Implemented	Python	`langchain-core`	Synthesize and validate the final answer	Retrieved Context	Final Answer
API (Search/Ingest)	Implemented	Python	`FastAPI`	REST surface for `/search` and `/ingest`	Requests	JSON responses
API (Admin)	Scaffolded	Python	`FastAPI`	Placeholder REST surface for `/admin`	Requests	JSON responses
RAG Retriever	Implemented	Python	`llama-index`	Retrieve claims from vector store for the API	Query	Ranked Claims
RAG Answerer	Implemented	Python	`langchain-core`	Compose a final answer from claims	Claims	Answer string

1.3 Multi-Agent System Architecture

The core of this project's query processing is a Multi-Agent System, located in src/multi_agent_system. It is designed as a modular, stateful pipeline of specialized agents that work together to answer a user's query, following a Plan-and-Execute model.

The system is orchestrated by the ComplexQueryAgent (complex_query/agent.py), which serves as the main entry point. It manages the end-to-end workflow, ensuring a clear separation of concerns between the different phases of query processing.

The agent workflow is as follows:

ComplexQueryAgent: Receives the user query and initiates the process.
OrchestratorAgent: Called by the ComplexQueryAgent, this agent is responsible for the "execute" portion of the strategy. It first calls the PlanningAgent to generate a multi-step plan. Then, for each step, it uses the RoutingAgent to select the appropriate tool and executes it to gather context. It returns the collected results to the ComplexQueryAgent.
SynthesisAgent: The ComplexQueryAgent then passes the collected context to the SynthesisAgent, which aggregates the information and composes a draft answer.
ValidationAgent: Finally, the ComplexQueryAgent sends the draft answer and the original context to the ValidationAgent. This agent performs a final check to ensure the answer is consistent and supported by the retrieved context.
ComplexQueryAgent: The top-level agent formats and returns the final, validated response to the user.

Each agent is configured via its own AGENTS.md file, which is loaded at runtime by the loader.py module. This allows for decentralized and modular configuration of prompts, models, and other parameters for each agent.

1.4 Module Component Map

Note: This diagram shows the primary modules and their relationships.

flowchart TB
    %% Define styles
    classDef implemented fill:#e7fff3,stroke:#10b981;
    classDef partial fill:#fffbe6,stroke:#f59e0b;
    classDef scaffold fill:#f3f4f6,stroke:#6b7280,stroke-dasharray: 5 5;

    subgraph "Core Services"
        C_CFG[core.config]
        C_IDS[core.ids]
        C_LOG[core.logging]
    end

    subgraph "Data Stores"
        V_QDR[(Qdrant)]
        K_NEO[(Neo4j)]
    end

    subgraph "Vector Pipeline (Functional)"
        I_FET[ingest.fetch] --> I_NORM[ingest.normalize]
        I_NORM --> I_BUILD[ingest.builder<br>(LlamaIndex)]
        I_BUILD --> V_QDR
    end

    subgraph "KG Pipeline (Aspirational)"
        N_CLM[nlp.claims]:::scaffold -.-> K_UPS[kg.upsert]
        K_UPS -.-> K_NEO
    end

    subgraph "Multi-Agent System (Implemented)"
        MAS_CQ[complex_query.agent]:::implemented
        MAS_ORCH[orchestrator.agent]:::implemented
        MAS_PLAN[planning.agent]:::implemented
        MAS_ROUTE[routing.agent]:::implemented
        MAS_SYNTH[synthesis.agent]:::implemented
        MAS_VAL[validation.agent]:::implemented

        MAS_CQ --> MAS_ORCH
        MAS_CQ --> MAS_SYNTH
        MAS_CQ --> MAS_VAL
        MAS_ORCH --> MAS_PLAN
        MAS_ORCH --> MAS_ROUTE
    end

    subgraph "RAG (Implemented)"
        RAG_RET[rag.retriever] --> RAG_ANS[rag.answerer]
        RAG_RET --> V_QDR
    end

    subgraph "API (Partially Implemented)"
        A_SRV[api.server]:::partial
        A_SRV --> RAG_ANS
        A_SRV --> I_BUILD
    end

    %% Apply classes to implemented modules
    class C_CFG, C_IDS, C_LOG, I_FET, I_NORM, I_BUILD, V_QDR, K_NEO, K_UPS, RAG_RET, RAG_ANS implemented;

2. File Hierarchy (reference)

kg_rag/
├── AGENTS.md                        # Agent contract (acceptance gates)
├── README.md                        # THIS DOCUMENT
├── docker-compose.yml               # Neo4j + Qdrant
├── pyproject.toml                   # Ruff/mypy/pytest config
├── .env.example
├── .pre-commit-config.yaml
├── .github/
│   └── workflows/agents-validate.yml
├── configs/
│   ├── schema.yaml                  # Labels/rels/props/indexes
│   ├── prune_policies.yaml          # Thresholds & decay
│   └── graft_policies.yaml          # NLI & canonicalization
├── scripts/
│   ├── bootstrap.sh                 # first‑run: env, DBs
│   ├── run_audit_cycle.sh           # auditor→graft→prune
│   └── backfill_embeddings.py
├── src/
│   ├── core/
│   │   ├── config.py
│   │   ├── logging.py
│   │   └── ids.py
│   ├── ingest/
│   │   ├── fetch.py
│   │   ├── normalize.py
│   │   └── builder.py
│   ├── nlp/
│   │   ├── claims.py
│   │   ├── entities.py
│   │   └── contradictions.py
│   ├── embed/
│   │   ├── encoder.py
│   │   └── qdrant.py
│   ├── kg/
│   │   ├── neo.py
│   │   ├── schema.py
│   │   ├── upsert.py
│   │   ├── queries/
│   │   │   └── *.cypher
│   ├── rag/
│   │   ├── retriever.py
│   │   └── answerer.py
│   ├── multi_agent_system/
│   │   ├── orchestrator/
│   │   ├── react/
│   │   ├── synthesis/
│   │   └── ...
│   ├── agents/
│   │   ├── auditor.py
│   │   ├── curator.py
│   │   └── scheduler.py
│   └── api/
│       └── server.py
└── tests/
    ├── test_*.py

3. Sequence Diagrams

3.1 Vector Ingestion Pipeline (Implemented)

This diagram shows the currently implemented, LlamaIndex-based ingestion pipeline that populates the Qdrant vector store.

sequenceDiagram
    autonumber
    participant User as User/Script
    participant Fetch as ingest.fetch
    participant Norm as ingest.normalize
    participant Builder as ingest.builder.KnowledgeGraphBuilder

    User->>Fetch: fetch_arxiv_papers(query)
    Fetch-->>User: list[Document]
    User->>Norm: normalize_documents(docs)
    Norm-->>User: list[Document]
    User->>Builder: build_from_directory(path)

    Builder->>Builder: SimpleDirectoryReader.load_data()
    Builder->>Builder: SentenceSplitter.get_nodes()
    Builder->>Builder: HuggingFaceEmbedding.get_text_embedding()
    Builder->>Qdrant: Upsert nodes and vectors
    Qdrant-->>Builder: Index ready
    Builder-->>User: VectorStoreIndex

3.1a Knowledge Graph Pipeline (Aspirational)

This diagram illustrates the target workflow for populating the Neo4j knowledge graph, which is not yet implemented.

sequenceDiagram
    autonumber
    participant Ingest as Ingestion Pipeline
    participant NLP as NLP Module
    participant KG as KG Module (upsert.py)
    participant Neo4j as Neo4j Database

    Ingest->>NLP: Text chunks
    NLP->>NLP: extract_claims()
    NLP->>NLP: extract_entities()
    NLP-->>KG: (Claim, Entities, Source) tuples
    KG->>Neo4j: MERGE (s:Source), (c:Claim), ...
    Neo4j-->>KG: Upsert complete

3.2 Audit → Graft → Prune Cycle

Status: Aspirational. This describes the target workflow for the governance agents, which are currently unimplemented.

sequenceDiagram
  autonumber
  participant CR as Cron/Scheduler
  participant AU as Auditor
  participant NLI as NLI Service
  participant KG as Neo4j

  CR->>AU: Start audit cycle
  AU->>KG: Sample candidate pairs (near‑duplicate, conflicting)
  AU->>NLI: Score entailment/contradiction
  NLI-->>AU: {entails, contradicts}
  AU->>KG: Propose actions (graft, keep_both, shadow, delete)
  AU->>KG: Execute grafts (merge canonical, DERIVED_FROM edges)
  AU->>KG: Prune per policy (score thresholds)
  KG-->>CR: Summary (merged, shadowed, deleted)

3.3 API Query Path (Implemented)

Status: Implemented. This diagram shows the implemented query path exposed by the /search endpoint.

sequenceDiagram
    autonumber
    participant User as User/GUI
    participant API as FastAPI Server
    participant Retriever as rag.retriever.LlamaIndexRetriever
    participant Answerer as rag.answerer.Answerer
    participant Qdrant as Qdrant
    participant LLM as OllamaLLM

    User->>API: GET /search?q=...
    API->>Retriever: retrieve(q)
    Retriever->>Qdrant: vector_search(query_vector)
    Qdrant-->>Retriever: Search results
    Retriever-->>API: list[Claim]
    API->>Answerer: compose_answer(q, claims)
    Answerer->>LLM: generate_response(prompt)
    LLM-->>Answerer: Answer text
    Answerer-->>API: Final answer
    API-->>User: JSON {answer, hits}

3.4 Agent Governance (Acceptance Gates)

sequenceDiagram
  autonumber
  participant Dev as Dev/Codex
  participant GH as GitHub Actions
  participant Repo as AGENTS.md

  Dev->>Repo: Propose diff
  Dev->>GH: Open PR
  GH->>Repo: Validate front‑matter (schema)
  GH->>GH: Run fmt→lint→types→tests
  GH-->>Dev: Pass/Fail report
  Dev->>GH: Fix & re‑push until green
  GH-->>Dev: Merge allowed

3.5 Bootstrap & Migration

Status: Partially Implemented. The bootstrap script handles environment setup, but database initialization is currently a manual step.

sequenceDiagram
    autonumber
    participant User as User
    participant Script as scripts/bootstrap.sh
    participant Python as Python script

    User->>Script: ./scripts/bootstrap.sh
    Script->>Script: Copy .env.example to .env if needed
    Script->>Python: "from src.core.config import settings"
    Python-->>Script: Load and print settings
    Script-->>User: "Bootstrap complete."

3.6 Answer Composer + Guardrails

sequenceDiagram
  autonumber
  participant RET as Retriever
  participant ANS as Answerer
  participant LLM as LLM

  RET-->>ANS: contexts (claims+citations)
  ANS->>ANS: build prompt (cite claim_ids/source_urls)
  ANS->>ANS: enforce min supporting claims
  ANS->>LLM: call(model, prompt)
  LLM-->>ANS: draft answer
  ANS->>ANS: strip user tool directives / sanitize
  ANS-->>RET: answer + citations

3.7 Policy Update (Dry‑Run & Apply)

sequenceDiagram
  autonumber
  participant Op as Operator
  participant API as /admin/policies
  participant Neo as KG

  Op->>API: PUT policies?dry=true
  API->>Neo: validate against schema
  Neo-->>API: diff + impact preview
  Op->>API: PUT policies (apply)
  API->>Neo: persist new Policy nodes
  Neo-->>API: version stamped

4. Data Model & Indexing

4.1 Node Types (typed schema)

Label	Required Props	Optional Props	Types / Domains	Defaults
Claim	`claim_id`, `text`, `source_hash`, `ingested_at`	`updated_at`, `score`, `visibility`, `topics[]`, `nli`	`claim_id`: string (stable hash), `text`: string, `score`: float∈[0,1], `visibility`: enum{active,shadow,deleted}	`visibility=active`, `score=0.0`
Concept	`name`	`aliases[]`	`name`: string, `aliases`: string[]	—
Author	`handle`	`name`	`handle`: string, `name`: string	—
Source	`url`, `hash`, `retrieved_at`	`publisher`, `type`	`url`: uri, `hash`: sha256, `type`: enum{arxiv,blog,doc,release}	—
Policy (meta)	`key`, `value`	`updated_at`	free-form policy overrides (stored in graph for audit)	—

Invariant: No orphan Claim nodes: every Claim must have (:Claim)-[:CITED_IN]->(:Source).

4.2 Relationship Types & Hints

Relation	From → To	Cardinality	Notes
ABOUT	`Claim` → `Concept`	many→many	Concepts act as topical anchors
WRITTEN_BY	`Claim` → `Author`	many→1	Normalize authors to handle
CITED_IN	`Claim` → `Source`	1→1..n	Provenance; may duplicate across claims
SUPPORTS / CONTRADICTS	`Claim` ↔ `Claim`	many↔many	Derived from NLI + heuristics
DERIVED_FROM	`Claim` → (`Claim`, `Source`)	many→many

Graph invariants

SUPPORTS and CONTRADICTS are undirected semantics; store as single directed edge with bidirectional=true or maintain symmetrical pairs for simpler querying.
CONTRADICTS edges must not connect a claim to itself.
visibility='deleted' nodes are tombstones kept ≤ 7 days (see retention).

4.3 Indexes & Constraints (Neo4j 5.x)

CREATE CONSTRAINT claim_id IF NOT EXISTS FOR (c:Claim) REQUIRE c.claim_id IS UNIQUE;
CREATE INDEX claim_score IF NOT EXISTS FOR (c:Claim) ON (c.score);
CREATE INDEX claim_visibility IF NOT EXISTS FOR (c:Claim) ON (c.visibility);
CREATE CONSTRAINT concept_name IF NOT EXISTS FOR (k:Concept) REQUIRE k.name IS UNIQUE;
CREATE INDEX source_url IF NOT EXISTS FOR (s:Source) ON (s.url);

4.4 Retention & Soft‑Delete

Shadowing: visibility='shadow' excludes from top‑k but keeps for lineage.
Tombstone: visibility='deleted' with deleted_at; weekly job purges tombstones older than retention_days.

4.5 Validation Queries (quality gates)

// Orphan claims (should be 0)
MATCH (c:Claim) WHERE NOT (c)-[:CITED_IN]->(:Source) RETURN count(c);

// Self contradictions (should be 0)
MATCH (c:Claim)-[:CONTRADICTS]-(c) RETURN count(c);

// Duplicate claims by text+source (should be 0)
MATCH (c1:Claim),(c2:Claim)
WHERE c1.claim_id <> c2.claim_id AND c1.text=c2.text AND c1.source_hash=c2.source_hash
RETURN count(*) LIMIT 1;

4.6 Worked Subgraph Example

("Hybrid retrieval combines dense and graph hops")-[:ABOUT]->(:Concept{name:"Hybrid Retrieval"})
  \-[:CITED_IN]->(:Source{url:"https://blog.example/rag-hybrid"})
  \-[:SUPPORTS]->(:Claim{text:"Graph expansion increases recall@10 by 8–15% on tech QA"})

5. Scoring & Policies

5.1 Composite Score

We retain the multiplicative core but add smoothing and weights:

score = (ε + F)^{w_f} × (ε + T)^{w_t} × (ε + U)^{w_u}

F Freshness, T Trust, U Utility ∈ [0,1]; ε=0.05 prevents zero‑kill.
Default weights: w_f=1.0, w_t=1.2, w_u=0.8 (trust slightly emphasized).

5.2 Freshness

Exponential decay with half‑life:

F = 0.5^( age_days / half_life_days )

Domain default: half_life_days=270 (RAG techniques age slower than product releases).

5.3 Trust

Combine source & author reputation and graph support:

T = 0.5·R_source + 0.3·R_author + 0.2·S_graph

R_source ∈ [0,1] via allowlist (e.g., arXiv=0.9, random blog=0.4).
R_author from a simple prior (citations, history) ∈ [0,1].
S_graph = normalized in‑support minus in‑contradict degree (sigmoid‑scaled).

5.4 Utility

Behavioral signals:

U = σ( a·clicks + b·answers + c·feedback )

σ is logistic; defaults: a=0.02, b=0.03, c=0.1 (thumbs have stronger lift).

5.5 Policies (YAML‑driven)

freshness: { half_life_days: 270 }
trust: { min_provenance_weight: 0.35 }
utility: { min_clicks: 1, min_answers_served: 1 }
thresholds: { delete_score: 0.18, downgrade_score: 0.32 }
contradictions:
  keep_both_if_recent_days: 30
  nli: { entail: 0.78, contradict: 0.78, hysteresis: 0.05 }

5.6 Contradiction Resolution

Entailment ≥ threshold: canonicalize newer/higher‑trust as winner; attach :DERIVED_FROM to keep lineage.
Contradiction ≥ threshold: keep both; prefer the one with higher score in retrieval; open debate window before pruning.
Uncertain: keep both; schedule re‑check after Δt or on new evidence.

5.7 Prune & Shadow Logic (pseudo)

if score < delete_score: delete(tombstone=True)
elif score < downgrade_score: set_visibility('shadow')
else: set_visibility('active')

5.8 Numeric Example

Paper posted 180 days ago ⇒ F = 0.5^(180/270) ≈ 0.65.
Trusted source 0.9, author 0.6, support 0.4 ⇒ T ≈ 0.50.9+0.30.6+0.2*0.4 = 0.69.
Utility from usage ⇒ U ≈ 0.55.
Score ≈ (0.05+0.65)^(1.0) × (0.05+0.69)^(1.2) × (0.05+0.55)^(0.8) ≈ 0.46 → active.

6. API Surface (FastAPI)

6.1 Authentication & Roles

AuthN: Bearer JWT (OIDC).
Roles: reader (search only), operator (trigger audit/graft/prune), admin (policy edits).
All admin/operator routes require role claim role∈{operator,admin}.

6.2 Endpoints

GET /search
- Query: q (str, ≥3), k (int, ≤50, default 20), expand (bool), include (enum: active|shadow|all, default active), since (ISO8601).
- 200 Response:
```
{"q":"...","hits":[
  {"claim_id":"...","text":"...","score":0.61,
   "provenance": {"source_url":"...","publisher":"..."},
   "neighbors": [{"claim_id":"...","rel":"SUPPORTS"}]
  }]}
```
- Errors: 400 invalid query, 429 rate limit, 500 server.
POST /admin/audit (operator+): run contradiction/drift audit (async job id)
POST /admin/graft (operator+): execute graft on queued candidates
POST /admin/prune (operator+): apply prune/shadow per thresholds
GET /admin/policies (admin): current YAML snapshot
PUT /admin/policies (admin): validate+update policies (dry‑run with ?dry=true)
GET /healthz: liveness probe

6.3 Versioning & Stability

Prefix future breaking API as /v2/...; keep /v1 stable for ≥ 6 months.

6.4 Pagination & Limits

Cursor pagination for /search with next_cursor token; bounded k≤50.

6.5 Rate Limiting

Global: 60 req/min per IP; burst 120; admin endpoints 10/min.

6.6 Curl Examples

curl 'http://localhost:8080/search?q=hybrid%20retrieval&k=10'

curl -X POST -H 'Authorization: Bearer $TOKEN' \
  'http://localhost:8080/admin/prune'

curl -X PUT -H 'Authorization: Bearer $TOKEN' -H 'Content-Type: application/yaml' \
  --data-binary @configs/prune_policies.yaml 'http://localhost:8080/admin/policies?dry=true'

7. Real‑World UX Diagrams

7.1 Researcher UX (Search & Traceability)

flowchart TD
  U[User: Researcher] -->|asks question| S[Search Box]
  S --> R[Hybrid Retrieve]
  R --> C[Results List<br/>Claim + Snippet + Score]
  C --> P[Provenance Panel<br/>Sources, Authors, Edges]
  P --> T[Trace Graph<br/>Neighborhood Explorer]
  T --> F[Feedback<br/>👍/👎 relevance]

Screen anatomy (textual)

Left: query + filters (topic, recency)
Middle: ranked claims, per‑claim score bars, badges (fresh/trust)
Right: provenance & mini‑graph; click to expand full graph overlay
Footer: feedback widgets log utility signals

7.2 Operator UX (Govern & Observe)

flowchart TD
  OP[Operator] --> D[Dashboard]
  D --> M[Metrics:<br>ingest rate, merges, deletes, shadowed]
  D --> Q[Quality:<br>contradictions over time, eval scores]
  D --> A[Actions:<br>Review candidates<br>&#40;Graft/Prune overrides&#41;]
  A -->|Approve| Pipeline[Execute & Log]

Admin console panels

Metrics: ingestion lag, audit throughput, %shadowed, delete rate, MRR@10, nDCG@10
Candidates: sortable table with diff view (A vs B claim text, scores, NLI probs)
Policy Editor: YAML with live schema validation & dry‑run preview

8. Operations & Deployment

8.1 Local

# 1. Set up environment variables
cp .env.example .env

# 2. Start Docker containers for databases
docker compose up -d

# 3. Run the bootstrap script to verify environment config
./scripts/bootstrap.sh

# 4. Run tests to ensure everything is working
pytest -q

# 5. Start the API server
uvicorn src.api.server:app --reload --port 8080

Note: The bootstrap.sh script only validates your .env file. It does not initialize database schemas, constraints, or collections. These must be managed manually or through future migration scripts.

8.2 Local GUI Application

For interactive querying, a simple desktop GUI is provided.

Technology: tkinter (standard Python library).
Functionality: Allows a user to enter a natural language query and view a list of retrieved claims with their scores and sources.
How to run:
```
python -m src.gui.app
```

8.3 Production (guidance)

Containers: separate services for API, auditor, curator; Neo4j & Qdrant as managed or stateful sets
K8s: HPA on API; CronJobs for audit/graft/prune
Secrets: Kubernetes Secrets + sealed‑secrets; never in repo
Backups: nightly Neo4j dump; Qdrant snapshot; verify restore

8.4 Observability

Logs: structured JSON (request_id, claim_id)
Metrics: Prometheus exporters
- ingest_docs_total, audit_pairs_scored_total, graft_merges_total, prune_deletes_total
- Retrieval quality: mrr_10, ndcg_10, hit_rate_5
Tracing: OpenTelemetry (API → retriever → stores)

8.5 SLOs (suggested)

API: p95 search < 300 ms (warm caches)
Freshness: 95% of eligible sources ingested < 24h
Consistency: 99% idempotent upserts (no dup keys)

9. Security & Compliance

AuthN/Z: admin endpoints require service account & RBAC; user tokens for write ops
Provenance: every claim cites Source; no orphan claims
PII: domain is technical content; if extended, add PII scanners and data handling rules
Supply chain: pin images; continuously scan deps (Dependabot/Snyk)

10. Testing & Evaluation

10.1 CI Gates (from AGENTS.md)

ruff format → ruff check → mypy → pytest

10.2 Unit/Integration/E2E

Unit: ids, scoring, normalize, NLI wrapper
Integration: Neo4j upsert/merge/prune; Qdrant search
E2E: ingest sample doc → query returns expected claims with provenance

10.3 Retrieval Eval Harness

Curate data/eval/qa.jsonl with questions about RAG
Compute MRR@k, nDCG@k, recall@k on claims (not documents)
Regression guard: fail PR if quality drops > ε

11. Configuration

11.1 Environment

NEO4J_URI=bolt://neo4j:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=...
QDRANT_URL=http://qdrant:6333
EMBED_MODEL=text-embedding-3-large (or local)
NLI_MODEL=roberta-large-mnli (or local)
MAX_CHUNK_TOKENS=1000
PRUNE_DELETE_SCORE=0.18
PRUNE_DOWNGRADE_SCORE=0.32
FRESHNESS_HALF_LIFE_DAYS=270

11.2 Policies (editable at runtime)

configs/prune_policies.yaml
configs/graft_policies.yaml

12. Extension Points

Sources: add ingest/*.py with a fetch() generator
Extractors: swap to specialized RAG ontologies (e.g., pipelines, evaluators)
Embeddings: switch models transparently via embed/encoder.py
Storage: Milvus/Weaviate adapters; add GQL API
LLM: local vLLM or hosted; add citation‑strict prompt templates

12.1 Caching

Currently, the system does not implement caching for LLM responses. A valuable future enhancement would be to integrate a caching layer like gpt-cache. This would reduce latency and costs for repeated queries or NLI scoring tasks. The caching mechanism could be added to the OllamaLLM client in src/nlp/llm.py.

13. Runbooks

13.1 High Delete Spike

Pause prune (PRUNE_ENABLED=false)
Inspect recent score changes; verify policy edits
Restore from snapshot if needed; re‑run audit in dry‑run mode

13.2 Contradiction Storm

Raise debate window to 90 days
Switch NLI to higher‑capacity model
Manually review top contradictions in Admin console

14. License & Attribution

Apache‑2.0 (templates, code)
Cite sources when surfacing claims; store publisher metadata

15. Roadmap

Admin UI (Next.js) for dashboards & candidate review
Temporal edges & time‑aware retrieval
Automatic excerpt selection for answer composer
Graph‑aware reranking (learning‑to‑rank)

16. Quick Start Checklist

cp .env.example .env
docker compose up -d
./scripts/bootstrap.sh
pytest -q
uvicorn src.api.server:app --reload
Use the /ingest endpoint or the GUI to add data.
Use the /search endpoint or the GUI to query.

17. Performance Tuning & Latency Budget

17.1 Qdrant

Collection: cosine (or dot) distance; payload: {claim_id, topics[], trust, freshness} for re‑rank.
HNSW: m=32, ef_construct=128, query ef_search=128 (tune vs recall).
Quantization: enable scalar/product quantization for memory; verify recall on eval set before enabling in prod.
Filters: pre‑filter by visibility='active' and topic when available.

17.2 Neo4j

Prefer index‑backed lookups; avoid label scans on hot paths.
Use subqueries and LIMIT early; batch writes with apoc.periodic.iterate.
Keep :Claim degree bounded; archive very high‑degree hubs to summary nodes.

17.3 Latency Budget (target)

Stage	p95 Target
Vector search (Qdrant)	≤ 40 ms
Graph expansion (Neo4j)	≤ 80 ms
Re‑rank + compose	≤ 100 ms
End‑to‑end /search	≤ 300 ms

18. Threat Model & Security Testing

18.1 STRIDE Snapshot

Spoofing: JWT validation, clock skew tolerance ±60s, key rotation.
Tampering: signed container images; policy updates gated by admin role + audit log.
Repudiation: structured request IDs; append‑only policy change log in graph.
Information Disclosure: redact secrets; fetchers sandboxed (no filesystem writes); allowlist egress.
DoS: rate limits; circuit breakers on external calls; bounded fan‑out.
Elevation: RBAC; no shell‑exec in request path; least‑privilege DB users.

18.2 Security Tests

Fuzz /search (unicode, pathological queries).
SSRF‑hardening in fetchers: enforce scheme allowlist https, max size, content-type checks.
Prompt‑injection hardening in answer composer (see §23).

19. Data Governance & Licensing

Store Source.license (e.g., CC‑BY‑4.0, All‑Rights‑Reserved).
Respect robots.txt and usage terms; keep publisher metadata.
Downstream responses must return citations and observe license terms; optionally suppress excerpt text for non‑redistributable sources.

20. Migrations & Versioning (Graph)

20.1 Migration Files

/migrations
  ├─ 0001_init.cypher
  ├─ 0002_add_claim_visibility.cypher
  └─ 0003_index_score.cypher

Each file is idempotent; include IF NOT EXISTS guards.
Store applied migrations in (:_Migration {id, checksum, applied_at}).

20.2 CLI

python -m scripts.migrate up    # apply pending
python -m scripts.migrate down  # rollback last (when reversible)

21. Disaster Recovery & Backups

Neo4j: nightly dump; weekly offsite; verify restore monthly.
Qdrant: snapshot API nightly; co‑version with Neo dump.
RPO/RTO: RPO ≤ 24h, RTO ≤ 2h (tune per tier).

22. Observability Dashboards & Alerts

Grafana panels: ingest lag, audit throughput, merge/prune counts, shadow ratio, quality (MRR@10, nDCG@10).
Alerts:
- ingest_lag_seconds > 86400 for 30m.
- contradictions_rate spike > 3× baseline.
- delete_spike > 5% of claims/day.

23. Answer Composer & Prompt Guardrails

Always include only retrieved claims; no unsupported assertions.
Template enforces: “Cite claim_ids and source_urls in every paragraph.”
Refuse answers when supporting_claims < min_k.
Strip/escape user‑provided instructions in content (no tool directives leaking into prompt).

24. Demo & Seed Data

Seed sources: 5 canonical RAG blog posts, 10 arXiv abstracts, 3 library docs pages.
Script: scripts/bootstrap.sh ingests seed and runs one Audit→Graft→Prune cycle.
Sample queries: "hybrid retrieval", "reranking vs hybrid", "graph‑aware retrieval benefits".

25. FAQ & Glossary

Graft: merging near‑duplicate or entailing claims into a canonical node with lineage preserved.
Shadow: visible in lineage but excluded from top‑k retrieval.
Debate window: time before resolving contradictions to avoid premature pruning.
MRR / nDCG: retrieval metrics used in our eval harness.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.meta		.meta
assets		assets
configs		configs
scripts		scripts
src		src
tests		tests
.coverage		.coverage
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
temp_requirements.txt		temp_requirements.txt

LJPearson176/Kg_rag

Folders and files

Latest commit

History

Repository files navigation