Skip to content

LJPearson176/Kg_rag

Repository files navigation

KG‑RAG: Self‑Regulating Knowledge Graph for the RAG Domain

Status: Core data pipelines and a multi-agent system are partially implemented. The immediate focus is on integrating these components and formalizing the agent workflows.

Audience: Developers, MLEs, platform/SRE, security, product

Goal: Maintain a living knowledge graph about Retrieval‑Augmented Generation (RAG) that ingests → audits → grafts → prunes to stay current, trustworthy, and useful for downstream RAG systems. This README.md describes the target architecture, with notes indicating the current implementation status.


0. Executive Summary

KG-RAG is a project to build a self-updating knowledge graph for the RAG domain. The project has evolved to center around a Multi-Agent System responsible for query understanding, retrieval, and governance. The long-term vision is a system that:

  • Continuously ingests RAG‑domain sources (papers, blogs, release notes).
  • Extracts claims/entities/relations with provenance into a Neo4j knowledge graph.
  • Embeds content for vector-based retrieval from a Qdrant vector store.
  • Uses a multi-agent system to govern the knowledge base, orchestrate complex queries, and ensure data quality.
  • Exposes a hybrid retriever (vector + graph) powered by specialized agents for advanced RAG applications.

Core design principles: idempotency, provenance, policy‑driven change, observability, safety defaults, modular agent-based architecture.

Current Implementation Status (as of Aug 14, 2025):

The project currently consists of several disconnected components. The vector ingestion pipeline and the user-facing API/GUI are functional, but the core knowledge graph population and multi-agent system are still in early development.

  • Vector Ingestion Pipeline (Functional): A LlamaIndex-based pipeline (ingest/) can fetch documents from sources, chunk them, generate embeddings, and store them in a Qdrant vector store.
  • API & GUI (Partially Implemented): A FastAPI server exposes functional /search and /ingest endpoints. A simple tkinter GUI application (gui/app.py) provides a user interface for the search functionality. The /admin endpoints are scaffolds.
  • Multi-Agent System (Implemented): The query-side multi-agent system follows a Plan-and-Execute model. A top-level ComplexQueryAgent orchestrates the full workflow: a PlanningAgent creates a plan, a RoutingAgent selects tools, an OrchestratorAgent executes the plan, and Synthesis and Validation agents produce the final answer. This system is runnable from the command line.
  • Knowledge Graph (KG) Module (Implemented, Not Integrated): Robust, tested modules for connecting to Neo4j and performing idempotent upserts for graph nodes (kg/upsert.py). It is not yet used by the ingestion or agent systems.
  • 🔶 Natural Language Processing (NLP) Module (Scaffolded): Basic, unused modules for sentence splitting (nlp/claims.py) and entity extraction (nlp/entities.py).

1. Architecture Overview

1.1 High‑Level Dataflow

The diagram below illustrates the three core workflows. The "Vector Ingestion" pipeline is functional, the "Multi-Agent Query" system is partially implemented, and the "KG Population" pipeline remains aspirational.

flowchart LR
    %% Styles
    classDef implemented fill:#e7fff3,stroke:#10b981,stroke-width:1px,color:#064e3b;
    classDef partial fill:#fffbe6,stroke:#f59e0b,stroke-width:1px,color:#7c2d12;
    classDef scaffold fill:#f3f4f6,stroke:#6b7280,stroke-width:1px,color:#1f2937,stroke-dasharray: 5 5;
    classDef sources fill:#eaf2ff,stroke:#4677f5,stroke-width:1px,color:#0b2b6a;

    subgraph "Sources"
        direction LR
        A1["RSS/Blogs"]:::sources
        A2["arXiv"]:::sources
    end

    subgraph "Pipeline 1: Vector Ingestion (Functional)"
        direction TB
        B1["ingest.fetch"] --> B2["ingest.normalize"] --> B3["ingest.builder<br>(LlamaIndex)"]
        B3 --> C1[(Qdrant<br>Vectors)]
    end

    subgraph "Pipeline 2: KG Population (Aspirational)"
        direction TB
        D1["nlp.claims<br>(spaCy)"] -.-> D2["kg.upsert"]
        D2 -.-> C2[(Neo4j<br>Graph)]
    end

    subgraph "Pipeline 3: Multi-Agent Query (Implemented)"
        direction TB
        E1["User Query"] --> E2["ComplexQueryAgent"]
        E2 --> E3["Orchestrator"]
        E3 --> E4{"Planning/Routing"}
        E3 --> E5["Tool Execution"]
        E5 --> C1
        E5 --> C2
        E2 --> E6["Synthesis"]
        E2 --> E7["Validation"]
    end

    %% Apply classes
    class A1,A2 sources;
    class B1,B2,B3,C1 implemented;
    class D1,D2,C2 scaffold;
    class E1,E2,E3,E4,E5,E6,E7 implemented;
Loading

1.1.1 Triggers & Scheduling

  • Who calls ingest? Cron/agent (batch) or an operator via /admin endpoints.
  • Granularity: per-source (feed, site, arXiv query).
  • Config knobs: INGEST_SOURCES, INGEST_RATE_LIMIT_QPS, MAX_FETCH_BYTES, RETRY_BACKOFF.
  • Metrics: ingest_jobs_total, ingest_sources_active, ingest_errors_total.

1.1.2 Fetchers (ingest/fetch.py)

Status: Implemented. The current implementation exists in ingest/fetch.py and supports RSS and arXiv sources. The details below describe the target state.

  • Input: URL or query spec (RSS feed URL, arXiv query, static docs list).
  • Process: HTTP GET with timeouts, size caps, redirects ≤ N, user-agent override; content-type allowlist (text/html, application/xml, application/json, text/markdown).
  • Output: (url, fetched_at, content_bytes, content_type, etag/last_modified?).
  • Failures & retries: network errors → exponential backoff (jitter), 4xx (except 429) are terminal, 5xx retried.
  • Side-effects: dedupe by URL + ETag/Last-Modified (skip unchanged).
  • Metrics: fetch_ok_total, fetch_4xx_total, fetch_5xx_total, p95 latency.
  • Config: HTTP_TIMEOUT_SEC, MAX_REDIRECTS, ALLOWED_SCHEMES=["https"].

1.1.3 Normalize & Chunk (ingest/normalize.py and ingest/builder.py)

Status: Implemented. Normalization using trafilatura is in ingest/normalize.py. Chunking is handled by LlamaIndex's SentenceSplitter in ingest/builder.py, which is the primary pipeline.

  • Input: raw HTML/JSON/MD.
  • Process: HTML→text (readability rules, boilerplate removal, code-block preservation); Markdown normalization; sentence segmentation; chunking ~800–1200 tokens (sentence-aware).
  • Output: NormalizedDoc {source_url, source_hash, text, chunks[]} with source_hash = sha256(canonicalized_text).
  • Failures: malformed docs → skip with reason; record dead-letter.
  • Metrics: normalize_docs_total, normalize_skipped_total, chunk length distribution.
  • Config: MAX_CHUNK_TOKENS, CHUNK_OVERLAP_TOKENS.

1.1.4 Extraction (nlp/claims.py, nlp/entities.py)

Status: Scaffolded, Not Integrated. The nlp module contains basic implementations for claim and entity extraction. However, it is not currently used by the functional LlamaIndex ingestion pipeline. The descriptions below outline the target functionality for the future integrated KG pipeline.

  • Input: chunks[], source_url, source_hash.
  • Process: claim mining (sentence-level candidates, heuristic/LLM filter); entity linking → Concept(name), Author(handle/name); tuples (claim_text, concepts[], author?, source_url, source_hash, ingested_at).
  • Output: ExtractedClaim[].
  • IDs: claim_id = sha256(normalize(claim_text) + source_hash) (stable).
  • Metrics: extract_claims_total, extract_concepts_total, extract_yield_per_doc.
  • Config: MIN_CLAIM_LEN, MAX_CLAIM_LEN, ENTITY_CONFIDENCE_MIN.

1.1.5 KG Upsert (kg/upsert.py, kg/schema.py)

Status: Implemented, Not Integrated. The kg module contains robust, tested functions for idempotently upserting data into Neo4j. However, this module is not currently called by any ingestion pipeline.

  • Input: ExtractedClaim[].
  • Process (transactional): MERGE Source, Claim, Concept(s), Author; MERGE relations CITED_IN, ABOUT, WRITTEN_BY; set defaults visibility='active', score=0.0.
  • Output: graph writes (idempotent).
  • Constraints: unique on Claim.claim_id, Concept.name, Author.handle, Source.url.
  • Failures: constraint conflicts handled by MERGE; transient errors → retry tx.
  • Metrics: kg_upserts_total, kg_relations_created_total, kg_tx_retries_total.
  • Config: NEO4J_URI, NEO4J_DB, WRITE_BATCH_SIZE.

1.1.6 Embedding & Vector Upsert (ingest/builder.py)

Status: Implemented. This entire process is handled by the LlamaIndex pipeline within ingest/builder.py:KnowledgeGraphBuilder. The standalone embed module is not directly used by this pipeline.

  • Input: Text nodes (chunks) from the SentenceSplitter.
  • Process:
    1. The HuggingFaceEmbedding model (sentence-transformers) is used to generate a vector for each text node.
    2. The node (with its text and metadata) and its corresponding vector are upserted into the Qdrant collection specified in the configuration.
  • Output: An indexed VectorStoreIndex in Qdrant.
  • Note: The claim_id assigned to the node serves as the point ID in Qdrant.

1.1.7 Governance Agents (Aspirational) (agents/*)

Status: Aspirational. The agent skeletons in src/agents (auditor.py, curator.py) represent a future, aspirational system for automated graph maintenance. This system is distinct from the implemented query-focused multi-agent system in src/multi_agent_system. The description below is the target design for this future governance loop.

  • Trigger: Scheduled (hourly audit, daily prune) or manual via /admin.
  • Audit: The Auditor agent would sample candidate claim pairs from the KG and use an NLI service to score them for entailment or contradiction.
  • Graft & Prune: Based on audit results and configured policies, the Curator agent would perform graft (merge) and prune (shadow/delete) operations.
  • Outputs: Updated KG; counters for merges/shadows/deletes.
  • Config: Policies in configs/*_policies.yaml.

1.1.8 Serving Path (Multi-Agent System) (src/multi_agent_system/)

Status: Partially Implemented. The core RAG and query-answering logic is handled by a multi-agent system. The system uses a manual, sequential orchestration pattern. It is not yet connected to the API server.

  • Input: User query (string).
  • Orchestration Flow:
    1. Planning: A PlanningAgent decomposes the input query into a series of steps.
    2. Routing: A RoutingAgent determines the appropriate retrieval agent for each step.
    3. Retrieval: The Orchestrator dynamically loads and executes the chosen retrieval agent (e.g., VectorDatabaseRetrievalAgent, KnowledgeGraphRetrievalAgent, WebSearchRetrievalAgent) to gather context.
    4. Synthesis: A SynthesisAgent combines the retrieved context into a draft answer.
    5. Validation: A ValidationAgent checks the draft answer for consistency against the retrieved context.
  • Output: A final answer string, appended with a validation status.
  • Metrics: retrieval_tasks_total, synthesis_latency_ms, validation_passes_total.
  • Config: Agent configurations are loaded from AGENTS.md files within each module.

1.1.9 IDs, Provenance & Idempotency

  • source_hash = sha256(canonical_text); claim_id = sha256(normalize(sentence) + source_hash).
  • Every Claim must have (:Claim)-[:CITED_IN]->(:Source); lineage via DERIVED_FROM.
  • Re-ingestion is a no-op unless text changed (MERGE semantics; selective re-embed).

1.1.10 Backpressure & Failure Handling

  • Cap in‑flight fetches/embeddings; queue retries; degrade governance if backlog grows.
  • Dead‑letter store for failed sources; operator review.
  • Circuit breakers: pause audit/graft/prune on anomalous merge/delete spikes.

1.1.11 Observability (per stage)

  • Structured logs with request_id, source_url, claim_id, stage.
  • Metrics: counters/histograms listed above; Prometheus export.
  • Tracing: spans for fetch, normalize, extract, upsert, embed, qdrant.search, neo4j.query, rank, compose.

1.1.12 Performance Targets (p95)

  • Fetch+normalize ≤ 200 ms/doc (excl. network)
  • Embed ≤ 15 ms/claim (CPU) or ≤ 3 ms (GPU)
  • Qdrant top‑k ≤ 40 ms; KG expansion ≤ 80 ms
  • End‑to‑end /search ≤ 300 ms

1.2 Component Responsibility Matrix

Note: This table reflects the current implementation status of the core components.

Component Status Language Key Deps Responsibilities Inputs Outputs
Fetchers Implemented Python httpx, feedparser Rate‑limited fetching (RSS, arXiv) Source URLs Raw HTML/JSON
Normalizer Implemented Python trafilatura Clean and de‑noise HTML HTML Clean text
Ingestion Pipeline Implemented Python llama-index Load, chunk, embed, and store documents into Qdrant Clean text files Vectors in Qdrant
Extractors Scaffolded Python spaCy Claim & entity extraction Chunks Tuples for KG
KG Upsert Implemented Python neo4j Idempotent node/edge upserts w/ provenance Tuples Nodes/Edges
ComplexQueryAgent Implemented Python langchain-core End-to-end query processing (plan, execute, synthesize) User Query Final Answer
OrchestratorAgent Implemented Python langchain-core Executes a plan of tool calls to gather information Plan Execution Results
Planning/Routing Implemented Python langchain-core Generate plan; select tools for execution Query Plan / Tool Name
Synthesis/Validation Implemented Python langchain-core Synthesize and validate the final answer Retrieved Context Final Answer
API (Search/Ingest) Implemented Python FastAPI REST surface for /search and /ingest Requests JSON responses
API (Admin) Scaffolded Python FastAPI Placeholder REST surface for /admin Requests JSON responses
RAG Retriever Implemented Python llama-index Retrieve claims from vector store for the API Query Ranked Claims
RAG Answerer Implemented Python langchain-core Compose a final answer from claims Claims Answer string

1.3 Multi-Agent System Architecture

The core of this project's query processing is a Multi-Agent System, located in src/multi_agent_system. It is designed as a modular, stateful pipeline of specialized agents that work together to answer a user's query, following a Plan-and-Execute model.

The system is orchestrated by the ComplexQueryAgent (complex_query/agent.py), which serves as the main entry point. It manages the end-to-end workflow, ensuring a clear separation of concerns between the different phases of query processing.

The agent workflow is as follows:

  1. ComplexQueryAgent: Receives the user query and initiates the process.
  2. OrchestratorAgent: Called by the ComplexQueryAgent, this agent is responsible for the "execute" portion of the strategy. It first calls the PlanningAgent to generate a multi-step plan. Then, for each step, it uses the RoutingAgent to select the appropriate tool and executes it to gather context. It returns the collected results to the ComplexQueryAgent.
  3. SynthesisAgent: The ComplexQueryAgent then passes the collected context to the SynthesisAgent, which aggregates the information and composes a draft answer.
  4. ValidationAgent: Finally, the ComplexQueryAgent sends the draft answer and the original context to the ValidationAgent. This agent performs a final check to ensure the answer is consistent and supported by the retrieved context.
  5. ComplexQueryAgent: The top-level agent formats and returns the final, validated response to the user.

Each agent is configured via its own AGENTS.md file, which is loaded at runtime by the loader.py module. This allows for decentralized and modular configuration of prompts, models, and other parameters for each agent.

1.4 Module Component Map

Note: This diagram shows the primary modules and their relationships.

flowchart TB
    %% Define styles
    classDef implemented fill:#e7fff3,stroke:#10b981;
    classDef partial fill:#fffbe6,stroke:#f59e0b;
    classDef scaffold fill:#f3f4f6,stroke:#6b7280,stroke-dasharray: 5 5;

    subgraph "Core Services"
        C_CFG[core.config]
        C_IDS[core.ids]
        C_LOG[core.logging]
    end

    subgraph "Data Stores"
        V_QDR[(Qdrant)]
        K_NEO[(Neo4j)]
    end

    subgraph "Vector Pipeline (Functional)"
        I_FET[ingest.fetch] --> I_NORM[ingest.normalize]
        I_NORM --> I_BUILD[ingest.builder<br>(LlamaIndex)]
        I_BUILD --> V_QDR
    end

    subgraph "KG Pipeline (Aspirational)"
        N_CLM[nlp.claims]:::scaffold -.-> K_UPS[kg.upsert]
        K_UPS -.-> K_NEO
    end

    subgraph "Multi-Agent System (Implemented)"
        MAS_CQ[complex_query.agent]:::implemented
        MAS_ORCH[orchestrator.agent]:::implemented
        MAS_PLAN[planning.agent]:::implemented
        MAS_ROUTE[routing.agent]:::implemented
        MAS_SYNTH[synthesis.agent]:::implemented
        MAS_VAL[validation.agent]:::implemented

        MAS_CQ --> MAS_ORCH
        MAS_CQ --> MAS_SYNTH
        MAS_CQ --> MAS_VAL
        MAS_ORCH --> MAS_PLAN
        MAS_ORCH --> MAS_ROUTE
    end

    subgraph "RAG (Implemented)"
        RAG_RET[rag.retriever] --> RAG_ANS[rag.answerer]
        RAG_RET --> V_QDR
    end

    subgraph "API (Partially Implemented)"
        A_SRV[api.server]:::partial
        A_SRV --> RAG_ANS
        A_SRV --> I_BUILD
    end

    %% Apply classes to implemented modules
    class C_CFG, C_IDS, C_LOG, I_FET, I_NORM, I_BUILD, V_QDR, K_NEO, K_UPS, RAG_RET, RAG_ANS implemented;
Loading

2. File Hierarchy (reference)

kg_rag/
├── AGENTS.md                        # Agent contract (acceptance gates)
├── README.md                        # THIS DOCUMENT
├── docker-compose.yml               # Neo4j + Qdrant
├── pyproject.toml                   # Ruff/mypy/pytest config
├── .env.example
├── .pre-commit-config.yaml
├── .github/
│   └── workflows/agents-validate.yml
├── configs/
│   ├── schema.yaml                  # Labels/rels/props/indexes
│   ├── prune_policies.yaml          # Thresholds & decay
│   └── graft_policies.yaml          # NLI & canonicalization
├── scripts/
│   ├── bootstrap.sh                 # first‑run: env, DBs
│   ├── run_audit_cycle.sh           # auditor→graft→prune
│   └── backfill_embeddings.py
├── src/
│   ├── core/
│   │   ├── config.py
│   │   ├── logging.py
│   │   └── ids.py
│   ├── ingest/
│   │   ├── fetch.py
│   │   ├── normalize.py
│   │   └── builder.py
│   ├── nlp/
│   │   ├── claims.py
│   │   ├── entities.py
│   │   └── contradictions.py
│   ├── embed/
│   │   ├── encoder.py
│   │   └── qdrant.py
│   ├── kg/
│   │   ├── neo.py
│   │   ├── schema.py
│   │   ├── upsert.py
│   │   ├── queries/
│   │   │   └── *.cypher
│   ├── rag/
│   │   ├── retriever.py
│   │   └── answerer.py
│   ├── multi_agent_system/
│   │   ├── orchestrator/
│   │   ├── react/
│   │   ├── synthesis/
│   │   └── ...
│   ├── agents/
│   │   ├── auditor.py
│   │   ├── curator.py
│   │   └── scheduler.py
│   └── api/
│       └── server.py
└── tests/
    ├── test_*.py

3. Sequence Diagrams

3.1 Vector Ingestion Pipeline (Implemented)

This diagram shows the currently implemented, LlamaIndex-based ingestion pipeline that populates the Qdrant vector store.

sequenceDiagram
    autonumber
    participant User as User/Script
    participant Fetch as ingest.fetch
    participant Norm as ingest.normalize
    participant Builder as ingest.builder.KnowledgeGraphBuilder

    User->>Fetch: fetch_arxiv_papers(query)
    Fetch-->>User: list[Document]
    User->>Norm: normalize_documents(docs)
    Norm-->>User: list[Document]
    User->>Builder: build_from_directory(path)

    Builder->>Builder: SimpleDirectoryReader.load_data()
    Builder->>Builder: SentenceSplitter.get_nodes()
    Builder->>Builder: HuggingFaceEmbedding.get_text_embedding()
    Builder->>Qdrant: Upsert nodes and vectors
    Qdrant-->>Builder: Index ready
    Builder-->>User: VectorStoreIndex
Loading

3.1a Knowledge Graph Pipeline (Aspirational)

This diagram illustrates the target workflow for populating the Neo4j knowledge graph, which is not yet implemented.

sequenceDiagram
    autonumber
    participant Ingest as Ingestion Pipeline
    participant NLP as NLP Module
    participant KG as KG Module (upsert.py)
    participant Neo4j as Neo4j Database

    Ingest->>NLP: Text chunks
    NLP->>NLP: extract_claims()
    NLP->>NLP: extract_entities()
    NLP-->>KG: (Claim, Entities, Source) tuples
    KG->>Neo4j: MERGE (s:Source), (c:Claim), ...
    Neo4j-->>KG: Upsert complete
Loading

3.2 Audit → Graft → Prune Cycle

Status: Aspirational. This describes the target workflow for the governance agents, which are currently unimplemented.

sequenceDiagram
  autonumber
  participant CR as Cron/Scheduler
  participant AU as Auditor
  participant NLI as NLI Service
  participant KG as Neo4j

  CR->>AU: Start audit cycle
  AU->>KG: Sample candidate pairs (near‑duplicate, conflicting)
  AU->>NLI: Score entailment/contradiction
  NLI-->>AU: {entails, contradicts}
  AU->>KG: Propose actions (graft, keep_both, shadow, delete)
  AU->>KG: Execute grafts (merge canonical, DERIVED_FROM edges)
  AU->>KG: Prune per policy (score thresholds)
  KG-->>CR: Summary (merged, shadowed, deleted)
Loading

3.3 API Query Path (Implemented)

Status: Implemented. This diagram shows the implemented query path exposed by the /search endpoint.

sequenceDiagram
    autonumber
    participant User as User/GUI
    participant API as FastAPI Server
    participant Retriever as rag.retriever.LlamaIndexRetriever
    participant Answerer as rag.answerer.Answerer
    participant Qdrant as Qdrant
    participant LLM as OllamaLLM

    User->>API: GET /search?q=...
    API->>Retriever: retrieve(q)
    Retriever->>Qdrant: vector_search(query_vector)
    Qdrant-->>Retriever: Search results
    Retriever-->>API: list[Claim]
    API->>Answerer: compose_answer(q, claims)
    Answerer->>LLM: generate_response(prompt)
    LLM-->>Answerer: Answer text
    Answerer-->>API: Final answer
    API-->>User: JSON {answer, hits}
Loading

3.4 Agent Governance (Acceptance Gates)

sequenceDiagram
  autonumber
  participant Dev as Dev/Codex
  participant GH as GitHub Actions
  participant Repo as AGENTS.md

  Dev->>Repo: Propose diff
  Dev->>GH: Open PR
  GH->>Repo: Validate front‑matter (schema)
  GH->>GH: Run fmt→lint→types→tests
  GH-->>Dev: Pass/Fail report
  Dev->>GH: Fix & re‑push until green
  GH-->>Dev: Merge allowed
Loading

3.5 Bootstrap & Migration

Status: Partially Implemented. The bootstrap script handles environment setup, but database initialization is currently a manual step.

sequenceDiagram
    autonumber
    participant User as User
    participant Script as scripts/bootstrap.sh
    participant Python as Python script

    User->>Script: ./scripts/bootstrap.sh
    Script->>Script: Copy .env.example to .env if needed
    Script->>Python: "from src.core.config import settings"
    Python-->>Script: Load and print settings
    Script-->>User: "Bootstrap complete."
Loading

3.6 Answer Composer + Guardrails

sequenceDiagram
  autonumber
  participant RET as Retriever
  participant ANS as Answerer
  participant LLM as LLM

  RET-->>ANS: contexts (claims+citations)
  ANS->>ANS: build prompt (cite claim_ids/source_urls)
  ANS->>ANS: enforce min supporting claims
  ANS->>LLM: call(model, prompt)
  LLM-->>ANS: draft answer
  ANS->>ANS: strip user tool directives / sanitize
  ANS-->>RET: answer + citations
Loading

3.7 Policy Update (Dry‑Run & Apply)

sequenceDiagram
  autonumber
  participant Op as Operator
  participant API as /admin/policies
  participant Neo as KG

  Op->>API: PUT policies?dry=true
  API->>Neo: validate against schema
  Neo-->>API: diff + impact preview
  Op->>API: PUT policies (apply)
  API->>Neo: persist new Policy nodes
  Neo-->>API: version stamped
Loading

4. Data Model & Indexing

4.1 Node Types (typed schema)

Label Required Props Optional Props Types / Domains Defaults
Claim claim_id, text, source_hash, ingested_at updated_at, score, visibility, topics[], nli claim_id: string (stable hash), text: string, score: float∈[0,1], visibility: enum{active,shadow,deleted} visibility=active, score=0.0
Concept name aliases[] name: string, aliases: string[]
Author handle name handle: string, name: string
Source url, hash, retrieved_at publisher, type url: uri, hash: sha256, type: enum{arxiv,blog,doc,release}
Policy (meta) key, value updated_at free-form policy overrides (stored in graph for audit)

Invariant: No orphan Claim nodes: every Claim must have (:Claim)-[:CITED_IN]->(:Source).

4.2 Relationship Types & Hints

Relation From → To Cardinality Notes
ABOUT ClaimConcept many→many Concepts act as topical anchors
WRITTEN_BY ClaimAuthor many→1 Normalize authors to handle
CITED_IN ClaimSource 1→1..n Provenance; may duplicate across claims
SUPPORTS / CONTRADICTS ClaimClaim many↔many Derived from NLI + heuristics
DERIVED_FROM Claim → (Claim, Source) many→many

Graph invariants

  • SUPPORTS and CONTRADICTS are undirected semantics; store as single directed edge with bidirectional=true or maintain symmetrical pairs for simpler querying.
  • CONTRADICTS edges must not connect a claim to itself.
  • visibility='deleted' nodes are tombstones kept ≤ 7 days (see retention).

4.3 Indexes & Constraints (Neo4j 5.x)

CREATE CONSTRAINT claim_id IF NOT EXISTS FOR (c:Claim) REQUIRE c.claim_id IS UNIQUE;
CREATE INDEX claim_score IF NOT EXISTS FOR (c:Claim) ON (c.score);
CREATE INDEX claim_visibility IF NOT EXISTS FOR (c:Claim) ON (c.visibility);
CREATE CONSTRAINT concept_name IF NOT EXISTS FOR (k:Concept) REQUIRE k.name IS UNIQUE;
CREATE INDEX source_url IF NOT EXISTS FOR (s:Source) ON (s.url);

4.4 Retention & Soft‑Delete

  • Shadowing: visibility='shadow' excludes from top‑k but keeps for lineage.
  • Tombstone: visibility='deleted' with deleted_at; weekly job purges tombstones older than retention_days.

4.5 Validation Queries (quality gates)

// Orphan claims (should be 0)
MATCH (c:Claim) WHERE NOT (c)-[:CITED_IN]->(:Source) RETURN count(c);

// Self contradictions (should be 0)
MATCH (c:Claim)-[:CONTRADICTS]-(c) RETURN count(c);

// Duplicate claims by text+source (should be 0)
MATCH (c1:Claim),(c2:Claim)
WHERE c1.claim_id <> c2.claim_id AND c1.text=c2.text AND c1.source_hash=c2.source_hash
RETURN count(*) LIMIT 1;

4.6 Worked Subgraph Example

("Hybrid retrieval combines dense and graph hops")-[:ABOUT]->(:Concept{name:"Hybrid Retrieval"})
  \-[:CITED_IN]->(:Source{url:"https://blog.example/rag-hybrid"})
  \-[:SUPPORTS]->(:Claim{text:"Graph expansion increases recall@10 by 8–15% on tech QA"})

5. Scoring & Policies

5.1 Composite Score

We retain the multiplicative core but add smoothing and weights:

score = (ε + F)^{w_f} × (ε + T)^{w_t} × (ε + U)^{w_u}

  • F Freshness, T Trust, U Utility ∈ [0,1]; ε=0.05 prevents zero‑kill.
  • Default weights: w_f=1.0, w_t=1.2, w_u=0.8 (trust slightly emphasized).

5.2 Freshness

Exponential decay with half‑life:

F = 0.5^( age_days / half_life_days )

  • Domain default: half_life_days=270 (RAG techniques age slower than product releases).

5.3 Trust

Combine source & author reputation and graph support:

T = 0.5·R_source + 0.3·R_author + 0.2·S_graph

  • R_source ∈ [0,1] via allowlist (e.g., arXiv=0.9, random blog=0.4).
  • R_author from a simple prior (citations, history) ∈ [0,1].
  • S_graph = normalized in‑support minus in‑contradict degree (sigmoid‑scaled).

5.4 Utility

Behavioral signals:

U = σ( a·clicks + b·answers + c·feedback )

  • σ is logistic; defaults: a=0.02, b=0.03, c=0.1 (thumbs have stronger lift).

5.5 Policies (YAML‑driven)

freshness: { half_life_days: 270 }
trust: { min_provenance_weight: 0.35 }
utility: { min_clicks: 1, min_answers_served: 1 }
thresholds: { delete_score: 0.18, downgrade_score: 0.32 }
contradictions:
  keep_both_if_recent_days: 30
  nli: { entail: 0.78, contradict: 0.78, hysteresis: 0.05 }

5.6 Contradiction Resolution

  • Entailment ≥ threshold: canonicalize newer/higher‑trust as winner; attach :DERIVED_FROM to keep lineage.
  • Contradiction ≥ threshold: keep both; prefer the one with higher score in retrieval; open debate window before pruning.
  • Uncertain: keep both; schedule re‑check after Δt or on new evidence.

5.7 Prune & Shadow Logic (pseudo)

if score < delete_score: delete(tombstone=True)
elif score < downgrade_score: set_visibility('shadow')
else: set_visibility('active')

5.8 Numeric Example

  • Paper posted 180 days ago ⇒ F = 0.5^(180/270) ≈ 0.65.
  • Trusted source 0.9, author 0.6, support 0.4 ⇒ T ≈ 0.50.9+0.30.6+0.2*0.4 = 0.69.
  • Utility from usage ⇒ U ≈ 0.55.
  • Score ≈ (0.05+0.65)^(1.0) × (0.05+0.69)^(1.2) × (0.05+0.55)^(0.8) ≈ 0.46 → active.

6. API Surface (FastAPI)

6.1 Authentication & Roles

  • AuthN: Bearer JWT (OIDC).
  • Roles: reader (search only), operator (trigger audit/graft/prune), admin (policy edits).
  • All admin/operator routes require role claim role∈{operator,admin}.

6.2 Endpoints

  • GET /search

    • Query: q (str, ≥3), k (int, ≤50, default 20), expand (bool), include (enum: active|shadow|all, default active), since (ISO8601).

    • 200 Response:

      {"q":"...","hits":[
        {"claim_id":"...","text":"...","score":0.61,
         "provenance": {"source_url":"...","publisher":"..."},
         "neighbors": [{"claim_id":"...","rel":"SUPPORTS"}]
        }]}
    • Errors: 400 invalid query, 429 rate limit, 500 server.

  • POST /admin/audit (operator+): run contradiction/drift audit (async job id)

  • POST /admin/graft (operator+): execute graft on queued candidates

  • POST /admin/prune (operator+): apply prune/shadow per thresholds

  • GET /admin/policies (admin): current YAML snapshot

  • PUT /admin/policies (admin): validate+update policies (dry‑run with ?dry=true)

  • GET /healthz: liveness probe

6.3 Versioning & Stability

  • Prefix future breaking API as /v2/...; keep /v1 stable for ≥ 6 months.

6.4 Pagination & Limits

  • Cursor pagination for /search with next_cursor token; bounded k≤50.

6.5 Rate Limiting

  • Global: 60 req/min per IP; burst 120; admin endpoints 10/min.

6.6 Curl Examples

curl 'http://localhost:8080/search?q=hybrid%20retrieval&k=10'

curl -X POST -H 'Authorization: Bearer $TOKEN' \
  'http://localhost:8080/admin/prune'

curl -X PUT -H 'Authorization: Bearer $TOKEN' -H 'Content-Type: application/yaml' \
  --data-binary @configs/prune_policies.yaml 'http://localhost:8080/admin/policies?dry=true'

7. Real‑World UX Diagrams

7.1 Researcher UX (Search & Traceability)

flowchart TD
  U[User: Researcher] -->|asks question| S[Search Box]
  S --> R[Hybrid Retrieve]
  R --> C[Results List<br/>Claim + Snippet + Score]
  C --> P[Provenance Panel<br/>Sources, Authors, Edges]
  P --> T[Trace Graph<br/>Neighborhood Explorer]
  T --> F[Feedback<br/>👍/👎 relevance]

Loading

Screen anatomy (textual)

  • Left: query + filters (topic, recency)
  • Middle: ranked claims, per‑claim score bars, badges (fresh/trust)
  • Right: provenance & mini‑graph; click to expand full graph overlay
  • Footer: feedback widgets log utility signals

7.2 Operator UX (Govern & Observe)

flowchart TD
  OP[Operator] --> D[Dashboard]
  D --> M[Metrics:<br>ingest rate, merges, deletes, shadowed]
  D --> Q[Quality:<br>contradictions over time, eval scores]
  D --> A[Actions:<br>Review candidates<br>&#40;Graft/Prune overrides&#41;]
  A -->|Approve| Pipeline[Execute & Log]
Loading

Admin console panels

  • Metrics: ingestion lag, audit throughput, %shadowed, delete rate, MRR@10, nDCG@10
  • Candidates: sortable table with diff view (A vs B claim text, scores, NLI probs)
  • Policy Editor: YAML with live schema validation & dry‑run preview

8. Operations & Deployment

8.1 Local

# 1. Set up environment variables
cp .env.example .env

# 2. Start Docker containers for databases
docker compose up -d

# 3. Run the bootstrap script to verify environment config
./scripts/bootstrap.sh

# 4. Run tests to ensure everything is working
pytest -q

# 5. Start the API server
uvicorn src.api.server:app --reload --port 8080

Note: The bootstrap.sh script only validates your .env file. It does not initialize database schemas, constraints, or collections. These must be managed manually or through future migration scripts.

8.2 Local GUI Application

For interactive querying, a simple desktop GUI is provided.

  • Technology: tkinter (standard Python library).

  • Functionality: Allows a user to enter a natural language query and view a list of retrieved claims with their scores and sources.

  • How to run:

    python -m src.gui.app

8.3 Production (guidance)

  • Containers: separate services for API, auditor, curator; Neo4j & Qdrant as managed or stateful sets
  • K8s: HPA on API; CronJobs for audit/graft/prune
  • Secrets: Kubernetes Secrets + sealed‑secrets; never in repo
  • Backups: nightly Neo4j dump; Qdrant snapshot; verify restore

8.4 Observability

  • Logs: structured JSON (request_id, claim_id)

  • Metrics: Prometheus exporters

    • ingest_docs_total, audit_pairs_scored_total, graft_merges_total, prune_deletes_total
    • Retrieval quality: mrr_10, ndcg_10, hit_rate_5
  • Tracing: OpenTelemetry (API → retriever → stores)

8.5 SLOs (suggested)

  • API: p95 search < 300 ms (warm caches)
  • Freshness: 95% of eligible sources ingested < 24h
  • Consistency: 99% idempotent upserts (no dup keys)

9. Security & Compliance

  • AuthN/Z: admin endpoints require service account & RBAC; user tokens for write ops
  • Provenance: every claim cites Source; no orphan claims
  • PII: domain is technical content; if extended, add PII scanners and data handling rules
  • Supply chain: pin images; continuously scan deps (Dependabot/Snyk)

10. Testing & Evaluation

10.1 CI Gates (from AGENTS.md)

  • ruff formatruff checkmypypytest

10.2 Unit/Integration/E2E

  • Unit: ids, scoring, normalize, NLI wrapper
  • Integration: Neo4j upsert/merge/prune; Qdrant search
  • E2E: ingest sample doc → query returns expected claims with provenance

10.3 Retrieval Eval Harness

  • Curate data/eval/qa.jsonl with questions about RAG
  • Compute MRR@k, nDCG@k, recall@k on claims (not documents)
  • Regression guard: fail PR if quality drops > ε

11. Configuration

11.1 Environment

NEO4J_URI=bolt://neo4j:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=...
QDRANT_URL=http://qdrant:6333
EMBED_MODEL=text-embedding-3-large (or local)
NLI_MODEL=roberta-large-mnli (or local)
MAX_CHUNK_TOKENS=1000
PRUNE_DELETE_SCORE=0.18
PRUNE_DOWNGRADE_SCORE=0.32
FRESHNESS_HALF_LIFE_DAYS=270

11.2 Policies (editable at runtime)

  • configs/prune_policies.yaml
  • configs/graft_policies.yaml

12. Extension Points

  • Sources: add ingest/*.py with a fetch() generator
  • Extractors: swap to specialized RAG ontologies (e.g., pipelines, evaluators)
  • Embeddings: switch models transparently via embed/encoder.py
  • Storage: Milvus/Weaviate adapters; add GQL API
  • LLM: local vLLM or hosted; add citation‑strict prompt templates

12.1 Caching

Currently, the system does not implement caching for LLM responses. A valuable future enhancement would be to integrate a caching layer like gpt-cache. This would reduce latency and costs for repeated queries or NLI scoring tasks. The caching mechanism could be added to the OllamaLLM client in src/nlp/llm.py.


13. Runbooks

13.1 High Delete Spike

  1. Pause prune (PRUNE_ENABLED=false)
  2. Inspect recent score changes; verify policy edits
  3. Restore from snapshot if needed; re‑run audit in dry‑run mode

13.2 Contradiction Storm

  1. Raise debate window to 90 days
  2. Switch NLI to higher‑capacity model
  3. Manually review top contradictions in Admin console

14. License & Attribution

  • Apache‑2.0 (templates, code)
  • Cite sources when surfacing claims; store publisher metadata

15. Roadmap

  • Admin UI (Next.js) for dashboards & candidate review
  • Temporal edges & time‑aware retrieval
  • Automatic excerpt selection for answer composer
  • Graph‑aware reranking (learning‑to‑rank)

16. Quick Start Checklist

  • cp .env.example .env
  • docker compose up -d
  • ./scripts/bootstrap.sh
  • pytest -q
  • uvicorn src.api.server:app --reload
  • Use the /ingest endpoint or the GUI to add data.
  • Use the /search endpoint or the GUI to query.

17. Performance Tuning & Latency Budget

17.1 Qdrant

  • Collection: cosine (or dot) distance; payload: {claim_id, topics[], trust, freshness} for re‑rank.
  • HNSW: m=32, ef_construct=128, query ef_search=128 (tune vs recall).
  • Quantization: enable scalar/product quantization for memory; verify recall on eval set before enabling in prod.
  • Filters: pre‑filter by visibility='active' and topic when available.

17.2 Neo4j

  • Prefer index‑backed lookups; avoid label scans on hot paths.
  • Use subqueries and LIMIT early; batch writes with apoc.periodic.iterate.
  • Keep :Claim degree bounded; archive very high‑degree hubs to summary nodes.

17.3 Latency Budget (target)

Stage p95 Target
Vector search (Qdrant) ≤ 40 ms
Graph expansion (Neo4j) ≤ 80 ms
Re‑rank + compose ≤ 100 ms
End‑to‑end /search ≤ 300 ms

18. Threat Model & Security Testing

18.1 STRIDE Snapshot

  • Spoofing: JWT validation, clock skew tolerance ±60s, key rotation.
  • Tampering: signed container images; policy updates gated by admin role + audit log.
  • Repudiation: structured request IDs; append‑only policy change log in graph.
  • Information Disclosure: redact secrets; fetchers sandboxed (no filesystem writes); allowlist egress.
  • DoS: rate limits; circuit breakers on external calls; bounded fan‑out.
  • Elevation: RBAC; no shell‑exec in request path; least‑privilege DB users.

18.2 Security Tests

  • Fuzz /search (unicode, pathological queries).
  • SSRF‑hardening in fetchers: enforce scheme allowlist https, max size, content-type checks.
  • Prompt‑injection hardening in answer composer (see §23).

19. Data Governance & Licensing

  • Store Source.license (e.g., CC‑BY‑4.0, All‑Rights‑Reserved).
  • Respect robots.txt and usage terms; keep publisher metadata.
  • Downstream responses must return citations and observe license terms; optionally suppress excerpt text for non‑redistributable sources.

20. Migrations & Versioning (Graph)

20.1 Migration Files

/migrations
  ├─ 0001_init.cypher
  ├─ 0002_add_claim_visibility.cypher
  └─ 0003_index_score.cypher
  • Each file is idempotent; include IF NOT EXISTS guards.
  • Store applied migrations in (:_Migration {id, checksum, applied_at}).

20.2 CLI

python -m scripts.migrate up    # apply pending
python -m scripts.migrate down  # rollback last (when reversible)

21. Disaster Recovery & Backups

  • Neo4j: nightly dump; weekly offsite; verify restore monthly.
  • Qdrant: snapshot API nightly; co‑version with Neo dump.
  • RPO/RTO: RPO ≤ 24h, RTO ≤ 2h (tune per tier).

22. Observability Dashboards & Alerts

  • Grafana panels: ingest lag, audit throughput, merge/prune counts, shadow ratio, quality (MRR@10, nDCG@10).

  • Alerts:

    • ingest_lag_seconds > 86400 for 30m.
    • contradictions_rate spike > 3× baseline.
    • delete_spike > 5% of claims/day.

23. Answer Composer & Prompt Guardrails

  • Always include only retrieved claims; no unsupported assertions.
  • Template enforces: “Cite claim_ids and source_urls in every paragraph.”
  • Refuse answers when supporting_claims < min_k.
  • Strip/escape user‑provided instructions in content (no tool directives leaking into prompt).

24. Demo & Seed Data

  • Seed sources: 5 canonical RAG blog posts, 10 arXiv abstracts, 3 library docs pages.
  • Script: scripts/bootstrap.sh ingests seed and runs one Audit→Graft→Prune cycle.
  • Sample queries: "hybrid retrieval", "reranking vs hybrid", "graph‑aware retrieval benefits".

25. FAQ & Glossary

  • Graft: merging near‑duplicate or entailing claims into a canonical node with lineage preserved.
  • Shadow: visible in lineage but excluded from top‑k retrieval.
  • Debate window: time before resolving contradictions to avoid premature pruning.
  • MRR / nDCG: retrieval metrics used in our eval harness.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •