Starter PGVector/GraphRAG Query System

A containerized application that generates paragraph answers with references based on user queries, powered by PGVector technology.

Overview

This project implements a full-stack RAG (Retrieval-Augmented Generation) system enhanced with graph-based knowledge representation. The system processes documents, builds a knowledge graph of entities and their relationships, and provides comprehensive answers to user queries with proper citations.

Key Features

Document ingestion and chunking
Vector embeddings with OpenAI
Entity extraction and relationship mapping (fixed)
Knowledge graph construction (fixed)
GraphRAG-enhanced retrieval (fixed)
Paragraph generation with citations
Interactive web interface
Query memory with similarity matching
User feedback and favorites
Document-grounded conversation threads

Architecture

The application is fully containerized using Docker and consists of the following components:

1. Database (PostgreSQL with pgvector)

Stores document chunks and their embeddings
Enables vector similarity search

2. Ingestion Service

Processes documents (PDF, DOCX, TXT)
Chunks documents and generates embeddings
Stores data in PostgreSQL

3. GraphRAG Processor

Builds a knowledge graph from document chunks
Extracts entities and relationships (fixed)
Generates community summaries
Outputs graph data for enhanced retrieval

4. API Service

Processes user queries
Performs vector similarity search
Enhances retrieval with graph data
Generates comprehensive answers
Caches results for similar future queries
Supports conversation threads and feedback

5. Frontend

Provides a user-friendly interface
Displays answers with citations
Shows relevant chunks, entities, and community insights
Features for feedback, favorites, and conversation threads

Getting Started

Prerequisites

Docker and Docker Compose
OpenAI API key

Installation

Clone this repository:

git clone <repository-url>
cd containerized-rag-starter-kit

Create a .env file in the root directory with your OpenAI API key:
```
OPENAI_API_KEY=your_openai_api_key_here
```
Build and start the containers:
```
docker-compose up -d
```
The application will be available at:
- Frontend: http://localhost:8080
- API: http://localhost:8000
- Database: localhost:5433 (PostgreSQL)

Development

Frontend

Go to /frontend
Run $ npm install
Run $ npm dev
Open browser to http://localhost:5173/

FAQ

Database Backup and Restore

The system includes scripts for database backup and restore operations:

# Create a manual backup (default location: ./backups)
./scripts/backup_db.sh [backup_directory]

# Restore from a backup
./scripts/restore_db.sh path/to/backup_file.sql.gz

# Setup scheduled backups with rotation (keeps last 7 by default)
./scripts/scheduled_backup.sh [backup_directory] [retention_count]

For detailed information, see Database Backup and Restore.

Usage

Adding Documents

Manual Addition

Place your documents (PDF, DOCX, TXT, images) in the data directory
The ingestion service will automatically process them. Run ./scripts/check_ingestion.sh
The GraphRAG processor will build the knowledge graph

Note: The system supports scanned PDFs and images through OCR processing

Bulk Import from Zotero

The project includes two scripts for importing documents in bulk from Zotero storage:

Basic Import

Run the basic import script:

./scripts/import_documents.sh /home/mu/Zotero/storage

The script will:
- Find all PDF, DOCX, and TXT files in the Zotero storage directory and its subdirectories
- Copy them to the data directory (skipping any duplicates by filename)
- Report how many new documents were added

Advanced Import with Metadata

For a more sophisticated import that preserves folder structure information:

Run the advanced import script:

./scripts/import_with_metadata.py /home/mu/Zotero/storage

The advanced script offers additional features:
- Content-based deduplication (using file hashes)
- Preserves source folder information in filenames
- Creates a JSON metadata file with original paths and other information
- Provides more options (run with --help to see all options)
Additional options:
```
# Preserve directory structure in target
./scripts/import_with_metadata.py --preserve-structure

# Specify a custom target directory
./scripts/import_with_metadata.py --target-dir ./custom_data_dir
```

Import with OCR Support

For handling scanned documents and images:

Run the OCR-enabled import script:

./scripts/import_with_ocr.py /home/mu/Zotero/storage

This script provides all the features of the metadata script plus:
- Automatic detection of scanned PDFs and images
- OCR processing to make non-searchable documents searchable
- Parallel processing for faster imports
- Integration with the OCR service
Additional options:
```
# Force OCR processing for all documents
./scripts/import_with_ocr.py --force-ocr

# Disable OCR processing
./scripts/import_with_ocr.py --no-ocr

# Adjust processing threads
./scripts/import_with_ocr.py --threads 8
```
The ingestion service will automatically process the imported documents

OCR Note: The system includes two OCR solutions:

Built-in OCR in the ingestion service for direct processing

Dedicated OCR service for more advanced preprocessing during import

Querying

Open the frontend at http://localhost:8080
Enter your query in the search box
View the generated answer with citations
Explore the relevant chunks

Technical Details

Vector Storage and Search

The system uses PostgreSQL with the pgvector extension to store and search vector embeddings, enabling efficient similarity search for relevant document chunks.

Knowledge Graph Construction

The GraphRAG processor extracts entities from document chunks using spaCy and builds a knowledge graph representing relationships between entities and chunks.

Query Processing

When a user submits a query:

The query is embedded using OpenAI
The system checks memory for similar previous queries
If not found in memory, relevant chunks are retrieved using vector search
The knowledge graph is queried for related entities and communities
A comprehensive answer is generated with citations
The result is stored in memory for future similar queries

Memory and Conversation Features

The system includes several enhanced features:

Query Memory: The system remembers previous queries and can instantly retrieve answers for similar questions without regenerating them.
User Feedback: Users can rate answers on a 5-star scale and provide text feedback.
Favorites: Users can bookmark particularly useful answers for quick reference later.
Conversation Threads: Users can start a conversation thread from any query, enabling follow-up questions with document-grounded responses.
Enhanced Retrieval: Conversation threads can optionally include document retrieval to ground responses in the knowledge base.

Development

Project Structure

writehere-graphrag/
├── data/                  # Document storage directory
├── db/                    # Database configuration
├── docs/                  # Documentation files
├── ingestion_service/     # Document processing service
├── graphrag_processor/    # Knowledge graph generator
├── api_service/           # Query processing API
├── frontend/              # Vue.js web interface
├── scripts/               # Utility scripts for backup, import, etc.
└── docker-compose.yml     # Container orchestration

Customization

Modify the chunking parameters in ingestion_service/app.py
Adjust the graph processing interval in graphrag_processor/app.py
Change the UI appearance in frontend/src/assets/main.css

ILRI deployment

This setup runs only the database, ingestion service, API service, and GROBID. The frontend runs outside Docker with its own NGINX and optional systemd unit.

Services (ILRI compose)

Database: db-ilri (PostgreSQL + pgvector) on port 5434
GROBID: grobid-ilri on port 8070
API: api-service-ilri on port 8001
Ingestion: ingestion-service-ilri on port 5051
Frontend is not run in Docker for ILRI; run it outside as described below

Compose file: docker-compose.ilri.yml

Start ILRI stack (no frontend in Docker)

# Expects OPENAI_API_KEY in environment (and optionally ANTHROPIC_API_KEY)
export OPENAI_API_KEY=...  # required
export ANTHROPIC_API_KEY=...  # optional

./run-ilri-instance.sh

# Check status / logs
docker compose -f docker-compose.ilri.yml ps
docker compose -f docker-compose.ilri.yml logs -f

GROBID integration (required for ingestion)

Ingestion uses grobid-client with config mounted at /app/grobid_config.json
ILRI-specific config points to the in-network GROBID service: ingestion_service/grobid_config.ilri.json
The ILRI compose mounts this file and ensures ingestion-service-ilri depends on grobid-ilri

Frontend outside Docker

You have two options:

Development (Vite dev server)

Run:

cd frontend
npm install
npm dev  # defaults to http://localhost:5173

Update NGINX reverse proxy to route the site to Vite and API to the ILRI API:
- File: deployment/nginx/containerized-rag
  - Change frontend proxy target to http://localhost:5173
  - Change API proxy target to http://localhost:8001/

Production (build and serve via system NGINX)

Build static assets:

cd frontend
npm install
npm build  # outputs to ./dist

Configure your NGINX server block to serve frontend/dist as root and proxy /api/ to http://localhost:8001/. Example:

server {
  listen 80;
  server_name ilri.example.org;

  root /home/muhia/src/containerized-rag-starter-kit/frontend/dist;
  index index.html;

  location /api/ {
    proxy_pass http://localhost:8001/;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_read_timeout 300s;
  }

  location / {
    try_files $uri $uri/ /index.html;
  }
}

Optionally add a systemd unit for NGINX or your deployment routine per your environment.

ILRI branding (frontend)

Update the name and branding in the frontend:

Change page title in frontend/index.html:
```
<title>AI Search</title>
```
to e.g. ILRI AI Search.
Change nav brand in frontend/src/App.vue:
```
<h1>AI Search</h1>
```
to e.g. ILRI AI Search.
Add ILRI logo and colors in frontend/src/assets/main.css or component styles as desired.

Dev proxy (optional)

If using Vite dev server, point /api proxy to the ILRI API port:

File: frontend/vite.config.js

export default defineConfig({
  server: {
    proxy: {
      '/api': { target: 'http://localhost:8001', changeOrigin: true, rewrite: p => p.replace(/^\/api/, '') }
    }
  }
})

Endpoints

Frontend: via external NGINX (per config above)
API: http://localhost:8001
Health: http://localhost:8001/health
GROBID UI: http://localhost:8070

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
api_service		api_service
db		db
deployment		deployment
docs		docs
frontend		frontend
graphrag_processor		graphrag_processor
ingestion_service		ingestion_service
ocr_service		ocr_service
scripts		scripts
.gitignore		.gitignore
.tool-versions		.tool-versions
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
ENHANCED_QA_README.md		ENHANCED_QA_README.md
LICENSE		LICENSE
README.md		README.md
docker-compose.example.yml		docker-compose.example.yml
docker-compose.yml		docker-compose.yml
enhanced_qa_test_results.json		enhanced_qa_test_results.json
env.example		env.example
evaluation_roadmap.md		evaluation_roadmap.md
package.json		package.json
run-ilri-instance.sh		run-ilri-instance.sh
run_enhanced_qa_tests.py		run_enhanced_qa_tests.py
test_academic_integration.py		test_academic_integration.py
test_enhanced_qa_system.py		test_enhanced_qa_system.py
test_ilri_verification.py		test_ilri_verification.py

License

ilri/animal-health-rag-stack

Folders and files

Latest commit

History

Repository files navigation

Starter PGVector/GraphRAG Query System

Overview

Key Features

Architecture

1. Database (PostgreSQL with pgvector)

2. Ingestion Service

3. GraphRAG Processor

4. API Service

5. Frontend

Getting Started

Prerequisites

Installation

Development

Frontend

FAQ

Database Backup and Restore

Usage

Adding Documents

Manual Addition

Bulk Import from Zotero

Basic Import

Advanced Import with Metadata

Import with OCR Support

Querying

Technical Details

Vector Storage and Search

Knowledge Graph Construction

Query Processing

Memory and Conversation Features

Development

Project Structure

Customization

ILRI deployment

Services (ILRI compose)

Start ILRI stack (no frontend in Docker)

GROBID integration (required for ingestion)

Frontend outside Docker

ILRI branding (frontend)

Dev proxy (optional)

Endpoints

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages