Skip to content

ckranon/emerging-topics-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 

Repository files navigation

Emerging Topics RAG — Retrieval-Augmented Generation System

A Dockerized Retrieval-Augmented Generation (RAG) system optimized for CPU-only environments and capable of indexing and querying large-scale document collections (100k+ documents). This system integrates a modular API, local LLM serving via Ollama, and optional evaluation via RAGAS metrics. Ideal for research or educational deployment, especially in resource-constrained setups.

Table of Contents


Features

  • Upload and index large-scale documents (>100k, ~5k characters each)
  • Perform semantic search with contextual answer generation
  • CPU-only compatible (≤16GB RAM, no GPU needed)
  • Modular microservices: FastAPI, embedding service, LLM wrapper, ChromaDB
  • Local LLM inference via Ollama (e.g., Mistral, LLaMA 2)
  • RAGAS-ready pipeline for evaluating answer quality and context precision
  • Designed for extensibility, benchmarking, and privacy-preserving applications

Prerequisites

  • Docker v20.10+
  • Docker Compose v1.27+
  • CPU-only machine (≥8GB RAM recommended)
  • (Optional) OPENAI_API_KEY set for metric computation:
export OPENAI_API_KEY=your_key

Getting Started

Clone the repository with submodules:

git clone https://github.com/ckranon/emerging-topics-rag.git
cd emerging-topics-rag/rag-api

RAG API Submodule

Located in rag-api/, the core RAG pipeline includes:

  • api/ — FastAPI endpoints for document upload and generation
  • embedding/ — Embedding server using SentenceTransformers
  • ollama/ — Local LLM runner using Ollama
  • vector_store/ — Persistent ChromaDB vector index
  • test_api.py — Basic integration test script; returns average respones time.
  • compute_metrics.py — Computes RAGAS Metrics based on generated results from test_api.py

API Endpoints

GET /

Health check:

curl http://localhost:8000/

Response:

{"message":"RAG API is running successfully"}

POST /upload

Uploads documents and indexes them into the vector store.

curl -X POST http://localhost:8000/upload \
  -H "Content-Type: application/json" \
  -d '{"texts":["Document 1 text...", "Document 2 text..."]}'

Response:

{"message":"Vector index successfully created","nodes_count":123}

POST /generate

Generates an answer based on user query and retrieved document context.

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"new_message":{"role":"user","content":"What is the capital of France?"}}'

Response:

{
  "generated_text": "The capital of France is Paris.",
  "contexts": ["Paris is the capital of France. It is known for the Eiffel Tower."]
}

Docker Setup

To build and run all services:

docker-compose up --build

Services launched:

  • api — FastAPI service for user interaction
  • embedding — Generates document embeddings
  • ollama — Runs a local LLM using start.sh <- Can change model.

Testing

Run API tests:

python test_api.py

Run RAGAS-based metric evaluation:

export OPENAI_API_KEY=your_key
python compute_metrics.py

⚠️ Due to runtime and API constraints, metric computation may timeout.


Methodology & Findings

Chunking Strategy Evaluation

We compared:

  • Semantic Chunking — Splitting based on semantic boundaries (embedding similarity)
  • Sentence Window Chunking — Fixed-size overlapping windows

Result: Inconclusive. compute_metrics.py timeout.


Model Comparison

We explored different LLMs:

  • DeepSeek-R1:1.5b (reasoning-focused, open-weight)
  • Qwen2.5:0.5b (BASELINE)

Result: Inconclusive. compute_metrics.py timeout.


Inference Backends

We tested:

  • Ollama — Seamless local inference with minimal setup
  • Hugging Face TGI — Scalable backend for multi-GPU serving

Result: Ollama replaced TGI due to TGI not being able to pull baseline models.



Vector Stores

Insteaad of using HuggingFace TGI, we implemented a persistent storage using Chroma.db.


Evaluation Issues

Although the pipeline stores generation outputs for downstream evaluation, RAGAS metric computation consistently timed out during execution due to:

  • API response delays from OpenAI

As a result, we deliver a baseline model with only qualitative improvement insights and no definitive RAGAS scores.


Project Structure

emerging-topics-rag/
├── .gitignore
├── README.md               # This file
├── compute_metrics.py      # Metric computation using RAGAS (OpenAI required)
└── rag-api/
    ├── api/
    │   ├── api_rag.py
    │   ├── Dockerfile
    │   └── requirements.txt
    ├── embedding/
    │   ├── embed_server.py
    │   ├── Dockerfile
    │   └── requirements.txt
    ├── ollama/
    │   ├── start.sh
    │   └── Dockerfile
    ├── vector_store/
    │   └── chroma.db       # Persistent ChromaDB index
    ├── docker-compose.yaml
    └── test_api.py

Use Cases

  • Research Prototypes — Test chunking and RAG strategies
  • Private Knowledge Retrieval — Deploy local document Q&A systems
  • Teaching Tool — Understand full-stack RAG pipelines
  • Baseline Model Benchmarks — Evaluate low-resource model performance

Limitations & Challenges

  • RAGAS Metrics Unavailable — Due to OpenAI API timeout issues
  • No GPU Support — CPU-only by design; not optimized for high-scale workloads
  • Manual Chunking Trade-offs — Semantic methods improve results but increase complexity
  • Ollama Model Limitation — Must manually ensure models are pulled and accessible

Contributing

We welcome contributions!

  1. Fork the repository
  2. Create a new feature branch
  3. Commit your changes
  4. Open a pull request

If you find a bug or have a feature request, feel free to open an Issue.


License

This project is licensed under the MIT License. You are free to use, modify, and distribute the code for academic or commercial purposes.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •