A Dockerized Retrieval-Augmented Generation (RAG) system optimized for CPU-only environments and capable of indexing and querying large-scale document collections (100k+ documents). This system integrates a modular API, local LLM serving via Ollama, and optional evaluation via RAGAS metrics. Ideal for research or educational deployment, especially in resource-constrained setups.
- Upload and index large-scale documents (>100k, ~5k characters each)
- Perform semantic search with contextual answer generation
- CPU-only compatible (≤16GB RAM, no GPU needed)
- Modular microservices: FastAPI, embedding service, LLM wrapper, ChromaDB
- Local LLM inference via Ollama (e.g., Mistral, LLaMA 2)
- RAGAS-ready pipeline for evaluating answer quality and context precision
- Designed for extensibility, benchmarking, and privacy-preserving applications
- Docker v20.10+
- Docker Compose v1.27+
- CPU-only machine (≥8GB RAM recommended)
- (Optional)
OPENAI_API_KEYset for metric computation:
export OPENAI_API_KEY=your_keygit clone https://github.com/ckranon/emerging-topics-rag.git
cd emerging-topics-rag/rag-apiLocated in rag-api/, the core RAG pipeline includes:
api/— FastAPI endpoints for document upload and generationembedding/— Embedding server usingSentenceTransformersollama/— Local LLM runner using Ollamavector_store/— Persistent ChromaDB vector indextest_api.py— Basic integration test script; returns average respones time.compute_metrics.py— Computes RAGAS Metrics based on generated results fromtest_api.py
Health check:
curl http://localhost:8000/Response:
{"message":"RAG API is running successfully"}Uploads documents and indexes them into the vector store.
curl -X POST http://localhost:8000/upload \
-H "Content-Type: application/json" \
-d '{"texts":["Document 1 text...", "Document 2 text..."]}'Response:
{"message":"Vector index successfully created","nodes_count":123}Generates an answer based on user query and retrieved document context.
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"new_message":{"role":"user","content":"What is the capital of France?"}}'Response:
{
"generated_text": "The capital of France is Paris.",
"contexts": ["Paris is the capital of France. It is known for the Eiffel Tower."]
}To build and run all services:
docker-compose up --buildServices launched:
api— FastAPI service for user interactionembedding— Generates document embeddingsollama— Runs a local LLM usingstart.sh<- Can change model.
python test_api.pyexport OPENAI_API_KEY=your_key
python compute_metrics.py
⚠️ Due to runtime and API constraints, metric computation may timeout.
We compared:
- Semantic Chunking — Splitting based on semantic boundaries (embedding similarity)
- Sentence Window Chunking — Fixed-size overlapping windows
Result:
Inconclusive. compute_metrics.py timeout.
We explored different LLMs:
- DeepSeek-R1:1.5b (reasoning-focused, open-weight)
- Qwen2.5:0.5b (BASELINE)
Result:
Inconclusive. compute_metrics.py timeout.
We tested:
- Ollama — Seamless local inference with minimal setup
- Hugging Face TGI — Scalable backend for multi-GPU serving
Result: Ollama replaced TGI due to TGI not being able to pull baseline models.
Insteaad of using HuggingFace TGI, we implemented a persistent storage using Chroma.db.
Although the pipeline stores generation outputs for downstream evaluation, RAGAS metric computation consistently timed out during execution due to:
- API response delays from OpenAI
As a result, we deliver a baseline model with only qualitative improvement insights and no definitive RAGAS scores.
emerging-topics-rag/
├── .gitignore
├── README.md # This file
├── compute_metrics.py # Metric computation using RAGAS (OpenAI required)
└── rag-api/
├── api/
│ ├── api_rag.py
│ ├── Dockerfile
│ └── requirements.txt
├── embedding/
│ ├── embed_server.py
│ ├── Dockerfile
│ └── requirements.txt
├── ollama/
│ ├── start.sh
│ └── Dockerfile
├── vector_store/
│ └── chroma.db # Persistent ChromaDB index
├── docker-compose.yaml
└── test_api.py
- Research Prototypes — Test chunking and RAG strategies
- Private Knowledge Retrieval — Deploy local document Q&A systems
- Teaching Tool — Understand full-stack RAG pipelines
- Baseline Model Benchmarks — Evaluate low-resource model performance
- RAGAS Metrics Unavailable — Due to OpenAI API timeout issues
- No GPU Support — CPU-only by design; not optimized for high-scale workloads
- Manual Chunking Trade-offs — Semantic methods improve results but increase complexity
- Ollama Model Limitation — Must manually ensure models are pulled and accessible
We welcome contributions!
- Fork the repository
- Create a new feature branch
- Commit your changes
- Open a pull request
If you find a bug or have a feature request, feel free to open an Issue.
This project is licensed under the MIT License. You are free to use, modify, and distribute the code for academic or commercial purposes.