RAG-based Q&A retrieval service for ACCESS-CI. Provides semantic search over human-verified Q&A pairs from Argilla.
┌─────────────────┐ ┌─────────────────┐
│ access-agent │ │ MCP Server │
│ (LangGraph) │ │ (TypeScript) │
└────────┬────────┘ └────────┬────────┘
│ │
│ HTTP/REST API │
└───────────┬───────────┘
▼
┌───────────────────────┐
│ QA Service (this) │
│ (FastAPI) │
└───────────┬───────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌────────┐ ┌────────────┐ ┌─────────┐
│ pgvector│ │ Redis │ │ Argilla │
│ (Q&A) │ │ (citations)│ │ (source)│
└────────┘ └────────────┘ └─────────┘
- Semantic search over Q&A pairs using sentence-transformers embeddings
- Performance optimized:
- HNSW index (15x faster than IVFFlat)
- Query-level caching (90%+ reduction for repeated queries)
- Pre-loaded embedding model (no cold start)
- Batch embedding generation
- Citation validation via Redis registry
- Argilla integration for syncing human-verified Q&A pairs
docker-compose up -dThis starts:
- PostgreSQL with pgvector extension (port 5433)
- Redis (port 6380)
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install package
pip install -e ".[dev]"cp .env.example .env
# Edit .env with your settings# Development mode
uvicorn src.access_qa_service.main:app --reload --port 8001
# Or use the CLI
python -m access_qa_service.main# Health check
curl http://localhost:8001/health
# Search (after loading data)
curl -X POST http://localhost:8001/search \
-H "Content-Type: application/json" \
-d '{"query": "What GPUs does Delta have?"}'
# Get stats
curl http://localhost:8001/admin/stats| Method | Path | Description |
|---|---|---|
POST |
/search |
Semantic search for matching Q&A |
POST |
/search/by-domain |
Search filtered by domain |
POST |
/citations/validate |
Batch validate citation markers |
GET |
/admin/stats |
Service health and stats |
POST |
/admin/sync |
Trigger Argilla sync (auth required) |
POST |
/admin/bulk-load |
Direct Q&A upload (auth required) |
POST |
/admin/load-jsonl |
Load from JSONL file (auth required) |
POST |
/admin/clear-cache |
Clear query cache (auth required) |
curl -X POST http://localhost:8001/admin/sync \
-H "Authorization: Bearer $ADMIN_TOKEN"Create a file qa_pairs.jsonl:
{"question": "What GPUs does Delta have?", "answer": "Delta has NVIDIA A100 GPUs. <<SRC:compute-resources:delta.ncsa.access-ci.org>>"}
{"question": "What is the network fabric on PNRP?", "answer": "PNRP uses GigaIO SuperNODE fabric. <<SRC:compute-resources:pnrp.access-ci.org>>"}Upload it:
curl -X POST http://localhost:8001/admin/load-jsonl \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-F "file=@qa_pairs.jsonl"Expected latency after optimizations:
| Metric | Value |
|---|---|
| DB Query (HNSW) | ~25ms |
| Cache Hit | ~5ms |
| P50 End-to-End | ~50ms |
| P95 End-to-End | ~150ms |
# Run tests
pytest
# Type checking
mypy src
# Linting
ruff check src
ruff format src| Variable | Default | Description |
|---|---|---|
DATABASE_URL |
postgres://localhost:5433/qa_service | PostgreSQL connection |
REDIS_URL |
redis://localhost:6380/0 | Redis connection |
EMBEDDING_MODEL |
sentence-transformers/all-MiniLM-L6-v2 | Embedding model |
RAG_SIMILARITY_THRESHOLD |
0.85 | Minimum similarity for matches |
RAG_TOP_K |
3 | Max results to return |
CACHE_TTL_SECONDS |
3600 | Query cache TTL |
ADMIN_TOKEN |
(empty) | Admin endpoint auth token |