"If you can't measure it, you can't improve it — you're flying blind."
This is the first of four projects on building production-grade AI systems. I started with observability on purpose: before optimizing a RAG system's cost, latency, or answer quality, you have to be able to see them. So the very first thing I built into this system was the instrument panel — every search emits a full trace with token cost, retrieval scores, and a grounding signal. The full multi-tenant SaaS below is the substrate; measuring it is the thesis.
A multi-tenant RAG SaaS for asking natural-language questions about dense, multi-hour podcasts and getting synthesized answers with exact, clickable timestamp citations — built and operated like production software, not a demo.
Why this repo exists. Most RAG projects stop at a notebook and a Streamlit demo. This one is the other 80%: real multi-tenancy, an async job queue off the request path, server-side JWT auth, internal usage metering, first-class observability, a strict-typed codebase, and CI gating every change. It's the work of a backend engineer learning to build AI systems — and treating "AI system" as a synonym for "system."
flowchart TB
B(["Browser"])
subgraph vercel ["Vercel"]
FE["Next.js + Clerk<br/>sign-in · paste link · poll job · search<br/>identity injected server-side"]
end
subgraph railway ["Railway — Python backend"]
WEB["web · FastAPI<br/>verifies Clerk JWT (RS256 · JWKS · fail-closed)<br/>/search · /ingest · /jobs · /me/episodes"]
WK["worker · Celery<br/>ingest pipeline<br/>fetch → chunk → embed"]
RD[("Redis<br/>broker · rate limits")]
CH[("Chroma server<br/>one collection per tenant")]
end
subgraph managed ["Managed services"]
PG[("Supabase Postgres<br/>users · jobs · usage_ledger")]
LF["Langfuse<br/>per-search traces · scores · sessions"]
end
B -- "HTTPS" --> FE
FE -- "/api/* · Authorization: Bearer Clerk JWT" --> WEB
WEB -- "enqueue job" --> RD
RD -- "deliver" --> WK
WEB -- "users · jobs" --> PG
WK -- "status · usage" --> PG
WEB -- "vector search" --> CH
WK -- "upsert chunks" --> CH
WEB -. "traces" .-> LF
WK -. "traces" .-> LF
A user signs in, pastes their own YouTube link, the backend processes it on demand into their own per-user vector collection, and they search it to get a grounded answer with clickable timestamped citations. Ingestion is a background job — it never blocks the request path.
| Layer | Tech |
|---|---|
| Frontend | Next.js 15 (App Router) · React 19 · Clerk · Tailwind 4 · TypeScript — on Vercel |
| API | FastAPI · Pydantic · uvicorn |
| Ingest queue | Celery + Redis worker (fetch → chunk → embed, off the request path) |
| Vector DB | ChromaDB (standalone server; one collection per tenant) |
| Embeddings | OpenAI text-embedding-3-small |
| LLM synthesis | OpenAI gpt-4o-mini (grounded, citation-only answers) |
| Persistence | Postgres · SQLAlchemy 2.0 (typed) · Alembic migrations |
| Auth | Clerk (server-side JWT verification in FastAPI — RS256-pinned, JWKS, fail-closed) |
| Observability | Langfuse (per-search traces, retrieval scores, sessions) |
| Tooling | uv · pytest (377 tests) · mypy --strict (no Any) · GitHub Actions CI |
- Sign in (Clerk) and paste a YouTube URL.
- Real on-demand ingest. A Celery worker fetches captions, splits them into
~400-token chunks with ~50-token overlap, embeds them, and upserts into your
Chroma collection — reporting live phase/progress (
FETCH → PARSE → CHUNK → EMBED → READY) that the UI polls. - Ask a question. Your query is embedded, run as a vector similarity search against your collection, and the retrieved chunks (with their metadata) are passed to the LLM.
- Get a grounded answer with inline citations like
[How To Talk To Users — 01:14:30], where each timestamp is clickable and deep-links to that exact moment on YouTube. The timestamp comes straight from the chunk'sstart_timemetadata — the chunking pipeline is what makes citations possible, so it's never dropped during embedding.
The answer is grounded: if retrieval doesn't clear a similarity floor, the system says so rather than hallucinating. Six distinct result states (answer, low-confidence, no-match, empty-index, error, …) are rendered explicitly.
If you don't meter it, you can't improve it. The first capability I built into this system isn't a feature users see — it's the one I need to know whether everything else is working. An AI system whose cost, latency, and answer quality you can't observe is a system you can only guess about.
Every search() call emits one structured trace (via Langfuse),
so each question is fully accountable end-to-end:
- A
retrievalspan — embedding generation + the vector query, recording the retrieved chunk IDs and their similarity scores. This is where you see what the system actually pulled before the LLM ever runs. - An LLM
generation— model, token usage, and dollar cost, captured automatically by thelangfuse.openaidrop-in client. No manual accounting; the cost of every answer is just there. - The trace root — query → flattened result (status, answer, best_score, citations, latency).
Crucially, the trace doesn't just store JSON — it emits queryable scores, so "what's my grounding rate?" or "what's p95 retrieval quality?" are dashboard queries, not log greps:
| Score | Type | What it tells you |
|---|---|---|
status |
categorical | answer / low_confidence / no_match / … — the outcome mix |
best_score |
numeric | top retrieval similarity — is the index actually relevant? |
citation_count |
numeric | how grounded each answer is |
grounded |
boolean | did we answer from sources, or decline? → grounding rate |
injection_suspected |
boolean | prompt-injection guard fired on the query |
Failures are first-class too: real exceptions raise the trace to level=ERROR
(with the true cause, while the user-facing reason stays sanitized); an empty
query is level=WARNING (validation, not failure) — so the error views show real
problems, not noise. Traces carry service.name, environment (dev/test),
and a stable session_id so a visitor's searches group together.
It's a clean seam, zero-cost when off. Everything routes through
rag/observability.py (is_enabled(), trace_context(), record_score(),
re-exported @observe), gated on a configured client. With no LANGFUSE_* env
set, @observe is a passthrough and the app — and the entire 377-test suite —
behaves identically. Observability you can turn on in prod and off in tests,
without an if statement in the business logic.
Search trace
├─ retrieval span embed query · vector search · chunk ids + scores
└─ llm generation gpt-4o-mini · tokens · cost (auto-captured)
scores: status · best_score · citation_count · grounded · injection_suspected
The features below are where "RAG demo" becomes "production system":
- Multi-tenancy with hard isolation. Each user gets their own Chroma
collection (
tenant_{id}, validated inrag/security.py). A second account cannot see the first's episodes — enforced at the data layer, not just the UI, and covered by an isolation test. - Async ingest that survives restarts. Celery with
acks_late+ worker-fork-safe SQLAlchemy engine reset (worker_process_init → db.reset_engine()), an idempotent task (terminal-state guard + deterministic chunk IDs), so a redelivered job after a deploy/OOM never double-spends or double-writes. - Real server-side auth. The backend doesn't trust the proxy — it verifies the
Clerk session JWT itself: RS256 pinned, JWKS-fetched,
iss/exp/azpvalidated, fail-closed when unconfigured. A dev-only header seam keeps the test suite hermetic without weakening prod. - Usage metering. Every job records its embedding token spend + estimated cost
into an append-only
usage_ledger(cost accounting kept; the external billing provider was removed for this open-source release — see Scope below). - Two-phase deletes with real undo. Removing an episode soft-deletes first (instant, reversible — the UI shows an undo toast), then purges the vectors when the window closes; a server-side sweep finishes the job for clients that vanish mid-window. Spend records survive deletion, and a purge can never race an in-flight re-ingest of the same video.
- Observability built in. Full Langfuse tracing of every search — see the Observability section above (the thesis of this project).
- Strict-typed, fully tested.
mypy --strictclean with noAny; 377 tests (unit + integration + retrieval evals); CI blocks every red PR. - Security posture. CORS locked to the frontend origin, per-user Redis-backed
rate limits on
/ingestand/search, SSRF/abuse review on the user-supplied URL (host allowlist), and prompt-injection guards on the retrieval path. SeeSECURITY.md.
Bring your own keys. This repo is self-host only — there is no shared hosted instance. You supply your own OpenAI / Clerk / (optional) Supadata keys. Secrets live only in gitignored
.envfiles, never in the repo.
The entire app — Postgres, Redis, Chroma, the FastAPI web tier, the Celery ingest worker, the Next.js frontend, and a fully-initialized Langfuse (the observability showcase) — comes up with one command:
cp .env.example .env # then set OPENAI_API_KEY (the only required secret)
docker compose up -d --build
docker compose run --rm web alembic upgrade head # apply DB migrations (once)
open http://localhost:3000 # the app
open http://localhost:3001 # Langfuse (traces) — login below
curl http://localhost:8000/health # {"status":"ok",...}The containers wire themselves together over service DNS; nothing else in
.env is required — tracing works immediately against the bundled Langfuse
(see Observability for the UI
login and how to point at Langfuse Cloud instead). Host ports: app on
:3000, Langfuse on :3001, API on :8000, Postgres on :55432,
Redis on :6380, Chroma on :8001 (offset from the usual defaults so they
don't collide with other local stacks). First boot pulls the Langfuse images
(ClickHouse, MinIO, etc.) — give it a couple of minutes.
Sign-in needs Clerk keys in .env (free — see "Auth for local dev" below).
Without them the frontend still builds and runs; auth is simply inert.
Working on the frontend itself? Run it natively with hot reload instead of the container (both bind :3000, so stop one or the other):
docker compose stop frontend
cd web-next
cp .env.local.example .env.local # native dev reads THIS, not the root .env
npm install && npm run dev # http://localhost:3000A fresh stack has an empty index — sign in and paste a YouTube link (or use the dev-trust header below) to ingest your first episode.
Two options, depending on how much of the product you want to exercise:
-
Real Clerk flow (the full UI). Create a free Clerk app, then in
.envset its keys (NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY,CLERK_SECRET_KEY) andCLERK_JWT_ISSUER(your Clerk Frontend API origin — the backend verifies session JWTs against its JWKS). Then rebuild:docker compose up -d --build— a plain restart is NOT enough for the frontend, because the publishable key is inlined into the client bundle at image build time. (Running the frontend natively instead? Its keys go inweb-next/.env.local.) -
Dev-trust header (API only). Set
AUTH_DEV_TRUST_HEADER=1in.envand the backend accepts anX-User-Id: <anything>header as the identity:curl -X POST localhost:8000/ingest -H 'X-User-Id: dev' \ -H 'Content-Type: application/json' \ -d '{"youtube_url": "https://www.youtube.com/watch?v=..."}'
Dev-only. It bypasses JWT verification entirely — never set it anywhere reachable from the internet. Default: off.
The compose stack bundles a full self-hosted Langfuse (the langfuse-*
services) pre-initialized with a local project and API keys, and the app
defaults to it — so the trace stream described
above flows the moment the stack
is up. Run a search, then open the instrument panel:
- UI: http://localhost:3001 — login
dev@pdse.local/pdse-local-dev(setLANGFUSE_UI_PORTin.envif 3001 is taken on your machine). - One trace per search: retrieval span with chunk IDs + similarity scores,
LLM generation with token usage and dollar cost, the queryable scores
(
status,best_score,citation_count,grounded), and per-visitor session grouping viaX-Session-Id. Both the web tier (search traces) and the worker (ingest embedding calls) are wired.
The bundled instance's credentials are fixed local-dev values bound to loopback — fine for local dev, not a deployment recipe (for that, use Langfuse's self-hosting guide).
Prefer your own backend? Your .env overrides the bundled defaults:
- Langfuse Cloud: create a project at
cloud.langfuse.com, then set
LANGFUSE_PUBLIC_KEY/LANGFUSE_SECRET_KEY/LANGFUSE_BASE_URL=https://cloud.langfuse.com(a cloud URL is reachable from containers and native processes alike). - An existing self-hosted instance on this machine (UI on, say,
:3001): containers can't see your host'slocalhost, so set bothLANGFUSE_BASE_URL=http://localhost:3001(native processes) andLANGFUSE_BASE_URL_DOCKER=http://host.docker.internal:3001(containers —host.docker.internalis native on Docker Desktop; the compose file adds thehost-gatewaymapping so it works on Linux too).
Native (non-compose) runs have no bundled default: with no LANGFUSE_*
set, the observability seam (src/rag/observability.py) makes every tracing
call a no-op — same code path, nothing emitted, no latency added.
Dependency manager is uv. Tests and the type
gate are fully hermetic (no services needed; DB tests self-skip without
Postgres):
uv sync
uv run pytest # unit + integration + retrieval evals
uv run mypy src # strict, no Anyuv run pytest never touches your real database, with or without the compose
stack running. The DB-backed tests wipe tables between tests, so pytest
rewrites DATABASE_URL to a dedicated test database — it derives <name>_test
from the configured URL (default: pdse_test) and creates it on demand on the
local server (see tests/_db_isolation.py). To point tests somewhere else, set
TEST_DATABASE_URL; its database name must end in _test, anything else fails
the run loudly. If the schema in pdse_test ever drifts after a model change,
just drop it (docker compose exec postgres psql -U pdse -d postgres -c 'DROP DATABASE pdse_test') — the next run recreates it.
To iterate on the backend natively against the compose infra, the .env
written from .env.example already points at the host-mapped ports
(Postgres :55432, Redis :6380; set CHROMA_SERVER_URL=http://localhost:8001
so a native web/worker shares the compose Chroma — see .env.example):
uv run uvicorn server:app --app-dir src --reload # http://127.0.0.1:8000
# Ingest worker (separate terminal). macOS note: Apple's ObjC runtime aborts
# forked workers — use --pool=solo locally (the Linux container uses prefork).
uv run celery -A worker.celery_app worker --loglevel=info --pool=soloThe pure RAG path needs none of that: uv run python src/ingest.py and the
search pipeline work with just OPENAI_API_KEY and a local on-disk Chroma
(leave CHROMA_SERVER_URL unset) — the Postgres/Redis/Clerk machinery only
exists for the multi-tenant SaaS flow.
Full deployment runbook (Railway + Vercel + Supabase, the Chroma server-mode
gotcha, the captions backends): DEPLOY.md.
src/
server.py FastAPI app: /search /ingest /jobs/{id} /me/episodes (+removal trio) /health
worker.py Celery app + run_ingest task (fetch → chunk → embed)
ingest.py Offline batch pipeline (the original build; reused by the worker)
rag/ Pure logic — no framework coupling:
pipeline.py retrieve → ground → synthesize
transcript.py caption/metadata fallback chain (resilient to YouTube blocks)
chunker.py segment-aggregating ~400-token splitter w/ timestamp metadata
embeddings.py OpenAI embeddings + token/cost accounting
vectorstore.py Chroma client/server-mode, per-tenant collections
auth.py Clerk JWT verification (RS256, JWKS, fail-closed)
security.py tenant-id validation + prompt-injection guards
repository.py typed data layer (users · jobs · usage_ledger)
models_db.py SQLAlchemy 2.0 models
observability.py Langfuse seam (zero-cost when off)
ratelimit.py per-user Redis rate limits
tests/ 377 tests (unit · integration · retrieval evals)
web-next/ Next.js + Clerk frontend
alembic/ migrations
docker-compose.yml full local stack: postgres · redis · chroma · web · worker ·
frontend · bundled Langfuse (6 langfuse-* services)
MIT.
