Podcast Deep Search

Project 1 of 4 · Production AI series · Lens: Observability

"If you can't measure it, you can't improve it — you're flying blind."

This is the first of four projects on building production-grade AI systems. I started with observability on purpose: before optimizing a RAG system's cost, latency, or answer quality, you have to be able to see them. So the very first thing I built into this system was the instrument panel — every search emits a full trace with token cost, retrieval scores, and a grounding signal. The full multi-tenant SaaS below is the substrate; measuring it is the thesis.

A multi-tenant RAG SaaS for asking natural-language questions about dense, multi-hour podcasts and getting synthesized answers with exact, clickable timestamp citations — built and operated like production software, not a demo.

Why this repo exists. Most RAG projects stop at a notebook and a Streamlit demo. This one is the other 80%: real multi-tenancy, an async job queue off the request path, server-side JWT auth, internal usage metering, first-class observability, a strict-typed codebase, and CI gating every change. It's the work of a backend engineer learning to build AI systems — and treating "AI system" as a synonym for "system."

Architecture

flowchart TB
    B(["Browser"])

    subgraph vercel ["Vercel"]
        FE["Next.js + Clerk<br/>sign-in · paste link · poll job · search<br/>identity injected server-side"]
    end

    subgraph railway ["Railway — Python backend"]
        WEB["web · FastAPI<br/>verifies Clerk JWT (RS256 · JWKS · fail-closed)<br/>/search · /ingest · /jobs · /me/episodes"]
        WK["worker · Celery<br/>ingest pipeline<br/>fetch → chunk → embed"]
        RD[("Redis<br/>broker · rate limits")]
        CH[("Chroma server<br/>one collection per tenant")]
    end

    subgraph managed ["Managed services"]
        PG[("Supabase Postgres<br/>users · jobs · usage_ledger")]
        LF["Langfuse<br/>per-search traces · scores · sessions"]
    end

    B -- "HTTPS" --> FE
    FE -- "/api/* · Authorization: Bearer Clerk JWT" --> WEB
    WEB -- "enqueue job" --> RD
    RD -- "deliver" --> WK
    WEB -- "users · jobs" --> PG
    WK -- "status · usage" --> PG
    WEB -- "vector search" --> CH
    WK -- "upsert chunks" --> CH
    WEB -. "traces" .-> LF
    WK -. "traces" .-> LF

A user signs in, pastes their own YouTube link, the backend processes it on demand into their own per-user vector collection, and they search it to get a grounded answer with clickable timestamped citations. Ingestion is a background job — it never blocks the request path.

Stack

Layer	Tech
Frontend	Next.js 15 (App Router) · React 19 · Clerk · Tailwind 4 · TypeScript — on Vercel
API	FastAPI · Pydantic · uvicorn
Ingest queue	Celery + Redis worker (fetch → chunk → embed, off the request path)
Vector DB	ChromaDB (standalone server; one collection per tenant)
Embeddings	OpenAI `text-embedding-3-small`
LLM synthesis	OpenAI `gpt-4o-mini` (grounded, citation-only answers)
Persistence	Postgres · SQLAlchemy 2.0 (typed) · Alembic migrations
Auth	Clerk (server-side JWT verification in FastAPI — RS256-pinned, JWKS, fail-closed)
Observability	Langfuse (per-search traces, retrieval scores, sessions)
Tooling	`uv` · `pytest` (377 tests) · `mypy --strict` (no `Any`) · GitHub Actions CI

What it does

Sign in (Clerk) and paste a YouTube URL.
Real on-demand ingest. A Celery worker fetches captions, splits them into ~400-token chunks with ~50-token overlap, embeds them, and upserts into your Chroma collection — reporting live phase/progress (FETCH → PARSE → CHUNK → EMBED → READY) that the UI polls.
Ask a question. Your query is embedded, run as a vector similarity search against your collection, and the retrieved chunks (with their metadata) are passed to the LLM.
Get a grounded answer with inline citations like [How To Talk To Users — 01:14:30], where each timestamp is clickable and deep-links to that exact moment on YouTube. The timestamp comes straight from the chunk's start_time metadata — the chunking pipeline is what makes citations possible, so it's never dropped during embedding.

The answer is grounded: if retrieval doesn't clear a similarity floor, the system says so rather than hallucinating. Six distinct result states (answer, low-confidence, no-match, empty-index, error, …) are rendered explicitly.

Observability — the thesis of this project

If you don't meter it, you can't improve it. The first capability I built into this system isn't a feature users see — it's the one I need to know whether everything else is working. An AI system whose cost, latency, and answer quality you can't observe is a system you can only guess about.

Every search() call emits one structured trace (via Langfuse), so each question is fully accountable end-to-end:

A retrieval span — embedding generation + the vector query, recording the retrieved chunk IDs and their similarity scores. This is where you see what the system actually pulled before the LLM ever runs.
An LLM generation — model, token usage, and dollar cost, captured automatically by the langfuse.openai drop-in client. No manual accounting; the cost of every answer is just there.
The trace root — query → flattened result (status, answer, best_score, citations, latency).

Crucially, the trace doesn't just store JSON — it emits queryable scores, so "what's my grounding rate?" or "what's p95 retrieval quality?" are dashboard queries, not log greps:

Score	Type	What it tells you
`status`	categorical	answer / low_confidence / no_match / … — the outcome mix
`best_score`	numeric	top retrieval similarity — is the index actually relevant?
`citation_count`	numeric	how grounded each answer is
`grounded`	boolean	did we answer from sources, or decline? → grounding rate
`injection_suspected`	boolean	prompt-injection guard fired on the query

Failures are first-class too: real exceptions raise the trace to level=ERROR (with the true cause, while the user-facing reason stays sanitized); an empty query is level=WARNING (validation, not failure) — so the error views show real problems, not noise. Traces carry service.name, environment (dev/test), and a stable session_id so a visitor's searches group together.

It's a clean seam, zero-cost when off. Everything routes through rag/observability.py (is_enabled(), trace_context(), record_score(), re-exported @observe), gated on a configured client. With no LANGFUSE_* env set, @observe is a passthrough and the app — and the entire 377-test suite — behaves identically. Observability you can turn on in prod and off in tests, without an if statement in the business logic.

Search trace
├─ retrieval span        embed query · vector search · chunk ids + scores
└─ llm generation        gpt-4o-mini · tokens · cost  (auto-captured)
   scores: status · best_score · citation_count · grounded · injection_suspected

Engineering depth

The features below are where "RAG demo" becomes "production system":

Multi-tenancy with hard isolation. Each user gets their own Chroma collection (tenant_{id}, validated in rag/security.py). A second account cannot see the first's episodes — enforced at the data layer, not just the UI, and covered by an isolation test.
Async ingest that survives restarts. Celery with acks_late + worker-fork-safe SQLAlchemy engine reset (worker_process_init → db.reset_engine()), an idempotent task (terminal-state guard + deterministic chunk IDs), so a redelivered job after a deploy/OOM never double-spends or double-writes.
Real server-side auth. The backend doesn't trust the proxy — it verifies the Clerk session JWT itself: RS256 pinned, JWKS-fetched, iss/exp/azp validated, fail-closed when unconfigured. A dev-only header seam keeps the test suite hermetic without weakening prod.
Usage metering. Every job records its embedding token spend + estimated cost into an append-only usage_ledger (cost accounting kept; the external billing provider was removed for this open-source release — see Scope below).
Two-phase deletes with real undo. Removing an episode soft-deletes first (instant, reversible — the UI shows an undo toast), then purges the vectors when the window closes; a server-side sweep finishes the job for clients that vanish mid-window. Spend records survive deletion, and a purge can never race an in-flight re-ingest of the same video.
Observability built in. Full Langfuse tracing of every search — see the Observability section above (the thesis of this project).
Strict-typed, fully tested. mypy --strict clean with no Any; 377 tests (unit + integration + retrieval evals); CI blocks every red PR.
Security posture. CORS locked to the frontend origin, per-user Redis-backed rate limits on /ingest and /search, SSRF/abuse review on the user-supplied URL (host allowlist), and prompt-injection guards on the retrieval path. See SECURITY.md.

Run it locally

Bring your own keys. This repo is self-host only — there is no shared hosted instance. You supply your own OpenAI / Clerk / (optional) Supadata keys. Secrets live only in gitignored .env files, never in the repo.

Quick start (Docker Compose)

The entire app — Postgres, Redis, Chroma, the FastAPI web tier, the Celery ingest worker, the Next.js frontend, and a fully-initialized Langfuse (the observability showcase) — comes up with one command:

cp .env.example .env       # then set OPENAI_API_KEY (the only required secret)
docker compose up -d --build
docker compose run --rm web alembic upgrade head   # apply DB migrations (once)
open http://localhost:3000                         # the app
open http://localhost:3001                         # Langfuse (traces) — login below
curl http://localhost:8000/health                  # {"status":"ok",...}

The containers wire themselves together over service DNS; nothing else in .env is required — tracing works immediately against the bundled Langfuse (see Observability for the UI login and how to point at Langfuse Cloud instead). Host ports: app on :3000, Langfuse on :3001, API on :8000, Postgres on :55432, Redis on :6380, Chroma on :8001 (offset from the usual defaults so they don't collide with other local stacks). First boot pulls the Langfuse images (ClickHouse, MinIO, etc.) — give it a couple of minutes.

Sign-in needs Clerk keys in .env (free — see "Auth for local dev" below). Without them the frontend still builds and runs; auth is simply inert.

Working on the frontend itself? Run it natively with hot reload instead of the container (both bind :3000, so stop one or the other):

docker compose stop frontend
cd web-next
cp .env.local.example .env.local   # native dev reads THIS, not the root .env
npm install && npm run dev          # http://localhost:3000

A fresh stack has an empty index — sign in and paste a YouTube link (or use the dev-trust header below) to ingest your first episode.

Auth for local dev

Two options, depending on how much of the product you want to exercise:

Real Clerk flow (the full UI). Create a free Clerk app, then in .env set its keys (NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY, CLERK_SECRET_KEY) and CLERK_JWT_ISSUER (your Clerk Frontend API origin — the backend verifies session JWTs against its JWKS). Then rebuild: docker compose up -d --build — a plain restart is NOT enough for the frontend, because the publishable key is inlined into the client bundle at image build time. (Running the frontend natively instead? Its keys go in web-next/.env.local.)
Dev-trust header (API only). Set AUTH_DEV_TRUST_HEADER=1 in .env and the backend accepts an X-User-Id: <anything> header as the identity:
```
curl -X POST localhost:8000/ingest -H 'X-User-Id: dev' \
     -H 'Content-Type: application/json' \
     -d '{"youtube_url": "https://www.youtube.com/watch?v=..."}'
```
Dev-only. It bypasses JWT verification entirely — never set it anywhere reachable from the internet. Default: off.

Observability (Langfuse) — bundled, zero config

The compose stack bundles a full self-hosted Langfuse (the langfuse-* services) pre-initialized with a local project and API keys, and the app defaults to it — so the trace stream described above flows the moment the stack is up. Run a search, then open the instrument panel:

UI: http://localhost:3001 — login dev@pdse.local / pdse-local-dev (set LANGFUSE_UI_PORT in .env if 3001 is taken on your machine).
One trace per search: retrieval span with chunk IDs + similarity scores, LLM generation with token usage and dollar cost, the queryable scores (status, best_score, citation_count, grounded), and per-visitor session grouping via X-Session-Id. Both the web tier (search traces) and the worker (ingest embedding calls) are wired.

The bundled instance's credentials are fixed local-dev values bound to loopback — fine for local dev, not a deployment recipe (for that, use Langfuse's self-hosting guide).

Prefer your own backend? Your .env overrides the bundled defaults:

Langfuse Cloud: create a project at cloud.langfuse.com, then set LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_BASE_URL=https://cloud.langfuse.com (a cloud URL is reachable from containers and native processes alike).
An existing self-hosted instance on this machine (UI on, say, :3001): containers can't see your host's localhost, so set both LANGFUSE_BASE_URL=http://localhost:3001 (native processes) and LANGFUSE_BASE_URL_DOCKER=http://host.docker.internal:3001 (containers — host.docker.internal is native on Docker Desktop; the compose file adds the host-gateway mapping so it works on Linux too).

Native (non-compose) runs have no bundled default: with no LANGFUSE_* set, the observability seam (src/rag/observability.py) makes every tracing call a no-op — same code path, nothing emitted, no latency added.

Native development (uv)

Dependency manager is uv. Tests and the type gate are fully hermetic (no services needed; DB tests self-skip without Postgres):

uv sync
uv run pytest            # unit + integration + retrieval evals
uv run mypy src          # strict, no Any

uv run pytest never touches your real database, with or without the compose stack running. The DB-backed tests wipe tables between tests, so pytest rewrites DATABASE_URL to a dedicated test database — it derives <name>_test from the configured URL (default: pdse_test) and creates it on demand on the local server (see tests/_db_isolation.py). To point tests somewhere else, set TEST_DATABASE_URL; its database name must end in _test, anything else fails the run loudly. If the schema in pdse_test ever drifts after a model change, just drop it (docker compose exec postgres psql -U pdse -d postgres -c 'DROP DATABASE pdse_test') — the next run recreates it.

To iterate on the backend natively against the compose infra, the .env written from .env.example already points at the host-mapped ports (Postgres :55432, Redis :6380; set CHROMA_SERVER_URL=http://localhost:8001 so a native web/worker shares the compose Chroma — see .env.example):

uv run uvicorn server:app --app-dir src --reload   # http://127.0.0.1:8000

# Ingest worker (separate terminal). macOS note: Apple's ObjC runtime aborts
# forked workers — use --pool=solo locally (the Linux container uses prefork).
uv run celery -A worker.celery_app worker --loglevel=info --pool=solo

The pure RAG path needs none of that: uv run python src/ingest.py and the search pipeline work with just OPENAI_API_KEY and a local on-disk Chroma (leave CHROMA_SERVER_URL unset) — the Postgres/Redis/Clerk machinery only exists for the multi-tenant SaaS flow.

Full deployment runbook (Railway + Vercel + Supabase, the Chroma server-mode gotcha, the captions backends): DEPLOY.md.

Repo map

src/
  server.py        FastAPI app: /search /ingest /jobs/{id} /me/episodes (+removal trio) /health
  worker.py        Celery app + run_ingest task (fetch → chunk → embed)
  ingest.py        Offline batch pipeline (the original build; reused by the worker)
  rag/             Pure logic — no framework coupling:
    pipeline.py      retrieve → ground → synthesize
    transcript.py    caption/metadata fallback chain (resilient to YouTube blocks)
    chunker.py       segment-aggregating ~400-token splitter w/ timestamp metadata
    embeddings.py    OpenAI embeddings + token/cost accounting
    vectorstore.py   Chroma client/server-mode, per-tenant collections
    auth.py          Clerk JWT verification (RS256, JWKS, fail-closed)
    security.py      tenant-id validation + prompt-injection guards
    repository.py    typed data layer (users · jobs · usage_ledger)
    models_db.py     SQLAlchemy 2.0 models
    observability.py Langfuse seam (zero-cost when off)
    ratelimit.py     per-user Redis rate limits
tests/             377 tests (unit · integration · retrieval evals)
web-next/          Next.js + Clerk frontend
alembic/           migrations
docker-compose.yml full local stack: postgres · redis · chroma · web · worker ·
                   frontend · bundled Langfuse (6 langfuse-* services)

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
alembic		alembic
benchmarks		benchmarks
docs		docs
src		src
tests		tests
web-next		web-next
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
DEPLOY.md		DEPLOY.md
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
TODOS.md		TODOS.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
railway.json		railway.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Podcast Deep Search

Project 1 of 4 · Production AI series · Lens: Observability

Architecture

Stack

What it does

Observability — the thesis of this project

Engineering depth

Run it locally

Quick start (Docker Compose)

Auth for local dev

Observability (Langfuse) — bundled, zero config

Native development (uv)

Repo map

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Podcast Deep Search

Project 1 of 4 · Production AI series · Lens: Observability

Architecture

Stack

What it does

Observability — the thesis of this project

Engineering depth

Run it locally

Quick start (Docker Compose)

Auth for local dev

Observability (Langfuse) — bundled, zero config

Native development (uv)

Repo map

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages