Skip to content

Latest commit

 

History

History
295 lines (238 loc) · 19 KB

File metadata and controls

295 lines (238 loc) · 19 KB

Deploy

PDSE ships as a SaaS: Next.js on Vercel → proxies to this Python backend on Railway. This doc covers the Railway backend (FastAPI web + Celery worker

  • Redis + Postgres + a persistent volume for the vector DB). The Next.js front end deploys separately on Vercel and is out of scope here.

Status: config + runbook only. No deploy has been run — the Railway CLI is installed but not logged in. The owner must run railway login (interactive, opens a browser) before anything below touches Railway.


Architecture

                        ┌────────────────────────────────────────────┐
   Browser ── HTTPS ──▶ │ Vercel (Next.js)  ── proxy ──▶ Railway      │
                        │                                  backend     │
                        └────────────────────────────────────────────┘
                                                  │
        ┌─────────────────────────────────────────┼───────────────────────────┐
        │ Railway project                          ▼                           │
        │                                                                       │
        │  ┌──────────────────┐   enqueue job   ┌──────────────────┐           │
        │  │ web (FastAPI)    │ ───────────────▶│ worker (Celery)  │  (T1)     │
        │  │ uvicorn          │   via Redis     │ celery worker    │           │
        │  │ /search /health  │◀─── job status ─│ ingest pipeline  │           │
        │  └────────┬─────────┘                 └────────┬─────────┘           │
        │           │ HttpClient (CHROMA_SERVER_URL)      │                     │
        │           └─────────────────┬───────────────────┘                     │
        │                             ▼                                         │
        │   ┌───────────────────────────────────────────────────┐             │
        │   │ chroma  (chromadb/chroma image) — own service      │             │
        │   │ HTTP :8000     Volume → its data dir (the vectors) │             │
        │   └───────────────────────────────────────────────────┘             │
        │                                                                       │
        │   ┌──────────────┐   ┌──────────────┐                                │
        │   │ Redis plugin │   │ Postgres     │  (auth/accounts, usage meter)  │
        │   │ broker+result│   │ plugin       │                                │
        │   └──────────────┘   └──────────────┘                                │
        └───────────────────────────────────────────────────────────────────┘

Both web and worker are built from the same Dockerfile in this repo; they differ only by start command (see below). T1 adds the Celery app at src/worker.py; until then only web is deployable.


Services

Service Built from Start command Notes
web Dockerfile uvicorn server:app --app-dir src --host 0.0.0.0 --port $PORT Healthcheck GET /health. railway.json is its config.
worker Dockerfile celery -A worker.celery_app worker --loglevel=info --concurrency=2 T1 (built). src/worker.py defines celery_app. Set PYTHONPATH=src. No healthcheck. Linux prefork is fine.
chroma chromadb/chroma image (image default — serves HTTP on :8000) Standalone vector DB. Own Railway service from the official image. Attach the persistent volume HERE (its data dir). web + worker connect via CHROMA_SERVER_URL (see Volume + Env).
Redis Railway plugin Celery broker + result backend. Injects REDIS_URL.
Postgres Railway plugin Accounts/auth + usage metering. Injects DATABASE_URL.

The worker is built: src/worker.py exposes celery_app + the worker.run_ingest task. No Dockerfile change vs web; only the start command differs. worker.py wires the worker_process_init signal to db.reset_engine() so each forked prefork child gets its own SQLAlchemy connection pool (the SQLAlchemy + fork footgun). On Linux (the deploy target) prefork is correct and fast.

Local macOS dev only: Apple's Objective-C runtime aborts (SIGABRT) if a process forks after touching ObjC-backed libs, which yt-dlp/SSL do. Run the worker with OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES or --pool=solo when developing on macOS. This does NOT affect the Linux container.

Vector DB: standalone Chroma server (critical — read this)

web and worker are separate containers with separate disks. If each opened its own on-disk Chroma (the old PersistentClient(path="data/chroma") default), the worker would ingest into ITS disk and the web tier would search ITS own empty disk — the live index_empty bug (ingest reports chunks_indexed=1, /search returns index_empty). The fix: run Chroma as one standalone service that both connect to over the private network.

  • Add a chroma service from the official chromadb/chroma image. It serves the Chroma HTTP API on :8000 by default. Pin the image to a version compatible with the client we ship (chromadb 1.5.x — see pyproject.toml). HttpClient validates the tenant/database at connect time against the server's API; a large client↔server version skew surfaces as Could not connect to tenant default_tenant / 404 / 422 at startup (chroma-core/chroma #3410, #1392). Same major.minor is the safe bet.
  • Startup ordering: unlike the old on-disk client, HttpClient does a real network round-trip when VectorStore is first constructed (lifespan log + first /health//search). If chroma isn't reachable yet, web/worker raise at first use; Railway's restart-on-failure recovers once chroma is up. The client never caches a failed connection, so a transient blip self-heals on the next request — no manual restart needed.
  • Attach the persistent volume to the chroma service (mounted at the image's data dir — chromadb/chroma defaults to /data; set IS_PERSISTENT=1 / PERSIST_DIRECTORY per the image docs). The vectors live HERE now. This is the only service that needs a volume.
  • Set CHROMA_SERVER_URL on BOTH web and worker to the service's private URL: CHROMA_SERVER_URL=http://chroma.railway.internal:8000 (Railway private networking; plain http, no TLS). VectorStore (src/rag/vectorstore.py) then opens a chromadb.HttpClient against it instead of local disk — same collections, same cosine config, same per-tenant isolation, just over the wire.
  • web + worker no longer need the /app/data/chroma volume — remove the old per-service volume. With CHROMA_SERVER_URL set, the data/chroma path is ignored entirely; nothing is written to the web/worker container disk.
  • Populate the index by running ingest against the SAME server: a user /ingest job (the worker, with CHROMA_SERVER_URL set) writes to the chroma service, and the web tier's /search reads it back immediately.

Local dev: the repo's docker-compose.yml runs this whole topology (postgres + redis + chroma + web + worker) with the containers' CHROMA_SERVER_URL pre-wired to the chroma service — see README "Run it locally". For tests / python src/ingest.py / the pure RAG path: leave CHROMA_SERVER_URL unset. VectorStore falls back to the on-disk PersistentClient at data/chroma exactly as before — no server required. A set-but-blank value is treated as unset. A configured-but-unparseable URL raises (fail loud) rather than silently reading an empty local disk.

Captions + metadata backends (critical for the worker on Railway)

YouTube blocks unauthenticated extraction from datacenter IPs — the "Sign in to confirm you're not a bot" wall. This is an IP-reputation block that cookies, proxies, and PO-tokens don't reliably beat. yt-dlp and youtube-transcript-api both hit YouTube's timedtext endpoints, so both fail from Railway where a residential IP succeeds. The ingest worker therefore can't rely on yt-dlp in prod.

src/rag/transcript.py runs an env-driven fallback chain that prefers backends whose egress is from clean infrastructure:

fetch_segments(url)  ── captions ──▶
   1. youtube-transcript-api (free, no key)  ── IP-blocked on Railway ─▶ fall through
   2. Supadata  (SUPADATA_API_KEY)           ── reliable from any IP ──▶ timed segments
   3. yt-dlp    (local dev only)             ── blocked on Railway

fetch_metadata(url) ── title ──▶
   1. YouTube Data API v3 (YOUTUBE_DATA_API) ── 200 from any IP ───────▶ title
   2. yt-dlp --dump-json (local dev only)    ── blocked on Railway
  • SUPADATA_API_KEY (https://supadata.ai) is the path that actually works from Railway — Supadata owns the YouTube IP battle and returns timed segments (it converts ms offsets to seconds for citations). Long videos (>20 min — every podcast) come back as an async jobId the worker polls. Without this key the worker can only fall to yt-dlp, which YouTube blocks from the datacenter — so set it on the worker service.
  • YOUTUBE_DATA_API (Google Cloud console → enable "YouTube Data API v3") fetches the video title from any IP, free 10k units/day. Set it on worker.
  • Both are optional for local dev (unset ⇒ the chain falls back to yt-dlp, which works from a residential IP — the test suite and python src/ingest.py behave unchanged with no keys). Lazy + env-gated: no API is called unless its key is set.
  • The yt-dlp anti-bot env (YT_DLP_COOKIES_B64 / YT_DLP_COOKIES_FILE / YT_DLP_PROXY) remains as the local fallback's plumbing; in prod prefer the API backends above over fighting the bot wall with cookies.

Environment variables

Set per-service in the Railway dashboard (or railway variables). Plugin URLs (REDIS_URL, DATABASE_URL) are injected automatically when you add the plugin.

Var web worker Source / notes
OPENAI_API_KEY Embeddings + LLM synthesis. Required.
PORT Injected by Railway; uvicorn binds it. Do not hardcode.
REDIS_URL From the Redis plugin. Web enqueues, worker consumes.
CELERY_BROKER_URL T1. Set to ${{Redis.REDIS_URL}} (broker).
CELERY_RESULT_BACKEND T1. Set to ${{Redis.REDIS_URL}} (results), or a Postgres URL.
DATABASE_URL From the Postgres plugin. Accounts/auth + usage metering.
CHROMA_SERVER_URL Required in prod. Points web + worker at the standalone chroma service so they share ONE index. http://chroma.railway.internal:8000 (private network, http). Unset ⇒ local on-disk PersistentClient (dev/tests only; broken across two containers).
SUPADATA_API_KEY Required on worker in prod. Reliable captions from a datacenter IP (yt-dlp/timedtext are bot-walled on Railway). Unset ⇒ worker can only fall to yt-dlp, which YouTube blocks. See "Captions + metadata backends".
YOUTUBE_DATA_API Recommended on worker. YouTube Data API v3 key — fetches the video title from any IP (free 10k/day). Unset ⇒ falls back to yt-dlp --dump-json (bot-walled on Railway).
SIM_FLOOR ◻︎ ◻︎ Optional retrieval-gate override (default 0.35).
TOP_K ◻︎ ◻︎ Optional (default 5).
LLM_MODEL ◻︎ ◻︎ Optional (default gpt-4o-mini).
EMBED_MODEL ◻︎ ◻︎ Optional (default text-embedding-3-small).
LANGFUSE_PUBLIC_KEY ◻︎ ◻︎ Optional tracing. Unset → tracing off, zero cost.
LANGFUSE_SECRET_KEY ◻︎ ◻︎ Optional tracing.
LANGFUSE_BASE_URL ◻︎ ◻︎ Optional tracing collector URL.

Security (T3 — auth, CORS, rate limiting). The web tier verifies Clerk session JWTs on the request path; the worker does no auth (it consumes jobs the web tier already authorized):

Var web worker Notes
CLERK_JWT_ISSUER Clerk Frontend API origin. JWKS = ${ISSUER}/.well-known/jwks.json. Required — unset ⇒ every Bearer token rejected (fail closed). e.g. https://your-app.clerk.accounts.dev.
ALLOWED_ORIGINS Comma-separated CORS allowlist + Clerk azp allowlist. Set to your Vercel frontend origin in prod (e.g. https://<app>.vercel.app). NEVER *. Defaults to http://localhost:3000.
AUTH_DEV_TRUST_HEADER ◻︎ Dev only. 1 ⇒ also trust an X-User-Id header as identity. MUST be 0/unset in prod (else trivial impersonation). Default OFF.
RATE_LIMIT_INGEST_PER_HOUR ◻︎ Per-user ingest cap (default 10). Redis-backed; fails OPEN if Redis is down.
RATE_LIMIT_SEARCH_PER_MINUTE ◻︎ Per-user search cap for signed-in users (default 30).

Usage metering (internal). The worker records each successful job's embedding tokens + estimated cost to the usage_ledger table (and on the jobs row); no external billing provider is involved and no extra env is required.

Auth (Clerk — Vercel front end). The front end signs requests with the Clerk session JWT (Authorization: Bearer …); the backend (CLERK_JWT_ISSUER, above) verifies it server-side and derives the user from the token sub:

Var Used by Notes
CLERK_SECRET_KEY Vercel Server-side Clerk SDK (auth(), getToken()).
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY Vercel Clerk client.
CLERK_JWT_ISSUER web (backend) JWKS issuer for backend verification (see Security table above).

✅ required · ◻︎ optional · — not applicable

Identity flow (T3): browser → Vercel Route Handler / server component (web-next/src/lib/backend.ts) attaches the Clerk session JWT as Authorization: Bearer <jwt> → FastAPI require_user/optional_user verifies it (RS256 vs cached JWKS, iss/exp/azp) and resolves sub to the internal user. The browser never sends the identity itself.


Deploy runbook (owner runs this)

Prereqs: Railway CLI v5+ (installed), a Railway account.

# 1. Authenticate (INTERACTIVE — opens a browser). Owner must run this; an
#    automated agent cannot. Everything below depends on it.
railway login

# 2. Create / link the project (run from the repo root).
railway init            # or: railway link   (to attach to an existing project)

# 3. Add the managed plugins.
railway add --plugin redis
railway add --plugin postgres

# 4. Set secrets on the web service (repeat --set per var; see the table above).
railway variables --set "OPENAI_API_KEY=sk-..."
#    REDIS_URL / DATABASE_URL are injected by the plugins automatically.

# 5. Create the persistent volume and mount it at /app/data/chroma (see "Volume"
#    above for WHY this exact path). Volume mount is set in the dashboard
#    (Service → Settings → Volumes) or:
railway volume add --mount-path /app/data/chroma

# 6. Deploy the web service (uses railway.json: Dockerfile build + /health check).
railway up

# 7. Populate the index on the volume (one-off; or let the T1 worker do it).
railway run uv run python src/ingest.py

# 8. (T1) Add the worker service from the SAME repo/image, override its start
#    command to the Celery command in the Services table, and give it the same
#    OPENAI_API_KEY / REDIS_URL / DATABASE_URL.

railway.json configures the web service (Dockerfile builder, start command, /health healthcheck, restart-on-failure). The worker is a second Railway service pointed at this same repo with the start command overridden.

Verify after deploy

curl -fsS "https://<your-web-service>.up.railway.app/health"
# → {"status":"ok","episodes_indexed":N,"chunks_indexed":M}

If chunks_indexed is 0, the volume mount path is wrong (see "Volume") or the index wasn't ingested onto the volume.


Local container check (no Railway needed)

Proves the image builds and the web role serves /health:

docker build -t pdse-web .
docker run --rm -e PORT=8000 -e OPENAI_API_KEY=dummy -p 8000:8000 pdse-web
# in another shell:
curl -fsS http://127.0.0.1:8000/health

/health works without a real key or index (it returns counts, runs no search).


Benchmark (latency baseline)

See benchmarks/README.md. One command:

uv run python benchmarks/bench_search.py

Times search() over the golden queries against data/chroma/, writes p50/p95 to benchmarks/baseline.json. Skips cleanly (exit 0) with no key/index.