Skip to content

v2.0.0: cleanup release — architecture, telemetry, tenancy, gates#29

Open
dmichael-fastly wants to merge 9 commits into
mainfrom
refactor/cleanup
Open

v2.0.0: cleanup release — architecture, telemetry, tenancy, gates#29
dmichael-fastly wants to merge 9 commits into
mainfrom
refactor/cleanup

Conversation

@dmichael-fastly

@dmichael-fastly dmichael-fastly commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Architectural cleanup release. The post-v1.2.0 perf branch closed the worst read-path latency by stacking remediation on top of an architecture that wasn't designed for the workload; this PR pays that down. The branch is squashed into a single commit (6e29655) that's tag-ready as v2.0.0, plus follow-up commits from the user (d237e9f, 298797b) and a ruff lint fix (dd17b90).

See CHANGELOG.md [2.0.0] entry for the full annotated diff — this body is the PR-review companion (status of gates, test surface, what to watch for in review, follow-up items).

What landed

  • Architectural carves: every backend file > 1500 LOC carved into per-concern packages (iceberg/, cron/ jobs, metadata/, tunnel/, share_db/, routers/admin/, core/rollups/); pre-split public surface preserved via package __init__.py re-exports. Top-5 backend now ≤ 1461 LOC; no backend file > 1500.
  • Telemetry on OTel + structlog: replaces the four fragmented custom surfaces. Console exporter ships by default; production backends are deploy-config.
  • RequestContext replaces AnalyticsDeps: tenancy enforced at context construction; routes never parse service_id from a path param.
  • Composite-endpoint hard cutover: dashboard/bundle, security/bundle, network/bundle ship with the frontend swap. Granular endpoints deleted (see Breaking).
  • Frontend large-file splits: ProvisionWizard (3582 → wizard/steps/), app/logs/page.tsx (2136 → _sections/ + _state.ts), and 14 more. No frontend file > 499 LOC.
  • Coverage gates: backend --cov-fail-under 78 → 85 (actual 85.05 %); frontend coverage.thresholds.lines 44 → 58 (actual 61.66 %).
  • mypy: ignore_errors list 36 → 0; per-module strict block grew 8 → 19 modules.
  • Library swaps: aiodns + tenacity + pydantic-settings + cachetools + orjson + rich + typer + nuqs (Next.js URL state) + argon2id (passcode hashing).
  • Trust topology asserted in tests (Caddy + compose + middleware order); sql_validator rejects NUL bytes; ruff T201 (no print() in production code, with per-file ignores for scripts/*, the provision CLI, and tests/*).
  • Followups in subsequent commits: SQL hardening (escape_sql_literal at read_parquet() + glob() sites), accessibility, stable React keys, request timeouts (d237e9f); VCL macro heredoc support (298797b).

Verification

  • make ci is green on the current tip (dd17b90): 3972 pytest, 444 vitest, 9 VCL test all pass; lint + format + mypy + osv + secret-scan all pass.
  • The 4 "failing tests" I flagged in my initial PR body turned out to be pytest-randomly + xdist ordering flakes — every test passes deterministic + in isolation. No code change was needed for them.
  • The one real issue from the squash was a T201 print() violation in tests/test_performance_smoke.py (legitimate benchmark prints); fixed in dd17b90 by adding tests/* to the per-file-ignores list (same treatment as scripts/*).
  • Prod-deployed + smoke-tested via the post-restart pipeline:
    • Admin (SSH-tunneled localhost:3001): 18/18 endpoints green (bootstrap, sync-status, log-extents, every /api/admin/*)
    • Analyst (Fastly URL + /share-login session): 9/9 pages rendered with real data — Dashboard, Performance, Origin, Security, Charts, Insights, Network, Sessions, Query
    • Gating: /admin redirects analyst to /dashboard; /api/admin/* returns 401 for analyst session; /api/sync-status returns 401 (admin-only)
  • 593 files changed in the squash, +82,649 / −37,156 LOC

Breaking changes

  • Composite-endpoint cutover: granular per-card endpoints (/api/dashboard/aggregates, /api/dashboard/raw, /api/dashboard/top_n, etc.) are deleted; callers must use the composite (/api/dashboard/bundle).
  • AnalyticsDepsRequestContext: alias removed.
  • is_cached / _is_cached alias on BaseResponse removed (is_cached is canonical).
  • SSH-to-localhost.run analyst sharing removed; production has always been direct-mode against the Fastly+Caddy public URL.
  • Error contract (from v2.0.1 hardening): raise_internal(logger, exc, code, status) replaces raise HTTPException(detail={"error": str(e)}) at except sites — detail is now {"error": <code>, "error_id": <8-hex>}. Upstream messages are scrubbed; operators correlate via the server log keyed on error_id.

What was NOT done in this PR (deferred)

  • v2.0.0 tag — version is at 2.0.0 in pyproject.toml / frontend/package.json / backend/main.py / uv.lock; the tag itself is not yet pushed (per "do not merge anything yet"). The user has already drafted a v2.0.1 commit message (d237e9f) but the actual version strings still read 2.0.0 — needs a decision: ship as 2.0.0 (with the hardening folded in) or bump to 2.0.1 before tagging.
  • pending-docs/best_practices_audit.md — left untracked in the working tree (your WIP audit doc, not part of this PR).

Test plan

  • make ci green
  • Prod smoke: admin + analyst paths verified
  • Tag decision: v2.0.0 vs v2.0.1 (and whether to bump the version strings)
  • After tag merge: re-smoke prod once more via ~/restart.sh + the [prod-verify-paths] memory pattern

🤖 Generated with Claude Code

dmichael-fastly and others added 9 commits June 12, 2026 18:13
Squash of the refactor/cleanup branch — architectural cleanup
release. Tag this commit as v2.0.0.

== Architecture ==

Carved every backend file > 1500 LOC into a per-concern package; all
packages preserve the pre-split public surface via __init__.py
re-exports so import paths stay stable.

* core/iceberg.py (4232)   -> iceberg/{view, catalog, warehouse,
                              manifest, fs, _core, buffer, ddl,
                              snapshot_cache, dedup, ...}. Custom
                              FosFsspecFileIO + CachedFosS3FileSystem
                              subclasses retire 5 of 6 historical
                              s3fs monkeypatches.
* scheduler.py (2843)      -> cron/{scheduler, decorators,
                              jobs/{sync, commit, compaction, optimize,
                              expire, metadata, gap_heal,
                              rollup_compact_daily}}. Phase 6 picks
                              separate-pool isolation based on Phase 1
                              thread-wait telemetry.
* core/metadata_db.py(3168)-> core/metadata/{base, alerts, views,
                              ingest_log, cron_log, asn_cache,
                              usage_log, reconciliation, state};
                              metadata_db.py is a backward-compat shim.
* utils/tunnel.py (1022)   -> tunnel/{manager, session, rate_limiter,
                              state, fingerprint}. SSH-to-localhost.run
                              path DELETED (~400 lines). Direct-mode
                              only.
* core/share_db.py (1312)  -> share_db/{connection, schema, invites,
                              sessions, audit, passcode, tos, settings,
                              validation}. argon2id replaces scrypt for
                              new passcodes; scrypt verify branch stays
                              for transparent rehash-on-login.
* routers/admin.py (1650)  -> admin/{pop_locations, ingest, trees,
                              downloads, sync_status, compaction,
                              health, log_accounting, iceberg,
                              bot_sources} + _helpers / _dir_size /
                              _router. admin_usage sidecar still
                              attaches to the shared router.
* core/rollups.py (2045)   -> rollups/{_common, time_series, sessions,
                              hour_bundles, day_bundles, recompute,
                              wellknown_bots}. 41 re-exports.

Other architecture changes:

* RequestContext replaces AnalyticsDeps; tenancy enforced at context
  construction; routes never parse service_id from a path param.
* Composite endpoints hard cutover: dashboard/bundle + security/bundle
  + network/bundle ship with the frontend swap. Granular per-card
  endpoints deleted. _meta_con parallel path dropped. is_cached /
  _is_cached alias collapsed. AnalyticsDeps = RequestContext shim
  removed.
* Top-5 backend files now <= 1461 LOC; no backend file > 1500.
* Top frontend files all < 500 LOC (ProvisionWizard 3582 ->
  wizard/steps/*; app/logs/page.tsx 2136 -> _sections/* + _state.ts).

== Telemetry, dependencies ==

* OpenTelemetry (api/sdk + fastapi/botocore/aiohttp instrumentors)
  replaces the four fragmented custom telemetry surfaces. Console
  exporter ships by default.
* structlog wires trace_id + span_id into structured log output.
* process_context_scope + _ACTIVE_CONTEXTS mirror KEPT at
  backend/utils/telemetry.py. OTel context propagation uses Python
  ContextVars under the hood and inherits the same cross-thread
  limitation (fsspec iothread, pyiceberg ThreadPoolExecutor) the
  manual mirror was built to solve.
* aiodns + asyncio.gather + bulk-transaction sqlite writes in
  utils/rdns_cache.py replace the serial-blocking
  socket.gethostbyaddr loop.
* tenacity decorator retry for Fastly/NGWAF/SQLite-WAL-busy paths.
* pydantic-settings centralises env-var reads + boot validation.
* cachetools replaces in-process bounded_cache / rdns_cache /
  ngwaf_bot_cache.
* Structured .tf.json generation replaces f-string HCL + the
  _hcl_escape regex; eliminates a custom-HCL injection vector.
* orjson via FastAPI ORJSONResponse for composite payloads.
* rich + typer for the provision CLI.
* httpx everywhere except telemetry_proxy.py (aiohttp stays for
  the proxy server role).
* nuqs as the URL state source on the frontend.

== Trust topology, security hardening ==

* Middleware order asserted at boot AND in tests; one-line INVARIANT
  markers replace the paragraph-long comments in main.py.
* @pytest.mark.security_regression marker + monotonic-count CI gate
  (floor: 24).
* Trust-topology snapshot tests pin Caddy @from_fastly matcher, XFF
  forwarding, /share-login rate-limit, and the backend
  --forwarded-allow-ips=127.0.0.1 flags.
* sql_validator rejects NUL bytes before any cost.
* ruff T201 (print-detection) lint rule enforced in production code.

== Frontend ==

* RSC/CSR boundary in app/_routing.md. Hidden-Plotly + hidden-MapLibre
  + setTimeout warm-up hacks dropped; replaced with modulepreload +
  the styledata-event swap pattern.
* Live Query Monitor: live-first sort, peak-memory column, keyboard
  shortcuts, URL-persisted filters, per-run inline expand for xN
  cron-grouped rows, >=30s stuck-query pulse, copy-SQL.
* Operations Overview cards on the admin landing page surface ingest
  gap + live query activity + slow-query count.

== Quality gates ==

* Backend coverage gate --cov-fail-under 78 -> 85 (actual 85.05 %).
* Frontend coverage gate coverage.thresholds.lines 44 -> 58
  (actual 61.66 %).
* tool.mypy.overrides ignore_errors list: 36 modules -> 0.
* mypy per-module strict block: 19 modules opted in
  (disallow_untyped_defs + disallow_incomplete_defs +
  check_untyped_defs + warn_return_any + warn_unused_ignores).
* Load-harness CI step: scripts/emit_perf_latest.py runs a 100K-row
  synthetic DuckDB workload; scripts/perf_gate.sh fails on >50%
  regression vs tests/perf/baseline.json.

== Operations, portability ==

* VM-agnostic deploy runbooks at docs/deploy/{aws_ec2, azure_vm,
  gce, generic_linux}.md. Storage stays Fastly Object Storage
  (S3-compatible).
* scripts/refresh_fastly_cidrs.py pulls api.fastly.com/public-ip-list
  and rewrites the Caddy @from_fastly block.

== Version ==

pyproject.toml + frontend/package.json + backend/main.py FastAPI app
+ uv.lock all bumped to 2.0.0. CHANGELOG.md gains a [2.0.0] entry.
README.md updated to drop the removed SSH-tunnel sharing mode.
AGENTS.md updated to reflect the post-split admin/ + rollups/
package paths.

== Breaking ==

* Composite-endpoint cutover: granular per-card endpoints are
  deleted. Callers use the composite (/api/dashboard/bundle, etc.).
* AnalyticsDeps alias for RequestContext is removed.
* is_cached / _is_cached alias on BaseResponse is removed.
* SSH-to-localhost.run analyst sharing is removed; production has
  always been direct-mode against the Fastly+Caddy public URL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backend:
- escape_sql_literal at read_parquet() and glob() sites that
  interpolate computed paths
- quote identifiers in DROP TABLE IF EXISTS (11 sites) so temp-table
  cleanup tolerates reserved keywords
- session_scoring._cached: clear _inflight on the cache-hit path too,
  not only on producer-path teardown
- iceberg/buffer.tombstone_buffer_files: log + skip on marker write
  failure (the immediate-unlink fallback re-opened the in-flight-query
  race the grace window exists to close)

Frontend:
- replace key={i} on dynamic lists in DebugPanel, CronLiveLog, network
  metro leaderboard, query toolbar, custom-field drawer
- useSSE attaches a monotonic _id to each line for stable keys in
  append-only feeds
- a11y: FieldGroups and FileBrowser disclosure widgets are real
  buttons with aria-expanded; SSEModal uses base-ui Dialog render
  prop instead of a non-keyboard div wrapper; per-row "view audit
  logs" gets an aria-label including the row's email
- fetchWithTimeout helper (30s default; heartbeat tightens to 10s)
  applied to share-login, acknowledge, and useAnalystHeartbeat so
  hung requests surface as errors instead of infinite spinners

Infra:
- caddy/Dockerfile: USER caddy (Caddy is the only externally-facing
  socket, base image ships the user, no <1024 port bind)

Tests:
- regression: _cached must clear _inflight on cache hits
- regression: tombstone failure must NOT fall back to immediate
  unlink (rewrites the prior test that locked in the wrong contract)
- contract: raise_internal coverage already added in v2.0; usage
  operations test updated to assert generic code + error_id instead
  of the leaked upstream status/body

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior macro-content validation rejected ALL `;`, `{`, `}` after the
Fastly `\{` / `\}` unescape, which incorrectly killed legitimate
heredoc patterns like `strftime(\{"format"\}, time.start)` — the
custom-field VCL-lint endpoint started returning `valid: false` for
the built-in `time.start` format.

Split BEFORE unescape and reject only `;` plus UNESCAPED `{` / `}`
(`(?<!\\)[{}]` lookbehind). The original VCL-injection payload from
the audit remains blocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v2.0 ruff lint enforces T201 (no `print()` calls) to keep
production code on logger/structlog. CLI tools and scripts are
already exempted via per-file-ignores; this adds the same exemption
for tests/, which legitimately print benchmark timings and debug
dumps (e.g. `tests/test_performance_smoke.py` reports the wall-clock
of the 1M-row aggregate + insights runs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`caddy:2-alpine` stopped shipping a `caddy` user/group sometime after
early 2026, so the bare `USER caddy` at the end of the Dockerfile errors
on container start with "unable to find user caddy: no matching entries
in passwd file". Recreate the user/group explicitly.

The official caddy:2-alpine binary has CAP_NET_BIND_SERVICE set as a
filecap so a non-root user can bind :80/:443. The COPY from the builder
stage replaces /usr/bin/caddy with our custom-built binary, which
wipes that filecap. Re-apply with setcap so the non-root caddy user can
still bind :80 (which the Caddyfile listens on — the prior comment
claiming :8080 was stale).

Verified on the VM with a one-shot container test before deploying.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Versioning stays at 2.0.0 — the SQL hardening, stable React keys,
a11y, fetchWithTimeout, error-contract scrub, and Caddy USER caddy
work that ``d237e9f`` carried under a "v2.0.1" message are part of
the v2.0.0 release. CHANGELOG sections updated:

- Reliability, perf: session_scoring._cached _inflight cleanup,
  iceberg/buffer tombstone marker-failure handling, DROP TABLE IF
  EXISTS identifier quoting
- Trust topology, middleware: raise_internal scrubs upstream error
  messages, escape_sql_literal at read_parquet()/glob(), Caddy
  container drops to USER caddy
- Frontend: stable React keys + useSSE monotonic _id, a11y pass
  (aria-expanded, base-ui Dialog, per-row aria-label),
  fetchWithTimeout (30s / 10s heartbeat)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Caddyfile: drop 'unsafe-eval' from CSP script-src (no eval / new
  Function / string-form timers across frontend code; grep-verified)
- backend/deps: shared ServiceId Annotated[str, Path(pattern=...)]
  type; session_scoring_admin migrates all 17 path-param sites so
  malformed ids 422 at the FastAPI boundary
- core/duckdb.get_connection + utils/rdns_cache._run_async_resolve:
  docstring notes that callers must wrap with asyncio.to_thread when
  called from an async handler
- CustomFieldDrawer: surface VCL validation API failures via inline
  hint (previously console.error only); document the existing
  exhaustive-deps disable
- frontend Dockerfile: comment noting openapi.json must stay free of
  embedded credentials / example tokens
- ReportLayout becomes generic over TData; security/page.tsx threads
  components['schemas']['SecurityAggregatesResponse'] through, so the
  three section files drop the (data as any) blanket casts and gain
  inline narrowing for the dict-of-unknown shapes the schema can't
  pin (ngwaf bots timeseries, ipv6_adoption rows)

Tests:
- pin that the ServiceId pattern returns 422 on a service_id with
  characters outside [A-Za-z0-9_-]
- openapi.json regenerated to pick up the new Path pattern field

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backend/Dockerfile:
- add UID/GID 1000 `app` user + USER directive so a foothold in the
  FastAPI process can't escape its mounts. Pre-create + chown the
  bind-mount targets so docker's root-owned default dirs don't
  shadow the chown.

router_utils.raise_internal:
- annotate as -> NoReturn so mypy treats it as a control-flow
  terminator at the call site (the function unconditionally raises).
  Without this, every caller that puts raise_internal at the end of
  an except branch trips a "Missing return statement" error.

Exception-leak follow-up to v2.0.1 (additional sites the first sweep
missed where the original message genuinely leaked internals):
- session_scoring_admin: rotate_aes_key + update_enforce_status_code
  failures
- admin/sync_status: catch-all 500 branch (DBBusyError 503 path keeps
  its informative message since "busy" is the actionable signal)
- services/audit + services/cron: read / delete handlers
- intentionally untouched (str(e) is the right shape): parse_period
  ValueErrors in provision, SQL-validator PermissionErrors in query,
  share_admin passcode validation, label-create / -update validation
  in session_scoring, "busy" RuntimeErrors in admin/ingest +
  admin/iceberg (the body is the user-actionable state)

Tests updated to pin the new ``raise_internal`` shape (generic code
+ error_id, no exception-string leak) and to assert the leak
specifically does NOT travel through the wire.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
== fastly CDN auth: replace spoofable Fastly-FF check with shield-auth secret ==

The FOS CDN VCL was gating its 401/penaltybox/Client-IP logic on
`fastly.ff.visits_this_service == 0` to mean "this is the edge, not a
shield hop". That variable is derived from the `Fastly-FF` request
header, which is client-controllable — an attacker who sets
`Fastly-FF: <service-id>` makes the variable non-zero on the edge and
skips every gate.

The simple fix (remove the FF check) breaks shielding because
`querystring.filter_except` strips the `key` query param before the
shield-hop bereq, so the shield POP would 401 on missing auth.

Applied fix mirrors the existing session-scoring VCL pattern:
1. `load_vcl()` substitutes a fresh random secret into a placeholder.
   Edge + shield run the same compiled VCL so both see the same
   constant by construction.
2. `vcl_recv` strips client-supplied `X-Edge-CDN-Auth` that doesn't
   match — prevents a spoofer satisfying the shield-detection check.
3. Auth/penaltybox/Client-IP gates now check
   `req.http.X-Edge-CDN-Auth != "<secret>"` (unspoofable — the secret
   is compiled into the VCL and never leaves it) instead of the FF
   variable.
4. `miss_pass` stamps `bereq.http.X-Edge-CDN-Auth = "<secret>"` on
   every outgoing bereq. The shield matches it and skips edge-only
   gates; the actual FOS origin ignores it.

Regression coverage in tests/utils/test_fastly_utils.py.

The remaining `fastly.ff.visits_this_service > 1` reference in
vcl_recv is the SWR-on-shield tweak (a perf knob, not a security
gate) and is left as-is.

== views/alerts: remove O(N) cross-tenant scan fallback ==

`get_view_by_id` / `delete_view` / `get_alert_by_id` / `toggle_alert` /
`delete_alert` had a `service_id_hint=None` fallback that, when no
hint was given, iterated every per-service metadata DB looking for the
target id. An authenticated caller could trigger an O(N) workload
across the whole tenant set by requesting any non-existent id.

Make `service_id` required at all five entry points. The router-level
DELETE handlers raise 400 `service_id_required` if the caller didn't
supply one. Audit flagged only the views variant; alerts was
structurally identical and got the same fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant