v2.0.0: cleanup release — architecture, telemetry, tenancy, gates#29
Open
dmichael-fastly wants to merge 9 commits into
Open
v2.0.0: cleanup release — architecture, telemetry, tenancy, gates#29dmichael-fastly wants to merge 9 commits into
dmichael-fastly wants to merge 9 commits into
Conversation
Squash of the refactor/cleanup branch — architectural cleanup
release. Tag this commit as v2.0.0.
== Architecture ==
Carved every backend file > 1500 LOC into a per-concern package; all
packages preserve the pre-split public surface via __init__.py
re-exports so import paths stay stable.
* core/iceberg.py (4232) -> iceberg/{view, catalog, warehouse,
manifest, fs, _core, buffer, ddl,
snapshot_cache, dedup, ...}. Custom
FosFsspecFileIO + CachedFosS3FileSystem
subclasses retire 5 of 6 historical
s3fs monkeypatches.
* scheduler.py (2843) -> cron/{scheduler, decorators,
jobs/{sync, commit, compaction, optimize,
expire, metadata, gap_heal,
rollup_compact_daily}}. Phase 6 picks
separate-pool isolation based on Phase 1
thread-wait telemetry.
* core/metadata_db.py(3168)-> core/metadata/{base, alerts, views,
ingest_log, cron_log, asn_cache,
usage_log, reconciliation, state};
metadata_db.py is a backward-compat shim.
* utils/tunnel.py (1022) -> tunnel/{manager, session, rate_limiter,
state, fingerprint}. SSH-to-localhost.run
path DELETED (~400 lines). Direct-mode
only.
* core/share_db.py (1312) -> share_db/{connection, schema, invites,
sessions, audit, passcode, tos, settings,
validation}. argon2id replaces scrypt for
new passcodes; scrypt verify branch stays
for transparent rehash-on-login.
* routers/admin.py (1650) -> admin/{pop_locations, ingest, trees,
downloads, sync_status, compaction,
health, log_accounting, iceberg,
bot_sources} + _helpers / _dir_size /
_router. admin_usage sidecar still
attaches to the shared router.
* core/rollups.py (2045) -> rollups/{_common, time_series, sessions,
hour_bundles, day_bundles, recompute,
wellknown_bots}. 41 re-exports.
Other architecture changes:
* RequestContext replaces AnalyticsDeps; tenancy enforced at context
construction; routes never parse service_id from a path param.
* Composite endpoints hard cutover: dashboard/bundle + security/bundle
+ network/bundle ship with the frontend swap. Granular per-card
endpoints deleted. _meta_con parallel path dropped. is_cached /
_is_cached alias collapsed. AnalyticsDeps = RequestContext shim
removed.
* Top-5 backend files now <= 1461 LOC; no backend file > 1500.
* Top frontend files all < 500 LOC (ProvisionWizard 3582 ->
wizard/steps/*; app/logs/page.tsx 2136 -> _sections/* + _state.ts).
== Telemetry, dependencies ==
* OpenTelemetry (api/sdk + fastapi/botocore/aiohttp instrumentors)
replaces the four fragmented custom telemetry surfaces. Console
exporter ships by default.
* structlog wires trace_id + span_id into structured log output.
* process_context_scope + _ACTIVE_CONTEXTS mirror KEPT at
backend/utils/telemetry.py. OTel context propagation uses Python
ContextVars under the hood and inherits the same cross-thread
limitation (fsspec iothread, pyiceberg ThreadPoolExecutor) the
manual mirror was built to solve.
* aiodns + asyncio.gather + bulk-transaction sqlite writes in
utils/rdns_cache.py replace the serial-blocking
socket.gethostbyaddr loop.
* tenacity decorator retry for Fastly/NGWAF/SQLite-WAL-busy paths.
* pydantic-settings centralises env-var reads + boot validation.
* cachetools replaces in-process bounded_cache / rdns_cache /
ngwaf_bot_cache.
* Structured .tf.json generation replaces f-string HCL + the
_hcl_escape regex; eliminates a custom-HCL injection vector.
* orjson via FastAPI ORJSONResponse for composite payloads.
* rich + typer for the provision CLI.
* httpx everywhere except telemetry_proxy.py (aiohttp stays for
the proxy server role).
* nuqs as the URL state source on the frontend.
== Trust topology, security hardening ==
* Middleware order asserted at boot AND in tests; one-line INVARIANT
markers replace the paragraph-long comments in main.py.
* @pytest.mark.security_regression marker + monotonic-count CI gate
(floor: 24).
* Trust-topology snapshot tests pin Caddy @from_fastly matcher, XFF
forwarding, /share-login rate-limit, and the backend
--forwarded-allow-ips=127.0.0.1 flags.
* sql_validator rejects NUL bytes before any cost.
* ruff T201 (print-detection) lint rule enforced in production code.
== Frontend ==
* RSC/CSR boundary in app/_routing.md. Hidden-Plotly + hidden-MapLibre
+ setTimeout warm-up hacks dropped; replaced with modulepreload +
the styledata-event swap pattern.
* Live Query Monitor: live-first sort, peak-memory column, keyboard
shortcuts, URL-persisted filters, per-run inline expand for xN
cron-grouped rows, >=30s stuck-query pulse, copy-SQL.
* Operations Overview cards on the admin landing page surface ingest
gap + live query activity + slow-query count.
== Quality gates ==
* Backend coverage gate --cov-fail-under 78 -> 85 (actual 85.05 %).
* Frontend coverage gate coverage.thresholds.lines 44 -> 58
(actual 61.66 %).
* tool.mypy.overrides ignore_errors list: 36 modules -> 0.
* mypy per-module strict block: 19 modules opted in
(disallow_untyped_defs + disallow_incomplete_defs +
check_untyped_defs + warn_return_any + warn_unused_ignores).
* Load-harness CI step: scripts/emit_perf_latest.py runs a 100K-row
synthetic DuckDB workload; scripts/perf_gate.sh fails on >50%
regression vs tests/perf/baseline.json.
== Operations, portability ==
* VM-agnostic deploy runbooks at docs/deploy/{aws_ec2, azure_vm,
gce, generic_linux}.md. Storage stays Fastly Object Storage
(S3-compatible).
* scripts/refresh_fastly_cidrs.py pulls api.fastly.com/public-ip-list
and rewrites the Caddy @from_fastly block.
== Version ==
pyproject.toml + frontend/package.json + backend/main.py FastAPI app
+ uv.lock all bumped to 2.0.0. CHANGELOG.md gains a [2.0.0] entry.
README.md updated to drop the removed SSH-tunnel sharing mode.
AGENTS.md updated to reflect the post-split admin/ + rollups/
package paths.
== Breaking ==
* Composite-endpoint cutover: granular per-card endpoints are
deleted. Callers use the composite (/api/dashboard/bundle, etc.).
* AnalyticsDeps alias for RequestContext is removed.
* is_cached / _is_cached alias on BaseResponse is removed.
* SSH-to-localhost.run analyst sharing is removed; production has
always been direct-mode against the Fastly+Caddy public URL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backend:
- escape_sql_literal at read_parquet() and glob() sites that
interpolate computed paths
- quote identifiers in DROP TABLE IF EXISTS (11 sites) so temp-table
cleanup tolerates reserved keywords
- session_scoring._cached: clear _inflight on the cache-hit path too,
not only on producer-path teardown
- iceberg/buffer.tombstone_buffer_files: log + skip on marker write
failure (the immediate-unlink fallback re-opened the in-flight-query
race the grace window exists to close)
Frontend:
- replace key={i} on dynamic lists in DebugPanel, CronLiveLog, network
metro leaderboard, query toolbar, custom-field drawer
- useSSE attaches a monotonic _id to each line for stable keys in
append-only feeds
- a11y: FieldGroups and FileBrowser disclosure widgets are real
buttons with aria-expanded; SSEModal uses base-ui Dialog render
prop instead of a non-keyboard div wrapper; per-row "view audit
logs" gets an aria-label including the row's email
- fetchWithTimeout helper (30s default; heartbeat tightens to 10s)
applied to share-login, acknowledge, and useAnalystHeartbeat so
hung requests surface as errors instead of infinite spinners
Infra:
- caddy/Dockerfile: USER caddy (Caddy is the only externally-facing
socket, base image ships the user, no <1024 port bind)
Tests:
- regression: _cached must clear _inflight on cache hits
- regression: tombstone failure must NOT fall back to immediate
unlink (rewrites the prior test that locked in the wrong contract)
- contract: raise_internal coverage already added in v2.0; usage
operations test updated to assert generic code + error_id instead
of the leaked upstream status/body
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior macro-content validation rejected ALL `;`, `{`, `}` after the
Fastly `\{` / `\}` unescape, which incorrectly killed legitimate
heredoc patterns like `strftime(\{"format"\}, time.start)` — the
custom-field VCL-lint endpoint started returning `valid: false` for
the built-in `time.start` format.
Split BEFORE unescape and reject only `;` plus UNESCAPED `{` / `}`
(`(?<!\\)[{}]` lookbehind). The original VCL-injection payload from
the audit remains blocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v2.0 ruff lint enforces T201 (no `print()` calls) to keep production code on logger/structlog. CLI tools and scripts are already exempted via per-file-ignores; this adds the same exemption for tests/, which legitimately print benchmark timings and debug dumps (e.g. `tests/test_performance_smoke.py` reports the wall-clock of the 1M-row aggregate + insights runs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`caddy:2-alpine` stopped shipping a `caddy` user/group sometime after early 2026, so the bare `USER caddy` at the end of the Dockerfile errors on container start with "unable to find user caddy: no matching entries in passwd file". Recreate the user/group explicitly. The official caddy:2-alpine binary has CAP_NET_BIND_SERVICE set as a filecap so a non-root user can bind :80/:443. The COPY from the builder stage replaces /usr/bin/caddy with our custom-built binary, which wipes that filecap. Re-apply with setcap so the non-root caddy user can still bind :80 (which the Caddyfile listens on — the prior comment claiming :8080 was stale). Verified on the VM with a one-shot container test before deploying. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Versioning stays at 2.0.0 — the SQL hardening, stable React keys, a11y, fetchWithTimeout, error-contract scrub, and Caddy USER caddy work that ``d237e9f`` carried under a "v2.0.1" message are part of the v2.0.0 release. CHANGELOG sections updated: - Reliability, perf: session_scoring._cached _inflight cleanup, iceberg/buffer tombstone marker-failure handling, DROP TABLE IF EXISTS identifier quoting - Trust topology, middleware: raise_internal scrubs upstream error messages, escape_sql_literal at read_parquet()/glob(), Caddy container drops to USER caddy - Frontend: stable React keys + useSSE monotonic _id, a11y pass (aria-expanded, base-ui Dialog, per-row aria-label), fetchWithTimeout (30s / 10s heartbeat) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Caddyfile: drop 'unsafe-eval' from CSP script-src (no eval / new Function / string-form timers across frontend code; grep-verified) - backend/deps: shared ServiceId Annotated[str, Path(pattern=...)] type; session_scoring_admin migrates all 17 path-param sites so malformed ids 422 at the FastAPI boundary - core/duckdb.get_connection + utils/rdns_cache._run_async_resolve: docstring notes that callers must wrap with asyncio.to_thread when called from an async handler - CustomFieldDrawer: surface VCL validation API failures via inline hint (previously console.error only); document the existing exhaustive-deps disable - frontend Dockerfile: comment noting openapi.json must stay free of embedded credentials / example tokens - ReportLayout becomes generic over TData; security/page.tsx threads components['schemas']['SecurityAggregatesResponse'] through, so the three section files drop the (data as any) blanket casts and gain inline narrowing for the dict-of-unknown shapes the schema can't pin (ngwaf bots timeseries, ipv6_adoption rows) Tests: - pin that the ServiceId pattern returns 422 on a service_id with characters outside [A-Za-z0-9_-] - openapi.json regenerated to pick up the new Path pattern field Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backend/Dockerfile: - add UID/GID 1000 `app` user + USER directive so a foothold in the FastAPI process can't escape its mounts. Pre-create + chown the bind-mount targets so docker's root-owned default dirs don't shadow the chown. router_utils.raise_internal: - annotate as -> NoReturn so mypy treats it as a control-flow terminator at the call site (the function unconditionally raises). Without this, every caller that puts raise_internal at the end of an except branch trips a "Missing return statement" error. Exception-leak follow-up to v2.0.1 (additional sites the first sweep missed where the original message genuinely leaked internals): - session_scoring_admin: rotate_aes_key + update_enforce_status_code failures - admin/sync_status: catch-all 500 branch (DBBusyError 503 path keeps its informative message since "busy" is the actionable signal) - services/audit + services/cron: read / delete handlers - intentionally untouched (str(e) is the right shape): parse_period ValueErrors in provision, SQL-validator PermissionErrors in query, share_admin passcode validation, label-create / -update validation in session_scoring, "busy" RuntimeErrors in admin/ingest + admin/iceberg (the body is the user-actionable state) Tests updated to pin the new ``raise_internal`` shape (generic code + error_id, no exception-string leak) and to assert the leak specifically does NOT travel through the wire. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
== fastly CDN auth: replace spoofable Fastly-FF check with shield-auth secret == The FOS CDN VCL was gating its 401/penaltybox/Client-IP logic on `fastly.ff.visits_this_service == 0` to mean "this is the edge, not a shield hop". That variable is derived from the `Fastly-FF` request header, which is client-controllable — an attacker who sets `Fastly-FF: <service-id>` makes the variable non-zero on the edge and skips every gate. The simple fix (remove the FF check) breaks shielding because `querystring.filter_except` strips the `key` query param before the shield-hop bereq, so the shield POP would 401 on missing auth. Applied fix mirrors the existing session-scoring VCL pattern: 1. `load_vcl()` substitutes a fresh random secret into a placeholder. Edge + shield run the same compiled VCL so both see the same constant by construction. 2. `vcl_recv` strips client-supplied `X-Edge-CDN-Auth` that doesn't match — prevents a spoofer satisfying the shield-detection check. 3. Auth/penaltybox/Client-IP gates now check `req.http.X-Edge-CDN-Auth != "<secret>"` (unspoofable — the secret is compiled into the VCL and never leaves it) instead of the FF variable. 4. `miss_pass` stamps `bereq.http.X-Edge-CDN-Auth = "<secret>"` on every outgoing bereq. The shield matches it and skips edge-only gates; the actual FOS origin ignores it. Regression coverage in tests/utils/test_fastly_utils.py. The remaining `fastly.ff.visits_this_service > 1` reference in vcl_recv is the SWR-on-shield tweak (a perf knob, not a security gate) and is left as-is. == views/alerts: remove O(N) cross-tenant scan fallback == `get_view_by_id` / `delete_view` / `get_alert_by_id` / `toggle_alert` / `delete_alert` had a `service_id_hint=None` fallback that, when no hint was given, iterated every per-service metadata DB looking for the target id. An authenticated caller could trigger an O(N) workload across the whole tenant set by requesting any non-existent id. Make `service_id` required at all five entry points. The router-level DELETE handlers raise 400 `service_id_required` if the caller didn't supply one. Audit flagged only the views variant; alerts was structurally identical and got the same fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Architectural cleanup release. The post-
v1.2.0perf branch closed the worst read-path latency by stacking remediation on top of an architecture that wasn't designed for the workload; this PR pays that down. The branch is squashed into a single commit (6e29655) that's tag-ready asv2.0.0, plus follow-up commits from the user (d237e9f,298797b) and a ruff lint fix (dd17b90).See
CHANGELOG.md[2.0.0]entry for the full annotated diff — this body is the PR-review companion (status of gates, test surface, what to watch for in review, follow-up items).What landed
iceberg/,cron/jobs,metadata/,tunnel/,share_db/,routers/admin/,core/rollups/); pre-split public surface preserved via package__init__.pyre-exports. Top-5 backend now ≤ 1461 LOC; no backend file > 1500.RequestContextreplacesAnalyticsDeps: tenancy enforced at context construction; routes never parseservice_idfrom a path param.dashboard/bundle,security/bundle,network/bundleship with the frontend swap. Granular endpoints deleted (see Breaking).ProvisionWizard(3582 → wizard/steps/),app/logs/page.tsx(2136 → _sections/ + _state.ts), and 14 more. No frontend file > 499 LOC.--cov-fail-under78 → 85 (actual 85.05 %); frontendcoverage.thresholds.lines44 → 58 (actual 61.66 %).ignore_errorslist 36 → 0; per-module strict block grew 8 → 19 modules.sql_validatorrejects NUL bytes; ruff T201 (noprint()in production code, with per-file ignores forscripts/*, the provision CLI, andtests/*).escape_sql_literalatread_parquet()+glob()sites), accessibility, stable React keys, request timeouts (d237e9f); VCL macro heredoc support (298797b).Verification
make ciis green on the current tip (dd17b90): 3972 pytest, 444 vitest, 9 VCL test all pass; lint + format + mypy + osv + secret-scan all pass.T201 print()violation intests/test_performance_smoke.py(legitimate benchmark prints); fixed indd17b90by addingtests/*to the per-file-ignores list (same treatment asscripts/*).bootstrap,sync-status,log-extents, every/api/admin/*)/share-loginsession): 9/9 pages rendered with real data — Dashboard, Performance, Origin, Security, Charts, Insights, Network, Sessions, Query/adminredirects analyst to/dashboard;/api/admin/*returns 401 for analyst session;/api/sync-statusreturns 401 (admin-only)Breaking changes
/api/dashboard/aggregates,/api/dashboard/raw,/api/dashboard/top_n, etc.) are deleted; callers must use the composite (/api/dashboard/bundle).AnalyticsDeps→RequestContext: alias removed.is_cached / _is_cachedalias onBaseResponseremoved (is_cachedis canonical).v2.0.1hardening):raise_internal(logger, exc, code, status)replacesraise HTTPException(detail={"error": str(e)})at except sites —detailis now{"error": <code>, "error_id": <8-hex>}. Upstream messages are scrubbed; operators correlate via the server log keyed onerror_id.What was NOT done in this PR (deferred)
v2.0.0tag — version is at2.0.0inpyproject.toml/frontend/package.json/backend/main.py/uv.lock; the tag itself is not yet pushed (per "do not merge anything yet"). The user has already drafted av2.0.1commit message (d237e9f) but the actual version strings still read2.0.0— needs a decision: ship as2.0.0(with the hardening folded in) or bump to2.0.1before tagging.pending-docs/best_practices_audit.md— left untracked in the working tree (your WIP audit doc, not part of this PR).Test plan
make cigreenv2.0.0vsv2.0.1(and whether to bump the version strings)~/restart.sh+ the[prod-verify-paths]memory pattern🤖 Generated with Claude Code