feat(data-warehouse): fail webhook source runs on persistent delivery errors#61195
feat(data-warehouse): fail webhook source runs on persistent delivery errors#61195Gilbert09 wants to merge 13 commits into
Conversation
|
Hey @Gilbert09! 👋 It looks like your git author email on this PR isn't your
You can fix it for this repo with: git config user.email "you@posthog.com"Or set it globally with |
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
posthog/temporal/data_imports/sources/common/webhook_s3.py:124
The ClickHouse query is wrapped with plain `asgiref.sync_to_async`, which defaults to `thread_sensitive=True`. In a Temporal activity context, this serialises every concurrent call to `sync_execute` through a single shared thread — exactly the bottleneck that `database_sync_to_async_pool` was introduced to avoid. The file already imports `database_sync_to_async_pool` and uses it correctly for the Django ORM fetch a few lines above. Using bare `sync_to_async` here is inconsistent and, under concurrent activity workers, would cause calls to queue behind each other rather than run in parallel.
```suggestion
rows = await database_sync_to_async_pool(sync_execute)(
```
Reviews (1): Last reviewed commit: "feat(data-warehouse): fail webhook sourc..." | Re-trigger Greptile |
ClickHouse migration SQL per cloud environment
|
PR overviewAll previously flagged issues have been addressed. No open security concerns remain on this pull request. Security reviewNo open security issues remain on this pull request. Fixed/addressed: 6 · PR risk: 0/10 |
|
Applied the This PR pairs a ClickHouse migration (the
So there's no service-depends-on-unmigrated-schema hazard here — the check is a path-only false positive. (The label's docs frame it for DB-noop migrations; flagging the reasoning explicitly since this one does create tables.) |
There was a problem hiding this comment.
Supply Chain Security Review
⚠️ Dockerfile.recording-rasterizer — floating base image and dependency tags trade reproducibility for freshness
The Dockerfile intentionally moves from pinned versions to floating tags (node:24-bookworm-slim, chrome-headless-shell@stable, unpinned ffmpeg) paired with a weekly scheduled rebuild via DEPS_CACHE_BUST. This is a documented trade-off: weekly rebuilds auto-ingest security patches, but any individual build is no longer bit-for-bit reproducible and could pull a compromised upstream artifact without notice.
The mitigation (scheduled rebuild + cache-bust) is reasonable, but consider adding digest pinning for the base image and rotating it via Renovate/Dependabot to get both reproducibility and freshness.
Tag @mendral-app with feedback or questions. View session
Review: PR #61195 — feat(data-warehouse): fail webhook source runs on persistent delivery errorsGoal achieved: partial — the feedback loop is built end-to-end and works for the high-volume case, but recovery after fixing a secret is fragile for low-volume sources (see findings). Findings
VerdictSolid, well-scoped change with genuinely thorough tests on both sides of the boundary and a sensible fail-open design. The one thing I'd want addressed before merge is the re-enable-with-stale-failures behavior — for a low-volume source it can make the source un-recoverable until the failure rows age out of the window, which undercuts the stated recovery story. The unthrottled failure writes are the other worth-fixing item (cost/write-amplification, and trivially fixable given a success throttle already exists). Everything else is nit-level. |
a047ec2 to
c3c5747
Compare
|
Thanks for the thorough review @danielcarletti — going through each finding: [should-fix] re-enable with stale failures — Valid edge case for low-volume sources: after a non-retryable disable, if the user fixes the secret and re-enables but no fresh delivery lands before the next run, the stale ≥3 failures can still sit inside the lookback window ( [should-fix] failures never throttled — This is intentional and explicitly tested ( [nit] unused SQL constants / misleading "used by tests" comment — Addressed: corrected the comment so it no longer claims test usage (the tests exercise [nit] deriveRecord http-response extraction — Noted; a [nit] lastEmitBySource never evicted — Noted; entries are tiny and bounded by the distinct-source count, so not a real concern, but a periodic prune older than [nit] duplicate user-facing strings — Noted; the |
Rebase + CI status (automated)Resolved the merge conflict with master and brought the branch up to date. The conflict was in two files, both resolved preserving intent:
Backend CI is green (migration topology + schema snapshot + everything else passes). The only remaining red checks are two pre-existing master-wide failures, both unrelated to this PR (this branch touches neither file), and both currently red on
Will keep the branch updated as master's geoip fix lands. |
Update — sole blocker is a repo-wide GeoIP snapshot race on masterThis PR is otherwise complete: conflict resolved, branch mergeable with no leaks, all review threads resolved, and all PR-specific CI is green (Backend CI incl. the migration topology + The only remaining red checks are the
A separate |
ade0230 to
82e97fb
Compare
|
Heads up on CI: the Node.js Tests 3/3 shard has intermittently failed on |
e58a319 to
54709a0
Compare
|
Size Change: -15.7 kB (-0.02%) Total Size: 64.5 MB 📦 View Changed
ℹ️ View Unchanged
|
d9538af to
2c3bc02
Compare
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
cdaf230 to
463d285
Compare
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
… errors Webhook sources (e.g. Stripe) ingest via a hog function endpoint. When the signing secret is wrong the hog function returns a 4xx to the sender, no data lands in S3, and the import pipeline silently completes with 0 rows — the user is never told to fix their secret. This adds a feedback loop: the CDP node emits each warehouse_source_webhook delivery outcome to a new ClickHouse table, and the import pipeline reads recent outcomes per source. Persistent non-retryable rejections (>=3 consecutive 400/401/403 with no success since) now fail the run with a NonRetryableException, which disables the schema and surfaces a friendly error in the UI. The lookback window tracks the schema's own sync cadence plus a buffer.
…ry and stub it in S3 tests Run the warehouse_webhook_delivery_status ClickHouse query through database_sync_to_async_pool (thread_sensitive=False) instead of bare sync_to_async, matching the Django ORM fetches in the same file and avoiding serialising concurrent activity workers onto a single thread. Existing WebhookSourceManager get_items tests now exercise the new fail-fast delivery check, which queries Postgres + ClickHouse. Stub that check in those tests (it has dedicated coverage in test_webhook_source_manager.py) so they stay focused on S3 read/transform behaviour. Generated-By: PostHog Code Task-Id: c247d0fa-1196-4dd2-9265-80cad1088dd2
The persistent-delivery-failure check is an advisory health signal, not a correctness gate. If the warehouse_webhook_delivery_status query errors — a transient ClickHouse hiccup, or the table not yet existing during a deploy — swallow it and continue rather than failing the user's import run. This also makes the import worker safe to deploy ahead of the ClickHouse migration. Generated-By: PostHog Code Task-Id: c247d0fa-1196-4dd2-9265-80cad1088dd2
…comment The INSERT helper is not referenced by any test; correct the misleading "used by tests" comment to describe it as a bypass-Kafka producer helper kept for parity with the hog_invocation_results table family. Generated-By: PostHog Code Task-Id: 77e746c7-ae01-4e1a-9621-076bb86deb00
Generated-By: PostHog Code Task-Id: 77cc7f1c-5848-4065-9d39-13b1fb1a4afe
master relocated ingestion-outputs from src/ingestion/outputs/ to src/common/outputs/. Point the warehouse webhook status service (and its test) at the new ~/common/outputs/ingestion-outputs path so the nodejs build compiles. Generated-By: PostHog Code Task-Id: fb7cf23b-91d0-499e-ab76-df9a61bd4ea4
0325284 to
600f547
Compare
…ils move The cdp utils modules (logger, posthog, json-parse) moved to ~/common/utils. Point the new warehouse-webhook-status service and its test at the new paths so the Node.js build compiles. Generated-By: PostHog Code Task-Id: c2c97120-82b6-4827-8f17-aac874fcdc09
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
CI status: one required check blocked by a pre-existing master failureThis branch is now conflict-free and rebased onto the latest
This failure is not caused by this PR. It reproduces identically on recent Root cause (for whoever picks up the master fix): the test is order-dependent. I re-ran the failed shard once to rule out flakiness — it fails deterministically, matching |
Generated-By: PostHog Code Task-Id: c2c97120-82b6-4827-8f17-aac874fcdc09
|
🎭 Playwright report · View test results →
These issues are not necessarily caused by your changes. |
Problem
Data warehouse webhook sources (e.g. Stripe) ingest via a hog function endpoint: an external service POSTs to the endpoint, the hog function validates it (signature / signing-secret check) and, on success, funnels the payload through Kafka to S3 for the import pipeline to load.
When the signing secret is wrong, the hog function returns an HTTP 4xx (e.g.
400 Bad signature) straight back to the sender. No data lands in S3, so the scheduled import pipeline keeps running, reads 0 rows, and completes "successfully" with 0 rows synced. There is no feedback path from the Node.js hog-function rejection back to the Python pipeline, so the user is never told their secret is broken — the source just quietly imports nothing.Changes
Adds a feedback loop so persistent, human-actionable webhook delivery failures fail the run with a meaningful, non-retryable error (which the existing machinery surfaces in the UI and uses to disable the schema).
warehouse_webhook_delivery_statustable family (data + Kafka engine + MV + distributed alias) on the AUX cluster via the warpstream-cyclotron named collection, mirroringhog_invocation_results. ClickHouse auto-ingests from Kafka, so there is no new consumer process. The sort key(team_id, source_id, schema_id, timestamp)keeps the per-source windowed read a small range scan; rows TTL after 7 days.WarehouseWebhookStatusServicederives a delivery outcome (team_id, source_id, schema_id, http_status, ok, reason, timestamp) from eachwarehouse_source_webhookinvocation's HTTP response and produces it to a new Kafka topic. It is wired into the existingInvocationResultsServicecomposite, so it receives and flushes alongside the other result sinks with no change to the webhook request path. Steady-state successes are throttled per source; failures and the first success after a failure (the recovery transition) are never throttled.WebhookSourceManagernow queries recent delivery outcomes before reading S3 and raises aNonRetryableExceptionwhen deliveries are persistently failing (≥3 consecutive400/401/403with no success since).404,429and5xxare excluded — they aren't user-actionable signing-secret problems. The lookback window is derived from the schema's ownsync_frequency_intervalplus a buffer (the schedule jitter is a one-time start-time offset, so inter-run spacing stays ~one interval). The raised error phrase is registered inAny_Source_Errors, so the existing workflow logic disables the schema and stores a friendlylatest_error. Recovery is the standard manual re-enable, identical to every other non-retryable error.Granularity note: a bad signing secret is rejected before per-event schema mapping, so those outcomes carry an empty
schema_id(source-level); the per-schema pipeline check matches both its own schema's rows and source-level rows.How did you test this code?
I'm an agent (Claude Code). I wrote and ran the following automated tests locally; I did not perform manual end-to-end testing against a live Stripe webhook.
warehouse-webhook-status.service.test.tscovering record derivation (string/object reason, truncation, 2xx/4xx/5xx, missing source_id, non-warehouse functions, queued/no-response), per-source success throttling, the recovery transition, and flush/error-swallowing; plus updates toinvocation-results.service.test.tsfor the new sub-service. tsc + eslint clean.test_webhook_source_manager.py— parameterized classification logic (threshold, recovery, transient 5xx/429/404 exclusion, reason fallback), lookback-window math, the raise / no-raise paths, thatget_itemsshort-circuits before reading S3, and a drift guard that the raised phrase stays registered inAny_Source_Errors. ruff clean.cdp-source-webhooks.consumer.test.tsis an integration test requiring the running stack and was not executed in my sandbox (no Postgres) — worth a run in CI.Automatic notifications
🤖 Agent context
Authored with Claude Code. Planning explored both sides of the boundary (CDP hog-function webhook handling in Node, the temporal import pipeline in Python) and settled the key design fork — how the failure signal travels from Node back to the pipeline — on ClickHouse auto-ingest over a Postgres-via-consumer approach (avoids a new k8s pod; the per-source windowed query is cheap with the right sort key) and over reusing hog-function monitoring data (which would conflate deliberate 4xx rejections with hog runtime failures and risk the hog-watcher auto-disabling the function).
Notable decisions: hooking the new service into the
InvocationResultsServicecomposite rather than the HTTP handler (the result already carries the function type, inputs and HTTP response, so no request-path change); keying outcomes at source level with an optional schema id because signature failures happen before schema mapping; and deriving the lookback window from the schema's sync cadence after confirming the schedule jitter is a one-time offset, not per-run.