Skip to content

feat(webhooks-oss): WAIT_FOR_WEBHOOK end-to-end — port + persistence across 5 backings#1106

Open
nthmost-orkes wants to merge 80 commits into
mainfrom
feat/webhooks-from-orkes-split
Open

feat(webhooks-oss): WAIT_FOR_WEBHOOK end-to-end — port + persistence across 5 backings#1106
nthmost-orkes wants to merge 80 commits into
mainfrom
feat/webhooks-from-orkes-split

Conversation

@nthmost-orkes

@nthmost-orkes nthmost-orkes commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Ships WAIT_FOR_WEBHOOK end-to-end on OSS conductor: a faithful port of webhooks-oss/ from orkes-io/orkes-conductor#3612, plus the in-memory + 5 persistent backings needed to actually run it in any OSS deployment shape, plus retention of incoming_webhook_event rows. Closes #888.

Originally landed as two PRs (#1106 port + #1110 persistence); per leadership alignment they were folded into this one. Same commits, single review thread. #1110 auto-closed when its branch was fast-forwarded into this one.

User impact: First-time users can configure WAIT_FOR_WEBHOOK workflows, receive verified inbound events (HMAC/Slack/Stripe/Twitter/SendGrid/SignatureBased/HeaderBased), and have them dispatch into running workflows — on any supported OSS persistence shape (in-memory single-node, postgres, mysql, sqlite, redis, cassandra), not just dev mode.

Review-round changes

Six commits added after initial filing, addressing Viren's review:

Commit Change Why
893a990 Delete Tag.java from common Tags are enterprise RBAC — TagsService/AutomaticTagService don't exist in OSS; the class had no callers other than WebhookConfig.tags
b986d65 Drop List<Tag> tags from WebhookConfig Field was always null in OSS (nothing ever sets it); dead API surface that would mislead users
762e4c2 Delete EventMessage.java from common Ported but never wired — IncomingWebhookService has no addEventMessage calls and OSS ExecutionDAO has no such method. Decision (per Viren): DLQ doesn't fit Conductor's design; timeoutSeconds on WAIT_FOR_WEBHOOK tasks is the correct negative-path mechanism
92e122d Delete WebhookExecutionHistory + recordHistory() + lastRunWorkflowIdSize Both Orkes and OSS carried // TODO Remove this on this field. Enterprise-only execution log embedded in config (anti-pattern); only implemented in webhooks-enterprise, not webhooks-oss
237701e Drop WebhookConfig.getWorkflowNames() Zero callers anywhere in OSS — duplicated getReceiverWorkflowNamesToVersions().keySet() with @JsonIgnore
a05447c Map NonTransientException400 in ApplicationExceptionMapper Signature verification failures were returning 500, which tells external senders (Stripe, GitHub, Slack) to retry. NonTransientException semantically means "this won't work no matter how many retries" — correct HTTP mapping is 400 Bad Request
5cdd389 Add Webhook.cancel() + remove(TaskModel, int) on all 6 backings Ports orkes-io/orkes-conductor#3663: Webhook.start() puts a taskId into the correlation hash set but there was no symmetric removal when a workflow was terminated before a matching event arrived — set grew without bound. cancel() calls webhookTaskService.remove(TaskModel, int) (new overload on all 6 backings); WebhookTaskHashing.computeHashIfPresent() handles tasks cancelled before IN_PROGRESS as a safe no-op.

Orkes parity

Mirrors the layout from orkes-io/orkes-conductor#3612, which performed the webhooks/webhooks-oss/ + webhooks-enterprise/ split per the precedent set by scheduler in orkes-io/orkes-conductor#3529 and #1064.

Source files mirrored:

  • orkes-conductor/webhooks-oss/src/main/java/io/orkes/conductor/webhook/**
  • 9 supporting classes in orkes-conductor/common/ and orkes-conductor/core/
  • 5 backend DAO + TaskService impls under orkes-conductor/{postgres,mysql,redis,cassandra}-persistence/

All io.orkes.conductor.* packages translated to org.conductoross.conductor.* per repo convention.

OSS-side additions (no Orkes parallel)

Required because Orkes' equivalents are tenant-aware or rely on infrastructure that doesn't exist in OSS:

  • InMemoryWebhookDAO + InMemoryWebhookTaskService@ConditionalOnMissingBean defaults for the single-node case. No Orkes parallel because Orkes always runs against persistent storage.
  • WebhookConfigService + WebhookConfigResource — tenant-free CRUD at POST/PUT/GET/DELETE /api/metadata/webhook. OSS equivalent of the enterprise resource that uses AuditUtils + OrkesAuthentication.
  • IncomingWebhookResource — tenant-free port at POST/GET /api/webhook/{id}. Drops orgId switching.
  • WebhookWorker — OSS port of the enterprise worker. Plain @Component with @PostConstruct/@PreDestroy, no LifecycleAwareComponent, no MetricsCollector/Monitors, no OrkesRequestContext, no ExtendedEventExecution audit bookkeeping. Failures log instead of persist.
  • WebhookTaskHashing (in core) — hash computation shared across all 6 WebhookTaskService impls. Orkes carries the body inline in every impl; OSS lifts it once. Behavioral parity across backings is enforced by call, not by 5 hand-maintained copies. Also fixes a latent NPE (matches validated before refname iteration removal).
  • WebhookMatcherComputer (in core) — same pattern for the WorkflowDef → matchers transformation. All 6 DAO impls call it; each getMatchers() is a one-liner. Net –166 LoC vs. each impl carrying its own copy.
  • Replay-protection SPIWebhookVerifier.dedupKey(...) (default no-op) + WebhookDAO.tryRecordSignature(webhookId, signature, ttl) (default no-op). IncomingWebhookService rejects signature replays inside a 5-minute window. Backings that don't implement dedup retain prior behavior. No Orkes parallel; Orkes has no replay protection.

Backings landed

Backend DAO + service Schema Retention
In-memory InMemoryWebhookDAO + InMemoryWebhookTaskService ConcurrentHashMap (default; single-node) n/a
Postgres PostgresWebhookDAO + PostgresWebhookTaskService V16__webhook.sql (4 tables) PostgresWebhookCleanupJobWITH ... DELETE ... RETURNING
MySQL MySQLWebhookDAO + MySQLWebhookTaskService V10__webhook.sql (4 tables, InnoDB/utf8mb4) MySQLWebhookCleanupJobDELETE ... LIMIT N
SQLite SqliteWebhookDAO + SqliteWebhookTaskService V4__webhook.sql (4 tables) SqliteWebhookCleanupJobDELETE WHERE rowid IN (...)
Redis RedisWebhookDAO + RedisWebhookTaskService hash maps + sets via JedisProxy RedisWebhookCleanupJobHGETALL + parse + HDEL walker
Cassandra CassandraWebhookDAO + CassandraWebhookTaskService ensureTables() on construction Table-level default_time_to_live (no separate job)

Shared design across backings

  • Matcher recomputation on read. Every backing persists only the target-workflow versions snapshot (taken at config-create time so any expression evaluation is preserved). The actual matches criteria are recomputed from MetadataDAO on every getMatchers() call. WorkflowDef updates take effect without re-registering the webhook. Regression covered per backing by getMatchers_recomputesFromMetadataDAO_onEachCall_noStaleCache.
  • Bean naming. All WebhookDAO beans register as webhookDAO and all WebhookTaskService beans as webhookTaskService. The in-memory impls use @ConditionalOnMissingBean(name="webhookDAO"/"webhookTaskService") — whichever persistence module is on the classpath wins; with none, in-memory provides the default.
  • Cleanup property surface (cross-backing):
    • conductor.webhooks.cleanup.enabled (default true)
    • conductor.webhooks.cleanup.cron (default hourly "0 0 * * * *")
    • conductor.webhooks.cleanup.retention-duration (default PT168H / 7 days)
    • conductor.webhooks.cleanup.batch-size (default 1000) — SQL backings only
    • conductor.webhooks.cleanup.max-runtime (default PT60S) — SQL backings only
  • Operational worker properties. conductor.webhook.worker.{threadCount,pollingInterval,pollBatchSize} via WebhookWorkerProperties.

Design departures from Orkes' source impls

  1. No org_id anywhere — tables, queries, code paths. Orkes' schema bleeds org_id into 4 tables and dozens of WHERE clauses; OSS is tenant-free.
  2. Matcher recomputation on read — Orkes' PostgresWebhookDAO persists pre-computed matchers and runs a scheduled refreshMatchers executor to handle WorkflowDef updates. OSS drops the cache, recomputes on every read.
  3. No ConductorLimitsDAO quota enforcement — Orkes enterprise-only.
  4. NotFoundException instead of Preconditions.checkNotNull for missing-id errors.
  5. Schema renamewebhook_matchers (Orkes) → webhook_target_workflows (OSS), since OSS persists target workflows, not pre-computed matchers.
  6. Cleanup pattern. Orkes' WebhookCleanupJob implements an Orkes-internal DataCleaner interface. OSS uses Spring @Scheduled directly with the property surface above.

Code rewrites beyond mechanical namespace translation

Port-side (webhooks-oss)

# Where Change Why
1 IncomingWebhookService, TimeBasedUUIDGenerator ApplicationExceptionNonTransientException OSS retired ApplicationException
2 TimeBasedUUIDGenerator Dropped generate(long), getOrgId(String), OrkesRequestContext lookups OrkesRequestContext is Orkes-only tenant context
3 IncomingWebhookService.handleWebhook Throws NotFoundException on unknown webhook id instead of returning null. Dropped ExecutionDAOFacade dep. 4xx contract correctness + dead code
4 WebhookConfigService.updateWebhook Throws NotFoundException if existing config not found Was NPE/500
5 WebhookConfigService.getWebhooks + WebhookConfigResource.getWebhook toBuilder().secretValue(SECRET).build() instead of in-place setSecretValue("***") In-memory DAO returns shared refs — in-place mutation corrupted persisted secret. Found via smoke.
6 IncomingWebhookResource, WebhookConfigResource /webhook/api/webhook, /metadata/webhook/api/metadata/webhook. All @PathVariable use explicit ("id"). OSS convention + Spring requires explicit names without -parameters. Found via smoke.
7 WebhookTaskMapper Package → org.conductoross.conductor.webhook.tasks.mapper AI module precedent
8 WebhookConfigResource, WebhookConfigService Drop AuditUtils, FeatureFlags, all @PreAuthorize/@PostFilter, OrkesAuthentication.getAuthenticatedUser. Tags (Tag.java, WebhookConfig.tags) fully removed — enterprise RBAC with no OSS backing. Enterprise-only
9 webhooks-oss/build.gradle spring-boot-starter-web + spring-boot-autoconfigure upgraded compileOnlyimplementation. Stripe/SendGrid pinned via revStripe/revSendgrid. Spring web needed at runtime; AGENTS.md pinning convention
10 StripeVerifier.verify Null-guard on event.getApiVersion() before .split(...) NPE on Stripe-shaped bodies that omit api_version. Orkes has identical code — needs mirror fix to keep parity.
11 SlackVerifier Full rewrite: implements Slack's X-Slack-Signature: v0=<hmac_sha256(timestamp:body)> + X-Slack-Request-Timestamp with 5-minute replay window and constant-time compare. Orkes' impl returns ErrorList.empty() once urlVerified=true — effectively unauthenticated post-handshake. OSS does real signing-secret verification. Orkes is genuinely vulnerable here; mirror-fix candidate.
12 ApplicationExceptionMapper NonTransientException400 Bad Request Was falling through to 500, causing external senders (Stripe, GitHub, Slack) to retry rejected events. "Non-transient" semantically is a client error.

Persistence-side

  • core/WebhookTaskHashing.java validates matches before computing the hash. Fixes a latent NPE where removeIterationFromTaskRefName(null) would mask the missing-matches NonTransientException. Regression: InMemoryWebhookTaskServiceTest.put_missingMatchesField_throws.
  • core/WebhookMatcherComputer.java centralizes the WorkflowDef → matchers loop. All 6 DAO impls + InMemory call it. Behavior unchanged from inline; eliminates 5 hand-maintained copies.
  • PostgresWebhookDAOTest + siblings set spring.flyway.ignore-migration-patterns=*:missing to tolerate an orphan V10.1 in CI's shared testcontainers postgres (another test class applies V10.1 via experimental-queue-notify).
  • CassandraWebhookDAO keeps a local objectMapper + toJson/readValue because CassandraBaseDAO's are package-private. Same workaround CassandraSchedulerDAO uses.

Tests

146 new test methods across 24 test classes + 1 property-style class (200 random reps per run), 0 failures.

Port (webhooks-oss): 78 tests

  • Verifiers (HMAC/Slack/Stripe/Twitter/SendGrid/SignatureBased/HeaderBased) — lifted from Orkes
  • TargetWorkflowCollector, idempotency substitution
  • WebhookConfigService (9) — CRUD, conflict, secret sanitization (mutation regression), NotFound on update
  • WebhookConfigResource (8) — validation, 404s, secret sanitization
  • InMemoryWebhookDAO (9) — CRUD + matcher computation against mocked MetadataDAO
  • InMemoryWebhookTaskService (6) — hash semantics, bucketing, missing-matches throws
  • IncomingWebhookResource (2) — delegation
  • Webhook system task (1) — put() registration
  • WebhookWorker (9) — handleMessage paths, pollAndExecute
  • StripeVerifier (3) — null api_version accepted, bad signature rejected, missing header rejected
  • WebhooksOssEndToEndTest (4) — full bean graph: register → receive → enqueue → worker dispatch → workflow start / WAIT_FOR_WEBHOOK completion
  • WebhookHashContractPropertyTest (2 × 100 reps) — property-style: hash-site agreement across InMemoryWebhookTaskService and WebhookHashingService.computeJsonHash

Persistence: 68 tests (most testcontainer-based)

Test class Count Local pass
PostgresWebhookDAOTest 8 (needs Docker)
PostgresWebhookTaskServiceTest 6 (needs Docker)
PostgresWebhookCleanupJobTest 4 (needs Docker)
MySQLWebhookDAOTest 6 (needs Docker)
MySQLWebhookTaskServiceTest 6 (needs Docker)
MySQLWebhookCleanupJobTest 3 (needs Docker)
SqliteWebhookDAOTest 7
SqliteWebhookTaskServiceTest 6
SqliteWebhookCleanupJobTest 3
RedisWebhookDAOTest 7 (needs Docker)
RedisWebhookTaskServiceTest 6 (needs Docker)
RedisWebhookCleanupJobTest 4 (needs Docker)
CassandraWebhookDAOTest 7 (needs Docker)
CassandraWebhookTaskServiceTest 6 (needs Docker)

Smoke validation

Single-node (against :conductor-server-lite:bootRun)

The reproducible harness at nthmost-orkes/webhook-emitter/conductor-smoke registers a workflow, registers a webhook, starts the workflow, fires a signed event, polls until COMPLETED. Result:

  → HMAC_BASED ... PASS (2.23s)
  → SIGNATURE_BASED ... PASS (2.23s)
  → HEADER_BASED ... PASS (1.96s)
  → SLACK_BASED ... PASS (1.95s)
  → STRIPE ... PASS (1.97s)
  → TWITTER ... PASS (2.24s)
summary: 6/6 verifiers passed

Per-backend matrix (docker-compose + smoke)

webhook-emitter/conductor-smoke/matrix.sh brings up each backing's docker-compose stack against this branch's conductor:server image, then runs the 6-verifier smoke against it.

Backing Result Notes
Postgres ✅ 6/6 PASS full webhook flow through real Postgres DAO
Redis ✅ 6/6 PASS full webhook flow through real Redis DAO
SQLite ✅ 6/6 PASS server's intrinsic default — local :conductor-server-lite:bootRun resolves conductor.db.type=sqlite from application.properties, so the local bootRun smoke is the sqlite functional pass. Also covered by :conductor-sqlite-persistence:test (16/16 green locally).
MySQL ⚠️ docker bring-up fails pre-existing flyway bean wiring issue at Spring startup — fails before webhook code loads; unrelated to this PR. Testcontainer tests will validate via MySQLWebhookDAOTest etc.
Cassandra ⚠️ docker bring-up fails missing com.codahale.metrics.JmxReporter transitive dep in server classpath; fails on existing cassandraMetadataDAO bean, not webhook code. Testcontainer tests will validate via CassandraWebhookDAOTest etc.

The mysql/cassandra docker failures are upstream of any WebhookDAO code path — both fail at Spring bootstrap of pre-existing beans (mySqlMetadataDAO / cassandraMetadataDAO). Worth filing separately against conductor-oss/conductor's docker image setup, but they don't gate this PR.

Full battery (conductor-server-lite SQLite, loki)

8/8 harnesses passed against this branch's JAR on a remote host (loki, Ubuntu 24.04, Java 21, SQLite persistence):

Harness Result Key metric
smoke.py — verifier matrix ✅ 6/6 HMAC/Sig/Header/Slack/Stripe/Twitter all COMPLETED
negative_smoke.py — failure modes ✅ 5/5 timeout→TIMED_OUT, bad_sig→400, unknown→404, no_match→RUNNING, replay→400 second fire
gap_smoke.py — coverage gaps ✅ 3/3 workflowsToStart auto-starts workflow, HTTP task returns 200, SENDGRID ECDSA verified
composite_smoke.py — all system tasks ✅ 14/14 checks SET_VARIABLE, JSON_JQ_TRANSFORM, SWITCH, FORK_JOIN, DO_WHILE, SUB_WORKFLOW, FORK_JOIN_DYNAMIC, JOIN, INLINE, TERMINATE all COMPLETED
multi_verifier_smoke.py — 3 configs, 1 wf ✅ 3/3 HMAC + Signature + Header, each fires fresh instance
bucket_smoke.py — N=20 fan-out ✅ 20/20 all completed from single event, median 0.47s
replay_smoke.py — observation ✅ exits 0 second fire rejected (400), Scenario B behavior documented
load_smoke.py — 100 wf, concurrency=10 ✅ 100/100 8.56/s throughput, p50=1019ms, p95=1276ms, 0 lost

Verified

  • ./gradlew :conductor-webhooks-oss:test — 78/78 unit + 200/200 property reps green
  • ./gradlew :conductor-sqlite-persistence:test --tests '*Webhook*' — 16/16 green
  • ./gradlew :conductor-{core,webhooks-oss,postgres,mysql,sqlite,redis,cassandra}-persistence:compileJava — clean
  • ./gradlew :conductor-server:compileJava — clean (full server tree)
  • ./gradlew :conductor-*:spotlessCheck (all touched modules) — clean
  • Local bootRun smoke against :conductor-server-lite — full webhook → workflow chain succeeded
  • Per-backend matrix smoke: postgres + redis 6/6
  • Full 8/8 battery on remote host (loki, SQLite) — see table above

What's NOT here (and why)

  • DLQ / EventMessage: decided against. Orkes ships EventMessage/EventMessageService as an enterprise-only observability layer (Redis-backed, org-scoped). It doesn't belong in OSS, and the decision from review is that timeoutSeconds on WAIT_FOR_WEBHOOK tasks is the correct negative-path mechanism — consistent with Conductor's design philosophy. EventMessage.java was removed.

Known follow-ups (out of scope, file as separate PRs)

  • Fix mysql-persistence docker image: flyway bean wiring at startup (existing issue, surfaces under matrix harness)
  • Fix cassandra-persistence docker image: missing com.codahale.metrics.dropwizard-metrics-jmx transitive dep
  • Docs page with verified curl examples for /api/metadata/webhook and /api/webhook/{id}

Test plan

  • CI green
  • CI green on all 5 testcontainer-based persistence test suites
  • Smoke: POST /api/webhook/{id} → verifier runs → WebhookTaskMapper dispatches → workflow completes
  • Verify conductor.webhook.worker.* properties bind via WebhookWorkerProperties
  • Verify conductor.webhooks.cleanup.* properties bind via Spring environment

Stats

91 files changed, ~9,600 LoC added net, 23 commits. Originally split as #1106 (port, ~5.5K LoC, 54 files) + #1110 (persistence, ~4.3K LoC, 39 files); folded per leadership alignment. 6 additional cleanup commits from review round.

nthmost-orkes and others added 14 commits May 19, 2026 12:17
Lifts the new webhooks-oss/ module created by orkes-io/orkes-conductor#3612
(scheduler-style OSS/enterprise split) into conductor-oss. All io.orkes.conductor.*
packages translated to org.conductoross.conductor.* per repo convention.

Replaces the prior webhook-task/ module (deleted on main, only stale build/
remnants existed) with a structurally faithful copy of webhooks-oss from
orkes-conductor.

Ported

- 17 main + 7 test files comprising the webhooks-oss module
  (verifiers, hashing, IncomingWebhookService, WebhookTaskMapper,
  WebhookWorkerProperties, etc.)
- 10 supporting classes (EventMessage, ErrorList, Tag, WebhookConfig,
  IncomingWebhookEvent, WebhookExecutionHistory, WebhookDAO,
  WebhookTaskService, TargetWorkflowCollector, TimeBasedUUIDGenerator)
  into common/ and core/

Code rewrites beyond mechanical namespace translation

1. ApplicationException -> NonTransientException (2 sites in
   IncomingWebhookService, 3 sites in TimeBasedUUIDGenerator).
   OSS retired ApplicationException.

2. TimeBasedUUIDGenerator stripped of multi-tenant OrkesRequestContext
   lookups. generate() still produces real time-based UUIDs via log4j
   UuidUtil. generate(long) (test-only, unused in OSS) falls back to
   UUID.randomUUID() with TODO. getOrgId() returns "_".

3. EventMessage: removed orgId field and List<ExtendedEventExecution>
   eventExecutions field (transitive class not ported; orgId is tenant
   context).

4. IncomingWebhookService: replaced two executionDAOFacade.addEventMessage()
   calls with log.warn(). OSS's ExecutionDAOFacade has no addEventMessage
   (Orkes-only audit feature). Rejected events still observable via logs;
   audit-table persistence is a future enhancement.

5. Deleted 3 postgres DAO tests (AbstractTestDAO, PostgresDAOTestUtil,
   PostgresWebhookTaskServiceTest). They exercise a PostgresWebhookTaskService
   impl that has not been ported yet.

6. Inlined FeatureFlags.SECURITY as literal "conductor.security.enabled"
   in 5 verifier tests. Avoided porting a 20-line constants class for one
   string reference.

7. TargetWorkflowCollectorTest: org.apache.groovy.util.Maps.of(...) ->
   java.util.Map.of(...). Same semantics.

8. WebhookTaskMapper relocated from com.netflix.conductor.core.execution.mapper
   to org.conductoross.conductor.tasks.webhook (per repo namespace
   convention). Added explicit imports for TaskMapper and TaskMapperContext.

9. webhooks-oss/build.gradle adds commons-lang3, spring-web (test), and
   spring-boot-starter-web (compileOnly) for the at-compile-time imports.

Wiring

- settings.gradle: include 'webhooks-oss'
- server/build.gradle: implementation project(':conductor-webhooks-oss')

Verified

- ./gradlew :conductor-webhooks-oss:compileJava + compileTestJava: clean
- ./gradlew :conductor-webhooks-oss:test: all green
- ./gradlew :conductor-server:compileJava: clean (full server tree)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d webhook

Adds the runtime pieces needed to receive a webhook over HTTP and store it
in the queue for later dispatch. The dispatch worker and config-CRUD API
land in the next iteration.

Added

- InMemoryWebhookDAO: lifted from the prior feat/webhook-foundation
  attempt, extended to satisfy the richer WebhookDAO interface ported
  from orkes. Annotated @component @ConditionalOnMissingBean(name=
  "webhookDAO") so a persistent impl can override.

- InMemoryWebhookTaskService: lifted similarly, with @component
  @ConditionalOnMissingBean(name="webhookTaskService").

- IncomingWebhookResource: tenant-free OSS port of the
  webhooks-enterprise REST controller. POST /webhook/{id} and GET
  /webhook/{id}. ~60 lines.

Code rewrite

- IncomingWebhookService: swapped constructor-injected TimeBasedUUIDGenerator
  for IDGenerator (already a @bean in ConductorCoreConfiguration). Removes
  the @ConditionalOnProperty wiring trap and avoids needing
  conductor.id.generator=time_based. Webhook event IDs are now random UUIDs
  rather than time-based UUIDs, which is fine for the queue path that uses
  them.

Matchers note

InMemoryWebhookDAO.createMatchers/getMatchers/removeMatchers are no-ops.
Orkes' impl computes matchers by reading WorkflowDefs and extracting
inputParameters.matches from WAIT_FOR_WEBHOOK tasks, which requires
MetadataDAO injection. Inert until a worker consumes them, so deferred
with an inline comment.

Verified

- :conductor-webhooks-oss:compileJava + compileTestJava: clean
- :conductor-webhooks-oss:test: all green
- :conductor-server:compileJava: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…CRUD

Adds the management surface so webhook configs can be registered before
they receive events. Combined with the previous commit's REST receive
endpoint, this is the full register-then-receive arc.

Added

- WebhookConfigService: tenant-free OSS port of orkes' enterprise
  service. CRUD over WebhookDAO, calls TargetWorkflowCollector to
  populate matchers, sanitizes secrets on read. No audit logging
  (AuditUtils is enterprise-only); no FeatureFlags conditional
  (always enabled in OSS).

- WebhookConfigResource: REST controller mounted at
  /api/metadata/webhook (mirrors orkes path). POST/PUT/GET/DELETE +
  GET-all. No security annotations (@PreAuthorize / @PostFilter are
  enterprise-only); no tag endpoints (TagsService is enterprise).
  Validation rules preserved.

Code rewrites vs orkes

- ApplicationException(CONFLICT, ...) -> ConflictException
- ApplicationException(NOT_FOUND, ...) -> NotFoundException
- ApplicationException(INVALID_INPUT, ...) -> NonTransientException
- TimeBasedUUIDGenerator -> IDGenerator (consistent with prior commit)
- Dropped AuditUtils, FeatureFlags.WEBHOOKS conditional, all security
  annotations, all tag endpoints
- Dropped createdBy = OrkesAuthentication.getAuthenticatedUser().getId()
  since OSS has no authenticated-user concept here

Verified

- :conductor-webhooks-oss:compileJava + compileTestJava: clean
- :conductor-webhooks-oss:test: all green
- :conductor-server:compileJava: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atch

Completes the end-to-end webhook arc. An OSS server can now:

1. Register a webhook config (POST /api/metadata/webhook)
2. Receive an incoming webhook (POST /webhook/{id})
3. Verify the signature
4. Store the event + enqueue
5. The worker polls the queue, matches against waiting WAIT_FOR_WEBHOOK
   tasks via the hash service, completes those tasks, and starts any
   workflows declared in workflowsToStart.

Added

- WebhookWorker: OSS port of the Orkes Enterprise WebhookWorker. Plain
  @component with @PostConstruct-started ScheduledExecutorService and
  @PreDestroy shutdown (no LifecycleAwareComponent base, no
  MetricsCollector / Monitors, no OrkesRequestContext orgId switching,
  no ExtendedEventExecution audit bookkeeping). Failures log instead
  of persist. ~250 lines.

Changes

- InMemoryWebhookDAO.createMatchers now actually computes matchers:
  injects MetadataDAO, reads WorkflowDef by name+version, finds
  WAIT_FOR_WEBHOOK / WAIT tasks, extracts inputParameters.matches and
  keys them by workflowName;version;taskRef. Mirrors orkes'
  PostgresWebhookDAO.createMatchers minus the orgId prefix and
  workflowDefToUpdateMap cache.

Code rewrites vs orkes WebhookWorker

- LifecycleAwareComponent -> plain @component with @PostConstruct /
  @PreDestroy
- MetricsCollector, Monitors.error -> log statements
- OrkesRequestContext.setOrgId per message -> dropped
- recordEventExecution(...) (writes ExtendedEventExecution audit
  records via ExecutionDAOFacade.updateEventExecution) -> log
  statements. ExtendedEventExecution / ExtendedEventHandler are not
  ported to OSS.
- executionDAOFacade.addEventMessage(...) -> dropped (method doesn't
  exist in OSS; same call previously stubbed in IncomingWebhookService)
- orkesRedisExecutionDAO.getTask(taskId) -> executionDAOFacade.getTaskModel(taskId)
- WorkflowConsistency.DURABLE -> dropped (enum doesn't exist in OSS)
- ApplicationException -> dropped (we already log + throw via log line)

Verified

- :conductor-webhooks-oss:compileJava + compileTestJava: clean
- :conductor-webhooks-oss:test: green
- :conductor-server:compileJava: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds unit-test coverage for the four OSS-original classes added in
prior commits on this branch. None of these had tests previously.

Added

- WebhookConfigServiceTest (8 tests): id generation, conflict on
  existing id, fresh-id passthrough, secret sanitization on read,
  redacted-secret preservation on update, real-secret overwrite,
  TargetWorkflowCollector delegation, removeWebhook ordering.

- InMemoryWebhookDAOTest (8 tests): CRUD for configs and events,
  plus the matchers computation logic added in the previous commit:
  null override stores empty, missing WorkflowDef skipped, real
  WAIT_FOR_WEBHOOK task with matches stored under
  workflowName;version;taskRef, task without matches skipped,
  non-webhook task skipped, removeMatchers drops.

- WebhookConfigResourceTest (8 tests): validation (missing all three
  targets, HEADER_BASED without headers), valid delegation, path-id
  override on update, 404 on missing get/delete, secret sanitization
  on get.

- WebhookWorkerTest (6 tests): event-not-found and config-not-found
  early returns, workflowsToStart invokes WorkflowExecutor with the
  right StartWorkflowInput (name, version, input merge, event name,
  createdBy), non-integer version skipped, matcher hit completes the
  waiting WAIT_FOR_WEBHOOK task and removes it from the task service,
  terminal task is not re-completed.

Change

- WebhookWorker.handleMessage made package-private so the test can
  invoke it directly without driving the ScheduledExecutorService.

Verified

- :conductor-webhooks-oss:test passes 62 tests, 0 failures

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the last untested link in the integration chain. When a workflow
reaches a WAIT_FOR_WEBHOOK task, the Webhook WorkflowSystemTask's
start() must register the task with WebhookTaskService so the worker
can find it via hash later.

- WebhookTest (1 test): verifies start() calls
  webhookTaskService.put(taskModel, workflow.getWorkflowVersion()) and
  sets status to IN_PROGRESS.

Verified: :conductor-webhooks-oss:test 63/63 pass.
Exercises the full webhooks-oss bean graph end-to-end with only the
WorkflowExecutor, QueueDAO, ExecutionDAOFacade, MetadataDAO, and
ParametersUtils mocked. Validates the integration story without
needing a full @SpringBootTest (which would require a circular dep
on test-harness).

Added

- WebhooksOssEndToEndTest (4 tests):
  - register_then_receive_storesEventAndPushesQueue: full POST path
    from REST controller through config DAO → IncomingWebhookService
    → verification → event store → queue push.
  - receive_unknownWebhookId_returnsNullAndNoEnqueue: idempotent-safe
    drop when no config is registered.
  - register_workflow_completion_path_end_to_end: the load-bearing
    one. Wires WorkflowDef → matchers computation → system-task
    put() → receive → worker.handleMessage → workflowExecutor.updateTask
    with status=COMPLETED. Validates the full task-completion chain.
  - receive_failedVerification_throwsAndNoEnqueue: verifier rejects,
    NonTransientException thrown, no queue push.

Wires real instances of InMemoryWebhookDAO, InMemoryWebhookTaskService,
TargetWorkflowCollector, WebhookConfigService, IncomingWebhookService,
IncomingWebhookResource, WebhookHashingService, WebhookWorker, and
Webhook. Uses a stub verifier that rejects bodies starting with
"REJECT" to drive the failure path deterministically.

Total tests on this branch now 67 across 9 test classes, 0 failures.
… server

Started server-lite, registered a webhook config, delivered an HMAC-signed
event, watched it dispatch into a workflow that completed. Found and fixed
four issues along the way.

Bug 1: REST routes missing /api prefix

IncomingWebhookResource mounted at /webhook; OSS convention is /api/...
(see RequestMappingConstants.API_PREFIX). Without the prefix, requests
fell through to the static resource handler and returned 500. Fixed:
/webhook -> /api/webhook.

Bug 2: @PathVariable name inference

Bare @PathVariable String id failed at runtime with "parameter name
information not available via reflection. Ensure that the compiler
uses the '-parameters' flag." OSS controllers explicitly name each
binding (see e.g. MetadataResource). Fixed: all @PathVariable in
IncomingWebhookResource and WebhookConfigResource now use explicit
("id") names.

Bug 3: Mutation of stored WebhookConfig on read

WebhookConfigService.getWebhooks() and WebhookConfigResource.getWebhook()
mutated the returned config's secretValue in place to redact it. With
the in-memory DAO that returns shared references, this corrupted the
persisted secret. Next webhook delivery read secretValue=*** and HMAC
verification failed with "Illegal base64 character 2a" trying to decode
the *** placeholder. Fix: WebhookConfig gained toBuilder=true; both
sites now clone-then-redact instead of mutating. Added regression
assertions in WebhookConfigServiceTest and WebhookConfigResourceTest.

Bug 4: server-lite did not depend on webhooks-oss

Only server/build.gradle pulled in :conductor-webhooks-oss. server-lite
is the lighter local-dev target; users running it via :conductor-server-lite:bootRun
would not get the webhooks module loaded at all. Fixed: added the dep
alongside the other system tasks (http-task, json-jq-task, kafka).

Verified end-to-end against running server

1. POST /api/metadata/webhook -> 200, returns config with generated id
2. GET /api/metadata/webhook/{id} -> 200, secret redacted to ***
3. POST /api/metadata/workflow -> 200, register wf-smoke v1
4. POST /api/webhook/{id} with HMAC-SHA256 signed body -> 200
5. WebhookWorker dequeued the event, invoked WorkflowExecutor.startWorkflow,
   wf-smoke v1 ran to COMPLETED in 655ms

Tests: 68 across 9 classes, 0 failures.
…ct bugs, dead code

Adversarial review surfaced three blockers before this PR can flip from
draft to reviewable. All addressed in this commit.

License headers — would have failed spotlessJavaCheck on first CI build

42 files reformatted by spotlessApply. Every "Copyright 2022 Orkes, Inc."
+ "Orkes Enterprise License" header now Apache 2.0 / Conductor Authors.
TimeBasedUUIDGenerator had no header at all — fixed. Import ordering
normalized across the affected modules per repo Spotless config.

WebhookConfigService.updateWebhook NPE on missing id

PUT /api/metadata/webhook/{id} with an id that doesn't exist would
dereference null and 500 instead of 404. Added an explicit
NotFoundException at the top of the method, matching the pattern in
WebhookConfigResource.getWebhook + deleteWebhook. Regression test added.

IncomingWebhookService.handleWebhook 200/null masquerade for unknown id

POST /api/webhook/{id} to an unregistered webhook id returned HTTP 200
with empty body — looked successful to the caller. Now throws
NotFoundException so the global exception mapper returns 404. Updated
WebhooksOssEndToEndTest accordingly.

Dead code

- IncomingWebhookService no longer builds EventMessage objects that
  were discarded immediately (left over from the addEventMessage ->
  log.warn rewrite). 28 lines removed.
- Dropped unused ExecutionDAOFacade dependency — constructor down from
  5 to 4 args.
- Dropped unused EventMessage / DEAD_LETTER_QUEUE imports.

Verified

- :conductor-webhooks-oss:compileJava + compileTestJava: clean
- :conductor-webhooks-oss:spotlessCheck: clean
- :conductor-webhooks-oss:test: 69 tests / 0 failures
…dversarial review

Addresses the should-fix items surfaced by the pre-review adversarial pass.

WebhookTaskMapper relocated for OSS convention

org.conductoross.conductor.tasks.webhook.WebhookTaskMapper ->
org.conductoross.conductor.webhook.tasks.mapper.WebhookTaskMapper

Matches the AI module's precedent (org.conductoross.conductor.ai.tasks.mapper.*)
and collocates the mapper with the rest of the webhook code instead of
splitting it across two top-level packages.

Dead code in TimeBasedUUIDGenerator

Dropped generate(long) and getOrgId(String) — both had TODO stubs with
no OSS callers (verified by grep). Net: 16 lines removed, generate()
gets an @OverRide annotation, three "OSS-OrkesRequestContext-removed"
TODOs gone. The remaining methods (generate() and getDate(String)) are
the only ones reachable in OSS.

Build.gradle hygiene

- compileOnly 'spring-boot-starter-web' -> implementation: the module's
  REST controllers use @RestController/@RequestMapping etc. at runtime,
  not just compile time. compileOnly only worked because :conductor-server
  happens to pull starter-web transitively; module would be non-functional
  if consumed standalone.
- Same upgrade for spring-boot-autoconfigure (used by
  @ConditionalOnMissingBean).
- stripe-java 29.4.0 and sendgrid-java 4.10.2 hardcoded -> bound to new
  revStripe and revSendgrid in dependencies.gradle. Added a PINNED
  comment explaining they're webhook-verifier-specific clients with no
  shared rev.
- Dropped now-orphan test deps: spring-web (no longer needed after
  IncomingWebhookService dropped HttpHeaders test usage indirectly),
  postgres-persistence, testcontainers (postgresql + base), HikariCP,
  duplicate spring-retry. The DAO tests these supported moved to
  postgres-persistence in the parent split PR.

WebhookWorker.recordHistory off-by-one

if (hist.size() > lastRunWorkflowIdSize) -> >=. Previous form let the
list grow to N+1 before trimming, contradicting the property name.

Missing test coverage

- IncomingWebhookResourceTest (2 tests): delegation + return value
  passthrough for both handleWebhook and handlePing endpoints.
- InMemoryWebhookTaskServiceTest (6 tests): put/get hash semantics,
  missing matches throws, bucketing of multiple tasks at the same hash,
  remove preserves vs drops bucket, unknown hash returns empty.

WebhookConfigServiceTest extended with the NotFoundException-on-missing-id
regression (mirrors the contract fix in round 1).

Verified

- :conductor-webhooks-oss:compileJava + compileTestJava clean
- :conductor-webhooks-oss:spotlessCheck clean
- :conductor-webhooks-oss:test 76 tests / 0 failures
- :conductor-server:compileJava clean
…utation

Two production-impacting bugs from the adversarial review.

WebhookWorker.pollAndExecute no longer acks on failure

Previously a try/finally acked unconditionally, so any exception in
handleMessage (DB blip, OOM, NPE) silently dropped the webhook event.
The TODO comment "retries are not yet modelled" was not a fix.

Now: ack only after handleMessage returns normally. On failure, log
and let the queue's unack timeout redeliver. Poison messages still
get retried, but they hit the underlying QueueDAO impl's retry policy
(e.g. postgres queue's max-retry → dead-letter) rather than vanishing.

InMemoryWebhookDAO.createMatchers no longer caches stale criteria

Previously createMatchers walked the WorkflowDefs at registration
time and stored the extracted `matches` criteria. If a workflow def
was later updated to change its WAIT_FOR_WEBHOOK task's matches
inputParameter, getMatchers kept returning the old criteria until
someone re-PUT the webhook config.

Now: createMatchers stores only the receiverWorkflowNamesToVersions
*targets* (which DO need to be captured at registration time so the
expression-based override is preserved). getMatchers looks up the
WorkflowDefs from MetadataDAO and extracts matches on every call.
Slightly more expensive (one MetadataDAO read per workflow per webhook
event), but always correct.

Tests

- InMemoryWebhookDAOTest gains getMatchers_reflectsWorkflowDefUpdates_noStaleCache:
  register matchers, assert criteria, swap the mock to return an updated
  WorkflowDef, assert getMatchers returns the new criteria without
  re-running createMatchers. Pins the regression.

- 77 tests / 0 failures.
… pollAndExecute

Per AGENTS.md preference: use real implementations over mocks. The prior
WebhookWorkerTest was 100% mocks. Rewrote to wire real instances of
InMemoryWebhookDAO, InMemoryWebhookTaskService, WebhookHashingService, and
TargetWorkflowCollector. Only the deep infra is still mocked: QueueDAO,
WorkflowExecutor, ExecutionDAOFacade, MetadataDAO, ParametersUtils.

The full register→receive→dispatch flow remains covered by
WebhooksOssEndToEndTest. This class now focuses on worker-internal
semantics that don't surface end-to-end.

Added pollAndExecute coverage

Made pollAndExecute() package-private (was private — same change pattern
already applied to handleMessage for the same reason). Four new tests:

- pollAndExecute_emptyBatch_noop: empty pop, no ack called.
- pollAndExecute_success_acks: happy path acks the message after
  handleMessage returns cleanly.
- pollAndExecute_handleMessageThrows_doesNotAck: REGRESSION TEST. Prior
  impl had a try/finally that acked even on failure, silently dropping
  webhook events on any exception. Test uses an anonymous InMemoryWebhookDAO
  subclass that throws from getWebhookEvent to simulate a DB blip.
- pollAndExecute_mixedBatch_acksOnlySuccesses: batch with one good +
  one poison message. Only the good one gets acked; the poison one is
  left for the queue's unack timeout to redeliver.

Existing handleMessage tests rewired with real beans

- handleMessage_eventNotFound_returnsEarly: stores nothing, calls
  handleMessage, expects no executor invocation.
- handleMessage_configNotFound_returnsEarly: stores event but not config;
  asserts the event is NOT removed (so retry can happen).
- handleMessage_workflowsToStart_invokesExecutor: registers config with
  workflowsToStart, asserts StartWorkflowInput contents (name, version,
  input merge, event name) and that the event is removed after success.
- handleMessage_workflowsToStart_nonIntegerVersion_skipped: version is a
  string, no executor call.
- handleMessage_matcherHit_completesWaitingTask: registers config + uses
  real WebhookTaskService.put to register a waiting task; asserts that
  webhookTaskService.get returns empty after handleMessage completes.
- handleMessage_matcherHit_terminalTask_skipped: pre-COMPLETED task is
  not re-completed.

Total: 81 tests across 11 classes, 0 failures.
Per project CLAUDE.md, new user-facing surfaces need source-derived docs.
The webhooks-oss module added /api/metadata/webhook (config CRUD) and
/api/webhook/{id} (event receive) without any documentation; users had no
on-ramp.

Adds docs/.../systemtasks/wait-for-webhook-task.md following the existing
system task doc pattern (wait-task.md template). Contents:

- Task type + how the dispatch chain works (mapper, hash, worker)
- inputParameters table (matches)
- WebhookConfig registration: endpoint, body, verifier types table
  (HMAC_BASED, SIGNATURE_BASED, HEADER_BASED, SLACK_BASED, STRIPE,
  TWITTER, SENDGRID — pulled from the actual Verifier enum)
- Delivery endpoint: status codes, response semantics
- Verified end-to-end curl example: same flow exercised during the
  smoke test (register config, register workflow def, deliver HMAC-signed
  event)
- conductor.webhook.worker.* properties table (pulled from
  WebhookWorkerProperties.java)
- Failure semantics: verification failure / worker dispatch failure /
  rejected event logging — matches the round-3 ack-only-on-success
  worker change.

Registered in mkdocs.yml nav alongside wait-task.md.
nthmost-orkes added a commit that referenced this pull request May 19, 2026
…ntion coverage

All 5 OSS persistence backends now have a retention mechanism for
incoming_webhook_event:

Cassandra (CassandraWebhookDAO)
- Adds default_time_to_live to incoming_webhook_event at CREATE TABLE
  time. Cassandra auto-expires rows after the configured window — no
  scheduled job needed, no tombstone churn beyond what the SSTable
  layer already manages.
- Default TTL: 7 days (matches the SQL cleanup-job default).
- Configurable via conductor.webhooks.cleanup.retention-duration on
  the CassandraConfiguration @bean wiring (Duration → toSeconds for
  default_time_to_live).
- Caveat documented inline: CREATE TABLE IF NOT EXISTS won't update an
  existing table's TTL; operators changing retention on an existing
  deployment need to ALTER TABLE manually.
- The other 3 tables (webhook, webhook_target_workflows,
  webhook_hash_to_taskid) don't get TTL — they're durable config /
  task-routing state, not auditable events.

Redis (RedisWebhookCleanupJob)
- @component @conditional(AnyRedisCondition.class)
  @ConditionalOnProperty(conductor.webhooks.cleanup.enabled,
  matchIfMissing=true).
- @scheduled walker on the same cron as the SQL backings: HGETALL on
  the WEBHOOK_EVENT hash, parse each value as IncomingWebhookEvent,
  HDEL any whose timeStamp is older than retentionDuration.
- Cheap at low cardinality; documented that high-volume deployments
  should migrate to per-key TTL storage (HEXPIRE landed in Redis 7.4
  but isn't ubiquitous yet).
- 4 tests with testcontainers Redis: deletes old, keeps recent, empty
  hash no-op, all-recent kept, timeStamp=0 fixtures preserved.

Verified locally: full compile clean across cassandra/redis modules
and server. Tests not run locally — both need Docker. CI will validate.

Deferred items now all addressed:
- ~~Worker DLQ semantics~~ (in #1106)
- ~~Matcher cache staleness~~ (in #1106 + #1110)
- ~~Postgres impls~~ (#1110)
- ~~MySQL/SQLite/Redis/Cassandra adapters~~ (#1110)
- ~~WebhookCleanupJob (postgres)~~ (#1110)
- ~~MySQL + SQLite cleanup~~ (#1110)
- ~~Cassandra TTL + Redis cleanup~~ (this commit)

Remaining open: EventMessage audit-table persistence (the addEventMessage
→ log.warn debt) — separate, larger architectural decision.
@nthmost-orkes nthmost-orkes requested a review from v1r3n May 20, 2026 00:02
…ment

Random (workflow, version, taskRef, matches, body) tuples assert the two
hash sites in webhooks-oss canonicalize identically: InMemoryWebhookTaskService
(registration) and WebhookHashingService.computeJsonHash (inbound). Catches
silent divergence that would cause inbound events to miss registered tasks.

100 reps per direction (match / no-match). Seed defaults to System.nanoTime();
on failure the TestWatcher prints -Dwebhook.hash.seed=<n> for reproducibility.
nthmost-orkes pushed a commit to nthmost-orkes/webhook-emitter that referenced this pull request May 21, 2026
… all verifiers

For each of HMAC_BASED, SIGNATURE_BASED, HEADER_BASED, SLACK_BASED, STRIPE,
TWITTER: register a workflow def with a WAIT_FOR_WEBHOOK task, register a
webhook config with a fresh secret, start the workflow, fire a signed event
via the emitter, and poll until COMPLETED.

Verified 6/6 PASS against conductor-server-lite running on loki with this
PR's webhooks-oss module (conductor-oss/conductor#1106). Per-verifier
config quirks (HEADER_BASED headers map, SLACK_BASED urlVerified, STRIPE
api_version, TWITTER header name) handled per-case from source.

Not in conductor's CI — too slow/flaky for the main test suite. This is
the manual / on-demand counterpart to the in-process tests in webhooks-oss.
@nthmost-orkes

Copy link
Copy Markdown
Contributor Author

Re-validated end-to-end: built conductor at 84ea71859 and ran the smoke harness at nthmost-orkes/webhook-emitter@16fd7e7 against it — 6/6 verifiers PASS (HMAC, signature, header, slack, stripe, twitter). Each fires a real signed event → conductor verifier → WAIT_FOR_WEBHOOK task completes → workflow reaches COMPLETED.

Stripe events whose body omits api_version (older API payloads, synthetic
test payloads, and apparently some real Stripe deliveries) NPE on
rawApiVersion.split(...). The signature has already been verified at this
point, and both branches of the apiVersion check ultimately return
ErrorList.empty(), so a missing api_version is treated as "old api / no
deserializer support" — same as the dated pre-deserializer path.

Surfaced by the smoke harness firing synthetic Stripe-shaped bodies; saw
500 NPE in the verifier path. Adds StripeVerifierTest (3 cases): null
api_version accepted, bad signature rejected, missing header rejected.

Orkes has identical code in io.orkes.conductor.webhook.verifier.StripeVerifier
— mirror fix needed upstream to keep parity.
nthmost-orkes pushed a commit to nthmost-orkes/webhook-emitter that referenced this pull request May 22, 2026
Stacks N workflow instances on the same WAIT_FOR_WEBHOOK matcher, fires a
single signed event, asserts all N reach COMPLETED. Exercises the
InMemoryWebhookTaskService bucket logic from the queue/worker level —
the unit test multiple_tasks_sameHash_bucketed pins it in-process; this
proves it through the real HTTP + dispatch path.

Verified against conductor-oss/conductor#1106 @ 38c40c5c3:
  N=20  → 20/20 in 0.67s   (median 0.26s, max 0.67s)
  N=100 → 100/100 in 3.01s (median 2.19s, max 3.01s)

HMAC_BASED only — per-verifier cross-cut stays in smoke.py.
nthmost-orkes pushed a commit to nthmost-orkes/webhook-emitter that referenced this pull request May 22, 2026
Three new conductor-smoke scripts and one new emitter endpoint:

  conductor-smoke/multi_verifier_smoke.py
    Three webhook configs (HMAC, signature, header) all bound to the same
    workflow as receivers. Each fire spawns a fresh workflow instance that
    completes. Exercises matcher recomputation over overlapping receivers.

  conductor-smoke/replay_smoke.py
    Fires the same signed event twice — documentation-mode for current
    behavior. Confirms: no inbound dedup, terminal-task-skip prevents
    double-dispatch errors, bucket dispatches to all matching tasks.

  conductor-smoke/dual_emitter_smoke.py
    Two emitters on different hosts fire identically-signed events at the
    same conductor at the same instant (threading.Event barrier). Tests
    cross-network body fidelity and worker behavior under concurrent ingress.

  scenarios.py / POST /scenarios/wait-for-webhook
    Server-side bundle of register-workflow → register-webhook → start →
    sign → fire → poll. Returns ok/final-status/per-phase timings. Lets
    end users drive a full smoke from one curl, no Python driver needed.
    Mounted optionally via include_router — fails open if scenarios.py
    can't import.

Verified against conductor-oss/conductor#1106 @ 38c40c5c3:
  - multi_verifier:    3/3 PASS
  - replay:            behaves as expected (no dedup, terminal skip)
  - dual_emitter:      PASS (mac@localhost:8767 + loki:8766 → loki:7001)
  - /scenarios call:   200 OK, COMPLETED in 2.45s including 2.02s poll
@nthmost-orkes nthmost-orkes marked this pull request as ready for review May 25, 2026 06:03
…ervice

Production multi-node backing for the WebhookDAO + WebhookTaskService
interfaces. Designed clean for OSS (no orgId anywhere in the schema or
queries) and applies the matcher-cache fix from the parent PR's Round 3:
target workflows are persisted, match criteria are recomputed from
MetadataDAO on every getMatchers() call so WorkflowDef updates take
effect without re-registering the webhook.

Stacked PR — base is feat/webhooks-from-orkes-split. Don't merge until
the parent (#1106) lands. After parent merges this rebases onto main.

Added

- V16__webhook.sql: four tables — webhook, incoming_webhook_event,
  webhook_target_workflows, webhook_hash_to_taskid. Renamed
  webhook_matchers -> webhook_target_workflows to reflect that we
  store target workflow ids (override snapshot), NOT pre-computed
  matchers. Index on incoming_webhook_event(created_on) for the
  cleanup follow-up.
- org.conductoross.conductor.postgres.dao.PostgresWebhookDAO:
  9 interface methods + private computeMatchers helper. Extends
  PostgresBaseDAO. ON CONFLICT DO UPDATE for upserts.
- org.conductoross.conductor.postgres.dao.PostgresWebhookTaskService:
  3 interface methods (put/get/remove). ON CONFLICT DO NOTHING on
  put for idempotency.
- PostgresWebhookDAOTest (8 tests with testcontainers + @MockBean MetadataDAO),
  PostgresWebhookTaskServiceTest (6 tests with testcontainers).
  Critical regression test: getMatchers_recomputesFromMetadataDAO_onEachCall_noStaleCache.

Shared hash extraction

- WebhookTaskHashing helper added to core
  (org.conductoross.conductor.service.webhook). Static methods that
  both InMemory and Postgres impls use, so tasks registered via one
  backing are findable by hash via any other. Validates matches
  before composing the hash (regression fix in this commit: NPE in
  removeIterationFromTaskRefName would mask the missing-matches
  exception).
- InMemoryWebhookTaskService refactored to delegate to the helper
  (-50 LoC). Test surface unchanged.

PostgresConfiguration wiring

- Two new @bean methods with @dependsOn({"flywayForPrimaryDb"}).
  Bean names "webhookDAO" and "webhookTaskService" match the
  @ConditionalOnMissingBean(name=...) on the InMemory impls, so the
  postgres beans take precedence when this module is on the classpath.
  Always-on for now (no @ConditionalOnProperty); the whole
  postgres-persistence module is gated by conductor.db.type=postgres
  at the upstream configuration.

OSS-side design changes vs orkes' PostgresWebhookDAO

- All org_id columns and WHERE clauses dropped. Tables are tenant-free.
- ConductorLimitsDAO/LimitUtils quota check dropped (Orkes-only).
- Preconditions.checkNotNull replaced with NotFoundException where
  appropriate; nulls otherwise propagate per OSS convention.
- Scheduled refreshMatchers executor dropped — matcher staleness is
  fixed by design (recompute on read).
- workflowDefToUpdateMap cache dropped — same reason.
- The "matches" persistence (webhook_matchers table in orkes)
  replaced with "targets" persistence (webhook_target_workflows).
  Different shape, same external contract.

Verified locally

- :conductor-postgres-persistence:compileJava + compileTestJava clean
- :conductor-postgres-persistence:spotlessCheck clean
- :conductor-webhooks-oss:test green (81 tests, hashing refactor
  doesn't break anything)
- :conductor-server:compileJava clean

Note: postgres tests need Docker daemon for testcontainers; not run
locally. CI will validate.
Closes #1142

V5 seeded webhook_cleanup_lease.expires_at with a text literal
('1970-01-01T00:00:00'). The xerial sqlite-jdbc driver stores
java.sql.Timestamp as INTEGER (epoch millis), and SQLite's type-
comparison rule says INTEGER < TEXT — so SqliteWebhookCleanupJob's
"expires_at < ?" check (with ? bound as a Timestamp) was always
false on a freshly migrated DB. The cleanup job would log "lease
held elsewhere" forever and never delete an incoming_webhook_event
row.

V5 hasn't shipped to main yet, so fix the seed value in place
rather than adding a patch migration. Use INTEGER 0 so the seed's
storage class matches what the cleanup job writes via setTimestamp().

The test's @before lease reset (added earlier in this PR for between-
test isolation, since flyway.clean() leaks rows on the shared sqlite
file) remains necessary — it covers a different concern. Trim its
comments to drop the now-stale "V5 seed is broken" wording.

Postgres and MySQL aren't affected: their native TIMESTAMP types parse
the text literal as a real timestamp. No Orkes parallel — sqlite-
persistence and the cleanup-lease pattern are both OSS-side additions.
v1r3n
v1r3n previously requested changes May 31, 2026

@v1r3n v1r3n left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

Tag.java was introduced by the webhooks-oss port but serves no purpose in OSS:
tags in Orkes power TagsService/AutomaticTagService-based access control, neither
of which exists here. The class had no callers other than WebhookConfig.tags.
In Orkes, WebhookConfig.tags is populated at query-time by TagsService from
a separate enterprise tags table. OSS has no TagsService, so the field was
always null. Dead API surface that would mislead users.
EventMessage was added to common to mirror Orkes' webhook event audit trail,
but OSS has no ExecutionDAO.addEventMessage() and IncomingWebhookService
never references it. Zero callers anywhere in OSS — pure dead code.

The event audit feature (DLQ persistence for rejected/unmatched webhook events)
requires a separate architectural decision before it belongs here.
…ODO Remove

WebhookExecutionHistory embeds a rolling execution log inside WebhookConfig
and is only implemented in orkes-conductor/webhooks-enterprise. Both the Orkes
and OSS models carried a '// TODO Remove this' comment. The recordHistory()
call also held the only webhookDAO.createWebhook() write-back after dispatch,
but urlVerified is already persisted by IncomingWebhookService before the event
is enqueued, so that write was redundant. Drops lastRunWorkflowIdSize property
and its associated test.
…getter

Method was @JsonIgnore and only extracted keys from receiverWorkflowNamesToVersions,
which is already accessible directly. No call site exists anywhere in OSS.
@nthmost-orkes nthmost-orkes requested a review from v1r3n June 1, 2026 17:29
NonTransientException previously fell through to the default 500, which
told webhook senders (Stripe, GitHub, Slack) that the server was broken
and to retry. Signature verification failures are client errors — the
sender's signature is wrong — and must return 4xx so senders stop
retrying.

NonTransientException by semantics ("this won't work no matter how many
times you retry") maps cleanly to 400 Bad Request.

Verified: bad_signature and replay cases in negative_smoke.py now return
400 instead of 500, 5/5 negative cases passing on enki.

@chrishagglund-ship-it chrishagglund-ship-it left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't provide meaningful review of the code quality. Looks great to me on a quick scan over 11k lines. Claude thinks it's pretty good too, noting a few risky things but mentioning that they're documented.

I ran through an example scenario with the emitter as well as some manual testing with it and a manually defined wf, stuff worked.

@nthmost-orkes

Copy link
Copy Markdown
Contributor Author

I can't provide meaningful review of the code quality. Looks great to me on a quick scan over 11k lines. Claude thinks it's pretty good too, noting a few risky things but mentioning that they're documented.

I ran through an example scenario with the emitter as well as some manual testing with it and a manually defined wf, stuff worked.

Thanks, I appreciate you just trying the stuff out, for sure!

…growth

Ports orkes-io/orkes-conductor#3663. Webhook.start() does a put() into the
WAIT_FOR_WEBHOOK hash set; the only removal path was WebhookWorker.handleEvent
firing SREM on a matching event. Workflows terminated before a match arrived
left their taskId in the set indefinitely — Redis memory leak / Postgres row
accumulation in webhook_hash_to_taskid.

Fix: add Webhook.cancel() that calls webhookTaskService.remove(TaskModel, int)
on all six backings (postgres, mysql, redis, cassandra, sqlite, in-memory).
WebhookTaskHashing.computeHashIfPresent() handles tasks cancelled before
IN_PROGRESS (null/missing matches) as a safe no-op instead of throwing.
Cleanup failures are caught and logged so the terminate path is never blocked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EPIC: Implement WAIT_FOR_WEBHOOK task type

3 participants