concord-server: add wait condition for external process for suspend/resume by benbroadaway · Pull Request #1334 · walmartlabs/concord

benbroadaway · 2026-05-25T22:15:01Z

Adds an alternative, more reliable, option to standard suspend/resume API.

Current issue: The standard suspend/resume pattern mostly works fine for one-to-one resumeEvent-to-externalWork relationships. Essentially, if someone uses a task that straightforwardly suspends and resumes from an external service callback, it works fine. But if they try to get too fancy with multiple resume events invoked at the same time (e.g. two threads calling the task which suspends, then there's a race condition if the two callbacks hit at the same time. This issue can be worked around by retrying the resume callback, but that adds noise and there are times when the resume-then-suspend-for-the-loser-to-retry situation takes longer than an external service will reasonably willing to retry. Further, there's no way to "timeout" any of the suspend/resume attempts--processes hit with the race condition can remain suspended indefinitely.

Solution: Add a new process wait condition which indicates the process is waiting for one or more specific external callbacks to be made. Very similar pattern overall, but with a little more control for the resuming side of things.

There's some extra overhead to implementing this pattern, so it's advisable to use the "standard" suspend/resume pattern unless necessary.

The pattern:

Concord process tells to do work and give's it a standard callback URL to call when it's done. The url contains a unique event id which for that specific remote work context (this is not the resume event ID for the process)
a. This can be repeated with another unique event id for each callback to be made.
Process creates an EXTERNAL_EVENT wait condition for each remote work event id, and a shared resume event id for the current thread of the process.
a. This can be repeated across threads within a single process. Very importantly, each thread should use its own resume event ID AND one use one per thread/suspend call.
b. Each wait condition can be customized with an expiresAt datetime value to ensure the process resumes in a timely manner even if the remote job is "lost"
Process suspends with the resume event ID(s).
Watchdog watches the wait conditions to check for expiration or 100% non-waiting for each wait condition with the same resume event ID for a given process ID.
External service(s) clear the wait conditions one at a time when they are done. They can optionally send variables which will eventually be added to the process state. Until then, they persist in the wait condition. (This is one reason a max-size is important).
When all of the wait conditions are satisfied for a given resume event ID, then the watchdog resumes the process execution with variables delivered under a name specified in the wait condition's saveAs field.
If multiple threads suspended for external event conditions, then the process may re-suspend to wait for the remaining thread's conditions to be met. If multiple thread's conditions were all met within the same watchdog run, then all threads will resume at the same time (no race condition).

Some things I'm not sure about yet:

Naming is currently WaitType.EXTERNAL_EVENT. Makes sense, but that's the same name as an audit log event type. Too close to home?
expiresAt default value. Currently set to 2 hours if not specified. Should default be configurable through server config or policy? Same for max value for the field (e.g. don't expire 3 allow expirations (Currently set to 2 days).
process_wait_conditions churn/bloat. This isn't a super-frequently called API, but just need to keep in mind what it does. Write (and rewrites for *N wait conditions) wait conditions, then re-write the conditions each time they're cleared by a callback which adds the resume vars. Probably wise to set a max-size for resume vars, and max number of waits allowed (the latter may be a little more complex than it initially seems).
Regardless of size, this adds more quantity of wait conditions overall, and will extend the run time of the watchdog process for handling other wait conditions. Good news is that it doesn't add update queries when nothing changes to the waits between runs and doesn't require extra mutual-exclusivity queries like the PROCESS_COMPLETION waits. Mostly more listing calls added.
cleaner way to indicate a wait condition expired when the process resumes?
There may be some idempotency concerns. (network error-induced retry on create wait, but the first one actually succeeded, just didn't get the response delivered)

…esume

…tion

…rocessResources

ibodrov

LGTM. Any plans to update ProcessWaitList to render EXTERNAL_EVENTs in the UI?

ibodrov

Some findings.

ibodrov · 2026-05-26T02:14:16Z

+        // collect to grouping by process resume event. All waits for the same resume
+        // event must be completed before we can process them together and resume.
+        var byResumeEvent = waits.stream()
+                .collect(groupingBy(wait -> wait.waitCondition().resumeEvent()));


I think we need to group by instanceId + resumeEvent instead of just resumeEvent?

There shouldn't be more than one instanceId in the process_wait_conditions table, right? Each row contains all of the conditions for the given instanceId. Otherwise we'd have some wacky side effects with the existing wait processing. All wait conditions for a single process are handled within the same batch item. Which is why I probably need to size limit for the dynamic-length fields (e.g. variables, externalEvent, reason, saveAs to keep one process from blowing out memory if it has too many conditions or conditions that are too large (or both).

ibodrov · 2026-05-26T02:22:12Z

+            return Map.of();
+        }
+
+        var variables = isExpired(condition) ? Map.of("isExpired", true) : condition.variables();


Do we need to check if waiting=false here? Might be unlucky race if someone calls clearExternalWaitCondition just before expiresAt and picked up by ProcessWaitWatchdog right after expiresAt.

Or maybe we do save variables unconditionally but also add isExpired=true to the map?

Very good point. Makes me think just remove the injected isExpired=true from the resume variables.

If the race condition you described hits, then the result sent back to the process would be valid=true + <valid callback-provided variables>, so whatever handles the result will still have to add extra logic to suss out whether they got back what they expected when it appears it expired.

benbroadaway added 7 commits May 6, 2026 08:04

concord-server: add wait condition for external process for suspend/r…

c8a3ca5

…esume

Merge remote-tracking branch 'oss/master' into bb/external-wait-condi…

eb56a82

…tion

single resume event per thread, multiple external events

49583c9

nested resuem vars support, max depth, add task

a336096

add IT

fd64b92

return error result when events expire, santize condition values in P…

31bd93c

…rocessResources

fix typo

d552fa0

benbroadaway requested review from brig and ibodrov May 25, 2026 22:15

mutable base args

dc59f4d

ibodrov previously approved these changes May 26, 2026

View reviewed changes

ibodrov reviewed May 26, 2026

View reviewed changes

benbroadaway added 4 commits May 26, 2026 08:35

add waitConditions IT flow

4e19e2e

remove 'isExpired' variable injection

65ac4f2

tidy wait task

3c626fd

console2: Add EXTERNAL_EVENT support

10eced6

benbroadaway dismissed ibodrov’s stale review via 10eced6 May 26, 2026 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concord-server: add wait condition for external process for suspend/resume#1334

concord-server: add wait condition for external process for suspend/resume#1334
benbroadaway wants to merge 12 commits into
masterfrom
bb/external-wait-condition

benbroadaway commented May 25, 2026

Uh oh!

ibodrov left a comment

Uh oh!

ibodrov left a comment

Uh oh!

ibodrov May 26, 2026

Uh oh!

benbroadaway May 26, 2026

Uh oh!

Uh oh!

ibodrov May 26, 2026

Uh oh!

benbroadaway May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

benbroadaway commented May 25, 2026

Uh oh!

ibodrov left a comment

Choose a reason for hiding this comment

Uh oh!

ibodrov left a comment

Choose a reason for hiding this comment

Uh oh!

ibodrov May 26, 2026

Choose a reason for hiding this comment

Uh oh!

benbroadaway May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ibodrov May 26, 2026

Choose a reason for hiding this comment

Uh oh!

benbroadaway May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants