Skip to content

concord-server: add wait condition for external process for suspend/resume#1334

Draft
benbroadaway wants to merge 12 commits into
masterfrom
bb/external-wait-condition
Draft

concord-server: add wait condition for external process for suspend/resume#1334
benbroadaway wants to merge 12 commits into
masterfrom
bb/external-wait-condition

Conversation

@benbroadaway
Copy link
Copy Markdown
Collaborator

Adds an alternative, more reliable, option to standard suspend/resume API.

Current issue: The standard suspend/resume pattern mostly works fine for one-to-one resumeEvent-to-externalWork relationships. Essentially, if someone uses a task that straightforwardly suspends and resumes from an external service callback, it works fine. But if they try to get too fancy with multiple resume events invoked at the same time (e.g. two threads calling the task which suspends, then there's a race condition if the two callbacks hit at the same time. This issue can be worked around by retrying the resume callback, but that adds noise and there are times when the resume-then-suspend-for-the-loser-to-retry situation takes longer than an external service will reasonably willing to retry. Further, there's no way to "timeout" any of the suspend/resume attempts--processes hit with the race condition can remain suspended indefinitely.

Solution: Add a new process wait condition which indicates the process is waiting for one or more specific external callbacks to be made. Very similar pattern overall, but with a little more control for the resuming side of things.

There's some extra overhead to implementing this pattern, so it's advisable to use the "standard" suspend/resume pattern unless necessary.

The pattern:

  1. Concord process tells to do work and give's it a standard callback URL to call when it's done. The url contains a unique event id which for that specific remote work context (this is not the resume event ID for the process)
    a. This can be repeated with another unique event id for each callback to be made.
  2. Process creates an EXTERNAL_EVENT wait condition for each remote work event id, and a shared resume event id for the current thread of the process.
    a. This can be repeated across threads within a single process. Very importantly, each thread should use its own resume event ID AND one use one per thread/suspend call.
    b. Each wait condition can be customized with an expiresAt datetime value to ensure the process resumes in a timely manner even if the remote job is "lost"
  3. Process suspends with the resume event ID(s).
  4. Watchdog watches the wait conditions to check for expiration or 100% non-waiting for each wait condition with the same resume event ID for a given process ID.
  5. External service(s) clear the wait conditions one at a time when they are done. They can optionally send variables which will eventually be added to the process state. Until then, they persist in the wait condition. (This is one reason a max-size is important).
  6. When all of the wait conditions are satisfied for a given resume event ID, then the watchdog resumes the process execution with variables delivered under a name specified in the wait condition's saveAs field.
  7. If multiple threads suspended for external event conditions, then the process may re-suspend to wait for the remaining thread's conditions to be met. If multiple thread's conditions were all met within the same watchdog run, then all threads will resume at the same time (no race condition).

Some things I'm not sure about yet:

  • Naming is currently WaitType.EXTERNAL_EVENT. Makes sense, but that's the same name as an audit log event type. Too close to home?
  • expiresAt default value. Currently set to 2 hours if not specified. Should default be configurable through server config or policy? Same for max value for the field (e.g. don't expire 3 allow expirations (Currently set to 2 days).
  • process_wait_conditions churn/bloat. This isn't a super-frequently called API, but just need to keep in mind what it does. Write (and rewrites for *N wait conditions) wait conditions, then re-write the conditions each time they're cleared by a callback which adds the resume vars. Probably wise to set a max-size for resume vars, and max number of waits allowed (the latter may be a little more complex than it initially seems).
  • Regardless of size, this adds more quantity of wait conditions overall, and will extend the run time of the watchdog process for handling other wait conditions. Good news is that it doesn't add update queries when nothing changes to the waits between runs and doesn't require extra mutual-exclusivity queries like the PROCESS_COMPLETION waits. Mostly more listing calls added.
  • cleaner way to indicate a wait condition expired when the process resumes?
  • There may be some idempotency concerns. (network error-induced retry on create wait, but the first one actually succeeded, just didn't get the response delivered)

@benbroadaway benbroadaway requested review from brig and ibodrov May 25, 2026 22:15
ibodrov
ibodrov previously approved these changes May 26, 2026
Copy link
Copy Markdown
Collaborator

@ibodrov ibodrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Any plans to update ProcessWaitList to render EXTERNAL_EVENTs in the UI?

Copy link
Copy Markdown
Collaborator

@ibodrov ibodrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some findings.

// collect to grouping by process resume event. All waits for the same resume
// event must be completed before we can process them together and resume.
var byResumeEvent = waits.stream()
.collect(groupingBy(wait -> wait.waitCondition().resumeEvent()));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to group by instanceId + resumeEvent instead of just resumeEvent?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be more than one instanceId in the process_wait_conditions table, right? Each row contains all of the conditions for the given instanceId. Otherwise we'd have some wacky side effects with the existing wait processing. All wait conditions for a single process are handled within the same batch item. Which is why I probably need to size limit for the dynamic-length fields (e.g. variables, externalEvent, reason, saveAs to keep one process from blowing out memory if it has too many conditions or conditions that are too large (or both).

return Map.of();
}

var variables = isExpired(condition) ? Map.of("isExpired", true) : condition.variables();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check if waiting=false here? Might be unlucky race if someone calls clearExternalWaitCondition just before expiresAt and picked up by ProcessWaitWatchdog right after expiresAt.

Or maybe we do save variables unconditionally but also add isExpired=true to the map?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. Makes me think just remove the injected isExpired=true from the resume variables.

If the race condition you described hits, then the result sent back to the process would be valid=true + <valid callback-provided variables>, so whatever handles the result will still have to add extra logic to suss out whether they got back what they expected when it appears it expired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants