concord-server: add wait condition for external process for suspend/resume#1334
concord-server: add wait condition for external process for suspend/resume#1334benbroadaway wants to merge 12 commits into
Conversation
ibodrov
left a comment
There was a problem hiding this comment.
LGTM. Any plans to update ProcessWaitList to render EXTERNAL_EVENTs in the UI?
| // collect to grouping by process resume event. All waits for the same resume | ||
| // event must be completed before we can process them together and resume. | ||
| var byResumeEvent = waits.stream() | ||
| .collect(groupingBy(wait -> wait.waitCondition().resumeEvent())); |
There was a problem hiding this comment.
I think we need to group by instanceId + resumeEvent instead of just resumeEvent?
There was a problem hiding this comment.
There shouldn't be more than one instanceId in the process_wait_conditions table, right? Each row contains all of the conditions for the given instanceId. Otherwise we'd have some wacky side effects with the existing wait processing. All wait conditions for a single process are handled within the same batch item. Which is why I probably need to size limit for the dynamic-length fields (e.g. variables, externalEvent, reason, saveAs to keep one process from blowing out memory if it has too many conditions or conditions that are too large (or both).
| return Map.of(); | ||
| } | ||
|
|
||
| var variables = isExpired(condition) ? Map.of("isExpired", true) : condition.variables(); |
There was a problem hiding this comment.
Do we need to check if waiting=false here? Might be unlucky race if someone calls clearExternalWaitCondition just before expiresAt and picked up by ProcessWaitWatchdog right after expiresAt.
Or maybe we do save variables unconditionally but also add isExpired=true to the map?
There was a problem hiding this comment.
Very good point. Makes me think just remove the injected isExpired=true from the resume variables.
If the race condition you described hits, then the result sent back to the process would be valid=true + <valid callback-provided variables>, so whatever handles the result will still have to add extra logic to suss out whether they got back what they expected when it appears it expired.
Adds an alternative, more reliable, option to standard suspend/resume API.
Current issue: The standard suspend/resume pattern mostly works fine for one-to-one resumeEvent-to-externalWork relationships. Essentially, if someone uses a task that straightforwardly suspends and resumes from an external service callback, it works fine. But if they try to get too fancy with multiple resume events invoked at the same time (e.g. two threads calling the task which suspends, then there's a race condition if the two callbacks hit at the same time. This issue can be worked around by retrying the resume callback, but that adds noise and there are times when the resume-then-suspend-for-the-loser-to-retry situation takes longer than an external service will reasonably willing to retry. Further, there's no way to "timeout" any of the suspend/resume attempts--processes hit with the race condition can remain suspended indefinitely.
Solution: Add a new process wait condition which indicates the process is waiting for one or more specific external callbacks to be made. Very similar pattern overall, but with a little more control for the resuming side of things.
There's some extra overhead to implementing this pattern, so it's advisable to use the "standard" suspend/resume pattern unless necessary.
The pattern:
a. This can be repeated with another unique event id for each callback to be made.
EXTERNAL_EVENTwait condition for each remote work event id, and a shared resume event id for the current thread of the process.a. This can be repeated across threads within a single process. Very importantly, each thread should use its own resume event ID AND one use one per thread/suspend call.
b. Each wait condition can be customized with an
expiresAtdatetime value to ensure the process resumes in a timely manner even if the remote job is "lost"saveAsfield.Some things I'm not sure about yet:
WaitType.EXTERNAL_EVENT. Makes sense, but that's the same name as an audit log event type. Too close to home?expiresAtdefault value. Currently set to 2 hours if not specified. Should default be configurable through server config or policy? Same for max value for the field (e.g. don't expire 3 allow expirations (Currently set to 2 days).process_wait_conditionschurn/bloat. This isn't a super-frequently called API, but just need to keep in mind what it does. Write (and rewrites for *N wait conditions) wait conditions, then re-write the conditions each time they're cleared by a callback which adds the resume vars. Probably wise to set a max-size for resume vars, and max number of waits allowed (the latter may be a little more complex than it initially seems).PROCESS_COMPLETIONwaits. Mostly more listing calls added.