fix(scheduler): run scheduled jobs in deterministic order by ferntheplant · Pull Request #65 · get-convex/convex-test

ferntheplant · 2026-01-30T17:41:49Z

I was running into flaky tests when using the test backend with the Workflow and Workpool components. I spent a while bashing my head into a wall trying to figure out why until I asked an LLM to look at the convex-test and Workpool source code. The following mostly AI generated but I can confirm it work on my large test suite including complex nested workflows, enqueued workpool actions, and other scheduled functions. I includes notes that it generated to provide more context.

Replace per-job setTimeout with a single queue drained by (scheduledTime, insertionOrder)
so workpool main→updateRunStatus ordering is preserved and generation mismatch is avoided
Add drainInProgress lock so only one drain runs at a time; skip job if already inProgress
Initialize scheduledJobQueue, nextDrainTimerId, scheduledJobInsertionCounter, drainInProgress
in convexTest()

Convex-test + Workpool/Workflow Debugging Summary

Summary of issues encountered when testing long workflows that use workpool with the convex-test backend, and the changes made to convex-test to fix them.

Context

convex-test runs a fake Convex backend in-process (no bundling, no serverless). It sets a global Convex and implements the same syscall interface the real backend uses; your backend code runs in the same Node/Vitest process.
Workpool uses a loop with a generation counter for optimistic concurrency: main(generation, segment) runs, increments generation, does work, then schedules updateRunStatus(newGeneration, segment) with runAfter(0, ...). When updateRunStatus runs, it expects state.generation === generation; otherwise it throws generation mismatch: X !== Y.
Workflow schedules steps; workpool runs jobs. Both can schedule mutations with runAfter(0, ...).

Issue 1: Generation mismatch and workpool spin

Symptoms

Intermittent Error: generation mismatch: 12 !== 4 when running scheduled function loop:updateRunStatus.
Followed by [complete] … work is done, but its work is gone and workpool reporting running: 1 indefinitely until finishAllScheduledFunctions hit its iteration limit.

Root cause

In the original convex-test implementation, every runAfter(0, ...) became its own setTimeout(callback, 0). All such callbacks (from workflow, workpool main, workpool updateRunStatus, kick, etc.) went into the same timer queue and ran in event-loop order, not in the order workpool expects.

So:

main(4) runs, commits, schedules updateRunStatus(5) with setTimeout(..., 0).
Before that callback runs, something else (e.g. a job completing → complete() → kick() → main(5) with runAfter(0)) also schedules with setTimeout(0).
If main(5)’s callback runs first, generation advances to 6, 7, … When updateRunStatus(5) finally runs, it sees state.generation === 12 → generation mismatch. The loop then gets into an inconsistent state (e.g. run status still says “running” but the work doc is gone), leading to the “work is gone” log and the spin.

Workpool alone controls the generation number; the bug was ordering: other workpool callbacks (e.g. main from kick) were running before the updateRunStatus that belonged to the previous main.

Fix: Queue-based scheduler with deterministic order

Replaced per-job setTimeout with a single queue of scheduled jobs (scheduledJobQueue) and a single drain driven by one timer.
On 1.0/schedule: push a ScheduledJobEntry (scheduledTime, insertionOrder, componentPath, functionPath, args, jobId, name) onto the queue and call scheduleDrain().
scheduleDrain(): if the queue is non-empty, set one setTimeout(drainScheduledJobs, delay) where delay = max(0, nextDue - now).
drainScheduledJobs(): in a loop, take all jobs with scheduledTime <= now, sort by (scheduledTime, insertionOrder), remove them from the queue, and run each with runOneScheduledJob. Then call scheduleDrain() again for any remaining jobs.

Effect: “Run now” jobs run in insertion order. So when workpool’s main(4) schedules updateRunStatus(5) with runAfter(0), that job is the next in line and runs before any later runAfter(0) (e.g. main(5) from kick), eliminating the generation mismatch.

Issue 2: “Unexpected scheduled function state when starting it: inProgress”

Symptoms

Test failed with: convexTest invariant error: Unexpected scheduled function state when starting it: inProgress.
Indicated we were trying to “start” a job that was already marked inProgress (i.e. the same job was being run twice).

Root cause

Two drains could run at the same time:

Drain 1 removes job A from the queue and runs runOneScheduledJob(A) (sets A to inProgress, then await withAuth().fun(A)).
While Drain 1 is awaiting, the event loop runs; a timer set by scheduleDrain() (e.g. from a job scheduling more work) fires and Drain 2 starts.
Drain 2 sees job B in the queue, removes B, and runs B (sets B to inProgress, then awaits).
Drain 1 resumes; its due list was computed earlier and still includes B. Drain 1 then runs B again → job is already inProgress → invariant.

So the same job could be executed by two concurrent drains.

Fix: Single drain at a time + defensive skip

Added drainInProgress on the Convex global. At the start of drainScheduledJobs(), if drainInProgress is true, return immediately. Set it to true for the duration of the drain and clear it in a finally before calling scheduleDrain(). Only one drain runs at a time; a timer that fires while a drain is in progress does nothing, and the current drain will call scheduleDrain() when it finishes.
In runOneScheduledJob, if the job is already inProgress when we’re about to start it (shouldn’t happen with the lock), treat it as a duplicate and return without running or throwing, so we don’t run the same job twice.

Code changes in convex-test (summary)

Types and global state
- ScheduledJobEntry (scheduledTime, insertionOrder, componentPath, functionPath, parsedArgs, jobId, name).
- ConvexGlobal: scheduledJobQueue, nextDrainTimerId, scheduledJobInsertionCounter, drainInProgress.
scheduleDrain()
- Clears any existing drain timer; if the queue is non-empty, sets a single setTimeout(drainScheduledJobs, delay).
drainScheduledJobs()
- If drainInProgress, return. Set drainInProgress = true, then in a loop: collect due jobs (scheduledTime ≤ now), sort by (scheduledTime, insertionOrder), remove from queue, run each with runOneScheduledJob. In finally, set drainInProgress = false and call scheduleDrain().
runOneScheduledJob(job)
- Same behavior as before (runInComponent to set inProgress, run function, set success/failed, jobFinished). If the job is already inProgress when starting, skip (return without running).
1.0/schedule handler
- Push one entry onto scheduledJobQueue (with incremented scheduledJobInsertionCounter), then call scheduleDrain(); no per-job setTimeout.
convexTest()
- Initialize scheduledJobQueue: [], nextDrainTimerId: null, scheduledJobInsertionCounter: 0, drainInProgress: false.

Other learnings

mergeModules for workpool only: Workflow runs your code via a function handle (component path in the handle), so your handlers are loaded from the root app’s modules. Workpool’s executor runs inside the workpool component and must resolve your job handlers from that component’s module map, so the workpool component’s modules need to include your app’s code (e.g. via a merged module map). Workflow’s registration doesn’t need that merge.
Multiple namespaces: Registering workflow and workpool multiple times (e.g. "workflow", "transactWorkflow", "workflow/workpool", etc.) is supported; each component path gets its own DatabaseFake and module cache, and reference resolution uses the path from the API so tests work as expected.

Why `finishAllScheduledFunctions` may need a higher `maxIterations`

For a workflow with ~7 steps you might expect ~14 scheduled function runs (e.g. one per step + one per step complete). In practice, finishAllScheduledFunctions(maxIterations) can hit the limit with the default 100 and require 200 (or more) even for “small” workflows.

How it works: Each iteration does (1) advanceTimers() (e.g. vi.runAllTimers()), then (2) waitForInProgressScheduledFunctions() until no jobs are in progress. So one iteration = one timer advance + wait for that batch of work to finish.

Why the count is higher than “steps × 2”:

One iteration ≠ one scheduled function. With the queue-based scheduler there is one timer per “next due time”. Advancing timers runs one drain, which can run several jobs (e.g. main then updateRunStatus). So one iteration can run 1–N jobs. The number of iterations is roughly the number of drains (timer fires), not the number of scheduled function executions.
Workpool adds many scheduled calls. Besides workflow step run + step complete, workpool’s loop runs main + updateRunStatus per “tick”, and there can be many ticks per step (pending start, completion, cancellation, recovery checks, etc.). So 7 workflow steps can trigger many more than 14 scheduled runs (e.g. dozens of workpool loop ticks).
runAt(future) creates more “waves”. When jobs use runAt(segmentTime) or recovery intervals, each distinct time gets its own timer. So you get one iteration per such time. Many segments/recovery times ⇒ many iterations.

So needing maxIterations around 200 for a 7-step workflow is expected: the real number of “advance + wait” cycles is driven by workpool loop ticks and distinct scheduled times, not just “steps × 2”. Bumping to 200 (or a bit more) for workflow+workpool tests is reasonable; if you still hit the limit, check for unexpectedly many loop ticks or recursive scheduling.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Summary by CodeRabbit

New Features
- Implemented deterministic scheduling for queued jobs with controlled draining mechanism.
- Added optional maxIterations parameter to finishAllScheduledFunctions (default increased to 500).
Bug Fixes
- Enhanced error handling in action execution to ensure proper cleanup.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

… concurrent drains

coderabbitai · 2026-01-30T17:42:05Z

📝 Walkthrough

Walkthrough

The change introduces a deterministic, queue-based scheduling mechanism for scheduled jobs with configurable drain limits, improved error handling via try/finally blocks, and updated public API signatures to support optional iteration parameters.

Changes

Cohort / File(s)	Summary
Scheduled Job Queue & Drain Mechanism `index.ts`	Introduced `scheduleDrain`, `drainScheduledJobs`, `runOneScheduledJob` functions and `ScheduledJobEntry` type to implement queue-based job scheduling with ordered processing by scheduled time and insertion order. Extended `ConvexGlobal` and related types to track `scheduledJobQueue`, timer ID, insertion counter, and `drainInProgress` flag.
Error Handling & Control Flow `index.ts`	Enhanced `withAuth().runInComponent` action execution with try/finally to ensure `finishAction()` is called even on errors. Modified action invocation path to properly return results after try/finally handling.
Public API & Configuration `index.ts`	Updated `finishAllScheduledFunctions` signature in `TestConvexForDataModel<DataModel>` and `TestConvexForDataModelAndIdentity` interfaces to accept optional `maxIterations` parameter. Changed default `maxIterations` from 100 to 500 in draining logic.

Sequence Diagram

sequenceDiagram
    participant Test as Test Framework
    participant Queue as Job Queue
    participant Scheduler as Scheduler/Timer
    participant Job as Job Executor
    participant Backend as Backend State

    Test->>Queue: schedule(job)
    activate Queue
    Queue->>Queue: Push to scheduledJobQueue<br/>(with scheduledTime, order)
    deactivate Queue

    Test->>Scheduler: advanceTimers()
    activate Scheduler
    Scheduler->>Scheduler: scheduleDrain()<br/>(compute delay to next due job)
    deactivate Scheduler

    Scheduler->>Job: Timer fires
    activate Job
    Job->>Job: drainScheduledJobs()
    Job->>Job: Set drainInProgress = true
    loop Process all due jobs in order
        Job->>Job: runOneScheduledJob()
        Job->>Job: pending → inProgress
        Job->>Backend: Update job state
        Job->>Job: Execute job function
        Job->>Job: inProgress → success/failed
        Job->>Backend: Update result
    end
    Job->>Job: Set drainInProgress = false
    deactivate Job

    Test->>Test: Verify job results

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Behold! The queue's now orderly and grand,
With timers set by drain's steady hand,
Jobs scheduled with grace, no chaos in sight,
Five-hundred iterations? Now that's quite right!
Errors caught safely—try, finally, done—
The warren's test framework is becoming more fun! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(scheduler): run scheduled jobs in deterministic order' directly describes the main objective of the PR - introducing a deterministic queue-and-drain scheduler to fix flaky tests by ensuring jobs run in consistent order rather than in event-loop order.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

fix(scheduler): run scheduled jobs in deterministic order and prevent…

1de3eda

… concurrent drains

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): run scheduled jobs in deterministic order#65

fix(scheduler): run scheduled jobs in deterministic order#65
ferntheplant wants to merge 1 commit intoget-convex:mainfrom
ferntheplant:fix-scheduled-jobs

ferntheplant commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ferntheplant commented Jan 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Convex-test + Workpool/Workflow Debugging Summary

Context

Issue 1: Generation mismatch and workpool spin

Symptoms

Root cause

Fix: Queue-based scheduler with deterministic order

Issue 2: “Unexpected scheduled function state when starting it: inProgress”

Symptoms

Root cause

Fix: Single drain at a time + defensive skip

Code changes in convex-test (summary)

Other learnings

Why finishAllScheduledFunctions may need a higher maxIterations

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ferntheplant commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

Why `finishAllScheduledFunctions` may need a higher `maxIterations`

coderabbitai bot commented Jan 30, 2026 •

edited

Loading