Skip to content

fix(scheduler): run scheduled jobs in deterministic order#65

Open
ferntheplant wants to merge 1 commit intoget-convex:mainfrom
ferntheplant:fix-scheduled-jobs
Open

fix(scheduler): run scheduled jobs in deterministic order#65
ferntheplant wants to merge 1 commit intoget-convex:mainfrom
ferntheplant:fix-scheduled-jobs

Conversation

@ferntheplant
Copy link

@ferntheplant ferntheplant commented Jan 30, 2026

I was running into flaky tests when using the test backend with the Workflow and Workpool components. I spent a while bashing my head into a wall trying to figure out why until I asked an LLM to look at the convex-test and Workpool source code. The following mostly AI generated but I can confirm it work on my large test suite including complex nested workflows, enqueued workpool actions, and other scheduled functions. I includes notes that it generated to provide more context.

  • Replace per-job setTimeout with a single queue drained by (scheduledTime, insertionOrder)
    so workpool main→updateRunStatus ordering is preserved and generation mismatch is avoided
  • Add drainInProgress lock so only one drain runs at a time; skip job if already inProgress
  • Initialize scheduledJobQueue, nextDrainTimerId, scheduledJobInsertionCounter, drainInProgress
    in convexTest()

Convex-test + Workpool/Workflow Debugging Summary

Summary of issues encountered when testing long workflows that use workpool with the convex-test backend, and the changes made to convex-test to fix them.

Context

  • convex-test runs a fake Convex backend in-process (no bundling, no serverless). It sets a global Convex and implements the same syscall interface the real backend uses; your backend code runs in the same Node/Vitest process.
  • Workpool uses a loop with a generation counter for optimistic concurrency: main(generation, segment) runs, increments generation, does work, then schedules updateRunStatus(newGeneration, segment) with runAfter(0, ...). When updateRunStatus runs, it expects state.generation === generation; otherwise it throws generation mismatch: X !== Y.
  • Workflow schedules steps; workpool runs jobs. Both can schedule mutations with runAfter(0, ...).

Issue 1: Generation mismatch and workpool spin

Symptoms

  • Intermittent Error: generation mismatch: 12 !== 4 when running scheduled function loop:updateRunStatus.
  • Followed by [complete] … work is done, but its work is gone and workpool reporting running: 1 indefinitely until finishAllScheduledFunctions hit its iteration limit.

Root cause

In the original convex-test implementation, every runAfter(0, ...) became its own setTimeout(callback, 0). All such callbacks (from workflow, workpool main, workpool updateRunStatus, kick, etc.) went into the same timer queue and ran in event-loop order, not in the order workpool expects.

So:

  1. main(4) runs, commits, schedules updateRunStatus(5) with setTimeout(..., 0).
  2. Before that callback runs, something else (e.g. a job completing → complete()kick()main(5) with runAfter(0)) also schedules with setTimeout(0).
  3. If main(5)’s callback runs first, generation advances to 6, 7, … When updateRunStatus(5) finally runs, it sees state.generation === 12generation mismatch. The loop then gets into an inconsistent state (e.g. run status still says “running” but the work doc is gone), leading to the “work is gone” log and the spin.

Workpool alone controls the generation number; the bug was ordering: other workpool callbacks (e.g. main from kick) were running before the updateRunStatus that belonged to the previous main.

Fix: Queue-based scheduler with deterministic order

  • Replaced per-job setTimeout with a single queue of scheduled jobs (scheduledJobQueue) and a single drain driven by one timer.
  • On 1.0/schedule: push a ScheduledJobEntry (scheduledTime, insertionOrder, componentPath, functionPath, args, jobId, name) onto the queue and call scheduleDrain().
  • scheduleDrain(): if the queue is non-empty, set one setTimeout(drainScheduledJobs, delay) where delay = max(0, nextDue - now).
  • drainScheduledJobs(): in a loop, take all jobs with scheduledTime <= now, sort by (scheduledTime, insertionOrder), remove them from the queue, and run each with runOneScheduledJob. Then call scheduleDrain() again for any remaining jobs.

Effect: “Run now” jobs run in insertion order. So when workpool’s main(4) schedules updateRunStatus(5) with runAfter(0), that job is the next in line and runs before any later runAfter(0) (e.g. main(5) from kick), eliminating the generation mismatch.

Issue 2: “Unexpected scheduled function state when starting it: inProgress”

Symptoms

  • Test failed with: convexTest invariant error: Unexpected scheduled function state when starting it: inProgress.
  • Indicated we were trying to “start” a job that was already marked inProgress (i.e. the same job was being run twice).

Root cause

Two drains could run at the same time:

  1. Drain 1 removes job A from the queue and runs runOneScheduledJob(A) (sets A to inProgress, then await withAuth().fun(A)).
  2. While Drain 1 is awaiting, the event loop runs; a timer set by scheduleDrain() (e.g. from a job scheduling more work) fires and Drain 2 starts.
  3. Drain 2 sees job B in the queue, removes B, and runs B (sets B to inProgress, then awaits).
  4. Drain 1 resumes; its due list was computed earlier and still includes B. Drain 1 then runs B again → job is already inProgress → invariant.

So the same job could be executed by two concurrent drains.

Fix: Single drain at a time + defensive skip

  • Added drainInProgress on the Convex global. At the start of drainScheduledJobs(), if drainInProgress is true, return immediately. Set it to true for the duration of the drain and clear it in a finally before calling scheduleDrain(). Only one drain runs at a time; a timer that fires while a drain is in progress does nothing, and the current drain will call scheduleDrain() when it finishes.
  • In runOneScheduledJob, if the job is already inProgress when we’re about to start it (shouldn’t happen with the lock), treat it as a duplicate and return without running or throwing, so we don’t run the same job twice.

Code changes in convex-test (summary)

  1. Types and global state

    • ScheduledJobEntry (scheduledTime, insertionOrder, componentPath, functionPath, parsedArgs, jobId, name).
    • ConvexGlobal: scheduledJobQueue, nextDrainTimerId, scheduledJobInsertionCounter, drainInProgress.
  2. scheduleDrain()

    • Clears any existing drain timer; if the queue is non-empty, sets a single setTimeout(drainScheduledJobs, delay).
  3. drainScheduledJobs()

    • If drainInProgress, return. Set drainInProgress = true, then in a loop: collect due jobs (scheduledTime ≤ now), sort by (scheduledTime, insertionOrder), remove from queue, run each with runOneScheduledJob. In finally, set drainInProgress = false and call scheduleDrain().
  4. runOneScheduledJob(job)

    • Same behavior as before (runInComponent to set inProgress, run function, set success/failed, jobFinished). If the job is already inProgress when starting, skip (return without running).
  5. 1.0/schedule handler

    • Push one entry onto scheduledJobQueue (with incremented scheduledJobInsertionCounter), then call scheduleDrain(); no per-job setTimeout.
  6. convexTest()

    • Initialize scheduledJobQueue: [], nextDrainTimerId: null, scheduledJobInsertionCounter: 0, drainInProgress: false.

Other learnings

  • mergeModules for workpool only: Workflow runs your code via a function handle (component path in the handle), so your handlers are loaded from the root app’s modules. Workpool’s executor runs inside the workpool component and must resolve your job handlers from that component’s module map, so the workpool component’s modules need to include your app’s code (e.g. via a merged module map). Workflow’s registration doesn’t need that merge.
  • Multiple namespaces: Registering workflow and workpool multiple times (e.g. "workflow", "transactWorkflow", "workflow/workpool", etc.) is supported; each component path gets its own DatabaseFake and module cache, and reference resolution uses the path from the API so tests work as expected.

Why finishAllScheduledFunctions may need a higher maxIterations

For a workflow with ~7 steps you might expect ~14 scheduled function runs (e.g. one per step + one per step complete). In practice, finishAllScheduledFunctions(maxIterations) can hit the limit with the default 100 and require 200 (or more) even for “small” workflows.

How it works: Each iteration does (1) advanceTimers() (e.g. vi.runAllTimers()), then (2) waitForInProgressScheduledFunctions() until no jobs are in progress. So one iteration = one timer advance + wait for that batch of work to finish.

Why the count is higher than “steps × 2”:

  1. One iteration ≠ one scheduled function. With the queue-based scheduler there is one timer per “next due time”. Advancing timers runs one drain, which can run several jobs (e.g. main then updateRunStatus). So one iteration can run 1–N jobs. The number of iterations is roughly the number of drains (timer fires), not the number of scheduled function executions.

  2. Workpool adds many scheduled calls. Besides workflow step run + step complete, workpool’s loop runs main + updateRunStatus per “tick”, and there can be many ticks per step (pending start, completion, cancellation, recovery checks, etc.). So 7 workflow steps can trigger many more than 14 scheduled runs (e.g. dozens of workpool loop ticks).

  3. runAt(future) creates more “waves”. When jobs use runAt(segmentTime) or recovery intervals, each distinct time gets its own timer. So you get one iteration per such time. Many segments/recovery times ⇒ many iterations.

So needing maxIterations around 200 for a 7-step workflow is expected: the real number of “advance + wait” cycles is driven by workpool loop ticks and distinct scheduled times, not just “steps × 2”. Bumping to 200 (or a bit more) for workflow+workpool tests is reasonable; if you still hit the limit, check for unexpectedly many loop ticks or recursive scheduling.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Summary by CodeRabbit

  • New Features

    • Implemented deterministic scheduling for queued jobs with controlled draining mechanism.
    • Added optional maxIterations parameter to finishAllScheduledFunctions (default increased to 500).
  • Bug Fixes

    • Enhanced error handling in action execution to ensure proper cleanup.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

The change introduces a deterministic, queue-based scheduling mechanism for scheduled jobs with configurable drain limits, improved error handling via try/finally blocks, and updated public API signatures to support optional iteration parameters.

Changes

Cohort / File(s) Summary
Scheduled Job Queue & Drain Mechanism
index.ts
Introduced scheduleDrain, drainScheduledJobs, runOneScheduledJob functions and ScheduledJobEntry type to implement queue-based job scheduling with ordered processing by scheduled time and insertion order. Extended ConvexGlobal and related types to track scheduledJobQueue, timer ID, insertion counter, and drainInProgress flag.
Error Handling & Control Flow
index.ts
Enhanced withAuth().runInComponent action execution with try/finally to ensure finishAction() is called even on errors. Modified action invocation path to properly return results after try/finally handling.
Public API & Configuration
index.ts
Updated finishAllScheduledFunctions signature in TestConvexForDataModel<DataModel> and TestConvexForDataModelAndIdentity interfaces to accept optional maxIterations parameter. Changed default maxIterations from 100 to 500 in draining logic.

Sequence Diagram

sequenceDiagram
    participant Test as Test Framework
    participant Queue as Job Queue
    participant Scheduler as Scheduler/Timer
    participant Job as Job Executor
    participant Backend as Backend State

    Test->>Queue: schedule(job)
    activate Queue
    Queue->>Queue: Push to scheduledJobQueue<br/>(with scheduledTime, order)
    deactivate Queue

    Test->>Scheduler: advanceTimers()
    activate Scheduler
    Scheduler->>Scheduler: scheduleDrain()<br/>(compute delay to next due job)
    deactivate Scheduler

    Scheduler->>Job: Timer fires
    activate Job
    Job->>Job: drainScheduledJobs()
    Job->>Job: Set drainInProgress = true
    loop Process all due jobs in order
        Job->>Job: runOneScheduledJob()
        Job->>Job: pending → inProgress
        Job->>Backend: Update job state
        Job->>Job: Execute job function
        Job->>Job: inProgress → success/failed
        Job->>Backend: Update result
    end
    Job->>Job: Set drainInProgress = false
    deactivate Job

    Test->>Test: Verify job results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Behold! The queue's now orderly and grand,
With timers set by drain's steady hand,
Jobs scheduled with grace, no chaos in sight,
Five-hundred iterations? Now that's quite right!
Errors caught safely—try, finally, done—
The warren's test framework is becoming more fun! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(scheduler): run scheduled jobs in deterministic order' directly describes the main objective of the PR - introducing a deterministic queue-and-drain scheduler to fix flaky tests by ensuring jobs run in consistent order rather than in event-loop order.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant