fix(runs): serialize run rollback to prevent state overwrites by mvanhorn · Pull Request #2598 · bytedance/deer-flow

mvanhorn · 2026-04-27T03:32:41Z

Closes #2505 (P1)

A run could be marked interrupted before its worker had actually finished cleanup. A newer run for the same thread could start and write fresh checkpoints, only for the old worker's rollback to restore an older snapshot over the new state.

The closed RFC PR #2504 documented the fix design; this PR is the implementation.

Change

Two pieces close the race.

1. New intermediate states

backend/packages/harness/deerflow/runtime/runs/schemas.py:

class RunStatus(StrEnum):
    pending = "pending"
    running = "running"
    cancelling = "cancelling"      # NEW: cancel signal sent, worker still cleaning up
    rolling_back = "rolling_back"  # NEW: worker finished, rollback in progress
    success = "success"
    error = "error"
    timeout = "timeout"
    interrupted = "interrupted"

cancel() now transitions to cancelling (not directly to interrupted). The worker is responsible for the subsequent progression:

running → cancelling → rolling_back → interrupted   (action=rollback, success)
running → cancelling → rolling_back → error         (action=rollback, rollback failed)
running → cancelling → interrupted                  (action=interrupt)

2. `create_or_reject` awaits cleanup

For interrupt/rollback multitask strategies, the manager now waits for the cancelled workers' tasks to finish before inserting the new run record. This is the serialization point that closes the race - an old worker can no longer write rollback state on top of a new run because the new run hasn't been created yet.

The wait happens with the manager lock released, so other threads' operations are not blocked. Re-entry is safe because the new run gets a fresh run_id keying.

has_inflight and the internal inflight check now include cancelling and rolling_back, so a third concurrent request for the same thread also blocks until cleanup completes.

Tests

backend/tests/test_run_manager.py adds three regression tests under # --- #2505: rollback-serialization regression tests ---:

test_has_inflight_includes_cancelling_and_rolling_back - the new states count as inflight.
test_create_or_reject_awaits_cancelled_workers_before_creating - the race-condition guard. Spawns a worker task that performs simulated rollback cleanup, calls create_or_reject, and asserts the worker finished BEFORE the new run was created (assert order: [\"worker_done\", \"new_created\"]).
test_create_or_reject_skips_already_cancelling_runs - re-cancelling a cancelling run does not stomp the original abort_action.

The existing test_cancel was updated to assert the intermediate cancelling state; the docstring now explains that the final state is set by the worker per the new state machine.

Adversarial verification

I confirmed the new race test catches the regression by temporarily commenting out the await asyncio.gather(*tasks_to_await, ...) line. The race test fails with:

AssertionError: assert ['new_created'] == ['worker_done', 'new_created']

Restoring the line returns all 14 manager tests to green, plus the 7 existing worker-rollback tests.

Verification

============================== 21 passed in 0.12s ==============================

cd backend && PYTHONPATH=. uv run pytest tests/test_run_manager.py tests/test_run_worker_rollback.py -v - 21 passed
cd backend && uvx ruff check packages/harness/deerflow/runtime/runs/ tests/test_run_manager.py - clean

🤖 Built with assistance from Claude Code.

Closes bytedance#2505. A run could be marked `interrupted` before its worker had finished cleanup, allowing a newer run for the same thread to start and write fresh checkpoints. The old worker would then complete its rollback, restoring an older snapshot over the new state. Two changes close the race: 1. Two new RunStatus values, `cancelling` and `rolling_back`, sit between `running` and the terminal states. `cancel()` now transitions to `cancelling` (not `interrupted`); the worker is responsible for `rolling_back -> interrupted` (or `-> error` on rollback failure). 2. `create_or_reject` for `interrupt`/`rollback` strategies now awaits the cancelled workers' tasks before creating the new run. `has_inflight` and the inflight check both include the new states, so a third concurrent request also blocks until cleanup completes. Three new tests cover the new behavior: - `has_inflight` includes `cancelling` and `rolling_back` - `create_or_reject` awaits cancelled workers before creating (the regression test for the race) - `create_or_reject` does not stomp `abort_action` on already- cancelling runs The existing `test_cancel` was updated to assert the intermediate `cancelling` state; the worker tests for actual rollback behavior remain green.

Copilot

Pull request overview

Implements the rollback/interrupt serialization design for #2505 by introducing intermediate run states and ensuring new runs for the same thread don’t start until cancelled workers have finished cleanup/rollback, preventing checkpoint state overwrites.

Changes:

Added cancelling and rolling_back run statuses and treated them as in-flight.
Updated RunManager.create_or_reject() to cancel inflight runs and await their worker tasks before inserting a new run (for interrupt/rollback strategies).
Updated rollback paths in run_agent() to mark rolling_back before performing rollback and to set terminal status afterward; added regression tests.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
backend/packages/harness/deerflow/runtime/runs/schemas.py	Adds `cancelling` / `rolling_back` to the RunStatus enum.
backend/packages/harness/deerflow/runtime/runs/manager.py	Defines inflight statuses, updates cancellation semantics, and serializes new run creation behind cancelled worker completion.
backend/packages/harness/deerflow/runtime/runs/worker.py	Marks `rolling_back` before rollback work and sets final status after rollback completes/fails.
backend/tests/test_run_manager.py	Updates cancel test expectations and adds regression tests covering inflight detection and create-or-reject serialization behavior.

+                # Mark rolling_back BEFORE the rollback work so create_or_reject
+                # for the same thread sees this run as still inflight and
+                # blocks until rollback completes (#2505).
+                await run_manager.set_status(run_id, RunStatus.rolling_back)


+        # Hold the lock only for the inflight check + cancel-signal phase. We
+        # then release it to await the worker tasks (so other operations on
+        # other threads aren't blocked), and reacquire it to insert the new
+        # record. The new record is created with a fresh run_id and is keyed
+        # by run_id, so the brief unlocked window cannot collide.


+                    if r.status in (RunStatus.pending, RunStatus.running):
+                        r.abort_action = multitask_strategy
+                        r.abort_event.set()
+                        if r.task is not None and not r.task.done():
+                            r.task.cancel()
+                        r.status = RunStatus.cancelling
+                        r.updated_at = now
                    if r.task is not None and not r.task.done():


            if record.status not in (RunStatus.pending, RunStatus.running):
                return False
            record.abort_action = action
            record.abort_event.set()
            if record.task is not None and not record.task.done():
                record.task.cancel()
-            record.status = RunStatus.interrupted
+            record.status = RunStatus.cancelling
            record.updated_at = _now_iso()


WillemJiang requested a review from Copilot May 1, 2026 22:41

Copilot started reviewing on behalf of WillemJiang May 1, 2026 22:42 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runs): serialize run rollback to prevent state overwrites#2598

fix(runs): serialize run rollback to prevent state overwrites#2598
mvanhorn wants to merge 1 commit intobytedance:mainfrom
mvanhorn:fix/2505-run-rollback-serialization

mvanhorn commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvanhorn commented Apr 27, 2026

Closes #2505 (P1)

Change

1. New intermediate states

2. create_or_reject awaits cleanup

Tests

Adversarial verification

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. `create_or_reject` awaits cleanup