Fix orphaned running executions by vcastellm · Pull Request #1957 · dkron-io/dkron

vcastellm · 2026-03-29T16:14:53Z

This pull request introduces a robust mechanism for cleaning up stale running executions in Dkron, ensuring that executions which are no longer active on any node are marked as failed and do not block new job runs. The logic for detecting and cleaning up these "orphaned" executions has been refactored, centralized, and is now invoked both during job scheduling and on leader startup. Comprehensive tests have been added to verify this behavior.

Refactoring and Centralization of Stale Execution Cleanup:

Extracted and centralized the stale execution cleanup logic into a new Agent.cleanupStaleRunningExecutions method, which marks executions as failed if they are no longer active and have exceeded the stale threshold. This replaces the previous inline logic in Job.isRunnable. (dkron/agent.go, dkron/job.go) [1] [2]
Added a helper function activeExecutionKeys to efficiently track currently active executions by key. (dkron/agent.go)

Leadership Startup Reconciliation:

On leadership establishment, added a reconciliation step (Agent.reconcileRunningExecutionOrphans) that iterates over all jobs and cleans up any orphaned running executions in persistent storage, ensuring a consistent state after leader changes. (dkron/leader.go)

Testing Improvements:

Added new tests in dkron/leader_test.go to verify that stale executions are properly cleaned up and that recent (non-stale) executions are left untouched during reconciliation. (dkron/leader_test.go)

Code Quality:

Minor import reordering for consistency. (dkron/agent.go)

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved handling of stale execution records marked as running beyond timeout thresholds; cleanup now occurs during leader startup.
Tests
- Added test coverage for running execution orphan reconciliation.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

coderabbitai · 2026-03-29T16:15:07Z

📝 Walkthrough

Walkthrough

The changes introduce a stale execution cleanup mechanism for distributed job execution. Helper functions are added to the agent to manage execution state and removal via Raft. The job's cleanup logic is refactored to delegate to these helpers. Leader startup now initiates a reconciliation process to identify and clean up orphaned running executions in storage. Tests validate the reconciliation behavior.

Changes

Cohort / File(s)	Summary
Agent Helpers `dkron/agent.go`	Reordered imports. Added `activeExecutionKeys` to build a set of active execution keys, `markExecutionDone` to persist execution state via Raft, and `cleanupStaleRunningExecutions` to identify and mark stale running executions as failed.
Job Cleanup Refactoring `dkron/job.go`	Refactored `isRunnable` to delegate stale execution cleanup to `cleanupStaleRunningExecutions` instead of inline logic. Retained concurrency blocking logic for non-stale, non-active executions. Minor whitespace adjustments in `validateMemoryLimit`.
Leader Reconciliation `dkron/leader.go`	Added startup reconciliation during `establishLeadership`: queries active executions and invokes `reconcileRunningExecutionOrphans` to clean up stale running executions in storage. Non-fatal errors are logged and startup continues.
Test Coverage `dkron/leader_test.go`	Added test helper `startTestLeaderAgent` and two test cases validating reconciliation behavior: one confirming stale executions are marked failed, another confirming recent executions are retained.

Sequence Diagram

sequenceDiagram
    participant Leader as Leader (Startup)
    participant Agent as Agent
    participant JobStore as Job Store
    participant RaftLog as Raft

    Leader->>Agent: establishLeadership()
    Agent->>Agent: GetActiveExecutions()
    Agent->>Leader: activeExecutionKeys

    loop For each job
        Agent->>JobStore: Get running executions
        JobStore-->>Agent: runningExecs
        Agent->>Agent: filterStaleExecutions(activeKeys, runningExecs)
        
        alt Stale execution detected
            Agent->>Agent: markExecutionDone(execution)
            Agent->>RaftLog: RaftApply(ExecutionDoneRequest)
            RaftLog-->>Agent: error or success
        else Recent execution
            Agent->>Agent: Log and continue
        end
    end

    Agent->>Leader: Reconciliation complete

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A tale of stale executions lost,
No more orphans left to cost,
The leader wakes and sweeps the store,
With Raft it marks what ran before,
Clean slate for jobs, forever more! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix orphaned running executions' directly and clearly summarizes the main purpose of the changeset, which is to address orphaned executions that block job runs.
Description check	✅ Passed	The description comprehensively covers the proposed changes with clear sections on refactoring, leadership reconciliation, testing, and code quality; however, it does not explicitly specify the type of change or include the template's 'Types of changes' section.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/orphaned-running-executions

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

dkron/agent.go (1)
897-913: Consider adding an IsLeader guard or documenting the precondition.

Per coding guidelines, Raft log applications should be preceded by an a.IsLeader() check. This function applies an ExecutionDoneType command via a.RaftApply, but doesn't verify leadership. While the current call sites (leader startup reconciliation and isRunnable during scheduling) run on the leader, adding a defensive check or documenting the precondition would prevent misuse.
💡 Optional: Add defensive leadership check
 func (a *Agent) markExecutionDone(execution *Execution) error {
+	if !a.IsLeader() {
+		return errors.New("not leader")
+	}
+
 	execDoneReq := &typesv1.ExecutionDoneRequest{
 		Execution: execution.ToProto(),
 	}
Based on learnings: "Check a.IsLeader() before performing leader-only operations like job scheduling or applying Raft logs."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dkron/agent.go` around lines 897 - 913, markExecutionDone calls a.RaftApply
with an ExecutionDoneType command but lacks a leadership guard; add a defensive
check at the top of markExecutionDone to return an error (or no-op) if
a.IsLeader() is false, or document the precondition clearly. Specifically, in
the markExecutionDone function, call a.IsLeader() before creating/encoding the
typesv1.ExecutionDoneRequest and invoking a.RaftApply (referencing
markExecutionDone, a.IsLeader, a.RaftApply, and ExecutionDoneType) and ensure
the function returns a clear error when not leader so callers cannot
accidentally apply Raft entries from non-leaders.
dkron/leader.go (1)
246-265: Consider early return on first error to avoid partial reconciliation.

The function iterates through all jobs and calls cleanupStaleRunningExecutions for each. If an error occurs mid-iteration, it returns immediately, leaving some jobs reconciled and others not. This creates an inconsistent state.

Consider either:

Collecting errors and continuing (best-effort for all jobs)

Documenting that partial reconciliation is acceptable

Additionally, time.Since(exec.StartedAt) on line 259 uses local time implicitly, while cleanupStaleRunningExecutions uses time.Now().UTC() for runningFor calculation. This is a minor inconsistency for logging purposes only (not affecting logic), but using .UTC() would be more consistent.
💡 Minor consistency fix for time calculation
 		for _, exec := range runningExecs {
 			a.logger.WithFields(map[string]interface{}{
 				"job":         job.Name,
 				"execution":   exec.Key(),
 				"node":        exec.NodeName,
 				"started_at":  exec.StartedAt,
-				"running_for": time.Since(exec.StartedAt).String(),
+				"running_for": time.Now().UTC().Sub(exec.StartedAt).String(),
 			}).Info("leader: Leaving running execution in storage during startup reconciliation")
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dkron/leader.go` around lines 246 - 265, reconcileRunningExecutionOrphans
currently returns on the first error from cleanupStaleRunningExecutions which
causes partial reconciliation; change it to continue iterating all jobs while
collecting errors (e.g., append errors to a slice and at the end return either
nil or a combined error) so the loop is best-effort across all jobs, and also
make the logged running duration consistent by calculating running_for using UTC
(use exec.StartedAt.UTC() or convert the reference time to UTC to match
cleanupStaleRunningExecutions); reference functions/values:
reconcileRunningExecutionOrphans, cleanupStaleRunningExecutions, runningExecs,
exec.StartedAt, and the "running_for" log field.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@dkron/agent.go`:
- Around line 897-913: markExecutionDone calls a.RaftApply with an
ExecutionDoneType command but lacks a leadership guard; add a defensive check at
the top of markExecutionDone to return an error (or no-op) if a.IsLeader() is
false, or document the precondition clearly. Specifically, in the
markExecutionDone function, call a.IsLeader() before creating/encoding the
typesv1.ExecutionDoneRequest and invoking a.RaftApply (referencing
markExecutionDone, a.IsLeader, a.RaftApply, and ExecutionDoneType) and ensure
the function returns a clear error when not leader so callers cannot
accidentally apply Raft entries from non-leaders.

In `@dkron/leader.go`:
- Around line 246-265: reconcileRunningExecutionOrphans currently returns on the
first error from cleanupStaleRunningExecutions which causes partial
reconciliation; change it to continue iterating all jobs while collecting errors
(e.g., append errors to a slice and at the end return either nil or a combined
error) so the loop is best-effort across all jobs, and also make the logged
running duration consistent by calculating running_for using UTC (use
exec.StartedAt.UTC() or convert the reference time to UTC to match
cleanupStaleRunningExecutions); reference functions/values:
reconcileRunningExecutionOrphans, cleanupStaleRunningExecutions, runningExecs,
exec.StartedAt, and the "running_for" log field.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0d62828d-008d-415c-ad5e-42d50273c9dc

📥 Commits

Reviewing files that changed from the base of the PR and between a666cf1 and c0d1d7f.

📒 Files selected for processing (4)

dkron/agent.go
dkron/job.go
dkron/leader.go
dkron/leader_test.go

vcastellm and others added 2 commits March 29, 2026 18:13

Reuse stale execution cleanup logic for concurrency checks

15d60f8

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

Reconcile orphaned running executions when leadership starts

c0d1d7f

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

coderabbitai bot reviewed Mar 29, 2026

View reviewed changes

vcastellm merged commit 18c9d7e into main Mar 29, 2026
3 checks passed

vcastellm deleted the fix/orphaned-running-executions branch March 29, 2026 16:20

vcastellm mentioned this pull request Mar 29, 2026

Jobs stuck in "Running" state, how to properly upgrade? #1950

Closed

fopina mentioned this pull request Mar 30, 2026

Upstream tag v4.1.0 (revision 758e62c0) fopina/dkron#52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix orphaned running executions#1957

Fix orphaned running executions#1957
vcastellm merged 2 commits intomainfrom
fix/orphaned-running-executions

vcastellm commented Mar 29, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 29, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vcastellm commented Mar 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vcastellm commented Mar 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 29, 2026 •

edited

Loading