Implement slow-start budget for orphaned shard claims and enhance related logging by benjaminpetit · Pull Request #9943 · dotnet/orleans

benjaminpetit · 2026-02-20T16:57:20Z

This pull request introduces a "slow-start" mechanism for orphaned job shard claiming, designed to prevent silos from overwhelming themselves by claiming too many shards immediately after startup, especially during disaster recovery scenarios. The changes add configurable limits and ramp-up logic, integrate overload detection to pause claims when needed, and update logging and validation accordingly.

Slow-start shard claiming mechanism:

Added SlowStartInitialBudget, SlowStartMaxBudget, and SlowStartRampUpDuration configuration options to DurableJobsOptions, allowing control over how many orphaned shards a silo may claim immediately after startup and how this budget increases over time.
Implemented ramp-up logic in LocalDurableJobManager to compute the current claim budget, track claimed shards, and respect overload detection by pausing claims when overloaded. [1] [2] [3] [4] [5]

API and implementation updates:

Modified AssignJobShardsAsync in JobShardManager and its implementations to accept a maxNewClaims parameter, enforcing the slow-start budget during shard assignment. [1] [2] [3] [4] [5] [6]
Updated tests to use the new maxNewClaims parameter, ensuring compatibility and correctness. [1] [2] [3] [4]

Validation and logging enhancements:

Added configuration validation for the new slow-start options to prevent misconfiguration.
Introduced new log messages for shard claim budget, orphaned shard claims, and overload pauses to improve observability.

Integration with overload detection:

Integrated IOverloadDetector in LocalDurableJobManager to pause new shard claims when the silo is overloaded, further protecting system stability. [1] [2]

These changes collectively improve the robustness and resilience of the job shard assignment process during silo startup and recovery scenarios.

Microsoft Reviewers: Open in CodeFlow

Copilot

Pull request overview

This pull request implements a "slow-start" mechanism for orphaned job shard claiming in Orleans' Durable Jobs feature to prevent silos from overwhelming themselves during startup and disaster recovery scenarios. The implementation adds configurable limits that ramp up over time, integrates with overload detection to pause claims when needed, and includes comprehensive logging for observability.

Changes:

Added three new configuration options (SlowStartInitialBudget, SlowStartMaxBudget, SlowStartRampUpDuration) with validation to control the slow-start behavior
Modified AssignJobShardsAsync API across JobShardManager implementations to accept a maxNewClaims parameter that enforces the budget
Implemented ramp-up logic in LocalDurableJobManager that computes the current claim budget, tracks claimed shards, and respects overload detection

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`src/Orleans.DurableJobs/Hosting/DurableJobsOptions.cs`	Added three new slow-start configuration properties with XML documentation and validation logic
`src/Orleans.DurableJobs/LocalDurableJobManager.cs`	Implemented slow-start state tracking, budget computation with overload integration, and orphaned shard claim counting
`src/Orleans.DurableJobs/LocalDurableJobManager.Log.cs`	Added three new log messages for claim budget, orphaned claims, and overload pauses
`src/Orleans.DurableJobs/JobShardManager.cs`	Updated `AssignJobShardsAsync` signature to include `maxNewClaims` parameter and implemented budget enforcement in InMemoryJobShardManager
`src/Azure/Orleans.DurableJobs.AzureStorage/AzureStorageJobShardManager.cs`	Implemented budget enforcement for the Azure Storage provider
`test/Tester/DurableJobs/JobShardManagerTestsRunner.cs`	Added three comprehensive tests for slow-start behavior and updated all existing test calls with `int.MaxValue`
`test/Tester/DurableJobs/InMemoryJobShardManagerTests.cs`	Added test method delegates for the three new slow-start tests
`test/Extensions/TesterAzureUtils/DurableJobs/AzureStorageJobShardManagerTests.cs`	Added test method delegates for the three new slow-start tests with appropriate test categories
`test/Extensions/TesterAzureUtils/DurableJobs/AzureStorageJobShardBatchingTests.cs`	Updated all test calls to use `maxNewClaims: int.MaxValue` for unlimited claiming

ReubenBond · 2026-02-20T17:02:23Z

src/Orleans.DurableJobs/JobShardManager.cs

-                        // If adopted from dead silo, increment adopted count
-                        if (isFromDeadSilo)
+                        // Respect the slow-start budget: skip claiming if we've exhausted the budget
+                        if (stolenShards.Count >= maxNewClaims)


IMO we should use 'Adopted' instead of stolen. Adopted vs Orphaned/Abandoned match, whereas Stolen vs Orphaned/Abandoned don't match well

I think I changed the terminology in one of the previous PRs already

ReubenBond · 2026-02-20T17:03:35Z

src/Orleans.DurableJobs/LocalDurableJobManager.cs

    {
        LogStarting(_logger);

+        _startTimestamp = Stopwatch.GetTimestamp();


Not a big deal, but we should inject & use TimeProvider here instead of Stopwatch

…ated logging

Copilot AI review requested due to automatic review settings February 20, 2026 16:57

Copilot started reviewing on behalf of benjaminpetit February 20, 2026 16:57 View session

benjaminpetit mentioned this pull request Feb 20, 2026

Durable Jobs follow-up #9750

Open

14 tasks

Copilot AI reviewed Feb 20, 2026

View reviewed changes

ReubenBond reviewed Feb 20, 2026

View reviewed changes

benjaminpetit force-pushed the feature/slow-shard-assignement branch from 222250d to 6dc83a8 Compare February 20, 2026 17:33

Implement slow-start budget for orphaned shard claims and enhance rel…

fc150cf

…ated logging

benjaminpetit force-pushed the feature/slow-shard-assignement branch from 6dc83a8 to fc150cf Compare February 20, 2026 17:43

Fix DurableJobs test shard assignment signature

d39aade

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Implement slow-start budget for orphaned shard claims and enhance related logging#9943

Implement slow-start budget for orphaned shard claims and enhance related logging#9943
benjaminpetit wants to merge 2 commits intodotnet:mainfrom
benjaminpetit:feature/slow-shard-assignement

benjaminpetit commented Feb 20, 2026 •

edited by dotnet-policy-service bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

ReubenBond Feb 20, 2026

Uh oh!

ReubenBond Feb 20, 2026

Uh oh!

ReubenBond Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

benjaminpetit commented Feb 20, 2026 • edited by dotnet-policy-service bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Microsoft Reviewers: Open in CodeFlow

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

ReubenBond Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

ReubenBond Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

ReubenBond Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benjaminpetit commented Feb 20, 2026 •

edited by dotnet-policy-service bot

Loading