Skip to content

Comments

Implement slow-start budget for orphaned shard claims and enhance related logging#9943

Open
benjaminpetit wants to merge 2 commits intodotnet:mainfrom
benjaminpetit:feature/slow-shard-assignement
Open

Implement slow-start budget for orphaned shard claims and enhance related logging#9943
benjaminpetit wants to merge 2 commits intodotnet:mainfrom
benjaminpetit:feature/slow-shard-assignement

Conversation

@benjaminpetit
Copy link
Member

@benjaminpetit benjaminpetit commented Feb 20, 2026

This pull request introduces a "slow-start" mechanism for orphaned job shard claiming, designed to prevent silos from overwhelming themselves by claiming too many shards immediately after startup, especially during disaster recovery scenarios. The changes add configurable limits and ramp-up logic, integrate overload detection to pause claims when needed, and update logging and validation accordingly.

Slow-start shard claiming mechanism:

  • Added SlowStartInitialBudget, SlowStartMaxBudget, and SlowStartRampUpDuration configuration options to DurableJobsOptions, allowing control over how many orphaned shards a silo may claim immediately after startup and how this budget increases over time.
  • Implemented ramp-up logic in LocalDurableJobManager to compute the current claim budget, track claimed shards, and respect overload detection by pausing claims when overloaded. [1] [2] [3] [4] [5]

API and implementation updates:

  • Modified AssignJobShardsAsync in JobShardManager and its implementations to accept a maxNewClaims parameter, enforcing the slow-start budget during shard assignment. [1] [2] [3] [4] [5] [6]
  • Updated tests to use the new maxNewClaims parameter, ensuring compatibility and correctness. [1] [2] [3] [4]

Validation and logging enhancements:

  • Added configuration validation for the new slow-start options to prevent misconfiguration.
  • Introduced new log messages for shard claim budget, orphaned shard claims, and overload pauses to improve observability.

Integration with overload detection:

  • Integrated IOverloadDetector in LocalDurableJobManager to pause new shard claims when the silo is overloaded, further protecting system stability. [1] [2]

These changes collectively improve the robustness and resilience of the job shard assignment process during silo startup and recovery scenarios.

Microsoft Reviewers: Open in CodeFlow

Copilot AI review requested due to automatic review settings February 20, 2026 16:57
@benjaminpetit benjaminpetit mentioned this pull request Feb 20, 2026
14 tasks
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a "slow-start" mechanism for orphaned job shard claiming in Orleans' Durable Jobs feature to prevent silos from overwhelming themselves during startup and disaster recovery scenarios. The implementation adds configurable limits that ramp up over time, integrates with overload detection to pause claims when needed, and includes comprehensive logging for observability.

Changes:

  • Added three new configuration options (SlowStartInitialBudget, SlowStartMaxBudget, SlowStartRampUpDuration) with validation to control the slow-start behavior
  • Modified AssignJobShardsAsync API across JobShardManager implementations to accept a maxNewClaims parameter that enforces the budget
  • Implemented ramp-up logic in LocalDurableJobManager that computes the current claim budget, tracks claimed shards, and respects overload detection

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/Orleans.DurableJobs/Hosting/DurableJobsOptions.cs Added three new slow-start configuration properties with XML documentation and validation logic
src/Orleans.DurableJobs/LocalDurableJobManager.cs Implemented slow-start state tracking, budget computation with overload integration, and orphaned shard claim counting
src/Orleans.DurableJobs/LocalDurableJobManager.Log.cs Added three new log messages for claim budget, orphaned claims, and overload pauses
src/Orleans.DurableJobs/JobShardManager.cs Updated AssignJobShardsAsync signature to include maxNewClaims parameter and implemented budget enforcement in InMemoryJobShardManager
src/Azure/Orleans.DurableJobs.AzureStorage/AzureStorageJobShardManager.cs Implemented budget enforcement for the Azure Storage provider
test/Tester/DurableJobs/JobShardManagerTestsRunner.cs Added three comprehensive tests for slow-start behavior and updated all existing test calls with int.MaxValue
test/Tester/DurableJobs/InMemoryJobShardManagerTests.cs Added test method delegates for the three new slow-start tests
test/Extensions/TesterAzureUtils/DurableJobs/AzureStorageJobShardManagerTests.cs Added test method delegates for the three new slow-start tests with appropriate test categories
test/Extensions/TesterAzureUtils/DurableJobs/AzureStorageJobShardBatchingTests.cs Updated all test calls to use maxNewClaims: int.MaxValue for unlimited claiming

// If adopted from dead silo, increment adopted count
if (isFromDeadSilo)
// Respect the slow-start budget: skip claiming if we've exhausted the budget
if (stolenShards.Count >= maxNewClaims)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should use 'Adopted' instead of stolen. Adopted vs Orphaned/Abandoned match, whereas Stolen vs Orphaned/Abandoned don't match well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I changed the terminology in one of the previous PRs already

{
LogStarting(_logger);

_startTimestamp = Stopwatch.GetTimestamp();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but we should inject & use TimeProvider here instead of Stopwatch

@benjaminpetit benjaminpetit force-pushed the feature/slow-shard-assignement branch from 222250d to 6dc83a8 Compare February 20, 2026 17:33
@benjaminpetit benjaminpetit force-pushed the feature/slow-shard-assignement branch from 6dc83a8 to fc150cf Compare February 20, 2026 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants