Implement slow-start budget for orphaned shard claims and enhance related logging#9943
Implement slow-start budget for orphaned shard claims and enhance related logging#9943benjaminpetit wants to merge 2 commits intodotnet:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request implements a "slow-start" mechanism for orphaned job shard claiming in Orleans' Durable Jobs feature to prevent silos from overwhelming themselves during startup and disaster recovery scenarios. The implementation adds configurable limits that ramp up over time, integrates with overload detection to pause claims when needed, and includes comprehensive logging for observability.
Changes:
- Added three new configuration options (
SlowStartInitialBudget,SlowStartMaxBudget,SlowStartRampUpDuration) with validation to control the slow-start behavior - Modified
AssignJobShardsAsyncAPI acrossJobShardManagerimplementations to accept amaxNewClaimsparameter that enforces the budget - Implemented ramp-up logic in
LocalDurableJobManagerthat computes the current claim budget, tracks claimed shards, and respects overload detection
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/Orleans.DurableJobs/Hosting/DurableJobsOptions.cs |
Added three new slow-start configuration properties with XML documentation and validation logic |
src/Orleans.DurableJobs/LocalDurableJobManager.cs |
Implemented slow-start state tracking, budget computation with overload integration, and orphaned shard claim counting |
src/Orleans.DurableJobs/LocalDurableJobManager.Log.cs |
Added three new log messages for claim budget, orphaned claims, and overload pauses |
src/Orleans.DurableJobs/JobShardManager.cs |
Updated AssignJobShardsAsync signature to include maxNewClaims parameter and implemented budget enforcement in InMemoryJobShardManager |
src/Azure/Orleans.DurableJobs.AzureStorage/AzureStorageJobShardManager.cs |
Implemented budget enforcement for the Azure Storage provider |
test/Tester/DurableJobs/JobShardManagerTestsRunner.cs |
Added three comprehensive tests for slow-start behavior and updated all existing test calls with int.MaxValue |
test/Tester/DurableJobs/InMemoryJobShardManagerTests.cs |
Added test method delegates for the three new slow-start tests |
test/Extensions/TesterAzureUtils/DurableJobs/AzureStorageJobShardManagerTests.cs |
Added test method delegates for the three new slow-start tests with appropriate test categories |
test/Extensions/TesterAzureUtils/DurableJobs/AzureStorageJobShardBatchingTests.cs |
Updated all test calls to use maxNewClaims: int.MaxValue for unlimited claiming |
| // If adopted from dead silo, increment adopted count | ||
| if (isFromDeadSilo) | ||
| // Respect the slow-start budget: skip claiming if we've exhausted the budget | ||
| if (stolenShards.Count >= maxNewClaims) |
There was a problem hiding this comment.
IMO we should use 'Adopted' instead of stolen. Adopted vs Orphaned/Abandoned match, whereas Stolen vs Orphaned/Abandoned don't match well
There was a problem hiding this comment.
I think I changed the terminology in one of the previous PRs already
| { | ||
| LogStarting(_logger); | ||
|
|
||
| _startTimestamp = Stopwatch.GetTimestamp(); |
There was a problem hiding this comment.
Not a big deal, but we should inject & use TimeProvider here instead of Stopwatch
222250d to
6dc83a8
Compare
6dc83a8 to
fc150cf
Compare
This pull request introduces a "slow-start" mechanism for orphaned job shard claiming, designed to prevent silos from overwhelming themselves by claiming too many shards immediately after startup, especially during disaster recovery scenarios. The changes add configurable limits and ramp-up logic, integrate overload detection to pause claims when needed, and update logging and validation accordingly.
Slow-start shard claiming mechanism:
SlowStartInitialBudget,SlowStartMaxBudget, andSlowStartRampUpDurationconfiguration options toDurableJobsOptions, allowing control over how many orphaned shards a silo may claim immediately after startup and how this budget increases over time.LocalDurableJobManagerto compute the current claim budget, track claimed shards, and respect overload detection by pausing claims when overloaded. [1] [2] [3] [4] [5]API and implementation updates:
AssignJobShardsAsyncinJobShardManagerand its implementations to accept amaxNewClaimsparameter, enforcing the slow-start budget during shard assignment. [1] [2] [3] [4] [5] [6]maxNewClaimsparameter, ensuring compatibility and correctness. [1] [2] [3] [4]Validation and logging enhancements:
Integration with overload detection:
IOverloadDetectorinLocalDurableJobManagerto pause new shard claims when the silo is overloaded, further protecting system stability. [1] [2]These changes collectively improve the robustness and resilience of the job shard assignment process during silo startup and recovery scenarios.
Microsoft Reviewers: Open in CodeFlow