Feature/stale job monitor#5558
Open
brendankowitz wants to merge 15 commits into
Open
Conversation
b3a52ff to
98da39a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5558 +/- ##
=======================================
Coverage ? 77.05%
=======================================
Files ? 1001
Lines ? 36731
Branches ? 5548
=======================================
Hits ? 28303
Misses ? 7078
Partials ? 1350 🚀 New features to boost your workflow:
|
4487a78 to
230fbb1
Compare
Spec for a StaleJobWatchdog that emits fhir_oldest_queued_job_age_seconds Prometheus gauge per queue type when no jobs are running. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6-task plan: notification, metric handler, watchdog, WatchdogsBackgroundService wiring, DI registration, integration test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Previously, a single running job in any queue masked staleness in every other queue, defeating per-queue-type alerting. ComputeQueueAges now evaluates the running check per queue. - StaleJobMetricHandler swapped from a per-key-updated ConcurrentDictionary to a volatile reference swap so ObservableGauge scrapes never observe a partial multi-queue update. - Added logic test asserting a running job in one queue does not suppress another queue's staleness. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an ObservableGauge<long> named Jobs.QueueDepth to the existing StaleJobMetricHandler, using the same per-tick SQL result set already fetched by StaleJobWatchdog. Reports pending (Created) and running job counts per QueueType via queue_type and state tags, complementing the existing Jobs.OldestQueuedAgeSeconds metric for full active-queue observability. ADR 2605 amended with the depth metric decision. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The watchdog reports both stale-queue age and queue depth metrics, so JobMonitorWatchdog more accurately describes its broader monitoring role. Renames the watchdog and its companion notification and metric handler, moves StaleJobMetricHandler into the Logging/Metrics/Handlers folder to match the post-rebase main layout (PR #5555 moved metric handlers there), and updates the Features/Operations/StaleJob folder to Features/Operations/JobMonitor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Mark _now as readonly in JobMonitorWatchdogLogicTests - Replace ContainsKey+indexer with TryGetValue in SqlServerWatchdogTests to avoid double dictionary lookups Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace per-queue GetActiveJobs calls with a single aggregate query over dbo.JobQueue (Status, MIN(CreateDate), COUNT per queue type); no more varchar(max) Definition/Result transfer every tick - Suppress stale gauge snapshots: gauges emit nothing when no notification has arrived within 300s, instead of re-reporting frozen values when the watchdog stops or the lease moves - Merge handler state into a single immutable Snapshot record, eliminating cross-gauge tearing - Rename Jobs.OldestQueuedAgeSeconds to Jobs.OldestQueuedAge (unit belongs in unit:, not the name) and report the oldest pending job age unconditionally; the stalled judgment (age >= 600s and no running jobs) moves to the warning log and alert expression - Clamp negative ages from clock skew - Add CoreFeatureConfiguration.EnableJobMonitor flag (default true) - Add error logging when a tick fails to refresh metrics - Tests: non-empty-queue and cross-queue-isolation integration tests, age-gauge MeterListener test, staleness and replace-not-merge tests, clamp test - Document metric conventions and updated semantics in ADR-2605 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
230fbb1 to
735ac63
Compare
- Use ClockResolver.TimeProvider instead of DateTime.UtcNow in RunWorkAsync
so the watchdog age computation shares one controllable clock with the handler
- Guard OperationCanceledException before the general catch to avoid logging
graceful shutdown as a monitor failure
- Rename log placeholders to {OldestJobAgeSecs} so structured log queries work
- Fix XML doc on StaleQueueWarningThresholdSeconds to include zero-running requirement
- Fix catch-block comment: FhirTimer catches the rethrow, not the other way around
- Scope MeterListener in logic tests to instrument.Meter.Scope to prevent
cross-test leakage in parallel xUnit execution
- Tighten empty-queue assertion to Assert.Equal(0, age); tighten age range to
Assert.InRange(exportAge, 0, 120) so the assertion can detect regressions
- Fix ADR alert expression boundary to >= 600 (matches code) and update clock
reference to ClockResolver.TimeProvider
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This pull request introduces a stale job monitor for SQL-backed async job queues. The monitor reports the age of the oldest queued job per queue type so stalled queues can be detected before customers observe delayed operations.
Key changes
Stale job monitor
StaleJobWatchdogto query active jobs for eachQueueType, compute the oldest queued job age per queue, log stale queues, and publishStaleJobMetricsNotification.StaleJobMetricsNotificationandStaleJobMetricHandlerto expose the latest queue-age snapshot through theFhirServermeter asJobs.OldestQueuedAgeSecondswith aqueue_typetag.Dependency injection and background service integration
StaleJobWatchdogas a singleton in SQL Server service registration.StaleJobMetricHandleras a singleton MediatR notification handler so the observable gauge reads a stable metric snapshot.WatchdogsBackgroundServiceto startStaleJobWatchdogwith the existing SQL watchdogs.Testing and documentation
docs/arch/adr-2605-stale-job-monitor.md.Related issues
Addresses AB#164461.
Testing
dotnet build .\src\Microsoft.Health.Fhir.Core\Microsoft.Health.Fhir.Core.csproj -c Release -f net8.0 --no-restoredotnet build .\src\Microsoft.Health.Fhir.Core\Microsoft.Health.Fhir.Core.csproj -c Release -f net9.0 --no-restoredotnet build .\src\Microsoft.Health.Fhir.SqlServer\Microsoft.Health.Fhir.SqlServer.csproj -c Release -f net8.0 --no-restoredotnet build .\src\Microsoft.Health.Fhir.SqlServer\Microsoft.Health.Fhir.SqlServer.csproj -c Release -f net9.0 --no-restoreFHIR Team Checklist
New Feature.Azure Healthcare APIs.No-PaaS-breaking-change.docs/arch/adr-2605-stale-job-monitor.md.Semver Change
Feature