Skip to content

Feature/stale job monitor#5558

Open
brendankowitz wants to merge 15 commits into
mainfrom
feature/stale-job-monitor
Open

Feature/stale job monitor#5558
brendankowitz wants to merge 15 commits into
mainfrom
feature/stale-job-monitor

Conversation

@brendankowitz

@brendankowitz brendankowitz commented May 7, 2026

Copy link
Copy Markdown
Member

Description

This pull request introduces a stale job monitor for SQL-backed async job queues. The monitor reports the age of the oldest queued job per queue type so stalled queues can be detected before customers observe delayed operations.

Key changes

Stale job monitor

  • Added StaleJobWatchdog to query active jobs for each QueueType, compute the oldest queued job age per queue, log stale queues, and publish StaleJobMetricsNotification.
  • Added StaleJobMetricsNotification and StaleJobMetricHandler to expose the latest queue-age snapshot through the FhirServer meter as Jobs.OldestQueuedAgeSeconds with a queue_type tag.

Dependency injection and background service integration

  • Registered StaleJobWatchdog as a singleton in SQL Server service registration.
  • Re-registered StaleJobMetricHandler as a singleton MediatR notification handler so the observable gauge reads a stable metric snapshot.
  • Updated WatchdogsBackgroundService to start StaleJobWatchdog with the existing SQL watchdogs.

Testing and documentation

  • Added logic tests for queue age computation and metric snapshot updates.
  • Added a SQL watchdog integration test that verifies notifications include all queue types when the queue is empty.
  • Added ADR documentation at docs/arch/adr-2605-stale-job-monitor.md.

Related issues

Addresses AB#164461.

Testing

  • dotnet build .\src\Microsoft.Health.Fhir.Core\Microsoft.Health.Fhir.Core.csproj -c Release -f net8.0 --no-restore
  • dotnet build .\src\Microsoft.Health.Fhir.Core\Microsoft.Health.Fhir.Core.csproj -c Release -f net9.0 --no-restore
  • dotnet build .\src\Microsoft.Health.Fhir.SqlServer\Microsoft.Health.Fhir.SqlServer.csproj -c Release -f net8.0 --no-restore
  • dotnet build .\src\Microsoft.Health.Fhir.SqlServer\Microsoft.Health.Fhir.SqlServer.csproj -c Release -f net9.0 --no-restore

FHIR Team Checklist

  • Title is succinct and less than 65 characters.
  • Milestone added for the sprint that it is merged.
  • Tagged with the type of update: New Feature.
  • Tagged with release area: Azure Healthcare APIs.
  • Tagged with PaaS compatibility: No-PaaS-breaking-change.
  • ADR included: docs/arch/adr-2605-stale-job-monitor.md.
  • CI is green before merge.
  • Reviewed squash-merge requirements.

Semver Change

Feature

@brendankowitz brendankowitz added New Feature Label for a new feature in FHIR OSS Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs No-PaaS-breaking-change ADR-Included ADR Included in the PR labels May 7, 2026
@brendankowitz brendankowitz added this to the FY26\Q4\2Wk\2Wk23 milestone May 7, 2026
@brendankowitz brendankowitz force-pushed the feature/stale-job-monitor branch from b3a52ff to 98da39a Compare May 19, 2026 19:26
@codecov-commenter

codecov-commenter commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@7769ecc). Learn more about missing BASE report.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #5558   +/-   ##
=======================================
  Coverage        ?   77.05%           
=======================================
  Files           ?     1001           
  Lines           ?    36731           
  Branches        ?     5548           
=======================================
  Hits            ?    28303           
  Misses          ?     7078           
  Partials        ?     1350           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@brendankowitz brendankowitz marked this pull request as ready for review May 19, 2026 21:21
@brendankowitz brendankowitz requested a review from a team as a code owner May 19, 2026 21:21
@brendankowitz brendankowitz force-pushed the feature/stale-job-monitor branch from 4487a78 to 230fbb1 Compare May 19, 2026 21:24
brendankowitz and others added 14 commits June 12, 2026 13:47
Spec for a StaleJobWatchdog that emits fhir_oldest_queued_job_age_seconds
Prometheus gauge per queue type when no jobs are running.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6-task plan: notification, metric handler, watchdog, WatchdogsBackgroundService
wiring, DI registration, integration test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Previously, a single running job in any queue masked staleness in every
  other queue, defeating per-queue-type alerting. ComputeQueueAges now
  evaluates the running check per queue.
- StaleJobMetricHandler swapped from a per-key-updated ConcurrentDictionary
  to a volatile reference swap so ObservableGauge scrapes never observe a
  partial multi-queue update.
- Added logic test asserting a running job in one queue does not suppress
  another queue's staleness.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an ObservableGauge<long> named Jobs.QueueDepth to the existing
StaleJobMetricHandler, using the same per-tick SQL result set already
fetched by StaleJobWatchdog. Reports pending (Created) and running job
counts per QueueType via queue_type and state tags, complementing the
existing Jobs.OldestQueuedAgeSeconds metric for full active-queue
observability. ADR 2605 amended with the depth metric decision.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The watchdog reports both stale-queue age and queue depth metrics, so
JobMonitorWatchdog more accurately describes its broader monitoring role.

Renames the watchdog and its companion notification and metric handler,
moves StaleJobMetricHandler into the Logging/Metrics/Handlers folder to
match the post-rebase main layout (PR #5555 moved metric handlers there),
and updates the Features/Operations/StaleJob folder to
Features/Operations/JobMonitor.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Mark _now as readonly in JobMonitorWatchdogLogicTests
- Replace ContainsKey+indexer with TryGetValue in SqlServerWatchdogTests
  to avoid double dictionary lookups

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace per-queue GetActiveJobs calls with a single aggregate query
  over dbo.JobQueue (Status, MIN(CreateDate), COUNT per queue type);
  no more varchar(max) Definition/Result transfer every tick
- Suppress stale gauge snapshots: gauges emit nothing when no
  notification has arrived within 300s, instead of re-reporting
  frozen values when the watchdog stops or the lease moves
- Merge handler state into a single immutable Snapshot record,
  eliminating cross-gauge tearing
- Rename Jobs.OldestQueuedAgeSeconds to Jobs.OldestQueuedAge (unit
  belongs in unit:, not the name) and report the oldest pending job
  age unconditionally; the stalled judgment (age >= 600s and no
  running jobs) moves to the warning log and alert expression
- Clamp negative ages from clock skew
- Add CoreFeatureConfiguration.EnableJobMonitor flag (default true)
- Add error logging when a tick fails to refresh metrics
- Tests: non-empty-queue and cross-queue-isolation integration
  tests, age-gauge MeterListener test, staleness and
  replace-not-merge tests, clamp test
- Document metric conventions and updated semantics in ADR-2605

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@brendankowitz brendankowitz force-pushed the feature/stale-job-monitor branch from 230fbb1 to 735ac63 Compare June 12, 2026 22:10
- Use ClockResolver.TimeProvider instead of DateTime.UtcNow in RunWorkAsync
  so the watchdog age computation shares one controllable clock with the handler
- Guard OperationCanceledException before the general catch to avoid logging
  graceful shutdown as a monitor failure
- Rename log placeholders to {OldestJobAgeSecs} so structured log queries work
- Fix XML doc on StaleQueueWarningThresholdSeconds to include zero-running requirement
- Fix catch-block comment: FhirTimer catches the rethrow, not the other way around
- Scope MeterListener in logic tests to instrument.Meter.Scope to prevent
  cross-test leakage in parallel xUnit execution
- Tighten empty-queue assertion to Assert.Equal(0, age); tighten age range to
  Assert.InRange(exportAge, 0, 120) so the assertion can detect regressions
- Fix ADR alert expression boundary to >= 600 (matches code) and update clock
  reference to ClockResolver.TimeProvider

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ADR-Included ADR Included in the PR Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs New Feature Label for a new feature in FHIR OSS No-PaaS-breaking-change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants