Skip to content

[Incident] Elevated Backend Latency — Cold-Path JVM/Hibernate Warm-up After Idle Period #68

@yortch

Description

@yortch

Summary

Azure Monitor alert alert-latency-sre-three-rivers fired at 2026-03-24T21:35:53 UTC on Container App ca-banking-demo-backend. Backend average response time reached 3120ms (avg) / 4681ms (max) when 4 requests hit the /api/cards endpoint after 3.5+ hours of zero traffic. No errors, no restarts, no resource starvation — this is a JVM cold-path latency issue caused by Hibernate/Spring framework re-initialization after an extended idle period.

Severity: Sev3 | Status: Resolved (self-recovering) | Duration: ~5 minutes


Impact

  • 4 user requests experienced degraded response times (3120ms avg, 4681ms max) vs expected <500ms baseline
  • Affected endpoint: GET /api/cards (card listing — landing page flow)
  • No data loss or errors — requests completed successfully, just slowly
  • Impact limited to first requests after idle; subsequent requests at normal speed

Timeline

Time (UTC) Event
~18:00 Last observed traffic before idle period
18:00–21:25 Zero traffic — 3.5 hours of complete idle
21:27–21:31 First requests arrive at GET /api/cards
21:30 ResponseTime spike: avg 3120ms, max 4681ms (4 requests)
21:31:27 Log: CreditCardController: GET /api/cards via nio-8080-exec-5
21:31:33 Hibernate SQL generated for credit_card table
21:31:51 Second request via nio-8080-exec-1, service fetching from H2
21:31:57 Hibernate SQL re-executed for credit_card table
21:35:53 Alert fired: alert-latency-sre-three-rivers
21:36+ No further requests; latency spike self-resolved

Evidence

Metrics (Azure Monitor)

Metric Value at 21:30 Baseline Assessment
ResponseTime (avg) 3120ms <500ms ELEVATED — 6x above expected
ResponseTime (max) 4681ms <500ms ELEVATED
Requests 4 0 (idle) First traffic after 3.5h
CPU % 0.1% 0% Normal — no starvation
Memory % 24% 24% Normal — flat
Replicas 1 1 Stable — no scale events
RestartCount 0 0 No restarts
JvmGcDuration 0ms 0ms No GC pressure

Logs (Log Analytics)

Console logs at incident time show normal execution — no errors, exceptions, or warnings:

21:31:27 INFO CreditCardController: GET /api/cards - cardType: null, noAnnualFee: null
21:31:27 INFO CreditCardService: Fetching all credit cards from H2 database
21:31:33 Hibernate: select cc1_0.id, ... from credit_card cc1_0
21:31:51 INFO CreditCardController: GET /api/cards - cardType: null, noAnnualFee: null
21:31:51 INFO CreditCardService: Fetching all credit cards from H2 database
21:31:57 Hibernate: select cc1_0.id, ... from credit_card cc1_0

System logs: No restarts, crashes, OOM events, or health probe failures in the 6-hour window.

Configuration

Setting Value Notes
Image crbankingdemooqaqx.azurecr.io/three-rivers-bank/backend-banking-demo:azd-deploy-1774199736 Unchanged
CPU/Memory 0.5 vCPU / 1GB Standard allocation
minReplicas 1 Prevents scale-to-zero
maxReplicas 3
targetPort 8080 Correct
BIAN_API_BASE_URL https://virtserver.swaggerhub.com/B154/BIAN/CreditCard/13.0.0 Correct
SPRING_PROFILES_ACTIVE production Correct

Root Cause

JVM/Spring Boot cold-path latency after extended idle period.

The request flow for GET /api/cards:

  1. CreditCardController.getAllCards() — REST controller
  2. CreditCardService.getAllCreditCards() — calls creditCardRepository.findAll() on H2
  3. Hibernate generates and executes SELECT ... FROM credit_card

Why it was slow: After 3.5 hours of zero traffic, multiple cold-path costs stack up on the first request:

Cold-Path Factor Estimated Contribution Code Location
Hibernate query plan cache cold ~1500ms JPA query compilation on first findAll() call
H2 connection pool stale/reconnect ~500ms Datasource reconnection after idle timeout
Spring servlet thread pool cold ~300ms nio-8080-exec thread initialization
JVM JIT deoptimization ~500ms Hot code paths decompiled during idle
Container Apps platform routing ~300ms Envoy sidecar warm-up after idle

Why CPU/Memory remained low: The bottleneck is I/O wait and initialization overhead, not compute. The JVM is spending time on class loading, connection establishment, and query plan compilation — none of which consume significant CPU.

Contributing factors in application.yml:

  • spring.jpa.show-sql: true — unnecessary overhead in production (line 20)
  • spring.cache.type: simple — no TTL, no warmup mechanism (line 28)
  • No health check warmup configured to keep JVM hot
  • No Application Insights configured — limits observability

Remediation

Immediate (No Code Change)

  1. Send a warmup request to verify the app is responsive and latency normalizes:
    curl -s https://ca-banking-demo-backend.greenpebble-7a243cbc.eastus2.azurecontainerapps.io/api/cards | head -c 200

Short-Term (Configuration)

  1. Add a scheduled health probe/warmup — Configure a periodic keep-alive that hits /api/cards every 5 minutes to prevent the JVM from going cold. This can be done via:

    • Azure Logic App / Timer trigger
    • Container Apps built-in health probes with a custom warmup path
    • A @Scheduled Spring Bean that calls getAllCreditCards() periodically
  2. Disable show-sql in production in application.yml:

    spring.jpa.show-sql: false
  3. Tune H2 connection pool keep-alive to prevent stale connections during idle:

    spring:
      datasource:
        hikari:
          connection-test-query: SELECT 1
          keepalive-time: 300000  # 5 minutes
          idle-timeout: 600000    # 10 minutes

Long-Term (Architecture)

  1. Add Application Insights for full APM tracing (request duration, dependency calls, JVM metrics)
  2. Consider increasing minReplicas to 2 during business hours for redundancy
  3. Implement a startup warmup controller that pre-loads Hibernate caches and verifies H2 connectivity on container start
  4. Tune alert threshold — the current alert fires on any elevated response time; consider adding a request-volume qualifier (e.g., only alert when response time > 3s AND request count > 10)

Action Items

  • Add @Scheduled warmup bean to periodically call getAllCreditCards() during idle periods
  • Set spring.jpa.show-sql: false for production profile
  • Configure HikariCP keep-alive settings to prevent stale H2 connections
  • Add Application Insights to rg-banking-demo for APM tracing
  • Review alert threshold to include minimum request volume qualifier
  • Consider startup warmup controller for container initialization

Detected by Azure SRE Agent | Alert: alert-latency-sre-three-rivers | Resource: ca-banking-demo-backend

This issue was created by sre-sre-three-rivers-bswqe--b2b14894
Tracked by the SRE agent here

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions