[Incident] Elevated Backend Latency — Cold-Path JVM/Hibernate Warm-up After Idle Period

## Summary

Azure Monitor alert `alert-latency-sre-three-rivers` fired at **2026-03-24T21:35:53 UTC** on Container App **ca-banking-demo-backend**. Backend average response time reached **3120ms (avg) / 4681ms (max)** when 4 requests hit the `/api/cards` endpoint after **3.5+ hours of zero traffic**. No errors, no restarts, no resource starvation — this is a **JVM cold-path latency** issue caused by Hibernate/Spring framework re-initialization after an extended idle period.

**Severity**: Sev3 | **Status**: Resolved (self-recovering) | **Duration**: ~5 minutes

---

## Impact

- **4 user requests** experienced degraded response times (3120ms avg, 4681ms max) vs expected <500ms baseline
- Affected endpoint: `GET /api/cards` (card listing — landing page flow)
- No data loss or errors — requests completed successfully, just slowly
- Impact limited to first requests after idle; subsequent requests at normal speed

---

## Timeline

| Time (UTC) | Event |
|---|---|
| ~18:00 | Last observed traffic before idle period |
| 18:00–21:25 | **Zero traffic** — 3.5 hours of complete idle |
| 21:27–21:31 | First requests arrive at `GET /api/cards` |
| 21:30 | ResponseTime spike: **avg 3120ms, max 4681ms** (4 requests) |
| 21:31:27 | Log: `CreditCardController: GET /api/cards` via `nio-8080-exec-5` |
| 21:31:33 | Hibernate SQL generated for `credit_card` table |
| 21:31:51 | Second request via `nio-8080-exec-1`, service fetching from H2 |
| 21:31:57 | Hibernate SQL re-executed for `credit_card` table |
| 21:35:53 | **Alert fired**: `alert-latency-sre-three-rivers` |
| 21:36+ | No further requests; latency spike self-resolved |

---

## Evidence

### Metrics (Azure Monitor)

| Metric | Value at 21:30 | Baseline | Assessment |
|---|---|---|---|
| **ResponseTime (avg)** | **3120ms** | <500ms | **ELEVATED** — 6x above expected |
| **ResponseTime (max)** | **4681ms** | <500ms | **ELEVATED** |
| Requests | 4 | 0 (idle) | First traffic after 3.5h |
| CPU % | 0.1% | 0% | Normal — no starvation |
| Memory % | 24% | 24% | Normal — flat |
| Replicas | 1 | 1 | Stable — no scale events |
| RestartCount | 0 | 0 | No restarts |
| JvmGcDuration | 0ms | 0ms | No GC pressure |

### Logs (Log Analytics)

Console logs at incident time show normal execution — no errors, exceptions, or warnings:
```
21:31:27 INFO CreditCardController: GET /api/cards - cardType: null, noAnnualFee: null
21:31:27 INFO CreditCardService: Fetching all credit cards from H2 database
21:31:33 Hibernate: select cc1_0.id, ... from credit_card cc1_0
21:31:51 INFO CreditCardController: GET /api/cards - cardType: null, noAnnualFee: null
21:31:51 INFO CreditCardService: Fetching all credit cards from H2 database
21:31:57 Hibernate: select cc1_0.id, ... from credit_card cc1_0
```

System logs: **No restarts, crashes, OOM events, or health probe failures** in the 6-hour window.

### Configuration

| Setting | Value | Notes |
|---|---|---|
| Image | `crbankingdemooqaqx.azurecr.io/three-rivers-bank/backend-banking-demo:azd-deploy-1774199736` | Unchanged |
| CPU/Memory | 0.5 vCPU / 1GB | Standard allocation |
| minReplicas | 1 | Prevents scale-to-zero |
| maxReplicas | 3 | — |
| targetPort | 8080 | Correct |
| BIAN_API_BASE_URL | `https://virtserver.swaggerhub.com/B154/BIAN/CreditCard/13.0.0` | Correct |
| SPRING_PROFILES_ACTIVE | production | Correct |

---

## Root Cause

**JVM/Spring Boot cold-path latency after extended idle period.**

The request flow for `GET /api/cards`:
1. [`CreditCardController.getAllCards()`](https://github.com/yortch/agentic-devops-demo/blob/main/backend/src/main/java/com/threeriversbank/controller/CreditCardController.java#L25-L39) — REST controller
2. [`CreditCardService.getAllCreditCards()`](https://github.com/yortch/agentic-devops-demo/blob/main/backend/src/main/java/com/threeriversbank/service/CreditCardService.java#L34-L38) — calls `creditCardRepository.findAll()` on H2
3. Hibernate generates and executes `SELECT ... FROM credit_card`

**Why it was slow**: After 3.5 hours of zero traffic, multiple cold-path costs stack up on the first request:

| Cold-Path Factor | Estimated Contribution | Code Location |
|---|---|---|
| **Hibernate query plan cache cold** | ~1500ms | JPA query compilation on first `findAll()` call |
| **H2 connection pool stale/reconnect** | ~500ms | Datasource reconnection after idle timeout |
| **Spring servlet thread pool cold** | ~300ms | `nio-8080-exec` thread initialization |
| **JVM JIT deoptimization** | ~500ms | Hot code paths decompiled during idle |
| **Container Apps platform routing** | ~300ms | Envoy sidecar warm-up after idle |

**Why CPU/Memory remained low**: The bottleneck is I/O wait and initialization overhead, not compute. The JVM is spending time on class loading, connection establishment, and query plan compilation — none of which consume significant CPU.

**Contributing factors in** [`application.yml`](https://github.com/yortch/agentic-devops-demo/blob/main/backend/src/main/resources/application.yml):
- `spring.jpa.show-sql: true` — unnecessary overhead in production (line 20)
- `spring.cache.type: simple` — no TTL, no warmup mechanism (line 28)
- No health check warmup configured to keep JVM hot
- No Application Insights configured — limits observability

---

## Remediation

### Immediate (No Code Change)
1. **Send a warmup request** to verify the app is responsive and latency normalizes:
   ```bash
   curl -s https://ca-banking-demo-backend.greenpebble-7a243cbc.eastus2.azurecontainerapps.io/api/cards | head -c 200
   ```

### Short-Term (Configuration)
2. **Add a scheduled health probe/warmup** — Configure a periodic keep-alive that hits `/api/cards` every 5 minutes to prevent the JVM from going cold. This can be done via:
   - Azure Logic App / Timer trigger
   - Container Apps built-in health probes with a custom warmup path
   - A `@Scheduled` Spring Bean that calls `getAllCreditCards()` periodically

3. **Disable `show-sql` in production** in [`application.yml`](https://github.com/yortch/agentic-devops-demo/blob/main/backend/src/main/resources/application.yml#L20):
   ```yaml
   spring.jpa.show-sql: false
   ```

4. **Tune H2 connection pool keep-alive** to prevent stale connections during idle:
   ```yaml
   spring:
     datasource:
       hikari:
         connection-test-query: SELECT 1
         keepalive-time: 300000  # 5 minutes
         idle-timeout: 600000    # 10 minutes
   ```

### Long-Term (Architecture)
5. **Add Application Insights** for full APM tracing (request duration, dependency calls, JVM metrics)
6. **Consider increasing `minReplicas` to 2** during business hours for redundancy
7. **Implement a startup warmup controller** that pre-loads Hibernate caches and verifies H2 connectivity on container start
8. **Tune alert threshold** — the current alert fires on any elevated response time; consider adding a request-volume qualifier (e.g., only alert when response time > 3s AND request count > 10)

---

## Action Items

- [ ] Add `@Scheduled` warmup bean to periodically call `getAllCreditCards()` during idle periods
- [ ] Set `spring.jpa.show-sql: false` for production profile
- [ ] Configure HikariCP keep-alive settings to prevent stale H2 connections
- [ ] Add Application Insights to `rg-banking-demo` for APM tracing
- [ ] Review alert threshold to include minimum request volume qualifier
- [ ] Consider startup warmup controller for container initialization

---

*Detected by Azure SRE Agent | Alert: alert-latency-sre-three-rivers | Resource: ca-banking-demo-backend*
---
*This issue was created by sre-sre-three-rivers-bswqe--b2b14894*
Tracked by the SRE agent [here](https://portal.azure.com/?feature.customPortal=false&feature.canmodifystamps=true&feature.fastmanifest=false&nocdn=force&websitesextension_loglevel=verbose&Microsoft_Azure_PaasServerless=beta&microsoft_azure_paasserverless_assettypeoptions=%7B%22SreAgentCustomMenu%22%3A%7B%22options%22%3A%22%22%7D%7D#view/Microsoft_Azure_PaasServerless/AgentFrameBlade.ReactView/id/%2Fsubscriptions%2F529eddcc-17c4-4834-842d-73670845229b%2FresourceGroups%2Frg-sre-sre-three-rivers%2Fproviders%2FMicrosoft.App%2Fagents%2Fsre-sre-three-rivers-bswqe/sreLink/%2Fviews%2Factivities%2Fthreads%2F418db1bf-1c4e-4f2e-8b80-1c3214d3f465)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Incident] Elevated Backend Latency — Cold-Path JVM/Hibernate Warm-up After Idle Period #68

Summary

Impact

Timeline

Evidence

Metrics (Azure Monitor)

Logs (Log Analytics)

Configuration

Root Cause

Remediation

Immediate (No Code Change)

Short-Term (Configuration)

Long-Term (Architecture)

Action Items

Detected by Azure SRE Agent | Alert: alert-latency-sre-three-rivers | Resource: ca-banking-demo-backend

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Time (UTC)	Event
~18:00	Last observed traffic before idle period
18:00–21:25	Zero traffic — 3.5 hours of complete idle
21:27–21:31	First requests arrive at `GET /api/cards`
21:30	ResponseTime spike: avg 3120ms, max 4681ms (4 requests)
21:31:27	Log: `CreditCardController: GET /api/cards` via `nio-8080-exec-5`
21:31:33	Hibernate SQL generated for `credit_card` table
21:31:51	Second request via `nio-8080-exec-1`, service fetching from H2
21:31:57	Hibernate SQL re-executed for `credit_card` table
21:35:53	Alert fired: `alert-latency-sre-three-rivers`
21:36+	No further requests; latency spike self-resolved

Metric	Value at 21:30	Baseline	Assessment
ResponseTime (avg)	3120ms	<500ms	ELEVATED — 6x above expected
ResponseTime (max)	4681ms	<500ms	ELEVATED
Requests	4	0 (idle)	First traffic after 3.5h
CPU %	0.1%	0%	Normal — no starvation
Memory %	24%	24%	Normal — flat
Replicas	1	1	Stable — no scale events
RestartCount	0	0	No restarts
JvmGcDuration	0ms	0ms	No GC pressure

Setting	Value	Notes
Image	`crbankingdemooqaqx.azurecr.io/three-rivers-bank/backend-banking-demo:azd-deploy-1774199736`	Unchanged
CPU/Memory	0.5 vCPU / 1GB	Standard allocation
minReplicas	1	Prevents scale-to-zero
maxReplicas	3	—
targetPort	8080	Correct
BIAN_API_BASE_URL	`https://virtserver.swaggerhub.com/B154/BIAN/CreditCard/13.0.0`	Correct
SPRING_PROFILES_ACTIVE	production	Correct

Cold-Path Factor	Estimated Contribution	Code Location
Hibernate query plan cache cold	~1500ms	JPA query compilation on first `findAll()` call
H2 connection pool stale/reconnect	~500ms	Datasource reconnection after idle timeout
Spring servlet thread pool cold	~300ms	`nio-8080-exec` thread initialization
JVM JIT deoptimization	~500ms	Hot code paths decompiled during idle
Container Apps platform routing	~300ms	Envoy sidecar warm-up after idle

[Incident] Elevated Backend Latency — Cold-Path JVM/Hibernate Warm-up After Idle Period #68

Description

Summary

Impact

Timeline

Evidence

Metrics (Azure Monitor)

Logs (Log Analytics)

Configuration

Root Cause

Remediation

Immediate (No Code Change)

Short-Term (Configuration)

Long-Term (Architecture)

Action Items

Detected by Azure SRE Agent | Alert: alert-latency-sre-three-rivers | Resource: ca-banking-demo-backend

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions