Skip to content

[Incident] Recurring Thread.sleep chaos injection causing elevated backend latency (3rd occurrence) #77

@yortch

Description

@yortch

Summary

Azure Monitor alert alert-latency-sre-three-rivers fired at 2026-03-25T18:35:00Z for elevated average response time on ca-banking-demo-backend. Root cause identified as a recurring Thread.sleep chaos injection in the service layer — the 3rd occurrence of this exact pattern (previous: Issue #71, Issue #73).

Impact

  • Affected endpoint: GET /api/cards — the primary card listing API used by the frontend
  • Impact scope: Every call to the homepage card listing experiences random 0–9 second delays
  • User experience: Significantly degraded — page load times unpredictable (0s to 9s added latency)
  • Severity: Sev3 (availability not affected, but performance severely degraded)
  • Traffic: Low volume (9 requests at 18:00, 2 at 18:30) — sparse traffic amplifies the 5-min average latency metric

Timeline (UTC)

Time Event
Unknown Thread.sleep chaos code re-introduced in CreditCardService.java line 36
18:00:57 GET /api/cards request logged — slow response expected
18:01:28–18:02:01 5x WARN logs for /api/cards/compare route mismatch (unrelated)
18:30:56 GET /api/cards requests resume
18:35:00 Alert fired: alert-latency-sre-three-rivers
18:40:00 Investigation started by SRE Agent

Evidence

1. Source Code — Root Cause Line

File: backend/src/main/java/com/threeriversbank/service/CreditCardService.java#L36

@Transactional(readOnly = true)
public List<CreditCardDto> getAllCreditCards() {
    log.info("Fetching all credit cards from H2 database");
    try { Thread.sleep((long)(Math.random() * 9000)); } catch (InterruptedException e) { Thread.currentThread().interrupt(); }
    return creditCardRepository.findAll().stream()
            .map(this::convertToDto)
            .collect(Collectors.toList());
}

This injects a random delay of 0–9000ms on every call to getAllCreditCards(), which backs the GET /api/cards endpoint.

2. Resource Metrics — No Resource Starvation

Metric Value Threshold Status
CPU (avg) ~1.1–1.6 millicores 500m allocation ✅ Normal (< 1%)
Memory (avg) ~252 MB 1024 MB allocation ✅ Normal (~25%)
Replicas 2 min: 2, max: 3 ✅ Normal
Restart Count 0 ✅ No restarts

CPU and memory are both well within allocation, confirming this is not resource starvation.

3. Request Metrics (18:00–18:40 UTC, 5min intervals)

Time Requests
18:00 9
18:05–18:25 0
18:30 2
18:35 0

Sparse traffic means each slow request heavily skews the 5-minute average, crossing the alert threshold.

4. Console Logs

  • No exceptions or errors related to latency — all requests return 200 OK
  • 5x WARN for MethodArgumentTypeMismatchException on /api/cards/compare (frontend routing issue, unrelated to latency)
  • Normal "Fetching all credit cards from H2 database" log messages present

5. Container App Configuration

  • Image: crbankingdemooqaqx.azurecr.io/three-rivers-bank/backend-banking-demo:azd-deploy-1774443429
  • Revision: ca-banking-demo-backend--azd-1774443552
  • Status: Running
  • Scale: 2–3 replicas

Root Cause

Confirmed: Thread.sleep((long)(Math.random() * 9000)) injected at line 36 of CreditCardService.java introduces a random blocking delay of 0–9 seconds on every call to GET /api/cards.

This is the same chaos injection found in Issue #71 and Issue #73. It has been re-introduced (or never fully removed) since the last deployment.

Remediation

Immediate Fix

Remove the Thread.sleep line from CreditCardService.java:

@Transactional(readOnly = true)
public List<CreditCardDto> getAllCreditCards() {
    log.info("Fetching all credit cards from H2 database");
-   try { Thread.sleep((long)(Math.random() * 9000)); } catch (InterruptedException e) { Thread.currentThread().interrupt(); }
    return creditCardRepository.findAll().stream()
            .map(this::convertToDto)
            .collect(Collectors.toList());
}

Preventive Measures

  1. Add a CI check (grep/regex scan) to block Thread.sleep in production source paths
  2. Add a unit test that asserts getAllCreditCards() completes within 1 second
  3. Code review gate: Flag any Thread.sleep in service layer as a blocking review comment

Action Items

  • Remove Thread.sleep from CreditCardService.java line 36
  • Deploy updated image to ca-banking-demo-backend
  • Add CI pipeline check to prevent Thread.sleep in production code paths
  • Add performance unit test for getAllCreditCards() (< 1s threshold)
  • Investigate why this chaos injection keeps recurring (3rd time)
  • Fix unrelated /api/cards/compare route mismatch (separate issue)

Related Issues


Detected by Azure SRE Agent | Alert ID: f6ab46ac-9e7d-48f8-9f5d-6e077903f000

This issue was created by sre-sre-three-rivers-bswqe--b2b14894
Tracked by the SRE agent here

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions