Skip to content

Elevated Backend Latency — Thread.sleep Chaos Injection in CreditCardService (Recurring) #75

@yortch

Description

@yortch

Summary

Azure Monitor alert alert-latency-sre-three-rivers fired at 2026-03-25T17:33:54Z (Sev3) on ca-banking-demo-backend indicating elevated average response time. Root cause is a recurring chaos injectionThread.sleep((long)(Math.random() * 9000)) at line 36 of CreditCardService.java — adding 0–9 seconds of random latency to every GET /api/cards request.

This is the third occurrence of this pattern (previously tracked in #71 and #73).

Impact

  • Affected endpoint: GET /api/cards — the primary card catalog API
  • Latency range: 0–9,000ms random delay per request (avg ~4.5s)
  • User experience: Card listing and comparison pages load slowly or appear to hang
  • Severity: Sev3 (service degraded but functional; all requests return 200)

Timeline (UTC)

Time Event
2026-03-25 12:59:19 Latest revision ca-banking-demo-backend--azd-1774443552 deployed
2026-03-25 17:01:01 GET /api/cards request logged (with Thread.sleep delay)
2026-03-25 17:30:46 Two more GET /api/cards requests logged
2026-03-25 17:33:54 Alert fired: avg response time elevated
2026-03-25 17:39:00 SRE Agent investigation started

Evidence

Metrics (17:00–17:40 UTC)

Metric Values Assessment
Requests 1 (17:00), 2 (17:30), 0 elsewhere Sparse — each slow request dominates avg
CPU ~1M nanocores avg (0.2% of 0.5 vCPU) Normal — no resource starvation
Memory ~263 MB avg (24.5% of 1 GiB) Normal — no memory pressure
Restarts 0 No OOM or crash events
Replicas 2 active Scaling healthy

Console Logs

Two GET /api/cards invocations observed, zero errors:

2026-03-25T17:01:01.875Z  INFO  CreditCardService : Fetching all credit cards from H2 database
2026-03-25T17:30:46.587Z  INFO  CreditCardService : Fetching all credit cards from H2 database

No exceptions, no 5xx responses — consistent with artificial delay (not errors).

Root Cause

File: backend/src/main/java/com/threeriversbank/service/CreditCardService.java#L36

@Transactional(readOnly = true)
public List<CreditCardDto> getAllCreditCards() {
    log.info("Fetching all credit cards from H2 database");
    try { Thread.sleep((long)(Math.random() * 9000)); } catch (InterruptedException e) { Thread.currentThread().interrupt(); }
    return creditCardRepository.findAll().stream()
            .map(this::convertToDto)
            .collect(Collectors.toList());
}

Thread.sleep((long)(Math.random() * 9000)) injects a random 0–9 second delay on every call to getAllCreditCards(). This is a chaos engineering artifact that was not reverted after testing.

Why CPU/Memory look normal

The Thread.sleep call blocks the thread without consuming CPU cycles. The JVM simply idles during the sleep, so CPU and memory metrics remain flat — ruling out resource starvation.

Why sparse traffic amplifies the alert

With only 1–2 requests per 5-minute window, a single 8-second sleep dominates the average response time, easily exceeding the alert threshold.

Remediation

Immediate Fix

Remove the Thread.sleep line from CreditCardService.java:

@Transactional(readOnly = true)
public List<CreditCardDto> getAllCreditCards() {
    log.info("Fetching all credit cards from H2 database");
    return creditCardRepository.findAll().stream()
            .map(this::convertToDto)
            .collect(Collectors.toList());
}

Preventive Measures

  1. Add a CI/CD check: Lint or grep for Thread.sleep in non-test source files to prevent chaos code from reaching production
  2. Feature flag chaos injections: Use a runtime-configurable flag (e.g., env var CHAOS_ENABLED=false) instead of hardcoded delays
  3. Pre-merge code review gate: Flag any Thread.sleep in service layer code during PR review

Action Items

  • Remove Thread.sleep from CreditCardService.java:36
  • Deploy updated backend image
  • Add CI check to prevent Thread.sleep in production code paths
  • Consider implementing chaos injection via feature flags for future testing

Related Issues


This issue was created by sre-sre-three-rivers-bswqe--b2b14894
Tracked by the SRE agent here

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions