-
Notifications
You must be signed in to change notification settings - Fork 458
Description
Problem
Prometheus had zero data during our 5-minute API outage on November 5th, 2025 from 12:01PM - 12:06PM IST. We couldn't analyze the incident because our primary monitoring system was blind during the exact window we needed it most.
Context:
When: November 5th, 2025 12:01PM - 12:06PM IST
What: API outage caused by request spike
Impact: No metrics available to determine:
Request volume/patterns
Endpoint breakdown (SDK vs admin)
Questions to Answer
Why did Prometheus stop collecting data during this period?
Was it overwhelmed by the request spike?
Was the scrape endpoint timing out?
Did Prometheus itself become unhealthy?
Is there a retention/storage issue?
Are we hitting cardinality limits?
Success Criteria
Root cause identified + fix implemented to ensure Prometheus remains operational during high-load incidents.