[SaaS]Prometheus Missing Data During API Outage

Problem
Prometheus had zero data during our 5-minute API outage on November 5th, 2025 from 12:01PM - 12:06PM IST. We couldn't analyze the incident because our primary monitoring system was blind during the exact window we needed it most.

Context:
When: November 5th, 2025 12:01PM - 12:06PM IST
What: API outage caused by request spike
Impact: No metrics available to determine:

Request volume/patterns
Endpoint breakdown (SDK vs admin)


Questions to Answer

Why did Prometheus stop collecting data during this period?
Was it overwhelmed by the request spike?
Was the scrape endpoint timing out?
Did Prometheus itself become unhealthy?
Is there a retention/storage issue?
Are we hitting cardinality limits?


Success Criteria
Root cause identified + fix implemented to ensure Prometheus remains operational during high-load incidents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SaaS]Prometheus Missing Data During API Outage #6250

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SaaS]Prometheus Missing Data During API Outage #6250

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions