Skip to content

[SaaS]Prometheus Missing Data During API Outage #6250

@gagantrivedi

Description

@gagantrivedi

Problem
Prometheus had zero data during our 5-minute API outage on November 5th, 2025 from 12:01PM - 12:06PM IST. We couldn't analyze the incident because our primary monitoring system was blind during the exact window we needed it most.

Context:
When: November 5th, 2025 12:01PM - 12:06PM IST
What: API outage caused by request spike
Impact: No metrics available to determine:

Request volume/patterns
Endpoint breakdown (SDK vs admin)

Questions to Answer

Why did Prometheus stop collecting data during this period?
Was it overwhelmed by the request spike?
Was the scrape endpoint timing out?
Did Prometheus itself become unhealthy?
Is there a retention/storage issue?
Are we hitting cardinality limits?

Success Criteria
Root cause identified + fix implemented to ensure Prometheus remains operational during high-load incidents.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions