Skip to content

feat: implement real-time prometheus metrics for experiment pods#791

Open
mihir-dixit2k27 wants to merge 2 commits intolitmuschaos:masterfrom
mihir-dixit2k27:feat/prometheus-metrics-pod-delete
Open

feat: implement real-time prometheus metrics for experiment pods#791
mihir-dixit2k27 wants to merge 2 commits intolitmuschaos:masterfrom
mihir-dixit2k27:feat/prometheus-metrics-pod-delete

Conversation

@mihir-dixit2k27
Copy link

This PR implements real-time Prometheus metrics for chaos experiment pods, addressing the existing TODO in pkg/telemetry/otel.go.

Currently, observability is limited to post-experiment ChaosResult CRDs. This change allows SREs to monitor chaos injection status and impact (e.g., litmuschaos_experiment_injection_count) in real-time via Grafana while the experiment is running.

Key Changes:

  • Telemetry: Added a Prometheus exporter in pkg/telemetry/otel.go that initializes an HTTP server on port 8080.
  • Instrumentation: Instrumented the pod-delete experiment (chaoslib/litmus/pod-delete/lib/pod-delete.go) as the pilot implementation.
  • Dependencies: Updated go.mod and go.sum to include the OpenTelemetry Prometheus exporter.

I have included a 5-second graceful shutdown delay at the end of the experiment logic. This ensures that Prometheus has a sufficient window to scrape the final "Verdict" and "Status" metrics before the pod is evicted and the process terminates.

Signed-off-by: Mihir Dixit <dixitmihir1@gmail.com>
Signed-off-by: Mihir Dixit <dixitmihir1@gmail.com>
@mihir-dixit2k27
Copy link
Author

Thanks for the review!

  1. Go Version & Compatibility: I have updated go.mod to use Go 1.22.0 and ran go mod tidy to ensure compatibility with the rest of the stack (Chaos Operator/Runner).

  2. Verification: Metrics Output Since a full Prometheus UI isn't available in this local environment, I validated the exporter by initializing the InitMetrics function and querying the localhost endpoint.

Raw Output Verification: The screenshots below confirm the endpoint is active and serving OpenTelemetry-compliant metrics.

Runtime & Process Metrics:
image
image

OpenTelemetry Metadata (Target Info):
image

Key Metrics Observed:

Runtime Stats: go_goroutines and go_memstats_* are being collected, confirming the runtime hook.

OTel Identification: The target_info metric (visible in the last screenshot) confirms the telemetry_sdk_name="opentelemetry", proving the OTel SDK is successfully registered.

Ready for the next round of review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant