Skip to content

feat: add Prometheus Pushgateway support for CLI apps#3176

Open
coolwednesday wants to merge 3 commits intogofr-dev:developmentfrom
coolwednesday:feature/metrics-pushgateway-cli
Open

feat: add Prometheus Pushgateway support for CLI apps#3176
coolwednesday wants to merge 3 commits intogofr-dev:developmentfrom
coolwednesday:feature/metrics-pushgateway-cli

Conversation

@coolwednesday
Copy link
Member

@coolwednesday coolwednesday commented Mar 17, 2026

Summary

CLI applications are short-lived — they exit before Prometheus can scrape /metrics. This PR adds push-based metrics export via Prometheus Pushgateway for GoFr CLI apps with cumulative counters across runs.

Closes #2232

Problem

Each CLI run starts counters at 0. A plain Pushgateway overwrites on each push, so counters never accumulate (run1=1, run2=1, run3=1 instead of 1, 2, 3). Gauges like last_success_timestamp must not be summed.

Solution: Read-Modify-Write

Instead of using an aggregation gateway, we implement read-modify-write on the standard Pushgateway:

CLI Run N:
  1. GET /metrics from Pushgateway (fetch existing values)
  2. Gather local metrics from this run
  3. Merge:
     - Counters/Histograms → sum existing + local
     - Gauges → use local value (latest wins)
  4. PUT merged result back to Pushgateway

Why not an aggregation gateway?

  • Zapier prom-aggregation-gateway: Sums ALL metric types including gauges — last_success_timestamp would produce nonsensical values (timestamp1 + timestamp2)
  • Prometheus Gravel Gateway: Supports per-type aggregation via clearmode label, but dormant since Nov 2023, single maintainer, no prebuilt Docker images, only ~117 stars

Read-modify-write gives correct semantics with zero additional infrastructure. The trade-off is a small race window when concurrent CLI runs overlap, but this is unlikely for CLI workloads (worst case: one lost increment).

What's included

  • Read-modify-write Pushgateway client (pkg/gofr/metrics/exporters/pushgateway.go):
    • Custom HTTP client using expfmt for Prometheus text format encoding/decoding
    • mergeMetrics() — sums counters/histograms, replaces gauges
    • labelKey() — matches metrics by app-defined labels only, filtering out Pushgateway-injected (job, instance) and OTel scope labels
  • Auto CLI metrics in cmd.go:
    • app_cmd_duration_seconds (histogram with CLI-appropriate buckets)
    • app_cmd_success (counter, cumulative across runs)
    • app_cmd_failures (counter, cumulative across runs)
    • app_cmd_last_success_timestamp (gauge, latest value)
  • CLI shutdown path in run.go: Calls Shutdown() after cmd.Run() to flush metrics
  • Config-driven: Set METRICS_PUSH_GATEWAY_URL env var to enable (CLI only)
  • Enriched CMD Metrics dashboard merged into the main GoFr Application Services Monitoring dashboard:
    • Health Overview: jobs tracked, last push age, total successes/failures, success rate
    • Per-Job Breakdown: table with merge transform (successes, failures, p95 duration, last success per job×command)
    • Duration Analysis: bar chart with p50/p90/p95/p99 percentiles, gauge panel with thresholds
    • $job and $command template variables for CLI filtering
  • Pushgateway added to http-server docker setup: docker-compose service + Prometheus scrape config with honor_labels: true
  • sample-cmd README expanded with setup instructions and GitHub links to shared docker/dashboard setup

Design decisions

  • Pushgateway is wired in NewCMD() only — HTTP apps continue using pull-based scraping
  • Container owns the pushgateway and flushes on Close()
  • Dropped prometheus/push dependency — raw HTTP with expfmt gives full control over the read-modify-write cycle
  • Uses dedicated AppRegistry (not DefaultGatherer) to avoid pushing Go runtime metrics
  • Dashboard uses ${DataSource} variable and collapsed row — non-intrusive for HTTP-only users

Test plan

  • go build ./... compiles
  • go test ./pkg/gofr/metrics/exporters/ — 18 tests covering all merge logic, label filtering, error paths
  • go test ./pkg/gofr/container/ ./pkg/gofr/ — existing tests pass
  • golangci-lint run clean
  • go vet -race clean on our packages
  • Docker smoke test: run hello×6, fail×4, batch×1, progress×1 → counters accumulated correctly, gauge shows latest timestamp, histogram buckets merged
  • Grafana dashboard verified: all panels populated, merge transform working, otel labels hidden

CLI apps are short-lived and exit before Prometheus can scrape /metrics.
This adds push-based metrics export via Pushgateway, configured through
METRICS_PUSH_GATEWAY_URL env var, along with auto CLI metrics tracking
(duration, success/error counters) and observability infrastructure.

Closes gofr-dev#2232
Copy link
Member

@Umang01-hash Umang01-hash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Issue #2232 explicitly listed "Support cleanup (optional) so old metrics don't pile up" as a requirement. Every CronJob run permanently adds a job group to the Pushgateway. Please add A Delete(ctx context.Context) error method on PushGateway using pusher.DeleteContext(ctx) and METRICS_PUSH_GATEWAY_DELETE_ON_FINISH=true env var to opt in .

  2. All apps without APP_NAME set push under the same job group and silently overwrite each other. Change the fallback to filepath.Base(os.Args[0]) or add a dedicated METRICS_PUSH_GATEWAY_JOB env var override.

  3. Current max bucket is 60s. Cron buckets extend to 3600s. A 5-minute batch job falls into +Inf only. Align upper boundary with app_cron_duration_seconds.

  4. Metric naming inconsistency with cron :
    app_cmd_errors_totalapp_cmd_failures (match cron's _failures)
    app_cmd_success_totalapp_cmd_success (match cron's no-_total)
    Add app_cmd_total (match cron's app_cron_job_total)

  5. Move metricServer.Shutdown(ctx) before container.Close() in Shutdown() so the Prometheus scrape endpoint stops accepting requests before the OTel meter provider is shut down.

@coolwednesday
Copy link
Member Author

Regarding Comment 1 (Delete support / METRICS_PUSH_GATEWAY_DELETE_ON_FINISH):

The Pushgateway documentation explicitly states that the Pushgateway is designed as a metric cache — the standard recommendation is to not delete pushed metrics, and instead use job and instance labels to distinguish runs.

If you push and immediately delete, Prometheus may not have scraped yet (typical scrape interval is 15–30s), and the metrics are lost forever. There's no reliable way for the CLI to know whether Prometheus has completed its scrape before issuing a delete.

For users who need cleanup of stale metrics, this is best handled at the Pushgateway operational level (e.g., Pushgateway's own --push.disable-consistency-check flag, TTL configurations, or external cron jobs that prune old job groups) — not from the framework level. Baking delete into the framework adds a footgun that's hard to use safely by default.

This can always be revisited in a follow-up if users explicitly request it, but for v1 the "push and leave" approach is the correct and safe default.

@coolwednesday
Copy link
Member Author

Regarding Comment 5 (Shutdown order — move metricServer.Shutdown before container.Close):

The current shutdown order is actually correct:

httpServer.Shutdown → grpcServer.Shutdown → container.Close() → metricServer.Shutdown

The /metrics HTTP endpoint should stay alive as long as possible so Prometheus can scrape final metrics. Shutting it down earlier would mean Prometheus misses the last scrape window.

For the Pushgateway path specifically, the push happens inside container.Close() before the meter provider shuts down — which is the right sequence (push metrics first, then tear down the provider).

@coolwednesday
Copy link
Member Author

Regarding Comment 8 (Factory.go test coverage):

The new pushgateway wiring in factory.go is 4 lines of config-read + constructor call. The core logic (NewPushGateway, Push) is already covered in pushgateway_test.go. Writing a proper test for the factory wiring requires heavy config mocking for minimal additional coverage. Deferring this to a follow-up PR.

- Replace basic CMD Metrics panels with enriched CLI dashboard
  (health overview, job status table, duration bar chart with p50-p99)
- Add pushgateway service to http-server docker-compose
- Add pushgateway scrape config with honor_labels
- Add $job and $command template variables for CLI filtering
- Expand sample-cmd README with setup instructions
@coolwednesday
Copy link
Member Author

Here is the screenshot of the CLI Dashboard :
Screenshot 2026-03-23 at 1 22 59 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support emitting metrics from CLI applications

2 participants