Skip to content

prom push-gateway support#1494

Merged
akshaydeo merged 1 commit intomainfrom
01-30-prom_push-gateway_support
Feb 12, 2026
Merged

prom push-gateway support#1494
akshaydeo merged 1 commit intomainfrom
01-30-prom_push-gateway_support

Conversation

@akshaydeo
Copy link
Contributor

@akshaydeo akshaydeo commented Jan 30, 2026

Add push-based metrics support for multi-node deployments

This PR adds two key observability enhancements for multi-node Bifrost deployments:

  1. OTLP Metrics Push: Extends the OpenTelemetry plugin to support push-based metrics export via OTLP protocol
  2. Prometheus Push Gateway: Adds a new Prometheus plugin for pushing metrics to a Prometheus Push Gateway

Changes

  • Added metrics push capability to the OTEL plugin with configurable endpoint and push interval
  • Created a new Prometheus plugin for Push Gateway integration with basic auth support
  • Added UI components for configuring both push mechanisms
  • Updated documentation with detailed setup instructions for both approaches
  • Added dependency updates for OpenTelemetry SDK metrics packages

Type of change

  • Feature
  • Documentation

Affected areas

  • Core (Go)
  • Plugins
  • UI (Next.js)
  • Docs

How to test

Testing OTEL Metrics Push

# Configure OTEL plugin with metrics_enabled: true
curl -X PUT http://localhost:8080/api/plugins/otel -H "Content-Type: application/json" -d '{
  "enabled": true,
  "config": {
    "service_name": "bifrost",
    "collector_url": "http://otel-collector:4318/v1/traces",
    "protocol": "http",
    "metrics_enabled": true,
    "metrics_endpoint": "http://otel-collector:4318/v1/metrics",
    "metrics_push_interval": 15
  }
}'

# Verify metrics are being pushed to your OTEL collector

Testing Prometheus Push Gateway

# Start a Push Gateway
docker run -d -p 9091:9091 prom/pushgateway

# Configure Prometheus plugin
curl -X PUT http://localhost:8080/api/plugins/prometheus -H "Content-Type: application/json" -d '{
  "enabled": true,
  "config": {
    "push_gateway_url": "http://localhost:9091",
    "job_name": "bifrost",
    "push_interval": 15
  }
}'

# Verify metrics in Push Gateway UI at http://localhost:9091

Breaking changes

  • No

Security considerations

  • Basic authentication support added for Push Gateway connections
  • TLS support maintained for secure metrics transmission

Copy link
Contributor Author

akshaydeo commented Jan 30, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@github-actions
Copy link
Contributor

🧪 Test Suite Available

This PR can be tested by a repository admin.

Run tests for PR #1494

@akshaydeo akshaydeo marked this pull request as ready for review January 30, 2026 16:22
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds OTEL metrics push support (OTLP HTTP/gRPC) with TLS/CA and insecure options, a Prometheus Pushgateway push plugin, JSON schema and UI updates for OTEL TLS and metrics push, multiple go.mod dependency bumps, documentation for OTEL/Prometheus, and server wiring to load/start the Prometheus push plugin.

Changes

Cohort / File(s) Summary
Documentation
docs/features/observability/otel.mdx, docs/features/observability/prometheus.mdx
Adds "Metrics Push (Cluster Mode)" and a Prometheus guide covering push vs pull, Pushgateway config/examples, metrics catalog, collector snippet, and troubleshooting.
OTEL plugin (core)
plugins/otel/metrics.go, plugins/otel/main.go, plugins/otel/grpc.go, plugins/otel/http.go, plugins/otel/version
New MetricsConfig and MetricsExporter (lifecycle, record APIs, shutdown), OTLP HTTP/gRPC exporter creation with CA path validation and insecure/system-root TLS options, config fields added (insecure, tls_ca_cert, metrics_enabled, metrics_endpoint, metrics_push_interval), and version bump to 2.0.0.
Prometheus Pushgateway plugin
plugins/prometheus/main.go, plugins/prometheus/go.mod, plugins/prometheus/version
New PrometheusPushPlugin implementing Pushgateway config (push_gateway_url, job_name, instance_id, push_interval, basic_auth), registry integration, periodic push loop, Start/Cleanup lifecycle, and version file.
Transport & server wiring
transports/bifrost-http/server/plugins.go, transports/config.schema.json, transports/go.mod
Adds builtin prometheus plugin loading/start during bootstrap, extends transports JSON schema with OTEL metrics fields and a prometheus plugin branch, and updates transports go.mod with module require/replace and dependency bumps.
UI — Observability
ui/app/workspace/observability/fragments/otelFormFragment.tsx, ui/app/workspace/observability/fragments/prometheusFormFragment.tsx, ui/app/workspace/observability/views/plugins/prometheusView.tsx, ui/app/workspace/observability/views/observabilityView.tsx, ui/app/workspace/observability/views/plugins/otelView.tsx
OTEL form: TLS CA path, insecure toggle, metrics push enable/endpoint/interval with cross-field validation. New Prometheus form/view for pull and Pushgateway config, basic auth, instance id, save/reset flows, and integration into Observability UI.
Types & validation
ui/lib/types/schemas.ts
Adds OTEL TLS/metrics fields and protocol-aware URL/host:port validation; introduces prometheusConfigSchema and prometheusFormSchema with cross-field checks and exported types.
Dependency & module updates
plugins/otel/go.mod, framework/go.mod, transports/go.mod, plugins/*/go.mod, plugins/prometheus/go.mod
Adds/updates OpenTelemetry SDK/metric (v1.39.0), otlp proto (v1.9.0), google genproto bump, Prometheus client bumps, and introduces prometheus plugin module with replace directive; multiple go.mod indirect updates across plugins.
Server integration
transports/bifrost-http/server/plugins.go
During bootstrap, retrieves telemetry plugin registry, instantiates Prometheus push plugin with that registry, validates registry presence, and starts the push loop.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant OtelPlugin as OtelPlugin
    participant MetricsExp as MetricsExporter
    participant OTLP as OTLP Exporter
    participant Collector as OTLP Collector

    App->>OtelPlugin: Load config (metrics_enabled, endpoint, protocol, interval)
    OtelPlugin->>MetricsExp: NewMetricsExporter(ctx, config)
    MetricsExp->>MetricsExp: build resource & meter
    alt protocol == grpc
        MetricsExp->>OTLP: createGRPCExporter(endpoint, headers, TLS/insecure)
    else
        MetricsExp->>OTLP: createHTTPExporter(endpoint, headers, TLS/insecure)
    end
    OTLP-->>MetricsExp: exporter ready
    loop every PushInterval
        App->>MetricsExp: Record* calls (e.g., RecordHTTPRequest)
        MetricsExp->>OTLP: export metrics
        OTLP->>Collector: push OTLP metrics
    end
    App->>MetricsExp: Shutdown()
    MetricsExp->>OTLP: Shutdown exporter
Loading
sequenceDiagram
    participant App as Application
    participant PromPlugin as PrometheusPushPlugin
    participant Registry as Prometheus Registry
    participant PushGW as Pushgateway

    App->>PromPlugin: Init(config)
    App->>PromPlugin: SetRegistry(Registry)
    PromPlugin->>PromPlugin: Start() -> spawn pushLoop
    loop every PushInterval
        PromPlugin->>Registry: gather metrics
        PromPlugin->>PushGW: push metrics (job, instance grouping, basic auth)
        PushGW-->>PromPlugin: ACK / error
    end
    App->>PromPlugin: Cleanup()
    PromPlugin->>PushGW: final push
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped to the collector with metrics in tow,
CA checks done, intervals set, exporters all aglow,
Pushgateway listens while registries sing,
Counters and histograms take flight on a spring,
Tiny hops, big observability spring!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'prom push-gateway support' is concise and directly refers to a real, significant part of the changeset—the addition of Prometheus Push Gateway plugin support. However, it does not capture the equally important OTLP Metrics Push feature that is also a primary focus of this PR.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering the main objectives, changes, affected areas, testing instructions, breaking changes, and security considerations. All critical information is present and clearly articulated.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 01-30-prom_push-gateway_support

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/features/observability/otel.mdx (1)

420-426: ⚠️ Potential issue | 🟡 Minor

Minor inconsistency in Grafana access instructions.

The Docker Compose maps Grafana to port 4000 ("4000:3000" at line 270), but the instructions say to access it at http://localhost:3000. This should be http://localhost:4000.

📝 Suggested fix
-Access Grafana at `http://localhost:3000` (default credentials: admin/admin)
+Access Grafana at `http://localhost:4000` (default credentials: admin/admin)
🤖 Fix all issues with AI agents
In `@plugins/otel/metrics.go`:
- Around line 24-32: Add an Insecure bool to the MetricsConfig struct and update
createGRPCExporter, createHTTPExporter, and the main.go config wiring: when
TLSCACert is provided load and use the custom CA; when TLSCACert is empty and
Insecure is false explicitly use system root CAs (for gRPC call
credentials.NewClientTLSFromCert(nil, "") and for HTTP set a tls.Config that
uses system roots); only use insecure.NewCredentials() (gRPC) or skip TLS (HTTP)
when Insecure is true; wire MetricsConfig.Insecure through main.go so the flag
is respected. Ensure all three locations (MetricsConfig, createGRPCExporter,
createHTTPExporter) reference the new field and follow the TLS priority: custom
CA > system roots > explicit Insecure.

In `@plugins/prometheus/go.mod`:
- Around line 5-8: Update the Prometheus client dependency in go.mod by changing
the require entry for github.com/prometheus/client_golang from v1.23.0 to
v1.23.2, then refresh modules (e.g., run go get
github.com/prometheus/client_golang@v1.23.2 and go mod tidy) so the lockfile and
transitive deps are updated; target the require line referencing
github.com/prometheus/client_golang.

In `@transports/config.schema.json`:
- Around line 1023-1047: Add a conditional requirement so that when
metrics_enabled is true the metrics_endpoint field is required: update the JSON
Schema around the metrics_* properties to include an "if" with properties {
"metrics_enabled": { "const": true } } and a corresponding "then" that sets
"required": ["metrics_endpoint"]; keep the existing oneOf for metrics_endpoint
(or optionally tighten it to enforce URI vs host:port). This change ensures
metrics_enabled and metrics_endpoint (the symbols to edit) are validated
together in the transports/config.schema.json schema.
🧹 Nitpick comments (8)
plugins/prometheus/main.go (3)

37-41: Consider supporting environment variable resolution for BasicAuth credentials.

The OTEL plugin supports env. prefix for header values to load secrets from environment variables. For consistency and to avoid hardcoding credentials in config files, consider adding similar support for BasicAuth.

Example approach
// In SetRegistry or Init, resolve env vars:
if p.config.BasicAuth != nil {
    if newValue, ok := strings.CutPrefix(p.config.BasicAuth.Username, "env."); ok {
        p.config.BasicAuth.Username = os.Getenv(newValue)
    }
    if newValue, ok := strings.CutPrefix(p.config.BasicAuth.Password, "env."); ok {
        p.config.BasicAuth.Password = os.Getenv(newValue)
    }
}

65-102: Accept parent context for lifecycle coordination.

The Init function creates its own context.Background() instead of accepting a parent context. This differs from the OTEL plugin pattern (see plugins/otel/main.go:85) and may cause issues with graceful shutdown coordination if the server context is cancelled.

Proposed fix
-func Init(config *Config, logger schemas.Logger) (*PrometheusPushPlugin, error) {
+func Init(ctx context.Context, config *Config, logger schemas.Logger) (*PrometheusPushPlugin, error) {
 	if config == nil {
 		return nil, fmt.Errorf("config is required")
 	}
 	// ... validation ...
 
-	ctx, cancel := context.WithCancel(context.Background())
+	pluginCtx, cancel := context.WithCancel(ctx)
 
 	plugin := &PrometheusPushPlugin{
 		config: config,
 		logger: logger,
-		ctx:    ctx,
+		ctx:    pluginCtx,
 		cancel: cancel,
 	}
 
 	return plugin, nil
 }

196-209: Consider resetting started flag in Cleanup for completeness.

After Cleanup, the started flag remains true, preventing the plugin from being restarted. While plugins typically aren't restarted, resetting the flag would make the lifecycle cleaner and prevent confusing behavior if Start is called after Cleanup.

Optional fix
 func (p *PrometheusPushPlugin) Cleanup() error {
 	p.mu.Lock()
 	if !p.started {
 		p.mu.Unlock()
 		return nil
 	}
+	p.started = false
 	p.mu.Unlock()
 
 	p.cancel()
 	p.wg.Wait()
 
 	p.logger.Info("prometheus push gateway plugin stopped")
 	return nil
 }
ui/app/workspace/observability/views/plugins/otelView.tsx (1)

16-16: Remove unused baseUrl variable.

The baseUrl variable was likely used for the removed metrics endpoint display. It's now unused and should be removed.

Proposed fix
 	const [updatePlugin, { isLoading: isUpdatingPlugin }] = useUpdatePluginMutation();
-	const baseUrl = `${window.location.protocol}//${window.location.host}`;
ui/app/workspace/observability/views/plugins/prometheusView.tsx (1)

31-70: Promise wrapper is unnecessary with RTK Query.

The handlePrometheusConfigSave function wraps updatePlugin in a manual Promise, but .unwrap() already returns a Promise. The wrapper adds complexity without benefit.

♻️ Simplified implementation
-  const handlePrometheusConfigSave = (config: PrometheusFormSchema): Promise<void> => {
-    return new Promise((resolve, reject) => {
-      // Transform the form data to the backend config format
-      const backendConfig: PrometheusConfig = {
-        push_gateway_url: config.prometheus_config.push_gateway_url,
-        job_name: config.prometheus_config.job_name,
-        instance_id: config.prometheus_config.instance_id || undefined,
-        push_interval: config.prometheus_config.push_interval,
-      }
-
-      // Add basic auth if enabled
-      if (config.prometheus_config.basic_auth_enabled &&
-          config.prometheus_config.basic_auth_username &&
-          config.prometheus_config.basic_auth_password) {
-        backendConfig.basic_auth = {
-          username: config.prometheus_config.basic_auth_username,
-          password: config.prometheus_config.basic_auth_password,
-        }
-      }
-
-      updatePlugin({
-        name: "prometheus",
-        data: {
-          enabled: config.enabled,
-          config: backendConfig,
-        },
-      })
-        .unwrap()
-        .then(() => {
-          resolve()
-          toast.success("Prometheus configuration updated successfully")
-        })
-        .catch((err) => {
-          toast.error("Failed to update Prometheus configuration", {
-            description: getErrorMessage(err),
-          })
-          reject(err)
-        })
-    })
-  }
+  const handlePrometheusConfigSave = async (config: PrometheusFormSchema): Promise<void> => {
+    // Transform the form data to the backend config format
+    const backendConfig: PrometheusConfig = {
+      push_gateway_url: config.prometheus_config.push_gateway_url,
+      job_name: config.prometheus_config.job_name,
+      instance_id: config.prometheus_config.instance_id || undefined,
+      push_interval: config.prometheus_config.push_interval,
+    }
+
+    // Add basic auth if enabled
+    if (config.prometheus_config.basic_auth_enabled &&
+        config.prometheus_config.basic_auth_username &&
+        config.prometheus_config.basic_auth_password) {
+      backendConfig.basic_auth = {
+        username: config.prometheus_config.basic_auth_username,
+        password: config.prometheus_config.basic_auth_password,
+      }
+    }
+
+    try {
+      await updatePlugin({
+        name: "prometheus",
+        data: {
+          enabled: config.enabled,
+          config: backendConfig,
+        },
+      }).unwrap()
+      toast.success("Prometheus configuration updated successfully")
+    } catch (err) {
+      toast.error("Failed to update Prometheus configuration", {
+        description: getErrorMessage(err),
+      })
+      throw err
+    }
+  }
ui/app/workspace/observability/fragments/otelFormFragment.tsx (1)

63-78: Potential redundant validation triggers in useEffects.

The two useEffect hooks have overlapping concerns. The first hook (lines 64-71) already handles metricsEnabled in its dependency array and triggers validation when it changes. The second hook (lines 73-78) duplicates this behavior.

Additionally, the second useEffect is missing form in its dependency array (ESLint would flag this), and initialConfig check is missing which could cause validation on mount before config is loaded.

♻️ Consider consolidating validation logic
-	useEffect(() => {
-		if (initialConfig === undefined || initialConfig === null || (initialConfig.enabled ?? false) === false) return;
-		form.trigger("otel_config.collector_url");
-		// Also re-validate metrics_endpoint when protocol changes
-		if (metricsEnabled) {
-			form.trigger("otel_config.metrics_endpoint");
-		}
-	}, [protocol, form, metricsEnabled]);
-
-	// Re-run validation on metrics_endpoint when metrics_enabled changes
-	useEffect(() => {
-		if (metricsEnabled) {
-			form.trigger("otel_config.metrics_endpoint");
-		}
-	}, [metricsEnabled, form]);
+	useEffect(() => {
+		if (initialConfig === undefined || initialConfig === null || (initialConfig.enabled ?? false) === false) return;
+		form.trigger("otel_config.collector_url");
+		if (metricsEnabled) {
+			form.trigger("otel_config.metrics_endpoint");
+		}
+	}, [protocol, form, metricsEnabled, initialConfig]);
ui/app/workspace/observability/fragments/prometheusFormFragment.tsx (1)

325-334: Switch disabled when form is invalid may confuse users.

The "Enable Push Gateway" switch is disabled when !form.formState.isValid. Since the form requires push_gateway_url when enabled, this creates a UX issue: users must fill in the URL before they can enable the feature, but there's no visual indication that filling in the URL is the prerequisite.

Consider either:

  1. Allowing the switch to be toggled (let validation show the error)
  2. Adding helper text explaining the prerequisite
💡 Option: Allow toggle, let validation guide
                 <Switch
                   checked={form.watch("enabled")}
                   onCheckedChange={field.onChange}
-                  disabled={!hasPrometheusAccess || isLoading || !form.formState.isValid}
+                  disabled={!hasPrometheusAccess || isLoading}
                 />
plugins/otel/metrics.go (1)

212-223: Set a minimum TLS version for custom CA connections.

When a custom CA is used, the TLS config doesn’t set MinVersion. Consider pinning to TLS 1.2+ (or 1.3 if compatible) to avoid legacy protocol negotiation.

🔐 Harden TLS defaults
-		tlsConfig := &tls.Config{RootCAs: caCertPool}
+		tlsConfig := &tls.Config{
+			RootCAs:    caCertPool,
+			MinVersion: tls.VersionTLS12,
+		}

@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch from f046939 to a434552 Compare January 30, 2026 16:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@docs/features/observability/otel.mdx`:
- Around line 744-749: The docs state metrics_push_interval range 1-300 but
runtime validation only requires >0; either enforce the upper bound in the
config validation (update the validator/schema that checks metrics_push_interval
— e.g., the config parsing/validateConfig or MetricsOptions schema — to require
an integer between 1 and 300 inclusive and add/adjust unit tests), or change the
docs line for metrics_push_interval to indicate a “recommended range (1–300)” to
match current behavior; pick the intended behavior and make the corresponding
code or docs change so they stay consistent.

In `@plugins/otel/metrics.go`:
- Around line 212-222: createHTTPExporter reads config.TLSCACert directly (path
traversal risk) and builds a tls.Config without a MinVersion; validate and
constrain the TLSCACert path (e.g., require it to be under an allowed directory
or be an absolute path checked against a whitelist) before calling os.ReadFile
in createHTTPExporter, and add a minimum TLS version to the created tls.Config
(set tls.Config.MinVersion = tls.VersionTLS12 or higher) so opts uses a secure
minimum; update references in the code around config.TLSCACert,
createHTTPExporter, and tls.Config to implement these checks and the MinVersion
setting.

In `@plugins/prometheus/main.go`:
- Around line 196-209: The Cleanup method has a TOCTOU race: it checks p.started
under p.mu then unlocks before calling p.cancel(), allowing Start to run
concurrently; fix by holding p.mu until after you set p.started = false and call
p.cancel() (or at minimum set p.started = false while holding the lock) so the
sequence in Cleanup (acquire p.mu, verify started, set p.started = false,
release p.mu, then call p.cancel() and p.wg.Wait()) is atomic with respect to
Start; update the Cleanup implementation referencing p.mu, p.started, and
p.cancel to perform the state transition under lock.

In `@ui/app/workspace/observability/fragments/otelFormFragment.tsx`:
- Around line 282-289: The numeric input currently forces 15 when cleared
because onChange uses parseInt(...) || 15; change the Controller/field handling
to map empty strings to null and numbers otherwise: render the input with
value={field.value ?? ''} so an unset value shows as empty, and update onChange
to call field.onChange(e.target.value === '' ? null : Number(e.target.value));
alternatively, if using register, use setValueAs: (v) => (v === '' ? null :
Number(v)). Update the Input handling around field.onChange/field.value in
otelFormFragment so form state uses null for empty numeric values instead of
undefined or a default.
🧹 Nitpick comments (1)
plugins/otel/main.go (1)

149-165: Consider enforcing the documented push interval bounds.
UI/docs indicate 1–300 seconds, but the backend only guards <= 0. If the upper bound is intentional, enforce it here (and optionally in ValidateConfig) for consistent behavior.

✅ Proposed guard (if 1–300 is intended)
 	pushInterval := config.MetricsPushInterval
 	if pushInterval <= 0 {
 		pushInterval = 15 // default 15 seconds
 	}
+	if pushInterval > 300 {
+		return nil, fmt.Errorf("metrics_push_interval must be between 1 and 300 seconds")
+	}

@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch from a434552 to d0191d7 Compare January 30, 2026 17:32
@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch from d0191d7 to cad26b7 Compare January 30, 2026 17:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@plugins/otel/grpc.go`:
- Around line 25-48: Update the gRPC and HTTP constructors to accept an insecure
bool and honor TLS priority: modify NewOtelClientGRPC (and NewOtelClientHTTP)
signatures to add an insecure parameter, then implement logic: if tlsCACert !=
"" validateCACertPath and load a tls.Config with RootCAs from the file and
MinVersion TLS1.2 (as existing), else if insecure == false use system roots /
default TLS (i.e. create TLS credentials that use system cert pool or nil
tls.Config so system roots are used), otherwise when insecure == true use
insecure.NewCredentials() (for gRPC) or disable TLS for HTTP; also update all
call sites (main.go) to pass config.Insecure into
NewOtelClientGRPC/NewOtelClientHTTP so the constructors can honor
config.Insecure. Ensure validateCACertPath, NewOtelClientGRPC, and
NewOtelClientHTTP are the referenced symbols when making changes.

In `@ui/app/workspace/observability/fragments/prometheusFormFragment.tsx`:
- Around line 83-89: The handleCopyEndpoint function should guard against
clipboard API failures: wrap the call to
navigator.clipboard.writeText(metricsEndpoint) in a try-catch (or use .catch)
and await the promise so rejections are handled; on success call setCopied(true)
and schedule setCopied(false) as before, and on error log/report the error (or
show a user-facing fallback/notification) without setting copied; reference
handleCopyEndpoint, metricsEndpoint, setCopied, and
navigator.clipboard.writeText in your change.
- Around line 185-192: The numeric Input currently forces a fallback to 15 on
clear because onChange uses field.onChange(parseInt(e.target.value) || 15);
update the onChange handler (the Input using {...field}) to detect an empty
string and call field.onChange(null) when e.target.value === '' and otherwise
call field.onChange(parseInt(e.target.value, 10)); also ensure the
validation/schema for push_interval (where push_interval is defined) accepts
null as a valid value so empty clears are allowed instead of being coerced to
15.

In `@ui/lib/types/schemas.ts`:
- Around line 652-721: The form transformation in PrometheusView.tsx is checking
a non-existent flag basic_auth_enabled and thus drops credentials; update the
transform that builds backendConfig so it instead checks that both
config.prometheus_config.basic_auth_username and
config.prometheus_config.basic_auth_password are present and non-empty (trimmed)
before assigning backendConfig.basic_auth with username/password, ensuring
basic_auth is only added when both credentials exist.
🧹 Nitpick comments (3)
ui/app/workspace/observability/views/plugins/prometheusView.tsx (1)

31-69: Optional: simplify onSave by using async/await instead of manual Promise wrapping.

This reduces nesting and preserves the same success/error behavior.

♻️ Suggested refactor
-  const handlePrometheusConfigSave = (config: PrometheusFormSchema): Promise<void> => {
-    return new Promise((resolve, reject) => {
+  const handlePrometheusConfigSave = async (config: PrometheusFormSchema): Promise<void> => {
       // Transform the form data to the backend config format
       const backendConfig: PrometheusConfig = {
         push_gateway_url: config.prometheus_config.push_gateway_url,
         job_name: config.prometheus_config.job_name,
         instance_id: config.prometheus_config.instance_id || undefined,
         push_interval: config.prometheus_config.push_interval,
       }
@@
-      updatePlugin({
-        name: "prometheus",
-        data: {
-          enabled: config.enabled,
-          config: backendConfig,
-        },
-      })
-        .unwrap()
-        .then(() => {
-          resolve()
-          toast.success("Prometheus configuration updated successfully")
-        })
-        .catch((err) => {
-          toast.error("Failed to update Prometheus configuration", {
-            description: getErrorMessage(err),
-          })
-          reject(err)
-        })
-    })
-  }
+      try {
+        await updatePlugin({
+          name: "prometheus",
+          data: {
+            enabled: config.enabled,
+            config: backendConfig,
+          },
+        }).unwrap()
+        toast.success("Prometheus configuration updated successfully")
+      } catch (err) {
+        toast.error("Failed to update Prometheus configuration", {
+          description: getErrorMessage(err),
+        })
+        throw err
+      }
+  }
plugins/prometheus/main.go (2)

120-122: Consider warning when BasicAuth is used without HTTPS.

Sending credentials over HTTP exposes them to interception. Consider logging a warning or validating the URL scheme:

🛡️ Suggested enhancement
 if p.config.BasicAuth != nil && p.config.BasicAuth.Username != "" {
+	if !strings.HasPrefix(p.config.PushGatewayURL, "https://") {
+		// Log warning but don't block - user may have TLS termination at proxy
+		p.logger.Warn("basic auth configured but push_gateway_url is not HTTPS - credentials may be sent in cleartext")
+	}
 	pusher = pusher.BasicAuth(p.config.BasicAuth.Username, p.config.BasicAuth.Password)
 }

Note: This would require moving the check to SetRegistry or adding a logger field access, or doing validation in Init.


63-102: Accept parent context parameter for proper shutdown signal propagation.

The prometheus plugin should receive a context parameter like other plugins (logging, governance, otel, semanticcache) in the same initialization flow. Currently it creates an isolated context.Background() at line 92, which means server-wide shutdown signals won't propagate to this plugin's context. The context is already available at the call site in loadBuiltinPlugin().

Required changes:

  1. Add ctx context.Context as the first parameter to Init()
  2. Use pluginCtx, cancel := context.WithCancel(ctx) instead of context.WithCancel(context.Background())
  3. Update the caller at transports/bifrost-http/server/plugins.go:109 to pass ctx

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/otel/http.go (1)

33-45: ⚠️ Potential issue | 🟡 Minor

Use a concrete error message when AppendCertsFromPEM fails

Line 43-45: When AppendCertsFromPEM returns false, the code wraps err (the nil result from the successful os.ReadFile call) with %w. This produces a malformed error message with a formatting artifact. Use a concrete error message that describes the actual problem instead.

Proposed fix
-	if !caCertPool.AppendCertsFromPEM(caCert) {
-		return nil, fmt.Errorf("fail to add provided CA cert: %w", err)
-	}
+	if !caCertPool.AppendCertsFromPEM(caCert) {
+		return nil, fmt.Errorf("fail to add provided CA cert: invalid PEM or no certs found")
+	}
🧹 Nitpick comments (3)
transports/config.schema.json (1)

1150-1164: Add additionalProperties: false to basic_auth object.

The basic_auth object schema is missing additionalProperties: false, which would allow unexpected fields to pass validation. This is inconsistent with other nested objects in the schema (e.g., config at line 1169).

Proposed fix
                     "basic_auth": {
                       "type": "object",
                       "description": "Basic authentication credentials for the Push Gateway",
                       "properties": {
                         "username": {
                           "type": "string",
                           "description": "Username for basic authentication"
                         },
                         "password": {
                           "type": "string",
                           "description": "Password for basic authentication"
                         }
                       },
-                      "required": ["username", "password"]
+                      "required": ["username", "password"],
+                      "additionalProperties": false
                     }
ui/app/workspace/observability/views/plugins/otelView.tsx (1)

16-16: Remove unused baseUrl variable.

The baseUrl variable is no longer used after removing the readonly Metrics endpoint display block.

Proposed fix
 	const [updatePlugin, { isLoading: isUpdatingPlugin }] = useUpdatePluginMutation();
-	const baseUrl = `${window.location.protocol}//${window.location.host}`;
plugins/otel/metrics.go (1)

248-266: Consider using validated/cleaned path for file reads.

While validateCACertPath properly validates the path, the subsequent os.ReadFile calls (lines 254, 296) still use config.TLSCACert directly rather than the cleaned path. For extra safety and to better satisfy static analysis tools:

♻️ Suggested improvement

Modify validateCACertPath to return the cleaned path:

-func validateCACertPath(certPath string) error {
+func validateCACertPath(certPath string) (string, error) {
 	if certPath == "" {
-		return nil
+		return "", nil
 	}
 
 	cleanPath := filepath.Clean(certPath)
 	// ... validation logic ...
 
-	return nil
+	return cleanPath, nil
 }

Then use the returned path:

-	if err := validateCACertPath(config.TLSCACert); err != nil {
+	cleanPath, err := validateCACertPath(config.TLSCACert)
+	if err != nil {
 		return nil, err
 	}
-	caCert, err := os.ReadFile(config.TLSCACert)
+	caCert, err := os.ReadFile(cleanPath)

Also applies to: 290-309

@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch from cad26b7 to 4574632 Compare January 30, 2026 18:55
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/otel/http.go (1)

43-46: ⚠️ Potential issue | 🔴 Critical

Fix nil-wrapped error on CA append failure.

fmt.Errorf("...: %w", err) will return nil when err is nil, so CA parse failures can incorrectly return (nil, nil) and cause downstream panics.

🐛 Proposed fix
-	if !caCertPool.AppendCertsFromPEM(caCert) {
-		return nil, fmt.Errorf("fail to add provided CA cert: %w", err)
-	}
+	if !caCertPool.AppendCertsFromPEM(caCert) {
+		return nil, fmt.Errorf("fail to add provided CA cert")
+	}
🤖 Fix all issues with AI agents
In `@docs/features/observability/otel.mdx`:
- Around line 755-791: Update the gRPC example JSON so the collector_url uses
bare host:port (e.g., change the value of "collector_url" when "protocol":
"grpc" to "localhost:4317" instead of "http://localhost:4317"), and add a short
note in the otel.mdx text explaining that HTTP protocol examples require full
URLs with paths (e.g., "http://host:4318/v1/traces") while gRPC expects a
host:port string; reference the keys "collector_url" and "protocol" and the
implementation functions grpc.NewClient() vs http.NewRequest() to clarify the
expected formats.

In `@plugins/prometheus/main.go`:
- Around line 105-124: The SetRegistry method currently assigns p.registry and
then builds a pusher using Gatherer(registry) without validating registry; this
can lead to a nil gatherer and a panic later. Add an explicit nil check at the
start of PrometheusPushPlugin.SetRegistry (before assigning p.registry or
calling push.New/Gatherer): if registry == nil return a descriptive error (e.g.,
"nil registry") so you fail fast; keep the existing logic for pusher creation
(push.New(...).Gatherer(registry).Grouping(...)) and BasicAuth setup unchanged
after the nil guard.

In `@ui/lib/types/schemas.ts`:
- Around line 652-721: PrometheusView.tsx is checking a non-existent flag
basic_auth_enabled so basic auth never gets applied; update the conditional that
decides to attach credentials (where it references
config.prometheus_config.basic_auth_enabled) to instead check that both
config.prometheus_config.basic_auth_username and
config.prometheus_config.basic_auth_password are non-empty (use .trim() checks)
before applying basic auth, ensuring you only enable credentials when both
username and password are present.
🧹 Nitpick comments (2)
plugins/otel/metrics.go (2)

74-88: Silent metric creation failures may hide configuration issues.

When metric creation fails, the error is logged but subsequent Add calls silently do nothing (line 85-87). This is reasonable for resilience but could mask misconfiguration. Consider adding a health check method or startup validation to surface these errors earlier.


505-522: Consider using a struct for BuildBifrostAttributes parameters.

The function has 14 parameters, making call sites verbose and error-prone. A struct would improve readability and allow optional fields.

♻️ Example refactor
type BifrostAttributeParams struct {
    Provider        string
    Model           string
    Method          string
    VirtualKeyID    string
    VirtualKeyName  string
    SelectedKeyID   string
    SelectedKeyName string
    NumberOfRetries int
    FallbackIndex   int
    TeamID          string
    TeamName        string
    CustomerID      string
    CustomerName    string
}

func BuildBifrostAttributes(p BifrostAttributeParams) []attribute.KeyValue {
    // ...
}

@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch from 4574632 to 24c2b4a Compare January 31, 2026 07:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@plugins/otel/grpc.go`:
- Around line 29-34: validateCACertPath currently uses os.Stat which follows
symlinks and can be abused for symlink-based path traversal; update
validateCACertPath to use os.Lstat and explicitly check and reject symlinks
(e.g., if FileMode&os.ModeSymlink != 0) before proceeding with
filepath.Clean/IsAbs checks so that the tlsCACert passed into the cert-reading
logic cannot be a symlink to sensitive files.

In `@plugins/otel/http.go`:
- Around line 44-46: The error returned when
caCertPool.AppendCertsFromPEM(caCert) fails incorrectly wraps a nil variable
`err`; update the fmt.Errorf call in the failing branch to not use %w or `err`
(e.g. return a plain error message like "fail to add provided CA cert" or
include contextual details without wrapping a non-existent error). Locate the
check around caCertPool.AppendCertsFromPEM and replace the fmt.Errorf("fail to
add provided CA cert: %w", err) with a non-wrapping error; mirror the pattern
used in grpc.go's "fail to parse provided CA cert" message if helpful.

In `@ui/lib/types/schemas.ts`:
- Around line 544-546: The frontend schema for metrics_push_interval allows
nulls and non-integer numbers which mismatches the backend's non-nullable int
contract; update the metrics_push_interval zod entry (metrics_push_interval) to
remove .nullable() and enforce integers by adding .int(), e.g. change the chain
to z.number().int().min(1).max(300).default(15) so the frontend only emits
non-null integer seconds compatible with the backend MetricsPushInterval
(plugins/otel/main.go).

@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch from 24c2b4a to e748c34 Compare January 31, 2026 11:29
@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch from e748c34 to 36ae04b Compare January 31, 2026 11:45
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@plugins/otel/go.mod`:
- Around line 8-13: Update the metric export error handling introduced by the
otlpmetricgrpc/otlpmetrichttp v1.39.0 change: locate where your metric
reader/export pipeline (e.g., PeriodicReader, any Shutdown or ForceFlush calls,
and where exporters are invoked) calls the OTLP exporters and explicitly check
their returned error values (including partial export errors) instead of relying
solely on otel.ErrorHandler; handle or propagate returned errors from
otlpmetricgrpc and otlpmetrichttp (e.g., detect partial export errors and
retry/log/aggregate as appropriate) in the PeriodicReader shutdown/force-flush
paths and any immediate export call sites so failures are not silently ignored.

In `@plugins/otel/http.go`:
- Around line 33-57: The current logic leaves transport.TLSClientConfig nil when
insecureMode is true, but Go's defaults still verify TLS; update the branch
handling insecureMode to explicitly set transport.TLSClientConfig =
&tls.Config{InsecureSkipVerify: true, MinVersion: tls.VersionTLS12} (or, if you
want insecure to override any custom CA, ensure the insecureMode check runs
before/after tlsCACert logic as appropriate). Modify the code around tlsCACert,
insecureMode, and transport.TLSClientConfig (and keep validateCACertPath
behavior for tlsCACert) so that when insecureMode is enabled the TLS config
explicitly disables verification.
🧹 Nitpick comments (1)
plugins/otel/metrics.go (1)

184-186: Consider implications of setting global meter provider.

otel.SetMeterProvider(provider) sets the global provider, which affects all OTEL instrumentation in the process. This is appropriate for Bifrost's single-service model, but worth noting if the codebase ever supports multiple OTEL configurations or plugin isolation.

@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch from 36ae04b to 54b19f9 Compare February 9, 2026 04:56
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@plugins/litellmcompat/go.mod`:
- Around line 110-111: Update the vulnerable OTEL SDK entries by bumping
go.opentelemetry.io/otel/sdk and go.opentelemetry.io/otel/sdk/metric from
v1.39.0 to v1.40.0 in go.mod, then resolve modules and lock changes (e.g., run
go get go.opentelemetry.io/otel/sdk@v1.40.0 and go get
go.opentelemetry.io/otel/sdk/metric@v1.40.0 followed by go mod tidy) and run
tests/build to ensure nothing breaks; target the module paths
go.opentelemetry.io/otel/sdk and go.opentelemetry.io/otel/sdk/metric when making
the change.

In `@plugins/prometheus/main.go`:
- Around line 104-129: Set a bounded HTTP client timeout on the Prometheus
pusher to avoid blocking on network stalls: in SetRegistry
(PrometheusPushPlugin) after building the pusher variable, create an http.Client
with a Timeout (use a configured value like p.config.PushTimeout or a sensible
default, e.g. 10s) and call pusher.Client(yourClient) before assigning p.pusher
so all subsequent pusher.Push() calls (including the final cleanup push and
pushLoop pushes) use the timeout-bound client.

@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch 2 times, most recently from ae802ad to f3d4f0a Compare February 11, 2026 10:18
@akshaydeo akshaydeo mentioned this pull request Feb 11, 2026
9 tasks
@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch from f3d4f0a to 5ee5d40 Compare February 12, 2026 08:11
@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch 2 times, most recently from a889237 to ee8c701 Compare February 12, 2026 09:40
@akshaydeo akshaydeo force-pushed the 01-30-prom_push-gateway_support branch from ee8c701 to b214fcc Compare February 12, 2026 09:57
@akshaydeo akshaydeo merged commit 66364f6 into main Feb 12, 2026
8 of 9 checks passed
@akshaydeo akshaydeo deleted the 01-30-prom_push-gateway_support branch February 12, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant