Conversation
🧪 Test Suite AvailableThis PR can be tested by a repository admin. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds OTEL metrics push support (OTLP HTTP/gRPC) with TLS/CA and insecure options, a Prometheus Pushgateway push plugin, JSON schema and UI updates for OTEL TLS and metrics push, multiple go.mod dependency bumps, documentation for OTEL/Prometheus, and server wiring to load/start the Prometheus push plugin. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant OtelPlugin as OtelPlugin
participant MetricsExp as MetricsExporter
participant OTLP as OTLP Exporter
participant Collector as OTLP Collector
App->>OtelPlugin: Load config (metrics_enabled, endpoint, protocol, interval)
OtelPlugin->>MetricsExp: NewMetricsExporter(ctx, config)
MetricsExp->>MetricsExp: build resource & meter
alt protocol == grpc
MetricsExp->>OTLP: createGRPCExporter(endpoint, headers, TLS/insecure)
else
MetricsExp->>OTLP: createHTTPExporter(endpoint, headers, TLS/insecure)
end
OTLP-->>MetricsExp: exporter ready
loop every PushInterval
App->>MetricsExp: Record* calls (e.g., RecordHTTPRequest)
MetricsExp->>OTLP: export metrics
OTLP->>Collector: push OTLP metrics
end
App->>MetricsExp: Shutdown()
MetricsExp->>OTLP: Shutdown exporter
sequenceDiagram
participant App as Application
participant PromPlugin as PrometheusPushPlugin
participant Registry as Prometheus Registry
participant PushGW as Pushgateway
App->>PromPlugin: Init(config)
App->>PromPlugin: SetRegistry(Registry)
PromPlugin->>PromPlugin: Start() -> spawn pushLoop
loop every PushInterval
PromPlugin->>Registry: gather metrics
PromPlugin->>PushGW: push metrics (job, instance grouping, basic auth)
PushGW-->>PromPlugin: ACK / error
end
App->>PromPlugin: Cleanup()
PromPlugin->>PushGW: final push
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/features/observability/otel.mdx (1)
420-426:⚠️ Potential issue | 🟡 MinorMinor inconsistency in Grafana access instructions.
The Docker Compose maps Grafana to port 4000 (
"4000:3000"at line 270), but the instructions say to access it athttp://localhost:3000. This should behttp://localhost:4000.📝 Suggested fix
-Access Grafana at `http://localhost:3000` (default credentials: admin/admin) +Access Grafana at `http://localhost:4000` (default credentials: admin/admin)
🤖 Fix all issues with AI agents
In `@plugins/otel/metrics.go`:
- Around line 24-32: Add an Insecure bool to the MetricsConfig struct and update
createGRPCExporter, createHTTPExporter, and the main.go config wiring: when
TLSCACert is provided load and use the custom CA; when TLSCACert is empty and
Insecure is false explicitly use system root CAs (for gRPC call
credentials.NewClientTLSFromCert(nil, "") and for HTTP set a tls.Config that
uses system roots); only use insecure.NewCredentials() (gRPC) or skip TLS (HTTP)
when Insecure is true; wire MetricsConfig.Insecure through main.go so the flag
is respected. Ensure all three locations (MetricsConfig, createGRPCExporter,
createHTTPExporter) reference the new field and follow the TLS priority: custom
CA > system roots > explicit Insecure.
In `@plugins/prometheus/go.mod`:
- Around line 5-8: Update the Prometheus client dependency in go.mod by changing
the require entry for github.com/prometheus/client_golang from v1.23.0 to
v1.23.2, then refresh modules (e.g., run go get
github.com/prometheus/client_golang@v1.23.2 and go mod tidy) so the lockfile and
transitive deps are updated; target the require line referencing
github.com/prometheus/client_golang.
In `@transports/config.schema.json`:
- Around line 1023-1047: Add a conditional requirement so that when
metrics_enabled is true the metrics_endpoint field is required: update the JSON
Schema around the metrics_* properties to include an "if" with properties {
"metrics_enabled": { "const": true } } and a corresponding "then" that sets
"required": ["metrics_endpoint"]; keep the existing oneOf for metrics_endpoint
(or optionally tighten it to enforce URI vs host:port). This change ensures
metrics_enabled and metrics_endpoint (the symbols to edit) are validated
together in the transports/config.schema.json schema.
🧹 Nitpick comments (8)
plugins/prometheus/main.go (3)
37-41: Consider supporting environment variable resolution for BasicAuth credentials.The OTEL plugin supports
env.prefix for header values to load secrets from environment variables. For consistency and to avoid hardcoding credentials in config files, consider adding similar support for BasicAuth.Example approach
// In SetRegistry or Init, resolve env vars: if p.config.BasicAuth != nil { if newValue, ok := strings.CutPrefix(p.config.BasicAuth.Username, "env."); ok { p.config.BasicAuth.Username = os.Getenv(newValue) } if newValue, ok := strings.CutPrefix(p.config.BasicAuth.Password, "env."); ok { p.config.BasicAuth.Password = os.Getenv(newValue) } }
65-102: Accept parent context for lifecycle coordination.The Init function creates its own
context.Background()instead of accepting a parent context. This differs from the OTEL plugin pattern (seeplugins/otel/main.go:85) and may cause issues with graceful shutdown coordination if the server context is cancelled.Proposed fix
-func Init(config *Config, logger schemas.Logger) (*PrometheusPushPlugin, error) { +func Init(ctx context.Context, config *Config, logger schemas.Logger) (*PrometheusPushPlugin, error) { if config == nil { return nil, fmt.Errorf("config is required") } // ... validation ... - ctx, cancel := context.WithCancel(context.Background()) + pluginCtx, cancel := context.WithCancel(ctx) plugin := &PrometheusPushPlugin{ config: config, logger: logger, - ctx: ctx, + ctx: pluginCtx, cancel: cancel, } return plugin, nil }
196-209: Consider resettingstartedflag in Cleanup for completeness.After Cleanup, the
startedflag remainstrue, preventing the plugin from being restarted. While plugins typically aren't restarted, resetting the flag would make the lifecycle cleaner and prevent confusing behavior if Start is called after Cleanup.Optional fix
func (p *PrometheusPushPlugin) Cleanup() error { p.mu.Lock() if !p.started { p.mu.Unlock() return nil } + p.started = false p.mu.Unlock() p.cancel() p.wg.Wait() p.logger.Info("prometheus push gateway plugin stopped") return nil }ui/app/workspace/observability/views/plugins/otelView.tsx (1)
16-16: Remove unusedbaseUrlvariable.The
baseUrlvariable was likely used for the removed metrics endpoint display. It's now unused and should be removed.Proposed fix
const [updatePlugin, { isLoading: isUpdatingPlugin }] = useUpdatePluginMutation(); - const baseUrl = `${window.location.protocol}//${window.location.host}`;ui/app/workspace/observability/views/plugins/prometheusView.tsx (1)
31-70: Promise wrapper is unnecessary with RTK Query.The
handlePrometheusConfigSavefunction wrapsupdatePluginin a manual Promise, but.unwrap()already returns a Promise. The wrapper adds complexity without benefit.♻️ Simplified implementation
- const handlePrometheusConfigSave = (config: PrometheusFormSchema): Promise<void> => { - return new Promise((resolve, reject) => { - // Transform the form data to the backend config format - const backendConfig: PrometheusConfig = { - push_gateway_url: config.prometheus_config.push_gateway_url, - job_name: config.prometheus_config.job_name, - instance_id: config.prometheus_config.instance_id || undefined, - push_interval: config.prometheus_config.push_interval, - } - - // Add basic auth if enabled - if (config.prometheus_config.basic_auth_enabled && - config.prometheus_config.basic_auth_username && - config.prometheus_config.basic_auth_password) { - backendConfig.basic_auth = { - username: config.prometheus_config.basic_auth_username, - password: config.prometheus_config.basic_auth_password, - } - } - - updatePlugin({ - name: "prometheus", - data: { - enabled: config.enabled, - config: backendConfig, - }, - }) - .unwrap() - .then(() => { - resolve() - toast.success("Prometheus configuration updated successfully") - }) - .catch((err) => { - toast.error("Failed to update Prometheus configuration", { - description: getErrorMessage(err), - }) - reject(err) - }) - }) - } + const handlePrometheusConfigSave = async (config: PrometheusFormSchema): Promise<void> => { + // Transform the form data to the backend config format + const backendConfig: PrometheusConfig = { + push_gateway_url: config.prometheus_config.push_gateway_url, + job_name: config.prometheus_config.job_name, + instance_id: config.prometheus_config.instance_id || undefined, + push_interval: config.prometheus_config.push_interval, + } + + // Add basic auth if enabled + if (config.prometheus_config.basic_auth_enabled && + config.prometheus_config.basic_auth_username && + config.prometheus_config.basic_auth_password) { + backendConfig.basic_auth = { + username: config.prometheus_config.basic_auth_username, + password: config.prometheus_config.basic_auth_password, + } + } + + try { + await updatePlugin({ + name: "prometheus", + data: { + enabled: config.enabled, + config: backendConfig, + }, + }).unwrap() + toast.success("Prometheus configuration updated successfully") + } catch (err) { + toast.error("Failed to update Prometheus configuration", { + description: getErrorMessage(err), + }) + throw err + } + }ui/app/workspace/observability/fragments/otelFormFragment.tsx (1)
63-78: Potential redundant validation triggers in useEffects.The two
useEffecthooks have overlapping concerns. The first hook (lines 64-71) already handlesmetricsEnabledin its dependency array and triggers validation when it changes. The second hook (lines 73-78) duplicates this behavior.Additionally, the second
useEffectis missingformin its dependency array (ESLint would flag this), andinitialConfigcheck is missing which could cause validation on mount before config is loaded.♻️ Consider consolidating validation logic
- useEffect(() => { - if (initialConfig === undefined || initialConfig === null || (initialConfig.enabled ?? false) === false) return; - form.trigger("otel_config.collector_url"); - // Also re-validate metrics_endpoint when protocol changes - if (metricsEnabled) { - form.trigger("otel_config.metrics_endpoint"); - } - }, [protocol, form, metricsEnabled]); - - // Re-run validation on metrics_endpoint when metrics_enabled changes - useEffect(() => { - if (metricsEnabled) { - form.trigger("otel_config.metrics_endpoint"); - } - }, [metricsEnabled, form]); + useEffect(() => { + if (initialConfig === undefined || initialConfig === null || (initialConfig.enabled ?? false) === false) return; + form.trigger("otel_config.collector_url"); + if (metricsEnabled) { + form.trigger("otel_config.metrics_endpoint"); + } + }, [protocol, form, metricsEnabled, initialConfig]);ui/app/workspace/observability/fragments/prometheusFormFragment.tsx (1)
325-334: Switch disabled when form is invalid may confuse users.The "Enable Push Gateway" switch is disabled when
!form.formState.isValid. Since the form requirespush_gateway_urlwhen enabled, this creates a UX issue: users must fill in the URL before they can enable the feature, but there's no visual indication that filling in the URL is the prerequisite.Consider either:
- Allowing the switch to be toggled (let validation show the error)
- Adding helper text explaining the prerequisite
💡 Option: Allow toggle, let validation guide
<Switch checked={form.watch("enabled")} onCheckedChange={field.onChange} - disabled={!hasPrometheusAccess || isLoading || !form.formState.isValid} + disabled={!hasPrometheusAccess || isLoading} />plugins/otel/metrics.go (1)
212-223: Set a minimum TLS version for custom CA connections.When a custom CA is used, the TLS config doesn’t set
MinVersion. Consider pinning to TLS 1.2+ (or 1.3 if compatible) to avoid legacy protocol negotiation.🔐 Harden TLS defaults
- tlsConfig := &tls.Config{RootCAs: caCertPool} + tlsConfig := &tls.Config{ + RootCAs: caCertPool, + MinVersion: tls.VersionTLS12, + }
f046939 to
a434552
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@docs/features/observability/otel.mdx`:
- Around line 744-749: The docs state metrics_push_interval range 1-300 but
runtime validation only requires >0; either enforce the upper bound in the
config validation (update the validator/schema that checks metrics_push_interval
— e.g., the config parsing/validateConfig or MetricsOptions schema — to require
an integer between 1 and 300 inclusive and add/adjust unit tests), or change the
docs line for metrics_push_interval to indicate a “recommended range (1–300)” to
match current behavior; pick the intended behavior and make the corresponding
code or docs change so they stay consistent.
In `@plugins/otel/metrics.go`:
- Around line 212-222: createHTTPExporter reads config.TLSCACert directly (path
traversal risk) and builds a tls.Config without a MinVersion; validate and
constrain the TLSCACert path (e.g., require it to be under an allowed directory
or be an absolute path checked against a whitelist) before calling os.ReadFile
in createHTTPExporter, and add a minimum TLS version to the created tls.Config
(set tls.Config.MinVersion = tls.VersionTLS12 or higher) so opts uses a secure
minimum; update references in the code around config.TLSCACert,
createHTTPExporter, and tls.Config to implement these checks and the MinVersion
setting.
In `@plugins/prometheus/main.go`:
- Around line 196-209: The Cleanup method has a TOCTOU race: it checks p.started
under p.mu then unlocks before calling p.cancel(), allowing Start to run
concurrently; fix by holding p.mu until after you set p.started = false and call
p.cancel() (or at minimum set p.started = false while holding the lock) so the
sequence in Cleanup (acquire p.mu, verify started, set p.started = false,
release p.mu, then call p.cancel() and p.wg.Wait()) is atomic with respect to
Start; update the Cleanup implementation referencing p.mu, p.started, and
p.cancel to perform the state transition under lock.
In `@ui/app/workspace/observability/fragments/otelFormFragment.tsx`:
- Around line 282-289: The numeric input currently forces 15 when cleared
because onChange uses parseInt(...) || 15; change the Controller/field handling
to map empty strings to null and numbers otherwise: render the input with
value={field.value ?? ''} so an unset value shows as empty, and update onChange
to call field.onChange(e.target.value === '' ? null : Number(e.target.value));
alternatively, if using register, use setValueAs: (v) => (v === '' ? null :
Number(v)). Update the Input handling around field.onChange/field.value in
otelFormFragment so form state uses null for empty numeric values instead of
undefined or a default.
🧹 Nitpick comments (1)
plugins/otel/main.go (1)
149-165: Consider enforcing the documented push interval bounds.
UI/docs indicate 1–300 seconds, but the backend only guards<= 0. If the upper bound is intentional, enforce it here (and optionally inValidateConfig) for consistent behavior.✅ Proposed guard (if 1–300 is intended)
pushInterval := config.MetricsPushInterval if pushInterval <= 0 { pushInterval = 15 // default 15 seconds } + if pushInterval > 300 { + return nil, fmt.Errorf("metrics_push_interval must be between 1 and 300 seconds") + }
a434552 to
d0191d7
Compare
d0191d7 to
cad26b7
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@plugins/otel/grpc.go`:
- Around line 25-48: Update the gRPC and HTTP constructors to accept an insecure
bool and honor TLS priority: modify NewOtelClientGRPC (and NewOtelClientHTTP)
signatures to add an insecure parameter, then implement logic: if tlsCACert !=
"" validateCACertPath and load a tls.Config with RootCAs from the file and
MinVersion TLS1.2 (as existing), else if insecure == false use system roots /
default TLS (i.e. create TLS credentials that use system cert pool or nil
tls.Config so system roots are used), otherwise when insecure == true use
insecure.NewCredentials() (for gRPC) or disable TLS for HTTP; also update all
call sites (main.go) to pass config.Insecure into
NewOtelClientGRPC/NewOtelClientHTTP so the constructors can honor
config.Insecure. Ensure validateCACertPath, NewOtelClientGRPC, and
NewOtelClientHTTP are the referenced symbols when making changes.
In `@ui/app/workspace/observability/fragments/prometheusFormFragment.tsx`:
- Around line 83-89: The handleCopyEndpoint function should guard against
clipboard API failures: wrap the call to
navigator.clipboard.writeText(metricsEndpoint) in a try-catch (or use .catch)
and await the promise so rejections are handled; on success call setCopied(true)
and schedule setCopied(false) as before, and on error log/report the error (or
show a user-facing fallback/notification) without setting copied; reference
handleCopyEndpoint, metricsEndpoint, setCopied, and
navigator.clipboard.writeText in your change.
- Around line 185-192: The numeric Input currently forces a fallback to 15 on
clear because onChange uses field.onChange(parseInt(e.target.value) || 15);
update the onChange handler (the Input using {...field}) to detect an empty
string and call field.onChange(null) when e.target.value === '' and otherwise
call field.onChange(parseInt(e.target.value, 10)); also ensure the
validation/schema for push_interval (where push_interval is defined) accepts
null as a valid value so empty clears are allowed instead of being coerced to
15.
In `@ui/lib/types/schemas.ts`:
- Around line 652-721: The form transformation in PrometheusView.tsx is checking
a non-existent flag basic_auth_enabled and thus drops credentials; update the
transform that builds backendConfig so it instead checks that both
config.prometheus_config.basic_auth_username and
config.prometheus_config.basic_auth_password are present and non-empty (trimmed)
before assigning backendConfig.basic_auth with username/password, ensuring
basic_auth is only added when both credentials exist.
🧹 Nitpick comments (3)
ui/app/workspace/observability/views/plugins/prometheusView.tsx (1)
31-69: Optional: simplifyonSaveby using async/await instead of manual Promise wrapping.This reduces nesting and preserves the same success/error behavior.
♻️ Suggested refactor
- const handlePrometheusConfigSave = (config: PrometheusFormSchema): Promise<void> => { - return new Promise((resolve, reject) => { + const handlePrometheusConfigSave = async (config: PrometheusFormSchema): Promise<void> => { // Transform the form data to the backend config format const backendConfig: PrometheusConfig = { push_gateway_url: config.prometheus_config.push_gateway_url, job_name: config.prometheus_config.job_name, instance_id: config.prometheus_config.instance_id || undefined, push_interval: config.prometheus_config.push_interval, } @@ - updatePlugin({ - name: "prometheus", - data: { - enabled: config.enabled, - config: backendConfig, - }, - }) - .unwrap() - .then(() => { - resolve() - toast.success("Prometheus configuration updated successfully") - }) - .catch((err) => { - toast.error("Failed to update Prometheus configuration", { - description: getErrorMessage(err), - }) - reject(err) - }) - }) - } + try { + await updatePlugin({ + name: "prometheus", + data: { + enabled: config.enabled, + config: backendConfig, + }, + }).unwrap() + toast.success("Prometheus configuration updated successfully") + } catch (err) { + toast.error("Failed to update Prometheus configuration", { + description: getErrorMessage(err), + }) + throw err + } + }plugins/prometheus/main.go (2)
120-122: Consider warning when BasicAuth is used without HTTPS.Sending credentials over HTTP exposes them to interception. Consider logging a warning or validating the URL scheme:
🛡️ Suggested enhancement
if p.config.BasicAuth != nil && p.config.BasicAuth.Username != "" { + if !strings.HasPrefix(p.config.PushGatewayURL, "https://") { + // Log warning but don't block - user may have TLS termination at proxy + p.logger.Warn("basic auth configured but push_gateway_url is not HTTPS - credentials may be sent in cleartext") + } pusher = pusher.BasicAuth(p.config.BasicAuth.Username, p.config.BasicAuth.Password) }Note: This would require moving the check to
SetRegistryor adding a logger field access, or doing validation inInit.
63-102: Accept parent context parameter for proper shutdown signal propagation.The prometheus plugin should receive a context parameter like other plugins (logging, governance, otel, semanticcache) in the same initialization flow. Currently it creates an isolated
context.Background()at line 92, which means server-wide shutdown signals won't propagate to this plugin's context. The context is already available at the call site inloadBuiltinPlugin().Required changes:
- Add
ctx context.Contextas the first parameter toInit()- Use
pluginCtx, cancel := context.WithCancel(ctx)instead ofcontext.WithCancel(context.Background())- Update the caller at
transports/bifrost-http/server/plugins.go:109to passctx
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/otel/http.go (1)
33-45:⚠️ Potential issue | 🟡 MinorUse a concrete error message when
AppendCertsFromPEMfailsLine 43-45: When
AppendCertsFromPEMreturns false, the code wrapserr(the nil result from the successfulos.ReadFilecall) with%w. This produces a malformed error message with a formatting artifact. Use a concrete error message that describes the actual problem instead.Proposed fix
- if !caCertPool.AppendCertsFromPEM(caCert) { - return nil, fmt.Errorf("fail to add provided CA cert: %w", err) - } + if !caCertPool.AppendCertsFromPEM(caCert) { + return nil, fmt.Errorf("fail to add provided CA cert: invalid PEM or no certs found") + }
🧹 Nitpick comments (3)
transports/config.schema.json (1)
1150-1164: AddadditionalProperties: falsetobasic_authobject.The
basic_authobject schema is missingadditionalProperties: false, which would allow unexpected fields to pass validation. This is inconsistent with other nested objects in the schema (e.g.,configat line 1169).Proposed fix
"basic_auth": { "type": "object", "description": "Basic authentication credentials for the Push Gateway", "properties": { "username": { "type": "string", "description": "Username for basic authentication" }, "password": { "type": "string", "description": "Password for basic authentication" } }, - "required": ["username", "password"] + "required": ["username", "password"], + "additionalProperties": false }ui/app/workspace/observability/views/plugins/otelView.tsx (1)
16-16: Remove unusedbaseUrlvariable.The
baseUrlvariable is no longer used after removing the readonly Metrics endpoint display block.Proposed fix
const [updatePlugin, { isLoading: isUpdatingPlugin }] = useUpdatePluginMutation(); - const baseUrl = `${window.location.protocol}//${window.location.host}`;plugins/otel/metrics.go (1)
248-266: Consider using validated/cleaned path for file reads.While
validateCACertPathproperly validates the path, the subsequentos.ReadFilecalls (lines 254, 296) still useconfig.TLSCACertdirectly rather than the cleaned path. For extra safety and to better satisfy static analysis tools:♻️ Suggested improvement
Modify
validateCACertPathto return the cleaned path:-func validateCACertPath(certPath string) error { +func validateCACertPath(certPath string) (string, error) { if certPath == "" { - return nil + return "", nil } cleanPath := filepath.Clean(certPath) // ... validation logic ... - return nil + return cleanPath, nil }Then use the returned path:
- if err := validateCACertPath(config.TLSCACert); err != nil { + cleanPath, err := validateCACertPath(config.TLSCACert) + if err != nil { return nil, err } - caCert, err := os.ReadFile(config.TLSCACert) + caCert, err := os.ReadFile(cleanPath)Also applies to: 290-309
cad26b7 to
4574632
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/otel/http.go (1)
43-46:⚠️ Potential issue | 🔴 CriticalFix nil-wrapped error on CA append failure.
fmt.Errorf("...: %w", err)will returnnilwhenerris nil, so CA parse failures can incorrectly return(nil, nil)and cause downstream panics.🐛 Proposed fix
- if !caCertPool.AppendCertsFromPEM(caCert) { - return nil, fmt.Errorf("fail to add provided CA cert: %w", err) - } + if !caCertPool.AppendCertsFromPEM(caCert) { + return nil, fmt.Errorf("fail to add provided CA cert") + }
🤖 Fix all issues with AI agents
In `@docs/features/observability/otel.mdx`:
- Around line 755-791: Update the gRPC example JSON so the collector_url uses
bare host:port (e.g., change the value of "collector_url" when "protocol":
"grpc" to "localhost:4317" instead of "http://localhost:4317"), and add a short
note in the otel.mdx text explaining that HTTP protocol examples require full
URLs with paths (e.g., "http://host:4318/v1/traces") while gRPC expects a
host:port string; reference the keys "collector_url" and "protocol" and the
implementation functions grpc.NewClient() vs http.NewRequest() to clarify the
expected formats.
In `@plugins/prometheus/main.go`:
- Around line 105-124: The SetRegistry method currently assigns p.registry and
then builds a pusher using Gatherer(registry) without validating registry; this
can lead to a nil gatherer and a panic later. Add an explicit nil check at the
start of PrometheusPushPlugin.SetRegistry (before assigning p.registry or
calling push.New/Gatherer): if registry == nil return a descriptive error (e.g.,
"nil registry") so you fail fast; keep the existing logic for pusher creation
(push.New(...).Gatherer(registry).Grouping(...)) and BasicAuth setup unchanged
after the nil guard.
In `@ui/lib/types/schemas.ts`:
- Around line 652-721: PrometheusView.tsx is checking a non-existent flag
basic_auth_enabled so basic auth never gets applied; update the conditional that
decides to attach credentials (where it references
config.prometheus_config.basic_auth_enabled) to instead check that both
config.prometheus_config.basic_auth_username and
config.prometheus_config.basic_auth_password are non-empty (use .trim() checks)
before applying basic auth, ensuring you only enable credentials when both
username and password are present.
🧹 Nitpick comments (2)
plugins/otel/metrics.go (2)
74-88: Silent metric creation failures may hide configuration issues.When metric creation fails, the error is logged but subsequent
Addcalls silently do nothing (line 85-87). This is reasonable for resilience but could mask misconfiguration. Consider adding a health check method or startup validation to surface these errors earlier.
505-522: Consider using a struct forBuildBifrostAttributesparameters.The function has 14 parameters, making call sites verbose and error-prone. A struct would improve readability and allow optional fields.
♻️ Example refactor
type BifrostAttributeParams struct { Provider string Model string Method string VirtualKeyID string VirtualKeyName string SelectedKeyID string SelectedKeyName string NumberOfRetries int FallbackIndex int TeamID string TeamName string CustomerID string CustomerName string } func BuildBifrostAttributes(p BifrostAttributeParams) []attribute.KeyValue { // ... }
4574632 to
24c2b4a
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@plugins/otel/grpc.go`:
- Around line 29-34: validateCACertPath currently uses os.Stat which follows
symlinks and can be abused for symlink-based path traversal; update
validateCACertPath to use os.Lstat and explicitly check and reject symlinks
(e.g., if FileMode&os.ModeSymlink != 0) before proceeding with
filepath.Clean/IsAbs checks so that the tlsCACert passed into the cert-reading
logic cannot be a symlink to sensitive files.
In `@plugins/otel/http.go`:
- Around line 44-46: The error returned when
caCertPool.AppendCertsFromPEM(caCert) fails incorrectly wraps a nil variable
`err`; update the fmt.Errorf call in the failing branch to not use %w or `err`
(e.g. return a plain error message like "fail to add provided CA cert" or
include contextual details without wrapping a non-existent error). Locate the
check around caCertPool.AppendCertsFromPEM and replace the fmt.Errorf("fail to
add provided CA cert: %w", err) with a non-wrapping error; mirror the pattern
used in grpc.go's "fail to parse provided CA cert" message if helpful.
In `@ui/lib/types/schemas.ts`:
- Around line 544-546: The frontend schema for metrics_push_interval allows
nulls and non-integer numbers which mismatches the backend's non-nullable int
contract; update the metrics_push_interval zod entry (metrics_push_interval) to
remove .nullable() and enforce integers by adding .int(), e.g. change the chain
to z.number().int().min(1).max(300).default(15) so the frontend only emits
non-null integer seconds compatible with the backend MetricsPushInterval
(plugins/otel/main.go).
24c2b4a to
e748c34
Compare
e748c34 to
36ae04b
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@plugins/otel/go.mod`:
- Around line 8-13: Update the metric export error handling introduced by the
otlpmetricgrpc/otlpmetrichttp v1.39.0 change: locate where your metric
reader/export pipeline (e.g., PeriodicReader, any Shutdown or ForceFlush calls,
and where exporters are invoked) calls the OTLP exporters and explicitly check
their returned error values (including partial export errors) instead of relying
solely on otel.ErrorHandler; handle or propagate returned errors from
otlpmetricgrpc and otlpmetrichttp (e.g., detect partial export errors and
retry/log/aggregate as appropriate) in the PeriodicReader shutdown/force-flush
paths and any immediate export call sites so failures are not silently ignored.
In `@plugins/otel/http.go`:
- Around line 33-57: The current logic leaves transport.TLSClientConfig nil when
insecureMode is true, but Go's defaults still verify TLS; update the branch
handling insecureMode to explicitly set transport.TLSClientConfig =
&tls.Config{InsecureSkipVerify: true, MinVersion: tls.VersionTLS12} (or, if you
want insecure to override any custom CA, ensure the insecureMode check runs
before/after tlsCACert logic as appropriate). Modify the code around tlsCACert,
insecureMode, and transport.TLSClientConfig (and keep validateCACertPath
behavior for tlsCACert) so that when insecureMode is enabled the TLS config
explicitly disables verification.
🧹 Nitpick comments (1)
plugins/otel/metrics.go (1)
184-186: Consider implications of setting global meter provider.
otel.SetMeterProvider(provider)sets the global provider, which affects all OTEL instrumentation in the process. This is appropriate for Bifrost's single-service model, but worth noting if the codebase ever supports multiple OTEL configurations or plugin isolation.
36ae04b to
54b19f9
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@plugins/litellmcompat/go.mod`:
- Around line 110-111: Update the vulnerable OTEL SDK entries by bumping
go.opentelemetry.io/otel/sdk and go.opentelemetry.io/otel/sdk/metric from
v1.39.0 to v1.40.0 in go.mod, then resolve modules and lock changes (e.g., run
go get go.opentelemetry.io/otel/sdk@v1.40.0 and go get
go.opentelemetry.io/otel/sdk/metric@v1.40.0 followed by go mod tidy) and run
tests/build to ensure nothing breaks; target the module paths
go.opentelemetry.io/otel/sdk and go.opentelemetry.io/otel/sdk/metric when making
the change.
In `@plugins/prometheus/main.go`:
- Around line 104-129: Set a bounded HTTP client timeout on the Prometheus
pusher to avoid blocking on network stalls: in SetRegistry
(PrometheusPushPlugin) after building the pusher variable, create an http.Client
with a Timeout (use a configured value like p.config.PushTimeout or a sensible
default, e.g. 10s) and call pusher.Client(yourClient) before assigning p.pusher
so all subsequent pusher.Push() calls (including the final cleanup push and
pushLoop pushes) use the timeout-bound client.
ae802ad to
f3d4f0a
Compare
f3d4f0a to
5ee5d40
Compare
a889237 to
ee8c701
Compare
ee8c701 to
b214fcc
Compare

Add push-based metrics support for multi-node deployments
This PR adds two key observability enhancements for multi-node Bifrost deployments:
Changes
Type of change
Affected areas
How to test
Testing OTEL Metrics Push
Testing Prometheus Push Gateway
Breaking changes
Security considerations