Skip to content

gcp: support sidecar containers + addons/gcp/otel-sidecar#245

Open
robbiet480 wants to merge 1 commit into
fleetdm:mainfrom
CampusTech:feat-otel-sidecar-support
Open

gcp: support sidecar containers + addons/gcp/otel-sidecar#245
robbiet480 wants to merge 1 commit into
fleetdm:mainfrom
CampusTech:feat-otel-sidecar-support

Conversation

@robbiet480
Copy link
Copy Markdown

@robbiet480 robbiet480 commented May 29, 2026

Summary

Adds two inputs to the gcp module so callers can colocate sidecars (typically OpenTelemetry collectors) with the fleet-api Cloud Run service, plus a new addons/gcp/otel-sidecar addon that emits the matching Fleet env vars.

  • var.sidecar_containers (list of container objects) — wires arbitrary sidecars into the Cloud Run service. Matches the upstream cloud-run module's container schema. Useful for any collector exposing an OTLP receiver on a localhost port (Datadog DDOT, otelcol-contrib, Grafana Alloy, Honeycomb agent, …).
  • var.service_only_env_vars (map) — env vars applied to the Cloud Run service but not the migration job. Needed for vars that depend on a sidecar being present, e.g. OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 — the migration job (a Cloud Run Job) has no sidecars, so it would log exporter retry noise during each migration if those vars leaked in.
  • addons/gcp/otel-sidecar/ — emits the FLEET_LOGGING_TRACING_ENABLED, FLEET_LOGGING_OTEL_LOGS_ENABLED, OTEL_EXPORTER_OTLP_*, and OTEL_RESOURCE_ATTRIBUTES env vars in a {universal, service_only} split so callers route them correctly. Doesn't ship a sidecar container itself — the README has copy-pasteable examples for OpenTelemetry Collector contrib and Datadog DDOT.

Follows the addons/gcp/* convention established by #223 (okta-conditional-access) and #231 (pubsub-to-bigquery).

Why

Fleet's server has built-in OpenTelemetry exporters for traces, metrics, and logs (otlptracegrpc, otlpmetricgrpc, otlploggrpc — all gRPC). The pattern that actually works on Cloud Run is: run an OTel collector as a sidecar in the same instance, point Fleet at localhost:4317, let the collector forward to whatever backend.

Today the gcp module only renders a single container per service, so there's no clean way to express this. Direct OTLP intake from Fleet to a SaaS backend isn't a workable substitute — Datadog's direct intake is HTTP-only and requires delta temporality, neither of which Fleet supports without source patches.

Comparable prior art: addons/xrays-sidecar does the same job for AWS X-Ray on ECS. This PR fills the GCP gap.

Backwards compatibility

Both new inputs default to empty ([] and {}). With defaults, the rendered output is functionally identical to today — only field added is container_name = "fleet" on the main container (forces one revision swap on first deploy after upgrade, no behavior change).

The migration job is untouched. The fleet-service Cloud Run service gets the new container name + a depends_on only when callers actually pass sidecars.

Prerequisite

Multi-container Cloud Run services require GoogleCloudPlatform/terraform-google-cloud-run#450 (open, mergeable, awaiting maintainer review) to land. That PR fixes a lookup() != {} skip in the v2 module's dynamic "ports" block that prevents sidecars from opting out of the ingress port. Until it merges, attempting terraform apply with non-empty sidecar_containers fails with:

Error 400: Revision template should contain exactly one container with an exposed port.

I've called this out clearly in the addon README and the var.sidecar_containers description. Callers can vendor the patched module locally as a workaround — that's how the production deployment this PR is extracted from runs today.

This PR is fine to merge before #450; the new inputs default to empty so nothing breaks for existing users, and the new code path becomes useful once #450 ships.

Validation

  • terraform fmt -recursive + terraform validate clean for both gcp/ and addons/gcp/otel-sidecar/.
  • The runtime behavior — Fleet's OTel SDK shipping traces, metrics, and logs through a DDOT sidecar (gcr.io/datadoghq/agent:latest-full + DD_OTELCOLLECTOR_ENABLED=true) to Datadog us5 — is live in production at Campus today, on both a primary fleet-api service and a secondary bulk service. Verified all three signal types land in Datadog APM Traces, Metrics Explorer, and Log Explorer.

Test plan

  • On a clean GCP project, set sidecar_containers = [] and service_only_env_vars = {} → confirm no infrastructure change from current main.
  • After fix: permit optional ports for sidecars so we can run sidecars GoogleCloudPlatform/terraform-google-cloud-run#450 lands, add a sidecar per the addon README's "OpenTelemetry Collector (contrib)" example → confirm Fleet's /healthz stays passing and OTel data reaches the collector (e.g. via the debug exporter).
  • Confirm the migration job (Cloud Run Job) does not receive OTEL_EXPORTER_OTLP_ENDPOINT after wiring service_only_env_vars.

@robbiet480 robbiet480 marked this pull request as ready for review May 29, 2026 01:44
@robbiet480 robbiet480 requested review from a team and ddribeiro as code owners May 29, 2026 01:44
Adds two inputs to the gcp module so callers can wire OpenTelemetry
collectors (or any other sidecar) into the fleet-api Cloud Run service:

- sidecar_containers: list of arbitrary sidecar containers to colocate
  with Fleet. Lets users plug in DDOT, otelcol-contrib, Grafana Alloy,
  or any other observability agent that exposes an OTLP receiver on a
  localhost port.

- service_only_env_vars: extra env vars applied only to the Cloud Run
  service but not the migration job. The migration job runs as a Cloud
  Run Job without sidecars, so things like OTEL_EXPORTER_OTLP_ENDPOINT
  would point to a nonexistent listener during job execution.

Behavior is unchanged when both inputs use their empty defaults —
existing deployments don't need to update tfvars.

The new addons/gcp/otel-sidecar addon emits the corresponding Fleet env
vars (FLEET_LOGGING_TRACING_ENABLED, OTEL_EXPORTER_OTLP_*, etc.) in a
{universal, service_only} split so callers route them correctly. It
doesn't ship a sidecar container itself; the README has copy-pasteable
examples for OpenTelemetry Collector contrib and Datadog DDOT.

Multi-container Cloud Run services depend on
GoogleCloudPlatform/cloud-run/google//modules/v2 accepting sidecars
without a ports block, tracked in
GoogleCloudPlatform/terraform-google-cloud-run#450. Until that lands
upstream, callers using sidecar_containers will hit "exactly one
container with an exposed port" — documented in the addon README.
@robbiet480 robbiet480 force-pushed the feat-otel-sidecar-support branch from 443b3b0 to 79f7005 Compare May 29, 2026 01:55
@robbiet480 robbiet480 changed the title gcp: support sidecar containers + cloud-run-otel-sidecar addon gcp: support sidecar containers + addons/gcp/otel-sidecar May 29, 2026
@BCTBB
Copy link
Copy Markdown
Contributor

BCTBB commented Jun 4, 2026

@robbiet480 Thank you for your contribution! We'll schedule the review of your proposed changes for next sprint.

@erikngo
Copy link
Copy Markdown

erikngo commented Jun 5, 2026

@BCTBB that's quite important fix, it blocks the pattern sidecars for cloud run now via official tf module, we need to make forked module for working around this right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants