Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 159 additions & 0 deletions addons/gcp/otel-sidecar/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# GCP OpenTelemetry Sidecar — Cloud Run

Wires Fleet's built-in OpenTelemetry SDK to a sidecar collector running alongside `fleet-api` on Cloud Run. Ships traces, metrics, and logs through whatever collector you mount (Datadog DDOT, the upstream OpenTelemetry Collector, Grafana Alloy, Honeycomb agent, ...).

This addon emits **environment variables only**. You provide the sidecar container yourself via the GCP module's `sidecar_containers` input — examples below.

## Why

Fleet emits OTLP/gRPC for all three signal types (`otlptracegrpc`, `otlpmetricgrpc`, `otlploggrpc`). You can ship that data through any OTel-compatible backend by running a collector as a sidecar in the same Cloud Run instance and pointing Fleet at `localhost:4317`.

Compared to Fleet's other observability paths:

| Path | What it covers | Where it goes |
| ---- | -------------- | ------------- |
| `addons/logging-destination-*` | osquery scheduled query / status / activity audit logs | SaaS log backends |
| **This addon** | **Fleet *server* traces, metrics, logs** (everything except osquery results) | Any OTel-compatible backend |
| `addons/xrays-sidecar` | Fleet server traces (AWS-only) | AWS X-Ray |

## Prerequisites

The GCP module must accept multi-container Cloud Run services, which requires the upstream `GoogleCloudPlatform/cloud-run/google//modules/v2` module to support sidecars without a `ports` block. That fix is being tracked in [GoogleCloudPlatform/terraform-google-cloud-run#450](https://github.com/GoogleCloudPlatform/terraform-google-cloud-run/pull/450) — until it merges, attempting to deploy with `var.sidecar_containers` will fail with:

```
Error 400: Revision template should contain exactly one container with an exposed port.
```

Until upstream merges, you can vendor the patched module locally; this addon's outputs are unaffected by where the cloud-run module comes from.

## Usage

```hcl
module "fleet_otel" {
source = "github.com/fleetdm/fleet-terraform//addons/gcp/otel-sidecar?ref=main"

service_name = "fleet"
service_version = "v4.85.0" # match var.fleet_config.image_tag
deployment_environment = "prod"

extra_resource_attributes = {
"team" = "sre"
"region" = "us-central1"
}
}

module "fleet" {
source = "github.com/fleetdm/fleet-terraform//gcp?ref=main"
# ... other inputs ...

fleet_config = merge(var.fleet_config, {
extra_env_vars = merge(
coalesce(var.fleet_config.extra_env_vars, {}),
module.fleet_otel.fleet_extra_environment_variables.universal,
)
})

service_only_env_vars = module.fleet_otel.fleet_extra_environment_variables.service_only

sidecar_containers = [
# Pick one of the example sidecar definitions below.
]
}
```

The split between `universal` and `service_only` env vars exists because Fleet's migration job (a Cloud Run Job, not a Service) runs without sidecars. The `OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317` var would point to a non-existent listener during migrations, causing exporter retry noise in the job logs. Wire `universal` into `extra_env_vars` (visible to both) and `service_only` into `service_only_env_vars` (Service only).

## Sidecar examples

Each example is a single object you drop into `module.fleet.sidecar_containers = [ ... ]`. Cloud Run requires:

1. Every sidecar declared in another container's `depends_on_container` must have its own `startup_probe`. The probe targets the OTLP gRPC port — once it accepts connections, the collector is ready.
2. Only one container per service may own the ingress port. Set `ports = { name = "", container_port = 0 }` on sidecars to opt out (sentinel honored by the cloud-run v2 module once [PR #450](https://github.com/GoogleCloudPlatform/terraform-google-cloud-run/pull/450) lands).

### OpenTelemetry Collector (contrib) — vendor-neutral

```hcl
sidecar_containers = [
{
container_name = "otel-collector"
container_image = "otel/opentelemetry-collector-contrib:latest"
container_args = ["--config=/etc/otelcol/config.yaml"]

# You need to mount a config.yaml. Store it as a Secret Manager secret
# and mount via volume_mounts. See addons/gcp/otel-sidecar/examples/
# for a config that exports to Datadog, Honeycomb, or Grafana Cloud.
volume_mounts = [{
name = "otel-config"
mount_path = "/etc/otelcol"
}]

resources = {
limits = { cpu = "500m", memory = "256Mi" }
}
ports = { name = "", container_port = 0 }
startup_probe = {
tcp_socket = { port = 4317 }
initial_delay_seconds = 5
period_seconds = 5
timeout_seconds = 2
failure_threshold = 30
}
}
]
```

### Datadog DDOT collector — Datadog backend

DDOT is the Datadog Agent in OTel-collector mode (`DD_OTELCOLLECTOR_ENABLED=true`). Documented for Linux, Kubernetes, and EKS Fargate; the Cloud Run install path isn't covered by Datadog docs but works in practice (verified production usage shipping all three signal types to us5).

Use `agent:latest-full`, not `agent:latest` — the `-full` variant bundles the DDOT subprocess.

```hcl
sidecar_containers = [
{
container_name = "datadog-agent"
container_image = "gcr.io/datadoghq/agent:latest-full"

env_vars = {
DD_SITE = "us5.datadoghq.com" # or datadoghq.com, datadoghq.eu, etc.
DD_SERVICE = "fleet"
DD_ENV = "prod"
DD_VERSION = "v4.85.0"
DD_OTELCOLLECTOR_ENABLED = "true"
DD_LOGS_ENABLED = "true"
DD_HOSTNAME = "fleet-api-cloudrun" # Cloud Run instances are ephemeral; pin to a logical hostname
}
env_secret_vars = {
DD_API_KEY = {
secret = google_secret_manager_secret.datadog_api_key.secret_id
version = "latest"
}
}

resources = {
limits = { cpu = "1", memory = "512Mi" }
}
ports = { name = "", container_port = 0 }
startup_probe = {
tcp_socket = { port = 4317 }
initial_delay_seconds = 5
period_seconds = 5
timeout_seconds = 2
failure_threshold = 30
}
}
]
```

Set `enable_otel_logs = false` on this addon if you're using `gcr.io/datadoghq/serverless-init` instead — its OTLP logs pipeline is broken ([datadog-agent#34097](https://github.com/DataDog/datadog-agent/issues/34097)). DDOT's is fine.

### Grafana Alloy — Grafana Cloud / self-hosted Grafana stack

Similar pattern: image is `grafana/alloy:latest`, args point at a config file mounted via `volume_mounts`. Config follows the Alloy "river" config format with an `otelcol.receiver.otlp` block.

## Notes

- Fleet sets `OTEL_SERVICE_NAME=fleet` internally via `semconv.ServiceName("fleet")`. We set it again in the env var so callers can override.
- The Go OTel SDK defaults to TLS for OTLP gRPC. Talking plaintext to a localhost sidecar requires `OTEL_EXPORTER_OTLP_INSECURE=true` (this addon sets it).
- `FLEET_LOGGING_OTEL_LOGS_ENABLED=true` requires `FLEET_LOGGING_TRACING_ENABLED=true` (Fleet validates this on startup — see `cmd/fleet/serve.go`).
- The migration job runs only once per `image_tag` change. Even without the env-var split, the noise window is brief, but the split is correct hygiene.
18 changes: 18 additions & 0 deletions addons/gcp/otel-sidecar/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# This addon outputs the Fleet env vars needed to enable OpenTelemetry export
# to an OTLP gRPC receiver on localhost:4317. It does not provision a sidecar
# container itself — callers wire their preferred collector (DDOT,
# otelcol-contrib, Grafana Alloy, Honeycomb agent, ...) via the gcp module's
# var.sidecar_containers input.
#
# See the README for example sidecar definitions.

locals {
resource_attributes = join(",", concat(
[
"service.name=${var.service_name}",
"deployment.environment=${var.deployment_environment}",
],
var.service_version != null ? ["service.version=${var.service_version}"] : [],
[for k, v in var.extra_resource_attributes : "${k}=${v}"],
))
}
22 changes: 22 additions & 0 deletions addons/gcp/otel-sidecar/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
output "fleet_extra_environment_variables" {
description = "Env vars to pass to module.fleet.fleet_config.extra_env_vars (for the migration job) and var.service_only_env_vars (for the Cloud Run service). The OTLP endpoint var is service-only since the migration job runs without sidecars."
value = {
# Universal across job + service: enables Fleet's OTel SDK init. Safe
# because if no collector is reachable, the SDK silently retries with
# exponential backoff — no Fleet-side error.
universal = {
FLEET_LOGGING_TRACING_ENABLED = "true"
FLEET_LOGGING_OTEL_LOGS_ENABLED = var.enable_otel_logs ? "true" : "false"
OTEL_SERVICE_NAME = var.service_name
OTEL_RESOURCE_ATTRIBUTES = local.resource_attributes
}
# Service-only: depends on a sidecar listening at this address. The
# migration job runs without sidecars, so it must not see these.
service_only = {
OTEL_EXPORTER_OTLP_ENDPOINT = var.otlp_endpoint
OTEL_EXPORTER_OTLP_PROTOCOL = "grpc"
# The Go OTel SDK defaults to TLS; localhost is plaintext.
OTEL_EXPORTER_OTLP_INSECURE = "true"
}
}
}
35 changes: 35 additions & 0 deletions addons/gcp/otel-sidecar/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
variable "service_name" {
description = "OTel service.name resource attribute. Surfaces in backends as the service tag/dimension."
type = string
default = "fleet"
}

variable "service_version" {
description = "OTel service.version resource attribute. Recommended: set to the Fleet image tag (e.g. \"v4.85.0\") so per-release attribution works in the backend."
type = string
default = null
}

variable "deployment_environment" {
description = "OTel deployment.environment resource attribute (e.g. \"prod\", \"staging\")."
type = string
default = "prod"
}

variable "extra_resource_attributes" {
description = "Additional OTel resource attributes merged into OTEL_RESOURCE_ATTRIBUTES. Useful for custom tags like {team = \"sre\", region = \"us-central1\"}."
type = map(string)
default = {}
}

variable "otlp_endpoint" {
description = "OTLP gRPC endpoint Fleet's OTel SDK exporters target. Defaults to the localhost sidecar pattern; override only if you're routing through a different listener (e.g. a Unix socket or an inline DDOT subprocess port)."
type = string
default = "http://localhost:4317"
}

variable "enable_otel_logs" {
description = "Enable Fleet's OTel logs exporter (FLEET_LOGGING_OTEL_LOGS_ENABLED). Logs ship to the same OTLP endpoint as traces and metrics. Set to false if your collector doesn't accept OTLP logs (notably gcr.io/datadoghq/serverless-init — see datadog-agent#34097)."
type = bool
default = true
}
3 changes: 3 additions & 0 deletions addons/gcp/otel-sidecar/versions.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
terraform {
required_version = ">= 1.5"
}
19 changes: 15 additions & 4 deletions gcp/byo-project/cloud_run.tf
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,18 @@ locals {
FLEET_S3_SOFTWARE_INSTALLERS_REGION = var.region
})

# fleet_env_vars is the safe baseline that works in any execution context.
# The migration job (Cloud Run Job, no sidecars) uses it directly. The
# Cloud Run service layers var.service_only_env_vars on top — that's where
# things like OTel exporter endpoints live, since the job can't reach a
# localhost OTLP receiver that only exists in the service context.
fleet_service_env_vars = merge(local.fleet_env_vars, var.service_only_env_vars)

fleet_vpc_network_id = module.vpc.network_id
# Use the direct construction for the subnet ID key as discussed
fleet_vpc_subnet_id = "fleet-subnet"

sidecar_container_names = [for c in var.sidecar_containers : c.container_name]
}

module "fleet-service" {
Expand Down Expand Up @@ -73,9 +82,11 @@ module "fleet-service" {
max_instance_count = var.fleet_config.max_instance_count
}

containers = [
containers = concat([
{
container_image = local.fleet_image_tag
container_name = "fleet"
container_image = local.fleet_image_tag
depends_on_container = local.sidecar_container_names
ports = {
name = var.fleet_config.use_h2c ? "h2c" : "http1"
container_port = 8080
Expand Down Expand Up @@ -112,10 +123,10 @@ module "fleet-service" {
limits = local.fleet_resources_limits
}

env_vars = local.fleet_env_vars
env_vars = local.fleet_service_env_vars
env_secret_vars = local.fleet_secrets_env_vars
}
]
], var.sidecar_containers)
}

# --- Cloud Run Job (Migrations) ---
Expand Down
15 changes: 15 additions & 0 deletions gcp/byo-project/iam.tf
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,18 @@ resource "google_secret_manager_secret_iam_member" "fleet_run_sa_private_key_sec
depends_on = [google_secret_manager_secret.private_key]
}

# Sidecar containers may reference their own secrets (e.g. an observability
# agent's API key). The Cloud Run SA runs every container in the service, so
# it needs accessor on each sidecar secret too. Iterating var.sidecar_containers
# keeps this grant in sync with whatever sidecars callers wire in.
resource "google_secret_manager_secret_iam_member" "fleet_run_sa_sidecar_secret_access" {
for_each = merge([
for c in var.sidecar_containers : c.env_secret_vars
]...)

project = var.project_id
secret_id = each.value.secret
role = "roles/secretmanager.secretAccessor"
member = "serviceAccount:${google_service_account.fleet_run_sa.email}"
}

69 changes: 69 additions & 0 deletions gcp/byo-project/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,75 @@ variable "vpc_config" {
}

}

variable "sidecar_containers" {
description = <<-EOT
Optional sidecar containers to run alongside Fleet in the fleet-api Cloud
Run service. Shape matches the cloud-run module's container object.

Useful for OpenTelemetry collectors (otelcol-contrib, Datadog DDOT, Grafana
Alloy, etc.) that expose an OTLP receiver Fleet can ship traces, metrics,
and logs into via its built-in OTel SDK exporters.

Requirements imposed by Cloud Run:

- Each sidecar must declare a startup_probe. Cloud Run rejects any
depends_on reference to a container without one.
- Only one container per service may own the ingress port. Set
ports = { name = "", container_port = 0 } on sidecars to opt out
(requires the cloud-run v2 module to accept container_port = 0 as a
"no exposed port" sentinel — see
https://github.com/GoogleCloudPlatform/terraform-google-cloud-run/pull/450).
EOT
type = list(object({
container_name = string
container_image = string
container_args = optional(list(string))
container_command = optional(list(string))
depends_on_container = optional(list(string))
env_vars = optional(map(string), {})
env_secret_vars = optional(map(object({
secret = string
version = string
})), {})
ports = optional(object({
name = optional(string)
container_port = optional(number)
}))
resources = optional(object({
limits = optional(object({
cpu = optional(string)
memory = optional(string)
}))
cpu_idle = optional(bool, true)
startup_cpu_boost = optional(bool, false)
}), {})
startup_probe = optional(object({
failure_threshold = optional(number)
initial_delay_seconds = optional(number)
timeout_seconds = optional(number)
period_seconds = optional(number)
tcp_socket = optional(object({
port = optional(number)
}))
}))
}))
default = []
}

variable "service_only_env_vars" {
description = <<-EOT
Extra env vars applied only to the fleet-api Cloud Run service, not the
migration job. Use this for vars that depend on a sidecar container being
present, such as OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 — the
migration job runs as a Cloud Run Job with no sidecars, so localhost:4317
has no listener and the OTel exporter would log retry errors during the
brief job run.
EOT
type = map(string)
default = {}
}

variable "fleet_config" {
type = object({
installers_bucket_name = string
Expand Down
Loading
Loading