Skip to content

Commit 737d527

Browse files
motatoesclaude
andauthored
Ship platform logs to Axiom via Vector (#240)
* ship platform logs to Axiom via Vector with cross-service request_id Adds internal/obslog: JSON slog with a stable host envelope (service, service_id, cell_id, region, hostname, host_ip, version) and per-request fields (request_id, sandbox_id, worker_id) propagated via context. The package installs itself as slog.Default AND redirects stdlib log.Printf through slog so existing log call sites emit JSON automatically. Wires obslog into both control plane (cmd/server, internal/controlplane, internal/api/router) and worker (cmd/worker, internal/worker), replacing echo's middleware.Logger with obslog.EchoMiddleware. internal/proxy forwards X-Request-Id in the Director so worker log lines share the same id as the control plane line for proxied requests. internal/config adds OPENSANDBOX_CELL_ID and OPENSANDBOX_HOST_IP; CellID defaults to "<region>-default" when unset. deploy/vector/ ships Vector configs that read journald (worker, dev-host) or docker_logs (control-plane), parse JSON, enrich non-JSON lines (kernel, systemd) with the host envelope from env, and forward to a NEW Axiom dataset oc-platform-logs — kept separate from the existing customer oc-sandbox-logs dataset for cost / retention / blast-radius reasons. install.sh accepts roles {worker, control-plane, dev-host}. deploy/azure/deploy-azure-dev.sh: extend rsync excludes so local dev-env-secrets-* and dev-vector-token-* files don't end up on the VM. .gitignore: cover deploy/azure/.dev-vector-token-*. Tests in internal/obslog cover the host envelope, context propagation, the Echo middleware round-trip (including a forwarded X-Request-Id appearing on both handler and access-log lines), and the link-local filter in detectHostIP. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ship Vector via Packer + deploy-server: prod gets logs without a follow-up PR Wires Vector install into both deploy paths so merging this PR doesn't leave a "Vector still needs deploying" gap. - deploy/vector/populate-vector-env.{sh,service}: systemd oneshot that runs Before=vector.service. Reads SECRETS_VAULT_NAME from worker.env (or server.env), fetches the platform-logs ingest token from Azure Key Vault via the VM's managed identity (IMDS → AAD → KV REST), and writes /etc/opensandbox/vector.env so Vector picks it up. Token never appears in any image or workflow artifact. The script exits 0 on any failure path (no IMDS, no vault, missing secret) so a logging credential problem doesn't break the worker boot. - deploy/vector/install.sh: installs Vector + drops the role config + installs the populator unit + wires the systemd drop-in with the env files (worker.env, server.env, vector.env). Idempotent. PACKER_BUILD=1 skips the systemctl start so the AMI capture doesn't trip over a started service. - deploy/vector/control-plane.yaml: rewrote from docker_logs to journald. Current prod CP runs the server binary directly under systemd (per deploy-server.yml), not in Docker. - deploy/packer/worker-ami.pkr.hcl: new provisioner step that extracts /tmp/packer-vector-ctx.tar.gz (created by CI in build-worker-ami.yml) and runs install.sh worker with PACKER_BUILD=1. New worker AMIs come out with Vector pre-installed and pre-configured; populator runs at first boot to fetch the token. - .github/workflows/build-worker-ami.yml: pre-tars deploy/vector/ to /tmp/packer-vector-ctx.tar.gz before invoking Packer. - .github/workflows/deploy-server.yml: bundles deploy/vector/ as bin/vector-deploy.tar.gz, uploads to blob storage alongside the server binary, and runs install.sh control-plane on each CP host via az vm run-command (idempotent — refreshes the config on every deploy). Operator prerequisite: create a `shared-axiom-platform-ingest-token` secret in each prod KV. Documented in the PR description's env-vars checklist. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * vector: rename CELL_ID/HOST_IP to OPENCOMPUTER_*; fetch dataset from KV too Two reviewer-requested changes: - Rename OPENSANDBOX_CELL_ID → OPENCOMPUTER_CELL_ID and OPENSANDBOX_HOST_IP → OPENCOMPUTER_HOST_IP. New fields use the product-named prefix; existing OPENSANDBOX_* fields untouched. Touch: config.go (env var read), the three vector configs (VRL refs), populate-vector-env.sh (env file write), install.sh (auto-detect), vector.env.example. - AXIOM_PLATFORM_DATASET is now also fetched from Key Vault as `shared-axiom-platform-dataset` alongside the token. No default value baked into the configs — a missing dataset secret fails Vector healthcheck (loud) instead of silently shipping to a presumed default. populate-vector-env.sh fetches both secrets in one pass with a shared IMDS-acquired AAD token. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * keyvault: register shared-axiom-platform-{ingest-token,dataset} Vector reads these from /etc/opensandbox/vector.env via the dedicated populate-vector-env.service, but secretMapping is the documented source of truth for "what shared-* secrets does this deployment need in KV". Adding the entries so: - operators have one place to audit required KV secrets - the Go binary loads them into its own env at startup (side-effect), so future admin endpoints / health views can surface platform-stream config without a separate KV fetch Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent da5220c commit 737d527

24 files changed

Lines changed: 1192 additions & 9 deletions

.github/workflows/build-worker-ami.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,14 @@ jobs:
9494
deploy/ec2/build-rootfs-docker.sh \
9595
scripts/claude-agent-wrapper/
9696
97+
- name: Package Vector configs
98+
# Bundled separately so Packer's file provisioner has a known
99+
# tarball path (it can't reliably upload an arbitrary directory).
100+
# install.sh on the builder extracts and installs Vector + the
101+
# KV-token populator into the AMI.
102+
run: |
103+
tar czf /tmp/packer-vector-ctx.tar.gz -C deploy vector
104+
97105
- name: Azure Login
98106
uses: azure/login@v2
99107
with:

.github/workflows/deploy-server.yml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,13 @@ jobs:
5050
- name: Package web assets
5151
run: tar czf bin/web-dist.tar.gz -C web dist
5252

53+
- name: Package Vector configs
54+
# Bundle deploy/vector/ for distribution to control-plane hosts.
55+
# install.sh is idempotent: skips Vector install if already present,
56+
# always refreshes config + restarts. Workers get the same files via
57+
# the Packer image bake (build-worker-ami.yml).
58+
run: tar czf bin/vector-deploy.tar.gz -C deploy vector
59+
5360
- name: Azure Login
5461
uses: azure/login@v2
5562
with:
@@ -75,6 +82,14 @@ jobs:
7582
--file bin/web-dist.tar.gz \
7683
--overwrite true -o none
7784
85+
az storage blob upload \
86+
--account-name ${{ secrets.AZURE_STORAGE_ACCOUNT }} \
87+
--account-key "${{ secrets.AZURE_STORAGE_KEY }}" \
88+
--container-name checkpoints \
89+
--name deploy/vector-deploy.tar.gz \
90+
--file bin/vector-deploy.tar.gz \
91+
--overwrite true -o none
92+
7893
echo "Artifacts uploaded to blob storage"
7994
8095
- name: Discover control planes
@@ -120,6 +135,12 @@ jobs:
120135
--container-name checkpoints --name deploy/web-dist.tar.gz \
121136
--file /tmp/web-dist.tar.gz --overwrite true -o none
122137
138+
az storage blob download \
139+
--account-name ${{ secrets.AZURE_STORAGE_ACCOUNT }} \
140+
--account-key '${{ secrets.AZURE_STORAGE_KEY }}' \
141+
--container-name checkpoints --name deploy/vector-deploy.tar.gz \
142+
--file /tmp/vector-deploy.tar.gz --overwrite true -o none
143+
123144
# Stop, replace binary, extract frontend
124145
systemctl stop opensandbox-server
125146
sleep 1
@@ -136,6 +157,17 @@ jobs:
136157
fi
137158
138159
systemctl start opensandbox-server
160+
161+
# Install/refresh Vector + the KV token populator. install.sh
162+
# is idempotent (skips Vector install if already present,
163+
# always refreshes config + reloads). populate-vector-env
164+
# fetches AXIOM_PLATFORM_TOKEN from KV via the VM's managed
165+
# identity at every boot (and on this restart).
166+
mkdir -p /tmp/vector-deploy
167+
tar xzf /tmp/vector-deploy.tar.gz -C /tmp/vector-deploy
168+
bash /tmp/vector-deploy/vector/install.sh control-plane
169+
rm -rf /tmp/vector-deploy /tmp/vector-deploy.tar.gz
170+
139171
echo 'Deploy complete'
140172
" -o none
141173

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,4 +65,5 @@ delme
6565
deploy/azure/.dev-env-state-*
6666
deploy/azure/.dev-env-secrets*
6767
!deploy/azure/.dev-env-secrets.example
68+
deploy/azure/.dev-vector-token-*
6869
/server

cmd/server/main.go

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ import (
55
"encoding/base64"
66
"fmt"
77
"log"
8+
"log/slog"
89
"os"
910
"strconv"
1011
"strings"
@@ -22,6 +23,7 @@ import (
2223
"github.com/opensandbox/opensandbox/internal/controlplane"
2324
"github.com/opensandbox/opensandbox/internal/crypto"
2425
"github.com/opensandbox/opensandbox/internal/db"
26+
"github.com/opensandbox/opensandbox/internal/obslog"
2527
"github.com/opensandbox/opensandbox/internal/observability"
2628
"github.com/opensandbox/opensandbox/internal/proxy"
2729
"github.com/opensandbox/opensandbox/internal/sandbox"
@@ -42,6 +44,21 @@ func main() {
4244
log.Fatalf("failed to load config: %v", err)
4345
}
4446

47+
// Structured logging (JSON to stdout, host envelope baked in). Installs
48+
// itself as slog.Default AND redirects stdlib log.Printf through slog so
49+
// existing log call sites emit JSON automatically. Vector reads stdout
50+
// from the Docker logging driver and ships to Axiom.
51+
cpHostname, _ := os.Hostname()
52+
obslog.Init(obslog.HostFields{
53+
Service: obslog.ServiceControlPlane,
54+
ServiceID: cpHostname, // hostname distinguishes HA replicas
55+
CellID: cfg.CellID,
56+
Region: cfg.Region,
57+
Hostname: cpHostname,
58+
HostIP: cfg.HostIP,
59+
Version: ServerVersion,
60+
}, slog.LevelInfo)
61+
4562
// Sentry error reporting — no-op if OPENSANDBOX_SENTRY_DSN is unset.
4663
flushSentry := observability.Init(cfg, "control-plane", ServerVersion)
4764
defer flushSentry()

cmd/worker/main.go

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ import (
55
"fmt"
66
"io"
77
"log"
8+
"log/slog"
89
"os"
910
"os/signal"
1011
"path/filepath"
@@ -19,6 +20,7 @@ import (
1920
"github.com/opensandbox/opensandbox/internal/db"
2021
"github.com/opensandbox/opensandbox/internal/metrics"
2122
"github.com/opensandbox/opensandbox/internal/observability"
23+
"github.com/opensandbox/opensandbox/internal/obslog"
2224
"github.com/opensandbox/opensandbox/internal/proxy"
2325
qm "github.com/opensandbox/opensandbox/internal/qemu"
2426
"github.com/opensandbox/opensandbox/internal/sandbox"
@@ -66,6 +68,21 @@ func main() {
6668
log.Fatalf("failed to load config: %v", err)
6769
}
6870

71+
// Structured logging (JSON to stdout/journald, host envelope baked in).
72+
// Installs itself as slog.Default AND redirects stdlib log.Printf through
73+
// slog so existing log call sites emit JSON automatically. Vector on the
74+
// host reads journald and ships to Axiom.
75+
workerHostname, _ := os.Hostname()
76+
obslog.Init(obslog.HostFields{
77+
Service: obslog.ServiceWorker,
78+
ServiceID: cfg.WorkerID,
79+
CellID: cfg.CellID,
80+
Region: cfg.Region,
81+
Hostname: workerHostname,
82+
HostIP: cfg.HostIP,
83+
Version: WorkerVersion,
84+
}, slog.LevelInfo)
85+
6986
// Sentry error reporting — no-op if OPENSANDBOX_SENTRY_DSN is unset.
7087
flushSentry := observability.Init(cfg, "worker", WorkerVersion)
7188
defer flushSentry()

deploy/azure/deploy-azure-dev.sh

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,9 @@ SETUP_DISK
125125
log "Syncing codebase..."
126126
rsync -az --progress \
127127
--exclude '.git' --exclude 'bin/' --exclude 'node_modules/' \
128-
--exclude '.dev-env-state*' --exclude '*.ext4' --exclude 'vendor/' \
128+
--exclude '.dev-env-state*' --exclude '.dev-env-secrets*' \
129+
--exclude '.dev-vector-token-*' \
130+
--exclude '*.ext4' --exclude 'vendor/' \
129131
"$REPO_ROOT/" "${AZURE_ADMIN_USER}@${VM_PUBLIC_IP}:~/opensandbox/"
130132

131133
log "Running host setup..."
@@ -190,7 +192,9 @@ cmd_deploy() {
190192
log "Syncing code..."
191193
rsync -az --progress \
192194
--exclude '.git' --exclude 'bin/' --exclude 'node_modules/' \
193-
--exclude '.dev-env-state*' --exclude '*.ext4' --exclude 'vendor/' \
195+
--exclude '.dev-env-state*' --exclude '.dev-env-secrets*' \
196+
--exclude '.dev-vector-token-*' \
197+
--exclude '*.ext4' --exclude 'vendor/' \
194198
"$REPO_ROOT/" "${AZURE_ADMIN_USER}@${VM_PUBLIC_IP}:~/opensandbox/"
195199

196200
# Build binaries on instance

deploy/packer/worker-ami.pkr.hcl

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,36 @@ build {
320320
]
321321
}
322322

323+
# 4.5. Install Vector + the KV-token-populator for platform-logs shipping.
324+
#
325+
# Vector is enabled but NOT started (Packer captures the image before
326+
# systemd has run). At first boot:
327+
# 1. cloud-init writes /etc/opensandbox/worker.env
328+
# 2. populate-vector-env.service fires (Before=vector.service), reads
329+
# SECRETS_VAULT_NAME from worker.env, fetches AXIOM_PLATFORM_TOKEN
330+
# from KV via the VM's managed identity, writes /etc/opensandbox/vector.env
331+
# 3. vector.service starts, reads both env files, ships to oc-platform-logs
332+
#
333+
# PACKER_BUILD=1 tells install.sh to skip `systemctl start` — systemd in a
334+
# baking image is offline.
335+
#
336+
# CI is expected to pre-tar deploy/vector/ at /tmp/packer-vector-ctx.tar.gz
337+
# (see .github/workflows/build-worker-ami.yml).
338+
provisioner "file" {
339+
source = "/tmp/packer-vector-ctx.tar.gz"
340+
destination = "/tmp/vector-ctx.tar.gz"
341+
}
342+
provisioner "shell" {
343+
execute_command = "chmod +x {{ .Path }}; {{ .Vars }} sudo -E bash '{{ .Path }}'"
344+
environment_vars = ["PACKER_BUILD=1"]
345+
inline = [
346+
"mkdir -p /tmp/vector-ctx",
347+
"tar xzf /tmp/vector-ctx.tar.gz -C /tmp/vector-ctx",
348+
"cd /tmp/vector-ctx/vector && PACKER_BUILD=1 bash install.sh worker",
349+
"rm -rf /tmp/vector-ctx /tmp/vector-ctx.tar.gz",
350+
]
351+
}
352+
323353
# 4b. Archive base image to blob storage keyed by goldenVersion so that old
324354
# checkpoints referencing this base can be rebased even after workers roll.
325355
provisioner "shell" {

deploy/vector/control-plane.yaml

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Vector config for the control-plane host (Azure prod: systemd binary,
2+
# not Docker). Reads opensandbox-server.service from journald, parses JSON
3+
# log lines (emitted by internal/obslog), enriches non-JSON lines with the
4+
# host envelope from env, and ships to oc-platform-logs.
5+
#
6+
# Install path: /etc/vector/vector.yaml
7+
# Env file: /etc/opensandbox/vector.env (populated by
8+
# populate-vector-env.service from Key Vault at boot)
9+
10+
data_dir: /var/lib/vector
11+
12+
sources:
13+
control_plane_journald:
14+
type: journald
15+
include_units:
16+
- opensandbox-server.service
17+
# Default current_boot_only=true is fine — Vector tracks its own cursor
18+
# in data_dir, so restarts resume cleanly without replaying the journal.
19+
20+
transforms:
21+
control_plane_parse:
22+
type: remap
23+
inputs:
24+
- control_plane_journald
25+
source: |
26+
parsed, err = parse_json(.message)
27+
if err == null && is_object(parsed) {
28+
. = merge(., object!(parsed))
29+
# Drop stringified duplicate; non-JSON lines (panics, systemd boot)
30+
# keep .message as the readable content.
31+
del(.message)
32+
}
33+
if !exists(.service) { .service = "control-plane" }
34+
if !exists(.service_id) { .service_id, _ = get_hostname() }
35+
if !exists(.cell_id) { .cell_id = "${OPENCOMPUTER_CELL_ID:-unknown}" }
36+
if !exists(.region) { .region = "${OPENSANDBOX_REGION:-unknown}" }
37+
if !exists(.host_ip) { .host_ip = "${OPENCOMPUTER_HOST_IP:-unknown}" }
38+
if !exists(.hostname) { .hostname, _ = get_hostname() }
39+
40+
sinks:
41+
axiom_platform:
42+
type: axiom
43+
inputs:
44+
- control_plane_parse
45+
dataset: "${AXIOM_PLATFORM_DATASET}"
46+
token: "${AXIOM_PLATFORM_TOKEN}"
47+
buffer:
48+
type: disk
49+
max_size: 268435488
50+
when_full: drop_newest
51+
healthcheck:
52+
enabled: true

deploy/vector/dev-host.yaml

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Vector config for a dev host that runs BOTH opensandbox-server AND
2+
# opensandbox-worker as systemd services on the same VM (the layout produced
3+
# by deploy/azure/deploy-azure-dev.sh).
4+
#
5+
# Production hosts use worker.yaml or control-plane.yaml — one role per host.
6+
# This config exists so the request-id join can be validated end-to-end on a
7+
# single dev VM before standing up a multi-VM dev cluster.
8+
#
9+
# The journald source captures both units in one stream; the parse transform
10+
# tags `service` based on which unit emitted each line.
11+
#
12+
# Install path: /etc/vector/vector.yaml
13+
# Env file: /etc/opensandbox/vector.env (provides AXIOM_PLATFORM_TOKEN)
14+
15+
data_dir: /var/lib/vector
16+
17+
sources:
18+
dev_journald:
19+
type: journald
20+
include_units:
21+
- opensandbox-server.service
22+
- opensandbox-worker.service
23+
# `current_boot_only: false` is rejected on systemd 250-257 (Vector
24+
# limitation). Default (true) is fine — Vector tracks its own cursor in
25+
# data_dir, so restarts pick up where they left off rather than replaying.
26+
27+
transforms:
28+
dev_parse:
29+
type: remap
30+
inputs:
31+
- dev_journald
32+
source: |
33+
parsed, err = parse_json(.message)
34+
if err == null && is_object(parsed) {
35+
. = merge(., object!(parsed))
36+
# The structured fields (msg, level, request_id, ...) are now at
37+
# root. Drop the stringified copy so events don't duplicate the
38+
# payload in Axiom. Non-JSON lines (panics, kernel) fall through
39+
# this branch and keep .message as the readable content.
40+
del(.message)
41+
}
42+
# Tag service based on the systemd unit. obslog.Init already sets
43+
# `service` inside JSON log lines, so this only fires for non-JSON
44+
# output (panics, plain stderr, systemd boot messages).
45+
if !exists(.service) {
46+
unit = string(.SYSTEMD_UNIT) ?? ""
47+
if unit == "opensandbox-server.service" {
48+
.service = "control-plane"
49+
} else if unit == "opensandbox-worker.service" {
50+
.service = "worker"
51+
} else {
52+
.service = "unknown"
53+
}
54+
}
55+
if !exists(.service_id) {
56+
# Worker exports OPENSANDBOX_WORKER_ID; control plane has no
57+
# equivalent on a binary-deployed dev host — fall back to hostname.
58+
unit = string(.SYSTEMD_UNIT) ?? ""
59+
if unit == "opensandbox-worker.service" {
60+
.service_id = "${OPENSANDBOX_WORKER_ID:-unknown}"
61+
} else {
62+
.service_id, _ = get_hostname()
63+
}
64+
}
65+
if !exists(.cell_id) { .cell_id = "${OPENCOMPUTER_CELL_ID:-unknown}" }
66+
if !exists(.region) { .region = "${OPENSANDBOX_REGION:-unknown}" }
67+
if !exists(.host_ip) { .host_ip = "${OPENCOMPUTER_HOST_IP:-unknown}" }
68+
if !exists(.hostname) { .hostname, _ = get_hostname() }
69+
# Stamp every event with environment so dev traffic is filterable inside
70+
# a shared dataset. Only the dev-host config sets this — production
71+
# worker.yaml / control-plane.yaml deliberately omit it so prod events
72+
# don't claim the dev tag.
73+
.environment = "dev"
74+
75+
sinks:
76+
axiom_platform:
77+
type: axiom
78+
inputs:
79+
- dev_parse
80+
dataset: "${AXIOM_PLATFORM_DATASET}"
81+
token: "${AXIOM_PLATFORM_TOKEN}"
82+
buffer:
83+
type: disk
84+
max_size: 268435488
85+
when_full: drop_newest
86+
healthcheck:
87+
enabled: true

0 commit comments

Comments
 (0)