Skip to content

Add OpenTelemetry tracing provider to the container (disabled by default)#37227

Merged
onurkaracali merged 8 commits into
masterfrom
onur/opentelemetry-sdk-tracing-provider
Jun 17, 2026
Merged

Add OpenTelemetry tracing provider to the container (disabled by default)#37227
onurkaracali merged 8 commits into
masterfrom
onur/opentelemetry-sdk-tracing-provider

Conversation

@onurkaracali

Copy link
Copy Markdown
Contributor

What

Introduces the no-op-by-default infrastructure for native OpenTelemetry (OTel) tracing in the container. Nothing is traced until explicitly enabled via the opentelemetry-sdk feature flag — safe to ship disabled and roll out gradually.

This is the foundation PR; request instrumentation (server spans in the Jetty/jdisc layer, context propagation, child spans, RPC propagation to content nodes) follows separately.

Changes

  • container-opentelemetry — new Felix bundle embedding the OTel SDK + OTLP/HTTP exporter (jdk sender; no okhttp/kotlin pulled in). Exports the io.opentelemetry.api/context/sdk/exporter.otlp packages. Pre-installed into the container.
  • OpenTelemetryProvider (container-disc) — Provider<OpenTelemetry> registered in all container types via ContainerCluster. Hands out OpenTelemetry.noop() when disabled; builds the real SDK (OTLP/HTTP, batch span processor, parent-based ratio sampling, W3C trace-context propagation) only when enabled. Builds the OTel Resource from the resource-attribute map.
  • telemetry.defenabled / endpoint / samplingRatio plus a resourceAttribute{} map. No defaults on the scalar fields, so config is always exactly what the model supplies (no silent fallback endpoint).
  • ContainerCluster — fills the resource-attribute map from deployment identity available in the model: application, tenant, cluster.type, cluster.id.
  • opentelemetry-sdk feature flag — JSON flag (enabled/endpoint/samplingRatio), threaded through ModelContextModelContextImplContainerCluster.getConfig. Takes effect at redeployment.
  • Dependency enforcer allow-list and config-model-api abi-spec updated accordingly.

Behavior

  • Disabled (default): OpenTelemetry.noop() — no SDK constructed, no exporter threads, no connections, no telemetry.
  • Enabled: real SDK built from flag config; flushes and shuts down cleanly on reconfiguration (deconstruct()).

Notes / follow-ups

  • service.instance.id (per-node identity) is intentionally not set — it's a metrics-centric attribute; for traces, per-node attribution (if needed) is better added at runtime as host.name or enriched by the Alloy agent.
  • cluster.type is currently "container" for all container clusters (the model has no stored ClusterSpec.Type here).
  • Blank-endpoint-when-enabled guard in the Provider is a known small follow-up.

🤖 Generated with Claude Code

…ult)

Introduces the no-op-by-default infrastructure for native OpenTelemetry
(OTel) tracing in the container. No tracing happens until explicitly
enabled via the opentelemetry-sdk feature flag.

- container-opentelemetry: new Felix bundle embedding the OTel SDK + OTLP
  HTTP exporter (jdk sender, no okhttp/kotlin); exports the api/context/sdk
  packages. Pre-installed into the container.
- OpenTelemetryProvider (container-disc): Provider<OpenTelemetry> registered
  in all container types. Hands out OpenTelemetry.noop() when disabled;
  builds the real SDK (OTLP/HTTP, batching, parent-based ratio sampling,
  W3C propagation) only when enabled. Builds the OTel Resource from the
  resource-attribute map.
- telemetry.def: enabled / endpoint / samplingRatio + a resourceAttribute
  map. ContainerCluster fills the map with deployment identity available in
  the model (application, tenant, cluster.type, cluster.id).
- opentelemetry-sdk JSON feature flag (enabled/endpoint/samplingRatio),
  threaded through ModelContext -> ContainerCluster.getConfig. Takes effect
  at redeployment.
- dependency-enforcer + abi-spec updated accordingly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@onurkaracali onurkaracali requested review from bjorncs and hmusum June 17, 2026 08:39
onurkaracali and others added 3 commits June 17, 2026 10:56
…void null endpoint

- container-dev: add the OpenTelemetry SDK + OTLP exporter artifacts so the in-process
  container test harness (StandaloneContainerApplication) can load OpenTelemetryProvider,
  which is now registered in every container. In production these come from the
  pre-installed container-opentelemetry bundle via OSGi; container-dev is the flat
  test classpath, mirroring vespa-3party-bundles / container-onnxruntime.
- Never hand back a null endpoint when disabled: the telemetry.def endpoint field is
  mandatory, so the generated config builder rejects null. OpenTelemetryConfiguration.disabled()
  and OpenTelemetrySettings now use an empty string (still no localhost default, so nothing
  is mistakenly sent).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The SDK must not leak onto the 3rd-party container classpath (container-dependencies-enforcer).
Keep the public surface to the OTel API only:

- Split OpenTelemetrySdkBuilder out of OpenTelemetryProvider. The provider now references only the OTel
  API; the SDK-building class is loaded lazily, only when tracing is enabled. So the in-process container
  test harness needs only the API to instantiate the disabled, no-op provider.
- container-dev: depend on opentelemetry-api only (was sdk + exporter). The SDK is supplied at runtime by
  the pre-installed container-opentelemetry bundle.
- container-dependencies-enforcer: whitelist opentelemetry-api/context/common (provided). These are
  genuinely provided at runtime by the bundle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…onal

The bundle failed to resolve in a real container (system tests): bnd had generated mandatory
Import-Package entries for packages the container does not provide (io.grpc, guava, the OTel
incubator/autoconfigure SPI, jspecify, sun.misc). The earlier assumption that bnd marks these
optional automatically was wrong.

Everything the bundle actually uses at runtime is embedded (private) in the jar. Add an explicit
Import-Package keeping only com.fasterxml.jackson.core mandatory (provided by the container) and
marking all other computed imports resolution:=optional, so the bundle resolves without the
unavailable deps it never exercises.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread configdefinitions/src/vespa/telemetry.def
Comment thread container-dependencies-enforcer/pom.xml Outdated
Comment on lines +90 to +93
<include>io.opentelemetry:opentelemetry-api:${opentelemetry.vespa.version}:provided</include>
<include>io.opentelemetry:opentelemetry-context:${opentelemetry.vespa.version}:provided</include>
<include>io.opentelemetry:opentelemetry-common:${opentelemetry.vespa.version}:provided</include>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not expose otel libraries unless it's strictly required by applications.

Comment thread container-dev/pom.xml
<version>${project.version}</version>
<type>pom</type>
</dependency>
<!-- OpenTelemetry API only: this is what the in-process container test harness needs to instantiate the

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add exclusion to container

onurkaracali and others added 4 commits June 17, 2026 13:28
…ontainer classpath

- configdefinitions: export the generated ai.vespa.telemetry package (@ExportPackage package-info)
  so container-disc can import TelemetryConfig at runtime in a real container (OSGi).
- container: exclude io.opentelemetry:* from the container-dev dependency so the public 3rd-party
  container classpath does not expose OpenTelemetry. It stays internal to the platform (container-disc),
  provided at runtime by the container-opentelemetry bundle.
- container-dependencies-enforcer: drop the OpenTelemetry whitelist entries (no longer exposed).

container-dev keeps opentelemetry-api so the in-process container test harness still resolves the
no-op OpenTelemetryProvider.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…endencies

The maven-bundle-plugin did not reliably embed deep transitive OTel modules across build
environments (e.g. opentelemetry-common, 4 levels deep via sdk->api->context->common). When a
module was not embedded, bnd emitted a mandatory versioned Import-Package for it, so the bundle
failed OSGi resolution in the real container (system tests), cascading to container-disc and
standalone-container.

Declare all 12 OTel modules directly so Embed-Dependency embeds them as first-level deps instead
of relying on Embed-Transitive. Same workaround pattern as container-apache-http-client-bundle.
Verified: the built bundle imports only com.fasterxml.jackson.core and no io.opentelemetry packages.

Also keep Import-Package limited to com.fasterxml.jackson.core (everything else is embedded).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t bundle

Inlined upstream OpenTelemetry MR-JARs carry their own OSGI-INF/MANIFEST.MF
(notably under META-INF/versions/9/OSGI-INF/), which survive into the
assembled fat bundle. OSGi resolvers only read META-INF/MANIFEST.MF, so the
extra files are inert noise that can mislead anyone inspecting the bundle.

Replace the built-in jar-with-dependencies descriptorRef with a custom
descriptor that mirrors the built-in but excludes OSGI-INF/MANIFEST.MF and
META-INF/versions/*/OSGI-INF/MANIFEST.MF when unpacking dependencies.
@onurkaracali onurkaracali merged commit 4fc286b into master Jun 17, 2026
3 checks passed
@onurkaracali onurkaracali deleted the onur/opentelemetry-sdk-tracing-provider branch June 17, 2026 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants