Skip to content

OTEP: Stable by Default#4813

Open
austinlparker wants to merge 8 commits intoopen-telemetry:mainfrom
austinlparker:ap/stabilityOtep
Open

OTEP: Stable by Default#4813
austinlparker wants to merge 8 commits intoopen-telemetry:mainfrom
austinlparker:ap/stabilityOtep

Conversation

@austinlparker
Copy link
Member

Summary

This OTEP proposes that OpenTelemetry distributions enable only stable components by default, decouple instrumentation stability from semantic convention stability, and establish expanded stability criteria.

Key Proposals

  • Stable by default: Distributions should only enable stable components; experimental requires explicit opt-in
  • Decouple instrumentation/semconv stability: Allow instrumentation to stabilize independently when API surface is stable, even if semantic conventions are still experimental
  • Expanded stability criteria: Stability means more than API compatibility—includes documentation, benchmarks, tested integrations
  • Unified component metadata: Extend Collector's metadata.yaml pattern to instrumentation libraries

Motivation

Community feedback consistently identifies pain points that this OTEP addresses:

  • Experimental features breaking production deployments
  • Semantic convention changes destroying dashboards
  • Instrumentation libraries stuck on pre-release due to experimental semconv dependencies
  • "Batteries not included" defaults that overwhelm newcomers

Related

Test plan

  • Review by Governance Committee
  • Review by affected SIGs (Collector, SDK maintainers)
  • Community feedback period

Proposes that OpenTelemetry distributions enable only stable components
by default, decouple instrumentation stability from semantic convention
stability, and establish expanded stability criteria.

Key proposals:
- Stable by default: distributions should only enable stable components
- Decouple instrumentation/semconv stability: let instrumentation stabilize
  independently when API surface is stable
- Expanded stability criteria: docs, benchmarks, tested integrations
- Unified component metadata schema extending Collector's metadata.yaml

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@austinlparker austinlparker requested review from a team as code owners December 30, 2025 15:43
Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think the work detailed in this OTEP is large, sweeping and likely needs to be divided up.

I suggest you keep the goals of the OTEP in this, and set up workstreams / requirements that can be tackled in further OTEPs. I.e. This is bigger than one person or one "design".

E.g.

  • Enabling experimental features - A workstream we can ask the configuraiton SIG to drive.
  • Federated Schema and declaring stability of a schema independently of semantic conventions - You can give this to Weaver / Semconv Tooling SIG (@lmolkova has an OTEP already to continue making progress here)
  • Distributions / Releasing - A workstream around defining what a distribution is, and gate-keeping its default features to those that are stable
  • Profiling - A workstream around profiling overhead and providing features / capabilities / infrastructure to allow Maintainers to set up these tests if they don't have them and meet the requirements listed here.

That's just top of mind, but I think we could refactor this OTEP to call out each workstream and find owners for those.

@pellared
Copy link
Member

pellared commented Jan 7, 2026

PTAL @open-telemetry/dotnet-instrumentation-maintainers

reyang
reyang previously requested changes Jan 7, 2026
Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before going into the details, I have the same question as what @cijothomas mentioned here.

"This OTEP proposes that OpenTelemetry distributions enable only stable components by default" - what does this mean? If let's say Company XYZ released their distribution of OpenTelemetry Java SDK, and they included an unstable component, what would the OpenTelemetry community do? - Do we send attorneys to them?

@austinlparker
Copy link
Member Author

Hi all - thank you for the robust discussion on this proposal, as well as the feedback from the discussion during the most recent GC/TC meeting. I've gone through and rewritten the entire thing to focus more on identifying specific workstreams and their potential owners in order to make this a more digestable set of changes. I will be resolving all comments on this PR; Please re-review at your leisure.

@austinlparker
Copy link
Member Author

Before going into the details, I have the same question as what @cijothomas mentioned here.

"This OTEP proposes that OpenTelemetry distributions enable only stable components by default" - what does this mean? If let's say Company XYZ released their distribution of OpenTelemetry Java SDK, and they included an unstable component, what would the OpenTelemetry community do? - Do we send attorneys to them?

I do want to be specific to this point. We, as a project, obviously have no recourse against people who do not respect the project rules as written. That said -

  • Providing a policy and having a project consensus that we will follow this policy means that we can work with CNCF to provide guarantees around that policy. For example, we could create a 'Certified OpenTelemetry Compatible' trademark (or something) and have adherence to the policy be a requirement of it. If someone used this improperly, the CNCF would be able to take legal action.
  • Generally, open source works on handshakes and rep :) As long as our users and the community view us as a neutral arbiter and implementor of the standards we create, our word carries weight. If company XYZ uses unstable components (or in some way releases software that is not compliant with our guidance) we are under no compunction to advertise their software on our website, in the registry, etc. So while we cannot 'force' compliance, we can certainly withhold promotion of derivatives that do not respect our policies, processes, etc.


The work is complete when we have a documented mechanism for enabling experimental features—whether through environment variables, configuration, or programmatic API—along with clear guidance on what "experimental" means and what users are opting into. Experimental features should be disabled by default with clear logging when enabled. Where possible, the design should align with existing patterns like Collector feature gates and `OTEL_SEMCONV_STABILITY_OPT_IN`.

The Configuration SIG is the natural owner for this work.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @open-telemetry/configuration-maintainers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config SIG was initialized as a project with a particular scope, and as such is planning on shutting down upon stabilizing the specification. It can recharter / restart, but with different goals and potentially different people.


This workstream should enable instrumentation stability to be assessed independently from semantic convention stability, with clear mechanisms for communicating telemetry stability to users. Instrumentation libraries should be able to declare API stability separately from telemetry output stability. Schema URLs should be populated consistently across instrumentations, enabling downstream tooling. Migration pathways should be documented when instrumentation stabilizes before its semantic conventions. Breaking changes to telemetry output should be treated as breaking changes, requiring major version bumps.

The Semantic Conventions SIG and Weaver maintainers are the natural owners. Related work includes the [OTEP on federated semantic conventions](https://github.com/open-telemetry/opentelemetry-specification/pull/4815).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @open-telemetry/weaver-maintainers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Draft proposal to put all the pieces together for this workstream - #4906


This guidance should be aspirational rather than a set of blocking requirements. Components can be stable without meeting every criterion. Requiring extensive benchmarks and documentation for every component would worsen the "stuck on pre-release" problem, not improve it. The goal is to help maintainers understand what production users need without creating barriers to stabilization.

The End User SIG and Communications SIG should own this work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love our End User + Communications SIG - but is this the right owner?

I think examples of this are crafting the collector resiliency documentation, but the key questions to ask here involve core architectural decisions around architectures OTEL components support and making sure our releases fit into that cohesive whole.

In lieu of a better SIG, I'd suggest this belongs to the TC (today, by charter), and we should step up what we offer here.


## Open Questions

Who will own each workstream? Should ownership be assigned before this OTEP is approved, or can workstreams proceed as volunteers emerge?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest each workstream is either adopted as the roadmap of an existing SIG (when that's the owner) or becomes a new project in the governance model, with a dedicated project owner, to make sure this succeeds.

@austinlparker
Copy link
Member Author

@jsuereth just to clarify - the intent of stability here is as follows:

  • an instrumentation library should declare stability if the instrumentation code itself is performant, stable, and safe for production use even if it depends on unstable semconv
  • if an instrumentation library updates any of its api contracts (or telemetry contracts) in a breaking way then this should be considered a breaking change requiring a major version update

is this unclear from the current wording? i can add this more explicit guidance.

Co-authored-by: Cijo Thomas <cithomas@microsoft.com>
@austinlparker
Copy link
Member Author

I'll be at the 2/10 spec call to discuss this OTEP further synchronously with maintainers.

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
@jsuereth
Copy link
Contributor

  • a major version update

is this unclear from the current wording? i can add this more explicit guidance.

As stated in the spec all - I think we agree in high level, but the way this is worded appears to give license to something we do not want.

Stable instrumentation SHOULD NOT break output telemetry. This means, in your wording, you should not say the output telemetry is unstable. Just that the output telemetry is not in "global" semconv, but maintained locally instead.

if an instrumentation library updates any of its api contracts (or telemetry contracts) in a breaking way then this should be considered a breaking change requiring a major version update

I do not get this when reading the current proposal. This is my major beef with it, and I'd like to reword/rephrase so it's clear our stance on this.

Reframe instrumentation stability around production readiness of code
rather than separating API stability from telemetry output stability.
Trim workstream sections to focus on problems and outcomes, leaving
solution details to the workstreams themselves.
Copy link
Member

@mx-psi mx-psi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I would like to see approvals from mentioned SIGs before approving myself though

Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the wording here!

I forgot to come back and re-review, was working on specific proposals w/ @lmolkova for the Federeated Semconv piece, but this now looks good to me.

I think we'll need to kick of specific projects for areas which don't have SIG owners. Ideally, I think we should get a "single person who feels responsible" for a workstream, but we can sort out that detail later.


OpenTelemetry has grown into a massive ecosystem supporting four telemetry signals across dozen programming languages. This growth has come with complexity that creates real barriers to production adoption.

Community feedback consistently identifies several pain points. Experimental features break production deployments—users report configuration breaking between minor versions, silent failures in telemetry pipelines, and unexpected performance regressions that only appear at scale. As one practitioner noted: "The silent failure policy of OTEL makes flames shoot out of the top of my head."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Community feedback consistently identifies several pain points. Experimental features break production deployments—users report configuration breaking between minor versions, silent failures in telemetry pipelines, and unexpected performance regressions that only appear at scale. As one practitioner noted: "The silent failure policy of OTEL makes flames shoot out of the top of my head."
Community feedback consistently identifies several pain points. Experimental features break production deployments—users report configuration breaking between minor versions, silent failures in telemetry pipelines, and unexpected performance regressions that only appear at scale.


Semantic convention changes destroy existing dashboards. When conventions change, users must update instrumentation across their entire infrastructure while simultaneously updating dashboards, alerts, and downstream tooling. Organizations report significant resistance from developers asked to coordinate these changes.

Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale, with reports of "four times the CPU usage" compared to simpler alternatives. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale, with reports of "four times the CPU usage" compared to simpler alternatives. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented.
Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented.


Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale, with reports of "four times the CPU usage" compared to simpler alternatives. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented.

These all stem from the same problem: OpenTelemetry's default configuration prioritizes feature completeness over production readiness. This OTEP establishes the goals and workstreams needed to address this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all stem from the same problem: OpenTelemetry's default configuration prioritizes feature completeness over production readiness.

this seems a bit of an oversimplification given some of the examples above


This workstream should result in a consistent pattern for experimental feature opt-in that works across SDKs, the Collector, and instrumentation libraries.

The Configuration SIG is the natural owner for this work.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


- Stability information should be visible and consistent. Users should be able to easily determine the stability status of any component before adopting it, and this information should be presented consistently across all OpenTelemetry projects.

- Instrumentation should be able to stabilize based on production readiness. The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized. However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump.

This seems like an unreasonable burden to place on things like auto instrumentation. Consider the example where an http client library is directly instrumented using OpenTelemetry APIs, and it is using the currently stable semantic conventions for http client calls. All auto instrumentation needs to do to enable capturing that telemetry, is to subscribe to that telemetry (ActivitySource or Meter in dotnet for example). The instrumentation version is directly coupled to the version of the http client library, and completely outside the control of auto instrumentation.

  • Does this mean that there is an expectation that auto instrumentation implementations need to perform proactive testing to detect changes in the telemetry output for new library versions?
  • Does auto instrumentation need a new major version whenever we want to support a new major version of 3rd party library that is natively instrumented?
  • Will library authors consistently do a major version bump if the telemetry signal changes?
  • Do we need something in this proposal specifically for auto instrumentation to call out how default instrumentations need to be managed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.