Conversation
Proposes that OpenTelemetry distributions enable only stable components by default, decouple instrumentation stability from semantic convention stability, and establish expanded stability criteria. Key proposals: - Stable by default: distributions should only enable stable components - Decouple instrumentation/semconv stability: let instrumentation stabilize independently when API surface is stable - Expanded stability criteria: docs, benchmarks, tested integrations - Unified component metadata schema extending Collector's metadata.yaml 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
jsuereth
left a comment
There was a problem hiding this comment.
I actually think the work detailed in this OTEP is large, sweeping and likely needs to be divided up.
I suggest you keep the goals of the OTEP in this, and set up workstreams / requirements that can be tackled in further OTEPs. I.e. This is bigger than one person or one "design".
E.g.
- Enabling experimental features - A workstream we can ask the configuraiton SIG to drive.
- Federated Schema and declaring stability of a schema independently of semantic conventions - You can give this to Weaver / Semconv Tooling SIG (@lmolkova has an OTEP already to continue making progress here)
- Distributions / Releasing - A workstream around defining what a distribution is, and gate-keeping its default features to those that are stable
- Profiling - A workstream around profiling overhead and providing features / capabilities / infrastructure to allow Maintainers to set up these tests if they don't have them and meet the requirements listed here.
That's just top of mind, but I think we could refactor this OTEP to call out each workstream and find owners for those.
|
PTAL @open-telemetry/dotnet-instrumentation-maintainers |
reyang
left a comment
There was a problem hiding this comment.
Before going into the details, I have the same question as what @cijothomas mentioned here.
"This OTEP proposes that OpenTelemetry distributions enable only stable components by default" - what does this mean? If let's say Company XYZ released their distribution of OpenTelemetry Java SDK, and they included an unstable component, what would the OpenTelemetry community do? - Do we send attorneys to them?
|
Hi all - thank you for the robust discussion on this proposal, as well as the feedback from the discussion during the most recent GC/TC meeting. I've gone through and rewritten the entire thing to focus more on identifying specific workstreams and their potential owners in order to make this a more digestable set of changes. I will be resolving all comments on this PR; Please re-review at your leisure. |
I do want to be specific to this point. We, as a project, obviously have no recourse against people who do not respect the project rules as written. That said -
|
|
|
||
| The work is complete when we have a documented mechanism for enabling experimental features—whether through environment variables, configuration, or programmatic API—along with clear guidance on what "experimental" means and what users are opting into. Experimental features should be disabled by default with clear logging when enabled. Where possible, the design should align with existing patterns like Collector feature gates and `OTEL_SEMCONV_STABILITY_OPT_IN`. | ||
|
|
||
| The Configuration SIG is the natural owner for this work. |
There was a problem hiding this comment.
cc @open-telemetry/configuration-maintainers
There was a problem hiding this comment.
The config SIG was initialized as a project with a particular scope, and as such is planning on shutting down upon stabilizing the specification. It can recharter / restart, but with different goals and potentially different people.
|
|
||
| This workstream should enable instrumentation stability to be assessed independently from semantic convention stability, with clear mechanisms for communicating telemetry stability to users. Instrumentation libraries should be able to declare API stability separately from telemetry output stability. Schema URLs should be populated consistently across instrumentations, enabling downstream tooling. Migration pathways should be documented when instrumentation stabilizes before its semantic conventions. Breaking changes to telemetry output should be treated as breaking changes, requiring major version bumps. | ||
|
|
||
| The Semantic Conventions SIG and Weaver maintainers are the natural owners. Related work includes the [OTEP on federated semantic conventions](https://github.com/open-telemetry/opentelemetry-specification/pull/4815). |
There was a problem hiding this comment.
cc @open-telemetry/weaver-maintainers
There was a problem hiding this comment.
Draft proposal to put all the pieces together for this workstream - #4906
|
|
||
| This guidance should be aspirational rather than a set of blocking requirements. Components can be stable without meeting every criterion. Requiring extensive benchmarks and documentation for every component would worsen the "stuck on pre-release" problem, not improve it. The goal is to help maintainers understand what production users need without creating barriers to stabilization. | ||
|
|
||
| The End User SIG and Communications SIG should own this work. |
There was a problem hiding this comment.
I love our End User + Communications SIG - but is this the right owner?
I think examples of this are crafting the collector resiliency documentation, but the key questions to ask here involve core architectural decisions around architectures OTEL components support and making sure our releases fit into that cohesive whole.
In lieu of a better SIG, I'd suggest this belongs to the TC (today, by charter), and we should step up what we offer here.
|
|
||
| ## Open Questions | ||
|
|
||
| Who will own each workstream? Should ownership be assigned before this OTEP is approved, or can workstreams proceed as volunteers emerge? |
There was a problem hiding this comment.
I'd suggest each workstream is either adopted as the roadmap of an existing SIG (when that's the owner) or becomes a new project in the governance model, with a dedicated project owner, to make sure this succeeds.
|
@jsuereth just to clarify - the intent of stability here is as follows:
is this unclear from the current wording? i can add this more explicit guidance. |
Co-authored-by: Cijo Thomas <cithomas@microsoft.com>
|
I'll be at the 2/10 spec call to discuss this OTEP further synchronously with maintainers. |
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
As stated in the spec all - I think we agree in high level, but the way this is worded appears to give license to something we do not want. Stable instrumentation SHOULD NOT break output telemetry. This means, in your wording, you should not say the output telemetry is unstable. Just that the output telemetry is not in "global" semconv, but maintained locally instead.
I do not get this when reading the current proposal. This is my major beef with it, and I'd like to reword/rephrase so it's clear our stance on this. |
Reframe instrumentation stability around production readiness of code rather than separating API stability from telemetry output stability. Trim workstream sections to focus on problems and outcomes, leaving solution details to the workstreams themselves.
mx-psi
left a comment
There was a problem hiding this comment.
This looks good to me, I would like to see approvals from mentioned SIGs before approving myself though
jsuereth
left a comment
There was a problem hiding this comment.
Thanks for fixing the wording here!
I forgot to come back and re-review, was working on specific proposals w/ @lmolkova for the Federeated Semconv piece, but this now looks good to me.
I think we'll need to kick of specific projects for areas which don't have SIG owners. Ideally, I think we should get a "single person who feels responsible" for a workstream, but we can sort out that detail later.
|
|
||
| OpenTelemetry has grown into a massive ecosystem supporting four telemetry signals across dozen programming languages. This growth has come with complexity that creates real barriers to production adoption. | ||
|
|
||
| Community feedback consistently identifies several pain points. Experimental features break production deployments—users report configuration breaking between minor versions, silent failures in telemetry pipelines, and unexpected performance regressions that only appear at scale. As one practitioner noted: "The silent failure policy of OTEL makes flames shoot out of the top of my head." |
There was a problem hiding this comment.
| Community feedback consistently identifies several pain points. Experimental features break production deployments—users report configuration breaking between minor versions, silent failures in telemetry pipelines, and unexpected performance regressions that only appear at scale. As one practitioner noted: "The silent failure policy of OTEL makes flames shoot out of the top of my head." | |
| Community feedback consistently identifies several pain points. Experimental features break production deployments—users report configuration breaking between minor versions, silent failures in telemetry pipelines, and unexpected performance regressions that only appear at scale. |
|
|
||
| Semantic convention changes destroy existing dashboards. When conventions change, users must update instrumentation across their entire infrastructure while simultaneously updating dashboards, alerts, and downstream tooling. Organizations report significant resistance from developers asked to coordinate these changes. | ||
|
|
||
| Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale, with reports of "four times the CPU usage" compared to simpler alternatives. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented. |
There was a problem hiding this comment.
| Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale, with reports of "four times the CPU usage" compared to simpler alternatives. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented. | |
| Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented. |
|
|
||
| Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale, with reports of "four times the CPU usage" compared to simpler alternatives. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented. | ||
|
|
||
| These all stem from the same problem: OpenTelemetry's default configuration prioritizes feature completeness over production readiness. This OTEP establishes the goals and workstreams needed to address this. |
There was a problem hiding this comment.
These all stem from the same problem: OpenTelemetry's default configuration prioritizes feature completeness over production readiness.
this seems a bit of an oversimplification given some of the examples above
|
|
||
| This workstream should result in a consistent pattern for experimental feature opt-in that works across SDKs, the Collector, and instrumentation libraries. | ||
|
|
||
| The Configuration SIG is the natural owner for this work. |
There was a problem hiding this comment.
Configuration SIG doesn't exist anymore:
|
|
||
| - Stability information should be visible and consistent. Users should be able to easily determine the stability status of any component before adopting it, and this information should be presented consistently across all OpenTelemetry projects. | ||
|
|
||
| - Instrumentation should be able to stabilize based on production readiness. The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized. However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump. |
There was a problem hiding this comment.
However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump.
This seems like an unreasonable burden to place on things like auto instrumentation. Consider the example where an http client library is directly instrumented using OpenTelemetry APIs, and it is using the currently stable semantic conventions for http client calls. All auto instrumentation needs to do to enable capturing that telemetry, is to subscribe to that telemetry (ActivitySource or Meter in dotnet for example). The instrumentation version is directly coupled to the version of the http client library, and completely outside the control of auto instrumentation.
- Does this mean that there is an expectation that auto instrumentation implementations need to perform proactive testing to detect changes in the telemetry output for new library versions?
- Does auto instrumentation need a new major version whenever we want to support a new major version of 3rd party library that is natively instrumented?
- Will library authors consistently do a major version bump if the telemetry signal changes?
- Do we need something in this proposal specifically for auto instrumentation to call out how default instrumentations need to be managed?
Summary
This OTEP proposes that OpenTelemetry distributions enable only stable components by default, decouple instrumentation stability from semantic convention stability, and establish expanded stability criteria.
Key Proposals
metadata.yamlpattern to instrumentation librariesMotivation
Community feedback consistently identifies pain points that this OTEP addresses:
Related
Test plan