|
| 1 | +# Semantic conventions migrations in the Collector |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The OpenTelemetry Collector components emit telemetry that often conforms to semantic conventions. |
| 6 | +Semantic conventions have [varying levels of stability][1] and often have an SDK-focused migration |
| 7 | +guide. |
| 8 | + |
| 9 | +This RFC defines how migration should be handled in Collector components that have |
| 10 | +semantic conventions that migrate to a stable version, in a Collector-native way. |
| 11 | + |
| 12 | +## Scope and goals |
| 13 | + |
| 14 | +This RFC provides general guidelines for semantic convention-mandated migrations of telemetry created by Collector components (usually receivers) and output into the Collector's pipeline. It explicitly does not attempt to cover: |
| 15 | +- telemetry created by an application and forwarded by a Collector receiver; |
| 16 | +- internal telemetry of Collector components; |
| 17 | +- guidelines for the migration of specific semantic conventions. |
| 18 | + |
| 19 | +The migration mechanism should have the following characteristics: |
| 20 | + |
| 21 | +1. **Collector native**: the mechanism should work in a similar way to other Collector migrations |
| 22 | + and should feel natural and intuitive to users. |
| 23 | +2. **Simple**: a user should have to make a small number of changes to their Collector deployment to |
| 24 | + migrate to a new set of conventions. |
| 25 | +3. **Easy to understand**: It should be easy to understand how to migrate a particular set of |
| 26 | + conventions. |
| 27 | +5. **Flexible (double publish)**: The mechanism should allow you to 'double publish' v0 and v1 |
| 28 | + conventions |
| 29 | +6. **Flexible (other conventions)**: The mechanism should still allow for evolution of other |
| 30 | + semantic conventions that are not being migrated. |
| 31 | + |
| 32 | +## Background |
| 33 | + |
| 34 | +### Setup |
| 35 | + |
| 36 | +We want to write guidance for when we have a component that emits telemetry from a common |
| 37 | +`area` that is undergoing a migration mandated by the Semantic Conventions SIG. In the rest of this |
| 38 | +document we refer to the **v0** conventions and the **v1** conventions, which are the conventions |
| 39 | +in this area before and after the migration. |
| 40 | + |
| 41 | +When the semantic conventions are specific to a component we use |
| 42 | +- `kind` to refer to the component kind (receiver, exporter...) |
| 43 | +- `id` for the component id (e.g. `hostmetrics`) |
| 44 | + |
| 45 | +### What does the semconv spec say? |
| 46 | + |
| 47 | +The semantic conventions specification defines an environment variable named |
| 48 | +`OTEL_SEMCONV_STABILITY_OPT_IN` that, for each area, takes two possible values: |
| 49 | +1. One value representing the new semantic conventions (e.g. `http`, `gen_ai_latest_experimental`) |
| 50 | +2. Once mature enough, a second value ending in `/dup` that emits both the old conventions and the |
| 51 | + new ones. |
| 52 | + |
| 53 | +This is not specified in a generic way, but it is a consistent pattern across all semantic |
| 54 | +conventions areas that are being actively worked on: |
| 55 | + |
| 56 | +<details> |
| 57 | + |
| 58 | +<summary> Example 1: HTTP compatibility warning </summary> |
| 59 | + |
| 60 | +Taken from [semconv v1.38.0][2]: |
| 61 | + |
| 62 | +> **Warning** |
| 63 | +> Existing HTTP instrumentations that are using |
| 64 | +> [v1.20.0 of this document](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.20.0/specification/trace/semantic_conventions/http.md) |
| 65 | +> (or prior): |
| 66 | +> |
| 67 | +> * SHOULD NOT change the version of the HTTP or networking conventions that they emit |
| 68 | +> until the HTTP semantic conventions are marked stable (HTTP stabilization will |
| 69 | +> include stabilization of a core set of networking conventions which are also used |
| 70 | +> in HTTP instrumentations). Conventions include, but are not limited to, attributes, |
| 71 | +> metric and span names, and unit of measure. |
| 72 | +> * SHOULD introduce an environment variable `OTEL_SEMCONV_STABILITY_OPT_IN` |
| 73 | +> in the existing major version which is a comma-separated list of values. |
| 74 | +> The only values defined so far are: |
| 75 | +> * `http` - emit the new, stable HTTP and networking conventions, |
| 76 | +> and stop emitting the old experimental HTTP and networking conventions |
| 77 | +> that the instrumentation emitted previously. |
| 78 | +> * `http/dup` - emit both the old and the stable HTTP and networking conventions, |
| 79 | +> allowing for a seamless transition. |
| 80 | +> * The default behavior (in the absence of one of these values) is to continue |
| 81 | +> emitting whatever version of the old experimental HTTP and networking conventions |
| 82 | +> the instrumentation was emitting previously. |
| 83 | +> * Note: `http/dup` has higher precedence than `http` in case both values are present |
| 84 | +> * SHOULD maintain (security patching at a minimum) the existing major version |
| 85 | +> for at least six months after it starts emitting both sets of conventions. |
| 86 | +> * SHOULD drop the environment variable in the next major version (stable |
| 87 | +> next major version SHOULD NOT be released prior to October 1, 2023). |
| 88 | +
|
| 89 | +</details> |
| 90 | + |
| 91 | +<details> |
| 92 | + |
| 93 | +<summary> Example 2: GenAI compatibility warning </summary> |
| 94 | + |
| 95 | +From [semconv v1.38.0][3]: |
| 96 | + |
| 97 | +> [!Warning] |
| 98 | +> |
| 99 | +> Existing GenAI instrumentations that are using |
| 100 | +> [v1.36.0 of this document](https://github.com/open-telemetry/semantic-conventions/blob/v1.36.0/docs/gen-ai/README.md) |
| 101 | +> (or prior): |
| 102 | +> |
| 103 | +> * SHOULD NOT change the version of the GenAI conventions that they emit by default. |
| 104 | +> Conventions include, but are not limited to, attributes, metric, span and event names, |
| 105 | +> span kind and unit of measure. |
| 106 | +> * SHOULD introduce an environment variable `OTEL_SEMCONV_STABILITY_OPT_IN` |
| 107 | +> as a comma-separated list of category-specific values. The list of values |
| 108 | +> includes: |
| 109 | +> * `gen_ai_latest_experimental` - emit the latest experimental version of |
| 110 | +> GenAI conventions (supported by the instrumentation) and do not emit the |
| 111 | +> old one (v1.36.0 or prior). |
| 112 | +> * The default behavior is to continue emitting whatever version of the GenAI |
| 113 | +> conventions the instrumentation was emitting (1.36.0 or prior). |
| 114 | +> |
| 115 | +> This transition plan will be updated to include stable version before the |
| 116 | +> GenAI conventions are marked as stable. |
| 117 | +
|
| 118 | +</details> |
| 119 | + |
| 120 | +<details> |
| 121 | + |
| 122 | +<summary> Example 3: K8s compatibility warning </summary> |
| 123 | + |
| 124 | +> From [semconv v1.38.0][3]: |
| 125 | +
|
| 126 | +> When existing K8s instrumentations published by OpenTelemetry are |
| 127 | +> updated to the stable K8s semantic conventions, they: |
| 128 | +> |
| 129 | +> - SHOULD introduce an environment variable `OTEL_SEMCONV_STABILITY_OPT_IN` in |
| 130 | +> their existing major version, which accepts: |
| 131 | +> - `k8s` - emit the stable k8s conventions, and stop emitting |
| 132 | +> the old k8s conventions that the instrumentation emitted previously. |
| 133 | +> - `k8s/dup` - emit both the old and the stable k8s conventions, |
| 134 | +> allowing for a phased rollout of the stable semantic conventions. |
| 135 | +> - The default behavior (in the absence of one of these values) is to continue |
| 136 | +> emitting whatever version of the old k8s conventions the |
| 137 | +> instrumentation was emitting previously. |
| 138 | +> - Need to maintain (security patching at a minimum) their existing major version |
| 139 | +> for at least six months after it starts emitting both sets of conventions. |
| 140 | +> - May drop the environment variable in their next major version and emit only |
| 141 | +> the stable k8s conventions. |
| 142 | +
|
| 143 | +> Specifically for the Opentelemetry Collector: |
| 144 | +
|
| 145 | +> The transition will happen through two different feature gates. |
| 146 | +> One for enabling the new schema called `semconv.k8s.enableStable`, |
| 147 | +> and one for disabling the old schema called `semconv.k8s.disableLegacy`. Then: |
| 148 | +
|
| 149 | +> - On alpha the old schema is enabled by default (`semconv.k8s.disableLegacy` defaults to false), |
| 150 | +> while the new schema is disabled by default (`semconv.k8s.enableStable` defaults to false). |
| 151 | +> - On beta/stable the old schema is disabled by default (`semconv.k8s.disableLegacy` defaults to true), |
| 152 | +> while the new is enabled by default (`semconv.k8s.enableStable` defaults to true). |
| 153 | +> - It is an error to disable both schemas |
| 154 | +> - Both schemas can be enabled with `--feature-gates=-semconv.k8s.disableLegacy,+semconv.k8s.enableStable`. |
| 155 | +
|
| 156 | +</details> |
| 157 | + |
| 158 | +## Proposed mechanism |
| 159 | + |
| 160 | +Suppose the `<id>` (e.g. `hostmetrics`) `kind` (e.g. `receiver`) component is migrating from v0 to |
| 161 | +v1 semantic conventions on the area `area` (e.g. `process`). The semantic conventions specification |
| 162 | +defines the set of conventions that are in scope for a particular migration. |
| 163 | + |
| 164 | +To support this migration, the component defines two feature gates: `<kind>.<id>.EmitV1<Area>Conventions` (e.g. |
| 165 | +`receiver.hostmetrics.EmitV1ProcessConventions`) and `<kind>.<id>.DontEmitV0<Area>Conventions` |
| 166 | +(e.g. `receiver.hostmetrics.DontEmitV0ProcessConventions`). These feature gates work as follows: |
| 167 | + |
| 168 | +| `<kind>.<id>.EmitV1<Area>Conventions` status | `<kind>.<id>.DontEmitV0<Area>Conventions` status | Resulting behavior | |
| 169 | +|-----------------------------------------------|-------------------------------------------------------|-----------------------------------------------------------| |
| 170 | +| Disabled | Disabled | Emit telemetry under the 'v0' conventions | |
| 171 | +| Disabled | Enabled | Error at startup since this would not emit any telemetry | |
| 172 | +| Enabled | Disabled | Emit telemetry under both the v0 and the v1 conventions | |
| 173 | +| Enabled | Enabled | Emit telemetry under the v1 conventions | |
| 174 | + |
| 175 | +Both feature gates evolve at the same pace through the feature gate stages, so that the progression |
| 176 | +is as follows: |
| 177 | +1. Initially both are at **alpha** stage (disabled by default). This means that the default behavior |
| 178 | + is to emit only the 'v0' conventions. Users can opt-in to emit the v1 conventions alongside the |
| 179 | + v0 conventions or to emit only the v1 conventions. A warning message must be logged by the component at startup indicating the upcoming change. |
| 180 | +2. Whenever there is a semantic conventions release that marks these as stable, the feature gates are promoted to the |
| 181 | + **beta** stage on the same Collector release. The new default behavior is therefore to emit only the |
| 182 | + 'v1' conventions. Users can opt-out to emit the v1 conventions alongside the v0 conventions or |
| 183 | + to emit only the v0 conventions. |
| 184 | +3. After 4 minor releases, the feature gates are promoted to the **stable** stage. At this point users |
| 185 | + can only use the v1 conventions. |
| 186 | +4. After additional 4 minor releases, the feature gates are removed. |
| 187 | + |
| 188 | +This mechanism does not cover any sort of transition for experimental semantic conventions. These |
| 189 | +presumably would be covered by separate feature gates or some other mechanism. |
| 190 | + |
| 191 | +## Alternative mechanisms |
| 192 | + |
| 193 | +There are some other possibilities: |
| 194 | + |
| 195 | +### Environment variable |
| 196 | + |
| 197 | +We could just use the `OTEL_SEMCONV_STABILITY_OPT_IN` mechanism. However, this does not feel |
| 198 | +"Collector native": Collector users expect experimental features to be controlled via feature gates |
| 199 | +and as such this could be a surprising mechanism. In particular, users would expect that they are |
| 200 | +able to 'roll back' to the previous behavior even after a Collector upgrade, something that the |
| 201 | +environment variable mechanism explicitly does not support. |
| 202 | + |
| 203 | +### More granular feature gate pairs |
| 204 | + |
| 205 | +The granularity of the feature gates described could be changed: we could have a pair per convention |
| 206 | +or even a pair for the whole Collector. I argue 'per component' strikes the right balance between |
| 207 | +simplicity and flexibility: |
| 208 | +- per convention would lead to dozens of feature gates on some of the areas we want to stabilize. It |
| 209 | + would also be unclear how these interact on edge cases (semantic conventions may only make sense |
| 210 | + holistically) |
| 211 | +- a single pair of feature gates would effectively be forever unstable and would not be flexible |
| 212 | + enough to allow people to migrate on a per dashboard basis |
| 213 | + |
| 214 | +### Meta feature gate |
| 215 | + |
| 216 | +We could have both a feature gate pair per component and a meta target feature gate pair that allows |
| 217 | +you to enable/disable all v1 conventions at the same time. This is effectively a superset of the |
| 218 | +proposed mechanism, so I argue we can postpone this for later: if users ask for it, we can always |
| 219 | +add it in the future. |
| 220 | + |
| 221 | +## Open questions and future possibilities |
| 222 | + |
| 223 | +This document does not cover how to deal with experimental semantic conventions after the 'big' |
| 224 | +migration has been completed in one particular area. What to do here in part depends on the |
| 225 | +[stabilization changes][4]. Quoting the blogpost: |
| 226 | +> Instrumentation stability should be decoupled from semantic convention stability. We have a lot of |
| 227 | +> stable instrumentation that is safe to run in production, but has data that may change in the |
| 228 | +> future. Users have told us that conflating these two levels of stability is confusing and limits |
| 229 | +> their options. |
| 230 | +
|
| 231 | +How to deal with these remains an open question that should be tackled in OTEPs first. |
| 232 | + |
| 233 | +As mentioned above, the 'Meta feature gate' remains a possibility even when adopting this mechanism. |
| 234 | + |
| 235 | +[1]: https://opentelemetry.io/docs/specs/semconv/general/semantic-convention-groups/#group-stability |
| 236 | +[2]: https://github.com/open-telemetry/semantic-conventions/blob/v1.38.0/docs/http/README.md |
| 237 | +[3]: https://github.com/open-telemetry/semantic-conventions/blob/v1.38.0/docs/gen-ai/README.md |
| 238 | +[4]: https://opentelemetry.io/blog/2025/stability-proposal-announcement/ |
0 commit comments