Skip to content

Latest commit

 

History

History
169 lines (98 loc) · 17.8 KB

File metadata and controls

169 lines (98 loc) · 17.8 KB
title Tracing redesign
authors
@squakez
reviewers
@zbendhiba
@davsclaus
approvers
@zbendhiba
@davsclaus
creation-date 2025-01-08
last-updated 2026-05-04
status implemented
see-also
replaces
superseded-by

Summary

Tracing and telemetry features are a pillar of application observability, above all when the applications are deployed in cloud environments and/or in distributed systems in general. During the last years we have observed an increasing demand in usage of telemetry components, above all, the usage of CNCF project Opentelemetry when running Camel application on cloud environments.

Motivation

The increase of usage in these components is also opening questions and showing potential flaws in the actual design of this feature. Recently we needed to work on several issues in order to enhance the component or fix failing behaviors which resulted in a increasingly difficult maintenance of the code. Most of the time we had to change the implementation, the abstraction and even the core dependencies in order to make things work: this is a symptoms that we probably need to think on a new design in order to simplify long term maintenance.

Goals

Goal of this proposal is to analyze the actual design, the challenges we are facing and provide any alternative design to have a simpler long term maintenance.

Context

Camel framework had originally an abstract component, camel-tracing, whose goal was to create a generic tracing lifecycle to be implemented concretely by specific technologies, such as camel-opentelemetry. The abstraction should take care of generic concepts (like when to create a new span according Camel eventing model). The implementation should concretely take care to instantiate unique traces and provide the mechanics required to pull/push such traces to a trace collector system for future traces inspections.

Any user that want to provide the tracing feature is required to include the component dependency and any specific configuration. The framework would take care to wire the Camel activity to a collection of traces.

I’ve performed a deep analysis in the last weeks, trying to figure it out which are the major problems we need to tackle and I came to the conclusion that the actual design may require some review in order to set the base for a stronger longer term maintenance. It follows a list of points that I think require attention when planning any future development.

Unclear tracing scope specification

We have not a clear specification of what a trace or a span represents from Camel point of view. We are thinking of this as a generic unit of work, mostly, without a clear definition of how that is bound to any Camel resource. There is no documentation around that, requiring the user to intuitively understand how a trace maps to Camel domain model.

Implementation details slipped in the abstraction

During the past we introduced certain developments that required the abstraction to be aware of certain implementation details, such as Autoclosable Opentelemetry scopes. Also, we have certain developments that are missing the required abstraction, making them specific of the implementation (for example, Opentelemetry processor traces).

Ad hoc "side" features implementations

The implementations we are using are offering their specific way to expose certain "side" features, for example, set the traces ids into MDC. However we do have our own implementation that is either conflicting or not working properly as it relies on a context propagation which is generally part of the tracing/telemetry implementation.

Inconsistent context storage

The abstraction (in camel-tracing) is taking care to maintain a stack based structure for each created span which is stored in the Exchange. The data structure is also taking care to maintain a hierarchy relationship between the different spans created during an Exchange execution. However, the implementation we have in camel-opentelemetry is mixing up this mechanism with its own storing mechanism which is based on Java ThreadLocal context. Additionally we have implemented a context propagation mechanism based, again, on adding information on the Exchange header. This is creating certain inconsistency because it’s hard to maintain synchronized both the Exchange and the specific implementation mechanism. Moreover we cannot really be confident of this mechanism as Camel cannot guarantee that the same thread that started an action is going to be the same that will close it (more on this point in the new design proposal).

Async exchange boundaries

With the actual design, Camel creates a new trace when it create an Exchange and later add span for each process. However, when we are creating an asynchronous Exchange (ie, wiretap EIP), this is considered as part of the original Exchange, and, with it, all the new Exchange execution. The result in the trace collector tool is that the new Exchange overflow the execution of the source Exchange.

Proposal

Before digging deep in the new design, we need to make an important consideration related to how Camel works and how the major telemetry component we want to consider (Opentelemetry) would require certain transformations. As mentioned in the "Inconsistent context storage" section, the Opentelemetry works on the assumption that any application can easily propagate the context to the threading model of such application. This is not the case of Camel, above all because the system is very much event based for performances reasons. Mechanism like ThreadLocal are a real limitation in our case as it would require that the thread that is executing a giving process is wrapped by the logic of the Opentelemetry (which is: create span, execute, close span, all on the same thread).

The new design should not change how the core of the application works. We must be implementation agnostic, so the design should be flexible enough to adapt to any future implementation and avoid any important future refactoring.

I advocate to move back to the root of the original abstract component, first of all, defining the trace specification meaning for Camel (tracing structure). Later we should provide a clear and flexible lifecycle for the traces (creation, activation, …​): this is probably the abstract part we will need to delegate to implementation specific components. In order to avoid depending on consistency problems, we should exclusively use the Exchange as a mean to store and define the hierarchy of spans (tracing storage). Any required activation/deactivation of a span during the lifecycle of the application must be done via the lifecycle abstract methods. Ideally we should also provide a simple and basic implementation that would work as a mocking system to prove the abstraction is solid.

Tracing structure

Each new Exchange will start the creation of a new Trace. For each event spanning the execution of the Route, then, it will be created a separate Span, which goal is to capture each component or step execution.

During the execution of a route, a new Exchange could be created for each asynchronous event spin off from the main process. In such case a new Span with a different Exchange ID will be created. However, the Span will still belong to the same main Trace in order to correctly keep the trail of the execution.

Tracing lifecycle

The camel-tracing component should be the one in charge to manage the trace lifecycle. Any implementation specific behavior has to adapt to this lifecycle, likely implementing the required logic in those abstract methods exposed by the component. At this stage of design, we can identify those function as:

  • Span creation

  • Span activation

  • Span deactivation

  • Span closure

The creation method would be in charge to create a new root trace or a new span within an existing trace. The activation method is the one in charge to tell the tracing system a given span is the one active at any given moment. The deactivation should be the one used to turn a given span off. The closure method is finally the one in charge to finalize a given span and the trace when this is the case.

The above definition may feel redundant as in this moment we may probably need only a creation/activation method and a deactivation/closure method. However, in order to give more flexibility to the abstraction, we must make sure to meet any future requirement by any tracing technology.

This design is very similar to the original component design. However, we need to remove the implementation specific details from the abstraction entirely. What is also important is that we entirely leverage the component storage to retrieve the current span and do with it the needful action. With this proposal we will also need to remove from the core components certain logic we had introduced in the past in order to support some features (ie, ExchangeAsyncProcessingStartedEvent implementation). We would enhance the component decoupling and provide a higher cohesion.

Beside the span lifecycle we will need to consider a few more aspects:

  • Span decoration

  • Context propagation

The span decoration is a Camel specific way of decorating the different components we handle with specific traces information. As an example, when you’re using Kafka component, you will get automatically in the trace useful configuration as the offset or the partition. We already have this mechanism in place and we should make sure to have a clear documentation stating about this particular feature.

The Context propagation is a way to correlate distributed traces between each other. It works reading a traceparent header on the Exchange and using it to correlate to a chain of distributed requests. It’s important to notice that the specific propagation mechanism belong to the implementation, so we will need to provide in the component the required level of abstraction (see Context Propagation chapter).

Tracing storage

The Exchange stack storage already exists and it may suffice to this proposal goals. Again, we need to remove the implementation specific details from the abstraction and make sure that we don’t slip any implementation detail in the future by design. Some concern we may have would be about the correct handling of opening and closure of spans which may be different according the each implementation specific. However, if the lifecycle we have in place takes care of consistency, this should not be a problem at all: each implementation should be in charge to do the needful when each lifecycle method is called. The Exchange stack storage can be used to store a span wrapper and maintain a state for it: this is something already available.

Thread local scope management

Certain implementations (i.e., Opentelemetry) may leverage the ThreadLocal Java API. This is an implementation details that we don’t want to manage directly in the abstraction and has to be managed in each concrete implementation. In the specific case of Opentelemetry, since this is materialized by the concept of a Scope and that the Scope must be opened and closed within the same Thread to maintain consistency and avoid leakage, we will implement it exclusively in the process via some InterceptStrategy. This means that the access to the given Opentelemetry context can be only available within a custom Processor execution (for example to provide any additional custom Span to your trace execution).

Context propagation

Each telemetry implementations may provide a different way of managing context propagation. However, from Camel perspective, this is managed consistently across the different implementations by the consumption of upstream headers and the generation of downstream headers accordingly.

┌───────────────────────┐     ┌───────────────────────┐     ┌───────────────────────┐
│   Upstream Component  │ ──> │        Camel          │ ──> │   Third-Party System  │
│ (sends traceparent)   │     │ consumes & injects    │     │ (receives traceparent)│
└───────────────────────┘     └───────────────────────┘     └───────────────────────┘
            traceparent ───────────────────────────────────────────────────────────▶

For this reason, each telemetry implementations will need to extract the expected header before generating a new trace. This is typically happening during the Trace/Span creation (ie, method create()) if no parent span still exists. In a similar fashion, the telemetry component must implements an inject() method in order to pass to any downstream system the origin of the trace. This mechanism will also serve to make different Camel systems properly exchange the traces among them.

It’s important therefore that any upstream component (java libraries, agents, …​) implements properly the given context propagation technology specification. The de-facto standard is the W3C Trace Context specification which is the one used in Opentelemetry based telemetry components. Camel Opentelemetry telemetry components must be compliant to the standard.

Tracing simple implementation (mock)

If we move most of the logic into the abstraction, the implementation of a simple implementation should be straightforward. We can expect this implementation in charge to implement the abstraction methods provided in the "tracing lifecycle" section, which can be some simple UUID generation and the tracing into MDC variables in order to simply log them in the application log. No push/pull to any collector is expected and this implementation would serve more as a way to debug the abstraction, making sure that any implementation specific detail would not be the cause of any faulty behavior.

Tracing specific implementations

The feature specific implementation should be therefore limited to the implementation of the abstract methods, as it would happen in the simple implementation. With this approach we are limiting to the bare minimum the maintenance of each specific technology. With this proposal we will need to rework massively on the reduction of code in the existing implementations (camel-opentelemetry).

Development

This design proposals may introduce certain breaking compatibility changes, reason why we must clarify the scope and plan the work in order to avoid adding breaking compatibility within any non major version. We may work by adding a new abstract component which will be compliant with this new specification and once the new development is stable enough, we can deprecate the older camel-tracing and let the user replace with the newer one.

Here below we can keep track of the development iterations until completion:

Abstract camel-telemetry component (2025-01-28)

Developed first draft component which cover this document specification. We have a base set of test covering the main features and a mock tracing implementation used to validate such test case scenarios.

camel-telemetry-dev component (2025-02-11)

Developed concrete mock/debugging component implementing the camel-telemetry specification that can be used for development purposes.

camel-opentelemetry2 component (2025-02-24)

Developed concrete OpenTelemetry component implementing the camel-telemetry specification. This component will eventually replace camel-opentelemetry component.

camel-micrometer-observability component (2025-08-21)

Developed concrete Micrometer Observability component implementing the camel-telemetry specification. This component will eventually replace camel-observation component.

Inclusion of generated traces headers (2025-10-21)

A new feature is available to add TRACE_ID and SPAN_ID Exchange headers. This is very useful in conjunction with MDC header usage. From now on we have a consistent setting of generated telemetry traces consumable from MDC mechanism.

Design clarification about the adoption of W3C trace context (2025-11-07)

Added a chapter to clarify the adoption and compliancy of W3C trace context of Camel for trace context propagation.

Deprecation of older camel-tracing components (2026-03-10)

Added a deprecation notice for camel-tracing and related components. Also, identified the custom logic previously required by these components into core dependencies. Also deprecated that part for future removals.

Removal of explicit camel-opentelemetry2 Scope wrapping to avoid leakage (2026-05-04)

As described in https://issues.apache.org/jira/browse/CAMEL-23380 any version before 4.21 was suffering from potential leaks. This was due to the assumption that it was fine to deal with "dirty" context (context that could be reused by asynchronous threads). Although the final solution was consistent, we realized that, in the long run, the leak could provoke disruptions.

In order to prevent this problem we need to rethink the implementation details of camel-opentelemetry2 and remove the explicit Scope management that, when asynchronous, was opening the Scope in a thread and closing in another (what we had called "dirty" context). We are now removing this explicit management and moving this part exclusively in the custom Camel Processors. Here Camel will take care to open the Opentelemetry scope and close it within the same thread.

What it means is that, from now on we get rid of the leak but the final user or any third party dependency can only access the Opentelemetry context within the boundary of a Processor execution.