Skip to content

Commit 4df5a87

Browse files
committed
docs(tracing): OpenTelemetry tracing documentation for Astarte
Add a `TRACING.md` documentation to reflect the actual OpenTelemetry implementation introduced across the Astarte platform. The documentation covers: - **What/Why:** The holistic view of HTTP -> AMQP -> RPC -> DB tracing. - **Propagation Methods:** How context is injected/extracted in AMQP headers and how it is passed as part of the payload in GenServer calls. - **Instrumentation Details:** How Phoenix APIs are instrumented, how spans are enriched with Astarte domain attributes (`astarte.realm`, etc.), and how Xandra's native `:telemetry` events are used to trace Cassandra queries. - **Local Development:** Step-by-step instructions for utilizing the new `docker-compose.tracing.yml` Jaeger profile to easily inspect traces. These improvements ensure that developers have clear, accurate, and up-to-date guidance for working with distributed tracing in Astarte. Signed-off-by: Davide Briani <davide.briani@secomind.com>
1 parent d9d4504 commit 4df5a87

1 file changed

Lines changed: 67 additions & 0 deletions

File tree

TRACING.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# Distributed Tracing in Astarte
2+
3+
Distributed tracing allows you to track requests as they flow through the various services of the Astarte IoT platform. This provides a holistic view of the system's performance and helps identify issues that span multiple service boundaries.
4+
5+
Astarte uses [OpenTelemetry](https://opentelemetry.io/) (OTEL) for distributed tracing. The implementation is integrated into the Elixir-based services using the official OpenTelemetry Erlang/Elixir SDKs. It captures spans for HTTP requests, AMQP messages, internal RPC calls, and database queries, providing a complete trace of data as it moves through the platform.
6+
7+
## Why
8+
9+
Tracing is invaluable for several reasons:
10+
11+
- **Identifying Bottlenecks**: Pinpoint exactly which service or database query is slowing down a request.
12+
- **Cross-Boundary Context Loss**: Astarte is a distributed system where a single logical operation might start with an HTTP request, trigger an AMQP message, and result in multiple database operations. Tracing ensures the context is preserved across these boundaries (HTTP -> AMQP -> RPC -> DB).
13+
- **Debugging Messaging & RPC**: Track the lifecycle of messages as they are produced and consumed across different exchanges/queues and GenServer boundaries.
14+
- **Performance Benchmarking**: Collect data on how different parts of the system perform under various loads to guide optimization efforts.
15+
16+
## How
17+
18+
### W3C Context Propagation
19+
20+
Astarte follows the [W3C Trace Context](https://www.w3.org/TR/trace-context/) specification for propagating trace information. This ensures compatibility with other tools and services that follow the same standard.
21+
22+
### AMQP Propagation
23+
24+
Since AMQP does not have a native way to propagate trace context like HTTP headers, Astarte injects it manually.
25+
26+
- **Injection**: When a service publishes an AMQP message (e.g., in `Astarte.Events.AMQPEvents.Producer`), it injects the current trace context into the AMQP message headers using `:otel_propagator_text_map.inject/1`.
27+
- **Extraction**: When a service consumes an AMQP message (e.g., in `Astarte.TriggerEngine.AMQPConsumer.AMQPMessageConsumer`), it retrieves the trace context from the headers and links the new spans to the existing trace using `:otel_propagator_text_map.extract/1` and `OpenTelemetry.Ctx.attach/1`.
28+
29+
### RPC Propagation
30+
31+
For internal GenServer and cross-node communication (e.g., `astarte_rpc` and Data Updater Plant calls), the trace context is passed directly within the message payload. The caller appends `OpenTelemetry.Ctx.get_current()` to the GenServer call, and the receiving server extracts and attaches it before processing the request.
32+
33+
### Instrumentation
34+
35+
- **Phoenix Instrumentation**: Web-facing services use `opentelemetry_phoenix` to automatically create spans for incoming HTTP requests.
36+
- **Metadata Enrichment**: To make traces searchable by business entities, plugs across Astarte's APIs (AppEngine, Realm Management, Pairing) enrich the active span with domain-specific OpenTelemetry attributes. Using `OpenTelemetry.Tracer.set_attribute/2`, traces are tagged with `astarte.realm`, `astarte.device_id`, `astarte.hw_id`, and `astarte.interface_name`.
37+
- **Cassandra Instrumentation**: Astarte uses a custom `XandraTracing` module to trace Cassandra queries. It natively hooks into Xandra's `:telemetry` events (`[:xandra, :execute_query, :start]`, etc.) to emit `db.execute_query` spans, enriching them with `db.system`, `db.user`, and the actual `db.statement`.
38+
39+
### Opt-in Configuration
40+
41+
Tracing is designed to be opt-in. By default, if no exporter is configured, the overhead is minimal as spans are not exported. To enable tracing, configure an OTLP exporter endpoint.
42+
43+
## Local Development
44+
45+
To enable tracing during local development, a pre-configured Jaeger instance is provided via Docker Compose.
46+
47+
The `docker-compose.tracing.yml` file sets up the Jaeger service and injects the required `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_RESOURCE_ATTRIBUTES` environment variables into all Astarte microservices.
48+
49+
To start the tracing-enabled environment, run:
50+
51+
```sh
52+
docker-compose -f docker-compose.yml -f docker-compose.tracing.yml up
53+
```
54+
55+
Once your services are running:
56+
1. Perform some actions in Astarte (e.g., make an API request).
57+
2. Open the Jaeger UI at [http://localhost:16686](http://localhost:16686).
58+
3. Select a service (e.g., `astarte-appengine-api`) from the dropdown and click "Find Traces".
59+
4. You can also search by tags like `astarte.realm=myrealm` to find specific traces.
60+
61+
## Production
62+
63+
In a production environment managed by the Astarte Kubernetes Operator, enabling OpenTelemetry involves:
64+
65+
1. **Configuring the OTEL Collector**: It's recommended to run an OpenTelemetry Collector as a sidecar or a standalone deployment to aggregate and forward traces to your backend (e.g., Jaeger, Honeycomb, Grafana Tempo).
66+
2. **Setting Environment Variables**: Configure the `OTEL_EXPORTER_OTLP_ENDPOINT` for Astarte services to point to the collector.
67+
3. **Operator Configuration**: Check the [Astarte Operator documentation](https://github.com/astarte-platform/astarte-kubernetes-operator) for the latest instructions on enabling distributed tracing in production via the Astarte Custom Resource (CR).

0 commit comments

Comments
 (0)