This guide explains how to implement and configure distributed tracing for the API Platform Gateway components.
The default tracing services included in the Docker Compose configuration are demonstration services designed to showcase how you can observe distributed traces across gateway components in a centralized setup. These services provide a reference implementation that you can use out-of-the-box for development, testing, or as a starting point for your production tracing strategy.
Important: You are free to choose any tracing or observability strategy that suits your environment and requirements. The provided setup is just one of many possible configurations.
The default tracing stack consists of:
- OpenTelemetry (OTLP) Collector: Receives, processes, and exports trace data from gateway components
- Jaeger: Stores and visualizes distributed traces with a web UI for trace exploration and analysis
- Gateway components (gateway-controller, policy-engine, router) are configured to export traces via OTLP (OpenTelemetry Protocol)
- Components send trace spans to the OpenTelemetry Collector via gRPC (port 4317) or HTTP (port 4318)
- The OTLP Collector processes traces (batching, adding resource attributes, etc.)
- The OTLP Collector forwards traces to Jaeger for storage and visualization
- Users can view and analyze traces through the Jaeger UI
Distributed tracing tracks a request as it flows through multiple components:
- Trace: Represents the entire journey of a request through the system
- Span: Represents a single operation within a trace (e.g., policy execution, upstream call)
- Context Propagation: Traces are correlated across components using trace IDs and span IDs in headers
You need to enable tracing in the gateway configuration file and point it to your OTLP collector endpoint.
The tracing configuration is located in gateway/configs/config.toml:
[tracing]
enabled = true # Set to true to enable tracing
endpoint = "otel-collector:4317" # OTLP collector gRPC endpoint
service_version = "0.2.0" # Service version
batch_timeout = "1s" # Batch timeout for exporting spans
max_export_batch_size = 512 # Maximum spans per batch
sampling_rate = 1.0 # Sample rate (1.0 = 100%, 0.5 = 50%)The tracing services included in the Docker Compose file (Jaeger and OpenTelemetry Collector) are provided as demonstration services to show one possible way to collect and visualize traces. You can use them as-is for development/testing, or replace them with your own tracing solution.
The gateway uses Docker Compose profiles to optionally enable these demonstration tracing services.
To start the gateway with the demonstration tracing services enabled:
docker compose --profile tracing up -dThis starts:
- Core gateway services (gateway-controller, policy-engine, router) - which export traces to OTLP collector
- OpenTelemetry Collector - receives and processes traces
- Jaeger - stores and visualizes traces
To run only the core gateway services without the demonstration tracing stack:
docker compose up -dNote: If tracing is enabled in the configuration but the OTLP collector is not running, components will log warnings about failed trace exports. To completely disable tracing, set enabled: false in the configuration.
To stop all services including the tracing stack:
docker compose --profile tracing downNote: Jaeger stores traces in memory by default. Stopping the service will lose all trace data. For persistent storage, configure Jaeger with a backend database (see Jaeger documentation).
Once you've started the gateway with the tracing profile, follow these steps to view distributed traces:
Open your browser and navigate to:
http://localhost:16686
The Jaeger UI provides several ways to search for traces:
-
Select a Service from the dropdown:
policy-engine- View traces from the policy enginerouter- View traces from the Envoy router
-
Select an Operation (optional):
- Choose "all" to see all operations
- Or select a specific operation (e.g., specific policy execution)
-
Adjust Lookback Time Range:
- Default: Last 1 hour
- Options: 5m, 15m, 1h, 6h, 12h, 1d, 2d, Custom
-
Add Filters (optional):
- Tags: Filter by specific tag values (e.g.,
http.status_code=500) - Min/Max Duration: Filter by trace duration
- Limit Results: Control number of traces returned (default: 20)
- Tags: Filter by specific tag values (e.g.,
-
Click Find Traces
Click on any trace in the results to view detailed information:
- Visual timeline showing all spans in the trace
- Duration bars showing relative time spent in each operation
- Parent-child relationships between spans
- Color coding by service
Click on any span to see:
- Operation name: What operation was performed
- Duration: How long it took
- Tags: Metadata about the operation (HTTP method, status code, etc.)
- Logs: Events logged during the span (errors, warnings, etc.)
- Process: Service name, version, and host information
Finding Slow Requests:
- Set Min Duration filter (e.g., 1000ms)
- Click Find Traces
- Examine spans to identify bottlenecks
Debugging Errors:
- Filter by tag:
error=trueorhttp.status_code=500 - Click on error traces
- Examine span logs and tags for error details
Understanding Request Flow:
- Search for a specific trace ID (from logs or headers)
- View the complete request path through all components
- Identify which component handled which part of the request
You can compare multiple traces to identify patterns:
- Select multiple traces using checkboxes
- Click Compare Traces button
- View side-by-side comparison of trace structure and timings
View how services interact:
- Click Dependencies in the top navigation
- Select time range
- View graph showing service-to-service communication patterns
To reduce trace volume in high-traffic environments, adjust the sampling rate:
[tracing]
sampling_rate = 0.1 # Sample 10% of requestsSampling strategies:
1.0(100%): Sample all requests - recommended for development and low-traffic environments0.5(50%): Sample half of requests - moderate traffic0.1(10%): Sample 10% of requests - high traffic0.01(1%): Sample 1% of requests - very high traffic
Note: Lower sampling rates reduce overhead but may miss important traces.
Customize service names for better identification:
[policy_engine]
service_name = "policy-engine-prod-us-east-1"Optimize batch settings for your environment:
[tracing]
batch_timeout = "5s" # Wait up to 5s before exporting
max_export_batch_size = 1024 # Export up to 1024 spans per batchLower timeout: Faster trace visibility, more network overhead Higher timeout: Better batching efficiency, slower trace visibility
While the default setup uses Jaeger, the gateway components use OpenTelemetry and can export to any OTLP-compatible backend.
Moesif provides API analytics and monitoring with support for OpenTelemetry traces. It treats each HTTP request/response span as an API event for detailed analytics.
No additional Docker services required - Moesif is a cloud-based SaaS platform. You only need to configure the OTLP Collector to export traces to Moesif's API.
Update the OTLP Collector configuration (gateway/observability/otel-collector/config.yaml) to export to Moesif:
exporters:
# Export to Moesif
otlphttp:
endpoint: https://api.moesif.net/v1/traces
headers:
X-Moesif-Application-Id: 'your-moesif-application-id'
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp] # Send to MoesifImportant Notes:
- The endpoint uses HTTPS (not HTTP)
- Use the
otlphttpexporter (nototlpwhich uses gRPC) - The
X-Moesif-Application-Idheader is required for authentication
- Sign up for a Moesif account at moesif.com
- Log in to your Moesif dashboard
- Navigate to Settings → Installation or API Keys
- Locate the Collector Application ID field
- Copy your unique Application ID
For better security, use environment variables for the Application ID:
exporters:
otlphttp:
endpoint: https://api.moesif.net/v1/traces
headers:
X-Moesif-Application-Id: '${MOESIF_APPLICATION_ID}'Update docker-compose.yaml to pass the environment variable:
otel-collector:
image: otel/opentelemetry-collector:latest
environment:
- MOESIF_APPLICATION_ID=${MOESIF_APPLICATION_ID}
# ... rest of configurationSet the environment variable before starting:
export MOESIF_APPLICATION_ID=your-moesif-application-id
docker compose --profile tracing up -dAfter configuring and starting the gateway:
- Navigate to moesif.com and log in
- Go to Events → Live Event Log to see incoming API events
- View API analytics, user behavior, and performance metrics
- Use Time Series to analyze API usage trends
- Set up Alerts for error rates, latency, or custom conditions
- API Analytics: Request volume, response times, error rates
- User Tracking: Identify and track API users across requests
- Error Analysis: Detailed error tracking with request/response bodies
- Behavioral Cohorts: Group users by API usage patterns
- Custom Dashboards: Build visualizations for your specific KPIs
- Alerting: Get notified of anomalies or threshold breaches
You can send traces to both Jaeger (for development) and Moesif (for analytics):
exporters:
# Local Jaeger for development
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
# Moesif for analytics
otlphttp/moesif:
endpoint: https://api.moesif.net/v1/traces
headers:
X-Moesif-Application-Id: '${MOESIF_APPLICATION_ID}'
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/jaeger, otlphttp/moesif]Replace Jaeger with Zipkin:
zipkin:
image: openzipkin/zipkin:latest
ports:
- "9411:9411"
networks:
- gateway-networkUpdate OTLP Collector configuration to export to Zipkin:
exporters:
zipkin:
endpoint: http://zipkin:9411/api/v2/spansAccess Zipkin UI at http://localhost:9411
For a Prometheus-style tracing backend:
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./observability/tempo/tempo.yaml:/etc/tempo.yaml
- tempo-data:/tmp/tempo
ports:
- "3200:3200" # Tempo HTTP
- "4317:4317" # OTLP gRPC
networks:
- gateway-network
grafana:
image: grafana/grafana:latest
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
ports:
- "3000:3000"
volumes:
- ./observability/grafana/datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
networks:
- gateway-network
depends_on:
- tempoConfigure gateway to send directly to Tempo:
tracing:
endpoint: tempo:4317Configure OTLP Collector to export to AWS X-Ray:
exporters:
awsxray:
region: us-east-1
no_verify_ssl: falseOr use the AWS Distro for OpenTelemetry (ADOT) Collector:
otel-collector:
image: public.ecr.aws/aws-observability/aws-otel-collector:latest
command: ["--config=/etc/otel-collector-config.yaml"]
environment:
- AWS_REGION=us-east-1Configure OTLP Collector to export to Google Cloud:
exporters:
googlecloud:
project: your-gcp-project-id
use_insecure: falseUse Azure Monitor exporter:
exporters:
azuremonitor:
instrumentation_key: "your-instrumentation-key"Configure OTLP Collector to export to Datadog:
exporters:
datadog:
api:
key: ${DD_API_KEY}
site: datadoghq.comOr use Datadog Agent directly:
datadog-agent:
image: datadog/agent:latest
environment:
- DD_API_KEY=${DD_API_KEY}
- DD_APM_ENABLED=true
- DD_APM_NON_LOCAL_TRAFFIC=true
- DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_GRPC_ENDPOINT=0.0.0.0:4317
ports:
- "4317:4317"
networks:
- gateway-networkUpdate gateway configuration:
tracing:
endpoint: datadog-agent:4317Configure OTLP Collector to export to New Relic:
exporters:
otlphttp:
endpoint: https://otlp.nr-data.net:4317
headers:
api-key: ${NEW_RELIC_LICENSE_KEY}exporters:
otlp:
endpoint: api.honeycomb.io:443
headers:
x-honeycomb-team: ${HONEYCOMB_API_KEY}exporters:
otlp:
endpoint: ingest.lightstep.com:443
headers:
lightstep-access-token: ${LIGHTSTEP_ACCESS_TOKEN}If using a service mesh like Istio or Linkerd:
Istio automatically generates traces for service-to-service communication. Configure gateway components to propagate trace context:
tracing:
enabled: true
endpoint: istio-telemetry.istio-system:4317Linkerd integrates with Jaeger via OpenTelemetry:
tracing:
enabled: true
endpoint: linkerd-collector.linkerd:4317The OTLP Collector configuration is located at:
gateway/observability/otel-collector/config.yaml
The configuration consists of three main sections:
Define how traces are received:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318Transform and enrich traces:
processors:
# Batch spans for efficiency
batch:
timeout: 1s
send_batch_size: 1024
# Add resource attributes
resource:
attributes:
- key: environment
value: production
action: upsert
- key: cluster
value: us-west-2
action: upsert
# Memory limiter to prevent OOM
memory_limiter:
check_interval: 1s
limit_mib: 512
# Sampling processor
probabilistic_sampler:
sampling_percentage: 10 # Sample 10% of tracesDefine where traces are sent:
exporters:
# Send to Jaeger
otlp:
endpoint: jaeger:4317
tls:
insecure: true
# Debug output to console
debug:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200Connect receivers, processors, and exporters:
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp, debug]Send traces to multiple backends simultaneously:
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
datadog:
api:
key: ${DD_API_KEY}
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger, otlp/tempo, datadog]Keep all error traces but sample successful traces:
processors:
tail_sampling:
policies:
- name: error-traces
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces
type: latency
latency:
threshold_ms: 1000
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10The gateway components automatically propagate trace context using standard W3C Trace Context headers:
traceparent: Contains trace ID, span ID, and sampling decisiontracestate: Contains vendor-specific trace information
When making requests to the gateway, you can:
- Let the gateway create a new trace (default)
- Propagate your own trace context by including trace headers:
curl http://localhost:8080/weather/v1.0/us/seattle \
-H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01"This allows you to trace requests across your entire system, including services before and after the gateway.
- Use 100% sampling rate (
sampling_rate: 1.0) - Enable debug output in OTLP collector
- Use Jaeger for quick trace visualization
- Keep trace data for 1-7 days
- Use managed services (Datadog, New Relic, etc.) to reduce operational overhead
- Implement appropriate sampling (1-10% depending on traffic volume)
- Enable TLS for OTLP connections
- Set resource limits on OTLP collector
- Monitor collector health and performance
- Implement trace retention policies based on compliance and storage costs
- Use tail-based sampling to keep important traces (errors, slow requests)
- Enable TLS for trace transmission
- Sanitize sensitive data from trace attributes
- Implement proper access controls for trace viewing
- Regularly audit who accesses trace data
- Consider data residency requirements
- Use appropriate sampling rates to balance visibility and overhead
- Configure batch settings to optimize network usage
- Monitor gateway component overhead from tracing
- Use asynchronous trace export (default with OTLP)
- Consider using tail-based sampling for high-volume environments
Choose sampling based on traffic volume:
| Traffic Volume | Sampling Rate | Use Case |
|---|---|---|
| < 100 req/s | 100% (1.0) | Full visibility, low overhead |
| 100-1000 req/s | 10-50% (0.1-0.5) | Balanced visibility and cost |
| 1000-10000 req/s | 1-10% (0.01-0.1) | Cost-effective, statistical sampling |
| > 10000 req/s | 0.1-1% (0.001-0.01) | Minimal overhead, error sampling |
Note: Always use 100% sampling for errors using tail-based sampling.
1. Verify tracing is enabled in configuration:
cat gateway/configs/config.toml | grep -A5 "tracing"Ensure enabled: true.
2. Check OTLP Collector is running:
docker ps | grep otel-collector3. View OTLP Collector logs:
docker logs otel-collectorLook for connection errors or export failures.
4. Check Jaeger is running:
docker ps | grep jaeger
curl http://localhost:16686/5. Verify network connectivity:
docker exec policy-engine ping otel-collector
docker exec otel-collector ping jaeger6. Check gateway component logs for trace export errors:
docker logs policy-engine | grep -i trace
docker logs gateway-controller | grep -i trace1. Check sampling rate - ensure it's not too low 2. Verify all components are configured to export traces 3. Check for trace context propagation issues - ensure headers are preserved 4. Look for timeout errors in OTLP collector logs
1. Reduce sampling rate:
[tracing]
sampling_rate = 0.1 # Reduce from 1.0 to 0.12. Increase batch size:
[tracing]
batch_timeout = "5s"
max_export_batch_size = 20483. Use tail-based sampling in OTLP collector to sample only important traces
- Ensure system clocks are synchronized across all containers (use NTP)
- Check for clock skew in trace timeline view
- Verify trace context propagation is working correctly
1. Verify Jaeger is running:
docker ps | grep jaeger2. Check Jaeger logs:
docker logs jaeger3. Ensure port 16686 is not blocked:
curl http://localhost:16686/To completely disable tracing:
- Update configuration in
gateway/configs/config.toml:
[policy_engine.tracing]
enabled = false
[tracing]
enabled = false- Restart gateway services:
docker compose restart gateway-controller policy-engine routerNote: The router (Envoy) tracing is controlled by the gateway-controller configuration and will be disabled when the configuration is updated.
Traces and logs work together for comprehensive observability:
- Trace ID in Logs: Gateway components include trace IDs in log entries
- Find Trace from Log: Copy trace ID from log entry and search in Jaeger
- Find Logs from Trace: Copy trace ID from Jaeger and search in log viewer
Example log entry with trace ID:
{
"level": "info",
"ts": "2025-12-19T10:30:45.456Z",
"msg": "Policy executed",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"policy": "modify-headers"
}Enable both logging and tracing profiles:
docker compose --profile logging --profile tracing up -dThis provides complete observability:
- Traces: Request flow and performance
- Logs: Detailed event information and debugging