Gateway Tracing

This guide explains how to implement and configure distributed tracing for the API Platform Gateway components.

Overview

The default tracing services included in the Docker Compose configuration are demonstration services designed to showcase how you can observe distributed traces across gateway components in a centralized setup. These services provide a reference implementation that you can use out-of-the-box for development, testing, or as a starting point for your production tracing strategy.

Important: You are free to choose any tracing or observability strategy that suits your environment and requirements. The provided setup is just one of many possible configurations.

Tracing Architecture

The default tracing stack consists of:

OpenTelemetry (OTLP) Collector: Receives, processes, and exports trace data from gateway components
Jaeger: Stores and visualizes distributed traces with a web UI for trace exploration and analysis

How It Works

Gateway components (gateway-controller, policy-engine, router) are configured to export traces via OTLP (OpenTelemetry Protocol)
Components send trace spans to the OpenTelemetry Collector via gRPC (port 4317) or HTTP (port 4318)
The OTLP Collector processes traces (batching, adding resource attributes, etc.)
The OTLP Collector forwards traces to Jaeger for storage and visualization
Users can view and analyze traces through the Jaeger UI

What is Distributed Tracing?

Distributed tracing tracks a request as it flows through multiple components:

Trace: Represents the entire journey of a request through the system
Span: Represents a single operation within a trace (e.g., policy execution, upstream call)
Context Propagation: Traces are correlated across components using trace IDs and span IDs in headers

Enabling Tracing

Configuration Required

You need to enable tracing in the gateway configuration file and point it to your OTLP collector endpoint.

The tracing configuration is located in gateway/configs/config.toml:

Policy Engine Tracing Configuration

[tracing]
enabled = true                          # Set to true to enable tracing
endpoint = "otel-collector:4317"        # OTLP collector gRPC endpoint
service_version = "0.2.0"               # Service version
batch_timeout = "1s"                    # Batch timeout for exporting spans
max_export_batch_size = 512             # Maximum spans per batch
sampling_rate = 1.0                     # Sample rate (1.0 = 100%, 0.5 = 50%)

Demonstrated Tracing Services

The tracing services included in the Docker Compose file (Jaeger and OpenTelemetry Collector) are provided as demonstration services to show one possible way to collect and visualize traces. You can use them as-is for development/testing, or replace them with your own tracing solution.

The gateway uses Docker Compose profiles to optionally enable these demonstration tracing services.

Start Gateway with Demonstrated Tracing Services

To start the gateway with the demonstration tracing services enabled:

docker compose --profile tracing up -d

This starts:

Core gateway services (gateway-controller, policy-engine, router) - which export traces to OTLP collector
OpenTelemetry Collector - receives and processes traces
Jaeger - stores and visualizes traces

Start Gateway without Tracing Services

To run only the core gateway services without the demonstration tracing stack:

docker compose up -d

Note: If tracing is enabled in the configuration but the OTLP collector is not running, components will log warnings about failed trace exports. To completely disable tracing, set enabled: false in the configuration.

Stop Tracing Services

To stop all services including the tracing stack:

docker compose --profile tracing down

Note: Jaeger stores traces in memory by default. Stopping the service will lose all trace data. For persistent storage, configure Jaeger with a backend database (see Jaeger documentation).

Viewing Traces in Jaeger

Once you've started the gateway with the tracing profile, follow these steps to view distributed traces:

Step 1: Access Jaeger UI

Open your browser and navigate to:

http://localhost:16686

Step 2: Search for Traces

The Jaeger UI provides several ways to search for traces:

Select a Service from the dropdown:
- policy-engine - View traces from the policy engine
- router - View traces from the Envoy router
Select an Operation (optional):
- Choose "all" to see all operations
- Or select a specific operation (e.g., specific policy execution)
Adjust Lookback Time Range:
- Default: Last 1 hour
- Options: 5m, 15m, 1h, 6h, 12h, 1d, 2d, Custom
Add Filters (optional):
- Tags: Filter by specific tag values (e.g., http.status_code=500)
- Min/Max Duration: Filter by trace duration
- Limit Results: Control number of traces returned (default: 20)
Click Find Traces

Step 3: Analyze Trace Details

Click on any trace in the results to view detailed information:

Trace Timeline

Visual timeline showing all spans in the trace
Duration bars showing relative time spent in each operation
Parent-child relationships between spans
Color coding by service

Span Details

Click on any span to see:

Operation name: What operation was performed
Duration: How long it took
Tags: Metadata about the operation (HTTP method, status code, etc.)
Logs: Events logged during the span (errors, warnings, etc.)
Process: Service name, version, and host information

Common Use Cases

Finding Slow Requests:

Set Min Duration filter (e.g., 1000ms)
Click Find Traces
Examine spans to identify bottlenecks

Debugging Errors:

Filter by tag: error=true or http.status_code=500
Click on error traces
Examine span logs and tags for error details

Understanding Request Flow:

Search for a specific trace ID (from logs or headers)
View the complete request path through all components
Identify which component handled which part of the request

Step 4: Trace Comparison

You can compare multiple traces to identify patterns:

Select multiple traces using checkboxes
Click Compare Traces button
View side-by-side comparison of trace structure and timings

Step 5: Service Dependency Graph

View how services interact:

Click Dependencies in the top navigation
Select time range
View graph showing service-to-service communication patterns

Configuration Options

Adjusting Sampling Rate

To reduce trace volume in high-traffic environments, adjust the sampling rate:

[tracing]
sampling_rate = 0.1  # Sample 10% of requests

Sampling strategies:

1.0 (100%): Sample all requests - recommended for development and low-traffic environments
0.5 (50%): Sample half of requests - moderate traffic
0.1 (10%): Sample 10% of requests - high traffic
0.01 (1%): Sample 1% of requests - very high traffic

Note: Lower sampling rates reduce overhead but may miss important traces.

Custom Service Names

Customize service names for better identification:

[policy_engine]
service_name = "policy-engine-prod-us-east-1"

Batch Configuration

Optimize batch settings for your environment:

[tracing]
batch_timeout = "5s"            # Wait up to 5s before exporting
max_export_batch_size = 1024    # Export up to 1024 spans per batch

Lower timeout: Faster trace visibility, more network overhead Higher timeout: Better batching efficiency, slower trace visibility

Alternative Tracing Backends

While the default setup uses Jaeger, the gateway components use OpenTelemetry and can export to any OTLP-compatible backend.

Moesif

Moesif provides API analytics and monitoring with support for OpenTelemetry traces. It treats each HTTP request/response span as an API event for detailed analytics.

No additional Docker services required - Moesif is a cloud-based SaaS platform. You only need to configure the OTLP Collector to export traces to Moesif's API.

Configuration

Update the OTLP Collector configuration (gateway/observability/otel-collector/config.yaml) to export to Moesif:

exporters:
  # Export to Moesif
  otlphttp:
    endpoint: https://api.moesif.net/v1/traces
    headers:
      X-Moesif-Application-Id: 'your-moesif-application-id'

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp]  # Send to Moesif

Important Notes:

The endpoint uses HTTPS (not HTTP)
Use the otlphttp exporter (not otlp which uses gRPC)
The X-Moesif-Application-Id header is required for authentication

Obtaining Your Moesif Application ID

Sign up for a Moesif account at moesif.com
Log in to your Moesif dashboard
Navigate to Settings → Installation or API Keys
Locate the Collector Application ID field
Copy your unique Application ID

Using Environment Variables

For better security, use environment variables for the Application ID:

exporters:
  otlphttp:
    endpoint: https://api.moesif.net/v1/traces
    headers:
      X-Moesif-Application-Id: '${MOESIF_APPLICATION_ID}'

Update docker-compose.yaml to pass the environment variable:

otel-collector:
  image: otel/opentelemetry-collector:latest
  environment:
    - MOESIF_APPLICATION_ID=${MOESIF_APPLICATION_ID}
  # ... rest of configuration

Set the environment variable before starting:

export MOESIF_APPLICATION_ID=your-moesif-application-id
docker compose --profile tracing up -d

Accessing Moesif Dashboard

After configuring and starting the gateway:

Navigate to moesif.com and log in
Go to Events → Live Event Log to see incoming API events
View API analytics, user behavior, and performance metrics
Use Time Series to analyze API usage trends
Set up Alerts for error rates, latency, or custom conditions

Moesif Features

API Analytics: Request volume, response times, error rates
User Tracking: Identify and track API users across requests
Error Analysis: Detailed error tracking with request/response bodies
Behavioral Cohorts: Group users by API usage patterns
Custom Dashboards: Build visualizations for your specific KPIs
Alerting: Get notified of anomalies or threshold breaches

Sending to Both Jaeger and Moesif

You can send traces to both Jaeger (for development) and Moesif (for analytics):

exporters:
  # Local Jaeger for development
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  # Moesif for analytics
  otlphttp/moesif:
    endpoint: https://api.moesif.net/v1/traces
    headers:
      X-Moesif-Application-Id: '${MOESIF_APPLICATION_ID}'

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/jaeger, otlphttp/moesif]

Zipkin

Replace Jaeger with Zipkin:

zipkin:
  image: openzipkin/zipkin:latest
  ports:
    - "9411:9411"
  networks:
    - gateway-network

Update OTLP Collector configuration to export to Zipkin:

exporters:
  zipkin:
    endpoint: http://zipkin:9411/api/v2/spans

Access Zipkin UI at http://localhost:9411

Grafana Tempo

For a Prometheus-style tracing backend:

tempo:
  image: grafana/tempo:latest
  command: ["-config.file=/etc/tempo.yaml"]
  volumes:
    - ./observability/tempo/tempo.yaml:/etc/tempo.yaml
    - tempo-data:/tmp/tempo
  ports:
    - "3200:3200"   # Tempo HTTP
    - "4317:4317"   # OTLP gRPC
  networks:
    - gateway-network

grafana:
  image: grafana/grafana:latest
  environment:
    - GF_AUTH_ANONYMOUS_ENABLED=true
    - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
  ports:
    - "3000:3000"
  volumes:
    - ./observability/grafana/datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
  networks:
    - gateway-network
  depends_on:
    - tempo

Configure gateway to send directly to Tempo:

tracing:
  endpoint: tempo:4317

Cloud-Native Tracing Solutions

AWS X-Ray

Configure OTLP Collector to export to AWS X-Ray:

exporters:
  awsxray:
    region: us-east-1
    no_verify_ssl: false

Or use the AWS Distro for OpenTelemetry (ADOT) Collector:

otel-collector:
  image: public.ecr.aws/aws-observability/aws-otel-collector:latest
  command: ["--config=/etc/otel-collector-config.yaml"]
  environment:
    - AWS_REGION=us-east-1

Google Cloud Trace

Configure OTLP Collector to export to Google Cloud:

exporters:
  googlecloud:
    project: your-gcp-project-id
    use_insecure: false

Azure Monitor

Use Azure Monitor exporter:

exporters:
  azuremonitor:
    instrumentation_key: "your-instrumentation-key"

Datadog APM

Configure OTLP Collector to export to Datadog:

exporters:
  datadog:
    api:
      key: ${DD_API_KEY}
      site: datadoghq.com

Or use Datadog Agent directly:

datadog-agent:
  image: datadog/agent:latest
  environment:
    - DD_API_KEY=${DD_API_KEY}
    - DD_APM_ENABLED=true
    - DD_APM_NON_LOCAL_TRAFFIC=true
    - DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_GRPC_ENDPOINT=0.0.0.0:4317
  ports:
    - "4317:4317"
  networks:
    - gateway-network

Update gateway configuration:

tracing:
  endpoint: datadog-agent:4317

New Relic

Configure OTLP Collector to export to New Relic:

exporters:
  otlphttp:
    endpoint: https://otlp.nr-data.net:4317
    headers:
      api-key: ${NEW_RELIC_LICENSE_KEY}

Honeycomb

exporters:
  otlp:
    endpoint: api.honeycomb.io:443
    headers:
      x-honeycomb-team: ${HONEYCOMB_API_KEY}

Lightstep

exporters:
  otlp:
    endpoint: ingest.lightstep.com:443
    headers:
      lightstep-access-token: ${LIGHTSTEP_ACCESS_TOKEN}

Service Mesh Integration

If using a service mesh like Istio or Linkerd:

Istio

Istio automatically generates traces for service-to-service communication. Configure gateway components to propagate trace context:

tracing:
  enabled: true
  endpoint: istio-telemetry.istio-system:4317

Linkerd

Linkerd integrates with Jaeger via OpenTelemetry:

tracing:
  enabled: true
  endpoint: linkerd-collector.linkerd:4317

Customizing OpenTelemetry Collector

The OTLP Collector configuration is located at:

gateway/observability/otel-collector/config.yaml

Configuration Structure

The configuration consists of three main sections:

Receivers

Define how traces are received:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

Processors

Transform and enrich traces:

processors:
  # Batch spans for efficiency
  batch:
    timeout: 1s
    send_batch_size: 1024

  # Add resource attributes
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: cluster
        value: us-west-2
        action: upsert

  # Memory limiter to prevent OOM
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

  # Sampling processor
  probabilistic_sampler:
    sampling_percentage: 10  # Sample 10% of traces

Exporters

Define where traces are sent:

exporters:
  # Send to Jaeger
  otlp:
    endpoint: jaeger:4317
    tls:
      insecure: true

  # Debug output to console
  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

Service Pipeline

Connect receivers, processors, and exporters:

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp, debug]

Example: Multi-Backend Export

Send traces to multiple backends simultaneously:

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  datadog:
    api:
      key: ${DD_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger, otlp/tempo, datadog]

Example: Tail-Based Sampling

Keep all error traces but sample successful traces:

processors:
  tail_sampling:
    policies:
      - name: error-traces
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Trace Context Propagation

The gateway components automatically propagate trace context using standard W3C Trace Context headers:

traceparent: Contains trace ID, span ID, and sampling decision
tracestate: Contains vendor-specific trace information

When making requests to the gateway, you can:

Let the gateway create a new trace (default)
Propagate your own trace context by including trace headers:

curl http://localhost:8080/weather/v1.0/us/seattle \
  -H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01"

This allows you to trace requests across your entire system, including services before and after the gateway.

Best Practices

Development

Use 100% sampling rate (sampling_rate: 1.0)
Enable debug output in OTLP collector
Use Jaeger for quick trace visualization
Keep trace data for 1-7 days

Production

Use managed services (Datadog, New Relic, etc.) to reduce operational overhead
Implement appropriate sampling (1-10% depending on traffic volume)
Enable TLS for OTLP connections
Set resource limits on OTLP collector
Monitor collector health and performance
Implement trace retention policies based on compliance and storage costs
Use tail-based sampling to keep important traces (errors, slow requests)

Security

Enable TLS for trace transmission
Sanitize sensitive data from trace attributes
Implement proper access controls for trace viewing
Regularly audit who accesses trace data
Consider data residency requirements

Performance

Use appropriate sampling rates to balance visibility and overhead
Configure batch settings to optimize network usage
Monitor gateway component overhead from tracing
Use asynchronous trace export (default with OTLP)
Consider using tail-based sampling for high-volume environments

Sampling Strategy

Choose sampling based on traffic volume:

Traffic Volume	Sampling Rate	Use Case
< 100 req/s	100% (1.0)	Full visibility, low overhead
100-1000 req/s	10-50% (0.1-0.5)	Balanced visibility and cost
1000-10000 req/s	1-10% (0.01-0.1)	Cost-effective, statistical sampling
> 10000 req/s	0.1-1% (0.001-0.01)	Minimal overhead, error sampling

Note: Always use 100% sampling for errors using tail-based sampling.

Troubleshooting

Traces Not Appearing in Jaeger

1. Verify tracing is enabled in configuration:

cat gateway/configs/config.toml | grep -A5 "tracing"

Ensure enabled: true.

2. Check OTLP Collector is running:

docker ps | grep otel-collector

3. View OTLP Collector logs:

docker logs otel-collector

Look for connection errors or export failures.

4. Check Jaeger is running:

docker ps | grep jaeger
curl http://localhost:16686/

5. Verify network connectivity:

docker exec policy-engine ping otel-collector
docker exec otel-collector ping jaeger

6. Check gateway component logs for trace export errors:

docker logs policy-engine | grep -i trace
docker logs gateway-controller | grep -i trace

Traces Are Incomplete or Missing Spans

1. Check sampling rate - ensure it's not too low 2. Verify all components are configured to export traces 3. Check for trace context propagation issues - ensure headers are preserved 4. Look for timeout errors in OTLP collector logs

High Trace Export Overhead

1. Reduce sampling rate:

[tracing]
sampling_rate = 0.1  # Reduce from 1.0 to 0.1

2. Increase batch size:

[tracing]
batch_timeout = "5s"
max_export_batch_size = 2048

3. Use tail-based sampling in OTLP collector to sample only important traces

Traces Have Incorrect Timing

Ensure system clocks are synchronized across all containers (use NTP)
Check for clock skew in trace timeline view
Verify trace context propagation is working correctly

Cannot Access Jaeger UI

1. Verify Jaeger is running:

docker ps | grep jaeger

2. Check Jaeger logs:

docker logs jaeger

3. Ensure port 16686 is not blocked:

curl http://localhost:16686/

Disabling Tracing

To completely disable tracing:

Update configuration in gateway/configs/config.toml:

[policy_engine.tracing]
enabled = false

[tracing]
enabled = false

Restart gateway services:

docker compose restart gateway-controller policy-engine router

Note: The router (Envoy) tracing is controlled by the gateway-controller configuration and will be disabled when the configuration is updated.

Integration with Logging

Traces and logs work together for comprehensive observability:

Correlating Traces and Logs

Trace ID in Logs: Gateway components include trace IDs in log entries
Find Trace from Log: Copy trace ID from log entry and search in Jaeger
Find Logs from Trace: Copy trace ID from Jaeger and search in log viewer

Example log entry with trace ID:

{
  "level": "info",
  "ts": "2025-12-19T10:30:45.456Z",
  "msg": "Policy executed",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "policy": "modify-headers"
}

Using Both Stacks

Enable both logging and tracing profiles:

docker compose --profile logging --profile tracing up -d

This provides complete observability:

Traces: Request flow and performance
Logs: Detailed event information and debugging

FilesExpand file tree

tracing.md

Latest commit

History