Skip to content

DefaultMessageManager outcome logs are emitted outside the telemetry Activity scope (no trace correlation) #337

@JobaDiniz

Description

@JobaDiniz

Description

Three LogError calls in DefaultMessageManager.InvokeHandler are emitted after the telemetry Activity opened by HandlerInvoker.InvokeAsync has already been disposed. As a result, those log records carry no trace context: in any observability platform that ingests logs via OTLP (Datadog, etc.), SpanId and TraceId are zeroed out, and operators cannot click from the log to the corresponding handler trace.

Steps to reproduce

  1. Register an SQS poller and a handler with AddAWSMessageBus.
  2. Wire OpenTelemetry tracing via AWS.Messaging.Telemetry.OpenTelemetry (tracing.AddAWSMessagingInstrumentation()), and ship logs through an OTLP exporter.
  3. Have the handler return MessageProcessStatus.Failed() for a message (or throw an exception that escapes HandlerInvoker, or fault its Task).
  4. Observe the "Message handling completed unsuccessfully for message ID {MessageId}" (or sibling) log record in the log backend.

Expected behavior

The outcome log records should sit inside the same Activity that wraps the handler invocation, so they share TraceId / SpanId with the handler span and can be correlated with it.

Actual behavior

The records have SpanId: "0000000000000000" and TraceFlags: "None". Example (sanitized) record from a Datadog log ingested via OTLP:

{
  "Attributes": {
    "MessageId": "<message-id>",
    "{OriginalFormat}": "Message handling completed unsuccessfully for message ID {MessageId}"
  },
  "CategoryName": "AWS.Messaging.Services.DefaultMessageManager",
  "SeverityText": "Error",
  "SpanId": "0000000000000000",
  "TraceFlags": "None"
}

The MessageId attribute is the only correlator available; nothing ties the record to the handler trace.

Root cause

The Activity is opened inside HandlerInvoker.InvokeAsync and ends when its using block exits:

HandlerInvoker.cs#L45

using (var trace = _telemetryFactory.Trace("Processing message", messageEnvelope))
{
    // ... handler invocation ...
}

DefaultMessageManager.InvokeHandler calls that method, awaits the returned task, and then decides whether to log the outcome:

DefaultMessageManager.cs#L167-L208

private async Task<bool> InvokeHandler(MessageEnvelope messageEnvelope, SubscriberMapping subscriberMapping, CancellationToken cancelToken)
{
    var isSuccessful = false;
    var handlerTask = _handlerInvoker.InvokeAsync(messageEnvelope, subscriberMapping, cancelToken);
    try
    {
        await handlerTask;
    }
    catch (InvalidMessageHandlerSignatureException) { throw; }
    catch (AWSMessagingException) { /* swallowed */ }
    catch (Exception ex)
    {
        _logger.LogError(ex, "An exception has been thrown from handler '{HandlerType}' ...", ...); // L185
    }

    _inFlightMessageMetadata.Remove(messageEnvelope, out _);

    if (handlerTask.IsCompletedSuccessfully)
    {
        if (handlerTask.Result.IsSuccess) { /* delete */ }
        else
        {
            _logger.LogError("Message handling completed unsuccessfully for message ID {MessageId}", ...); // L200
            await _sqsMessageCommunication.ReportMessageFailureAsync(messageEnvelope);
        }
    }
    else if (handlerTask.IsFaulted)
    {
        _logger.LogError(handlerTask.Exception, "An exception has been thrown from handler '{HandlerType}' ...", ...); // L206
        await _sqsMessageCommunication.ReportMessageFailureAsync(messageEnvelope);
    }

    return isSuccessful;
}

By the time any of L185, L200, or L206 execute, HandlerInvoker.InvokeAsync has returned and the Activity has been disposed, so Activity.Current is no longer the handler activity (it is null or the poller's parent activity).

The same problem applies to all three call sites:

  • DefaultMessageManager.cs:185 — handler exception that escapes HandlerInvoker and is not InvalidMessageHandlerSignatureException or AWSMessagingException.
  • DefaultMessageManager.cs:200 — handler returned MessageProcessStatus.Failed().
  • DefaultMessageManager.cs:206handlerTask.IsFaulted branch.

Impact

  • Outcome / failure logs cannot be correlated with the handler trace in any platform.
  • For users who rely on click-through from log to trace as their primary debugging workflow, the most operationally interesting records (failures, faulted tasks) are exactly the ones with no trace context.
  • The MessageId attribute is the only correlator, which forces a manual second query to find the related trace.

Environment

  • AWS.Messaging version: 1.3.0
  • AWS.Messaging.Telemetry.OpenTelemetry version: 1.0.0
  • .NET 10
  • Logs and traces exported via OTLP (Datadog backend in our case, but the issue is in the producer side and is platform-independent).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions