Skip to content

AWS Lambda: Register application as an internal extension to allow post-response OpenTelemetry flushing #53464

@hamburml

Description

@hamburml

Describe the bug

When running Quarkus with OpenTelemetry on AWS Lambda, telemetry can be lost when the function completes very quickly.

With quarkus.otel.simple=true, batching is disabled for traces and logs, but sometimes packages are still not reliably flushed before the Lambda execution environment is frozen. In AWS Lambda, the environment may freeze immediately after the response is returned. Because of that, some metrics are never exported.

This is especially visible for short-lived invocations. In some cases, telemetry from a previous invocation is only exported later when the same Lambda execution environment is reused.

Expected behavior

AWS Lambda supports internal and external extensions. Registering the application as an internal extension gives the runtime up to 500 ms of additional time before the environment is terminated. https://docs.aws.amazon.com/lambda/latest/dg/runtimes-extensions-api.html

Quarkus should use that extra time to flush all pending OpenTelemetry data after the response has been sent, including metrics, so that telemetry is exported reliably even for fast Lambda invocations.

As an alternative, Quarkus could flush synchronously before returning the response, but that would increase request latency and is therefore less desirable.

Actual behavior

OpenTelemetry data is not always flushed before the Lambda environment is frozen.

As a result:

metrics are sometimes missing for fast invocations
telemetry may be delayed and only appear on a later invocation when the Lambda environment is reused

This is how it sometimes looks in our tracing tool when the execution environment is reused

Image

How to Reproduce?

Create an AWS Lambda function using Quarkus.
Add the OpenTelemetry extension and enable telemetry export.
Set quarkus.otel.simple=true.
Record traces, logs, and metrics during the invocation.
Invoke the function with a short-running request.
Check the exported telemetry.

Output of uname -a or ver

No response

Output of java -version

No response

Quarkus version or git rev

No response

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

@brunobat We talked about that a while ago. Sorry, I took very long to create that issue. I also have a PR which still needs the otel changes. I am unsure how to add that. #53465 here is the PullRequest, please feel free for advice or changes. I was also not sure of default values for timeouts and such.
It is currently a draft. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions