Skip to content

Long running SnapStart Lambda runs out of ephemeral storage #876

Open
@steven-aerts

Description

@steven-aerts

Describe the bug

AWS SnapStart has a fixed ephemeral storage size in /tmp of 512 Mb. We see long running Lambda's using the aws-crt-java library running out of ephemeral storage space:

Unable to unpack AWS CRT lib: java.io.IOException: No space left on device
java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.writeBytes(Native Method)
at java.base/java.io.FileOutputStream.write(Unknown Source)
at software.amazon.awssdk.crt.internal.ExtractLib.extractLibrary(ExtractLib.java:63)
at software.amazon.awssdk.crt.CRT.extractAndLoadLibrary(CRT.java:310)
at software.amazon.awssdk.crt.CRT.loadLibraryFromJar(CRT.java:330)
at software.amazon.awssdk.crt.CRT.<clinit>(CRT.java:50)
at software.amazon.awssdk.crt.CrtResource.<clinit>(CrtResource.java:104)
at software.amazon.awssdk.http.crt.AwsCrtHttpClientBase.<init>(AwsCrtHttpClientBase.java:77)

The reason for that is that SnapStart does not adhere the deleteOnExit() call necessary to clean up the shared objects which the aws-crt-java library extracts in the ephemeral storage.

Every time the AWS SnapStart re-initializes the lambda a copy of the shared library will leak chipping away 2 megabyte of the ephemeral storage.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

Old version of the library not leaking when AWS SnapStart re-initializes the function.

Current Behavior

A typical example run showing INIT_REPORT or timeouts gradually filling the ephemeral storage:

Step 1: A typical request in snapstart takes 12 ms:

01:25:01.151 | START RequestId: 0axxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx9e Version: 61
01:25:01.163 | END RequestId: 0axxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx9e
01:25:01.163 | REPORT RequestId: 0axxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx9e Duration: 12.18 ms Billed Duration: 13 ms Memory Size: 1024 MB Max Memory Used: 288 MB

Step 2:
After it finishes, nothing happens for 96 seconds
We are then greeted with a trace coming from our Java constructor.  Telling us that the lambda is re-started.  Something normally not seen for a lambda 
Aws Support told us that `INIT_REPORT` points to a software update or runtime optimization step re-initializing/re-snapshotting the snapstart environment.

01:26:53.004 | [main] INFO  c.t.w.c.j.Lambda - Discovered region for bucket xxxxxxxxx: us-east-1

Step 3: The INIT_REPORT step takes 6 seconds, which is the time out period of this lambda

01:26:48.441 | INIT_REPORT Init Duration: 6006.06 ms
01:26:48.441 | START RequestId: 6dxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxd9 Version: 61

Step 4: Giving the lambda 34 ms to wake up and then timing out again  (INIT_REPORT takes btw 2s longer than the billed timeout)

01:26:48.475 | 01:26:48.474Z 6dxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxd9 Task timed out after 6.04 seconds
01:26:48.475 | END RequestId: 6dxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxd9
01:26:48.475 | REPORT RequestId: 6dxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxd9 Duration: 6039.30 ms Billed Duration: 6000 ms Memory Size: 1024 MB Max Memory Used: 150 MB

Step 5: The above failure is repeated 2 times, hitting the final error condition:

01:38:11.064 | REPORT RequestId: 24xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx73 Duration: 6025.82 ms Billed Duration: 6000 ms Memory Size: 1024 MB Max Memory Used: 147 MB
01:38:13.781 | Unable to unpack AWS CRT lib: java.io.IOException: No space left on device
01:38:13.903 | INIT_REPORT Init Duration: 2809.79 ms Phase: invoke Status: error Error Type: Runtime.BadFunctionCode
01:38:13.904 | START RequestId: 69xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx9e Version: 61
01:38:13.904 | Unknown application error occurred Runtime.BadFunctionCode

Reproduction Steps

We identified 2 scenarios where AWS SnapStart might re-initialize a lambda (detectable with the INIT_REPORT trace in the lambda logs):

  • the lambda execution times out (visible in Step 4 in the traces above)
  • AWS SnapStart deciding to restart the lambda on itself for an upgrade scenario or even an optimization scenario (taking a new snapshot) (visible in Step 2 in the traces)

To reproduce the scenario you can implement a lambda using SnapStart having a sporadic time-out emulating this behavior more rapidly.

Possible Solution

In our lambda's we already reduced the likelihood of this happening by increasing timeouts as lambda timeouts are source of leak . But we can still see the ephemeral disk not being cleaned up.

Possible solutions might be:

  • Make AWS SnapStart environment adhere deleteOnExit for time out operations, updates or any other situation triggering an INIT_REPORT
  • Update the aws-crt-java library to clean up old libraries, like it does for windows.
  • Update the aws-crt-java library to reuse an previously unpacked shared object.
  • Update the aws-crt-java library to delete the shared object on linux/unix directly after it is loaded. It will remove the filename, but for the java process the shared object stays accessible through its file descriptor.

Additional Information/Context

More detailed traces and logs are also found in AWS support case 174136266000566.

aws-crt-java version used

0.33.9 (aws sdk v2 2.30.15)

Java version used

java21

Operating System and version

latest lambda arm64 java21 runtime

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis issue is a bug.needs-triageThis issue or PR still needs to be triaged.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

      Participants

      @steven-aerts

      Issue actions

        Long running SnapStart Lambda runs out of ephemeral storage · Issue #876 · awslabs/aws-crt-java