Description
Describe the bug
AWS SnapStart has a fixed ephemeral storage size in /tmp
of 512 Mb. We see long running Lambda's using the aws-crt-java
library running out of ephemeral storage space:
Unable to unpack AWS CRT lib: java.io.IOException: No space left on device
java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.writeBytes(Native Method)
at java.base/java.io.FileOutputStream.write(Unknown Source)
at software.amazon.awssdk.crt.internal.ExtractLib.extractLibrary(ExtractLib.java:63)
at software.amazon.awssdk.crt.CRT.extractAndLoadLibrary(CRT.java:310)
at software.amazon.awssdk.crt.CRT.loadLibraryFromJar(CRT.java:330)
at software.amazon.awssdk.crt.CRT.<clinit>(CRT.java:50)
at software.amazon.awssdk.crt.CrtResource.<clinit>(CrtResource.java:104)
at software.amazon.awssdk.http.crt.AwsCrtHttpClientBase.<init>(AwsCrtHttpClientBase.java:77)
The reason for that is that SnapStart does not adhere the deleteOnExit()
call necessary to clean up the shared objects which the aws-crt-java
library extracts in the ephemeral storage.
Every time the AWS SnapStart re-initializes the lambda a copy of the shared library will leak chipping away 2 megabyte of the ephemeral storage.
Regression Issue
- Select this option if this issue appears to be a regression.
Expected Behavior
Old version of the library not leaking when AWS SnapStart re-initializes the function.
Current Behavior
A typical example run showing INIT_REPORT
or timeouts gradually filling the ephemeral storage:
Step 1: A typical request in snapstart takes 12 ms:
01:25:01.151 | START RequestId: 0axxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx9e Version: 61
01:25:01.163 | END RequestId: 0axxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx9e
01:25:01.163 | REPORT RequestId: 0axxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx9e Duration: 12.18 ms Billed Duration: 13 ms Memory Size: 1024 MB Max Memory Used: 288 MB
Step 2:
After it finishes, nothing happens for 96 seconds
We are then greeted with a trace coming from our Java constructor. Telling us that the lambda is re-started. Something normally not seen for a lambda
Aws Support told us that `INIT_REPORT` points to a software update or runtime optimization step re-initializing/re-snapshotting the snapstart environment.
01:26:53.004 | [main] INFO c.t.w.c.j.Lambda - Discovered region for bucket xxxxxxxxx: us-east-1
Step 3: The INIT_REPORT step takes 6 seconds, which is the time out period of this lambda
01:26:48.441 | INIT_REPORT Init Duration: 6006.06 ms
01:26:48.441 | START RequestId: 6dxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxd9 Version: 61
Step 4: Giving the lambda 34 ms to wake up and then timing out again (INIT_REPORT takes btw 2s longer than the billed timeout)
01:26:48.475 | 01:26:48.474Z 6dxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxd9 Task timed out after 6.04 seconds
01:26:48.475 | END RequestId: 6dxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxd9
01:26:48.475 | REPORT RequestId: 6dxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxd9 Duration: 6039.30 ms Billed Duration: 6000 ms Memory Size: 1024 MB Max Memory Used: 150 MB
Step 5: The above failure is repeated 2 times, hitting the final error condition:
01:38:11.064 | REPORT RequestId: 24xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx73 Duration: 6025.82 ms Billed Duration: 6000 ms Memory Size: 1024 MB Max Memory Used: 147 MB
01:38:13.781 | Unable to unpack AWS CRT lib: java.io.IOException: No space left on device
01:38:13.903 | INIT_REPORT Init Duration: 2809.79 ms Phase: invoke Status: error Error Type: Runtime.BadFunctionCode
01:38:13.904 | START RequestId: 69xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx9e Version: 61
01:38:13.904 | Unknown application error occurred Runtime.BadFunctionCode
Reproduction Steps
We identified 2 scenarios where AWS SnapStart might re-initialize a lambda (detectable with the INIT_REPORT
trace in the lambda logs):
- the lambda execution times out (visible in Step 4 in the traces above)
- AWS SnapStart deciding to restart the lambda on itself for an upgrade scenario or even an optimization scenario (taking a new snapshot) (visible in Step 2 in the traces)
To reproduce the scenario you can implement a lambda using SnapStart having a sporadic time-out emulating this behavior more rapidly.
Possible Solution
In our lambda's we already reduced the likelihood of this happening by increasing timeouts as lambda timeouts are source of leak . But we can still see the ephemeral disk not being cleaned up.
Possible solutions might be:
- Make AWS SnapStart environment adhere
deleteOnExit
for time out operations, updates or any other situation triggering anINIT_REPORT
- Update the
aws-crt-java
library to clean up old libraries, like it does for windows. - Update the
aws-crt-java
library to reuse an previously unpacked shared object. - Update the
aws-crt-java
library to delete the shared object on linux/unix directly after it is loaded. It will remove the filename, but for the java process the shared object stays accessible through its file descriptor.
Additional Information/Context
More detailed traces and logs are also found in AWS support case 174136266000566.
aws-crt-java version used
0.33.9 (aws sdk v2 2.30.15)
Java version used
java21
Operating System and version
latest lambda arm64 java21 runtime
Activity