generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 224
Open
Labels
opsImproves our operations and release processImproves our operations and release process
Description
Problem
GitHub Actions runners intermittently fail when pulling images from ECR with the following error:
------STDERR---------
failed to register layer: max depth exceeded
-------------------
docker pull status: 1
An unknown failure happened during Docker pull. This needs to be examined.
Root Cause
Inspecting affected images in ECR revealed that a single layer is being repeatedly appended, with the number of duplicates increasing over time:
% docker inspect <image ID>
...
"RootFS": {
"Type": "layers",
"Layers": [
...
"sha256:482f0923b37807c84cd98d679c1b38013ca7eb252bd12f66d5e0f6fdd910a3b9",
"sha256:ca1f2a2bc13c53495801f5b30e5bb510923f0c9c3ee851f6d87f15a86f15ab08",
"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
...Using dive shows the repeated layer contains:
Command:
RUN |1 USER_ID=1001 /bin/sh -c useradd -l -u ${USER_ID} -G build -o -s /bin/bash localbuild || { exit_code=$?; [ $exit_code -eq 9 ] && echo "User localbuild already exists, continuing..." || { echo "Failed to create user with error code: $exit_code"; exit $exit_code; }; } # buildkit
Why This Happens
The layer is added here. The accumulation seems to occur with the following sequence:
- GitHub Actions runs in an ephemeral environment with no local images
- The workflow pulls
smithy-build-image:ci-XXXfrom ECR (whereci-XXXis a hash of the tools directory) - It adds the user creation layer on top
- It pushes the modified image back to ECR with the same tag
ci-XXX - Each CI run appends another duplicate layer until Docker's max depth is exceeded
Workarounds
There are a couple of workarounds until this is fixed
- Force rebuild the image to eliminate duplicates (example).
- Pull the last known good, earlier image of
ci-XXXfrom the tools account (the tag must've become-), re-tag it assmithy-build-image:ci-XXX, and push it back to ECR to replace the image withci-XXXin question that exceeds the maximum layer depth.
Solution
Potential solutions include:
- Ensure the add-local-user layer is added only once per
ci-XXXtag. This could be as simple as making sure that an image with the given tag should be uploaded to ECR only once (ex.) - Uniquify the image tag
ci-XXXper CI run by appending a unique identifier (e.g., PR number, run ID, or commit SHA). Whenci-main.ymluploads the final image to ECR, strip the unique identifier to restore the canonicalci-XXXtag format. - Squash all layers before uploading the image to ECR
- Shorten lifecycle policy rules for
ci-XXXin ECR so we can expire those images more frequently, causing image rebuild to eliminate repeated layers. We can even let daily scheduled dry-run at night to be the first to build an image so we don't have to sit and wait for the image to be built in the daytime.
Metadata
Metadata
Assignees
Labels
opsImproves our operations and release processImproves our operations and release process