Skip to content

add-local-user layer keeps accumulating in Docker image #4436

@ysaito1001

Description

@ysaito1001

Problem

GitHub Actions runners intermittently fail when pulling images from ECR with the following error:

------STDERR---------
failed to register layer: max depth exceeded
-------------------
docker pull status: 1
An unknown failure happened during Docker pull. This needs to be examined.

Root Cause

Inspecting affected images in ECR revealed that a single layer is being repeatedly appended, with the number of duplicates increasing over time:

% docker inspect <image ID>
...
        "RootFS": {
            "Type": "layers",
            "Layers": [
                ...
                "sha256:482f0923b37807c84cd98d679c1b38013ca7eb252bd12f66d5e0f6fdd910a3b9",
                "sha256:ca1f2a2bc13c53495801f5b30e5bb510923f0c9c3ee851f6d87f15a86f15ab08",
                "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
                "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
                "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
                "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
                "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
                "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
                ...

Using dive shows the repeated layer contains:

Command:
RUN |1 USER_ID=1001 /bin/sh -c useradd -l -u ${USER_ID} -G build -o -s /bin/bash localbuild ||     { exit_code=$?; [ $exit_code -eq 9 ] && echo "User localbuild already exists, continuing..." ||     { echo "Failed to create user with error code: $exit_code"; exit $exit_code; }; } # buildkit

Why This Happens

The layer is added here. The accumulation seems to occur with the following sequence:

  1. GitHub Actions runs in an ephemeral environment with no local images
  2. The workflow pulls smithy-build-image:ci-XXX from ECR (where ci-XXX is a hash of the tools directory)
  3. It adds the user creation layer on top
  4. It pushes the modified image back to ECR with the same tag ci-XXX
  5. Each CI run appends another duplicate layer until Docker's max depth is exceeded

Workarounds

There are a couple of workarounds until this is fixed

  • Force rebuild the image to eliminate duplicates (example).
  • Pull the last known good, earlier image of ci-XXX from the tools account (the tag must've become -), re-tag it as smithy-build-image:ci-XXX, and push it back to ECR to replace the image with ci-XXX in question that exceeds the maximum layer depth.

Solution

Potential solutions include:

  • Ensure the add-local-user layer is added only once per ci-XXX tag. This could be as simple as making sure that an image with the given tag should be uploaded to ECR only once (ex.)
  • Uniquify the image tag ci-XXX per CI run by appending a unique identifier (e.g., PR number, run ID, or commit SHA). When ci-main.yml uploads the final image to ECR, strip the unique identifier to restore the canonical ci-XXX tag format.
  • Squash all layers before uploading the image to ECR
  • Shorten lifecycle policy rules for ci-XXX in ECR so we can expire those images more frequently, causing image rebuild to eliminate repeated layers. We can even let daily scheduled dry-run at night to be the first to build an image so we don't have to sit and wait for the image to be built in the daytime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    opsImproves our operations and release process

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions