DM-54833-2: Remove DIND in favor for buildkit by roceb · Pull Request #1174 · lsst-dm/jenkins-dm-jobs

roceb · 2026-05-12T17:15:56Z

No description provided.

…deDockerWrap

…tion, style fixes

…ontainer

…deK8sContainer Generated with AI Co-Authored-By: SLAC AI

…ontainer

…Container

All callers have been migrated to rootless BuildKit / insideK8sContainer. Remove the three dind-era helpers that are no longer referenced anywhere. Generated with AI Co-Authored-By: SLAC AI

…ildKit cache Generated with AI Co-Authored-By: SLAC AI

…o digests collected

…ocker Hub with GHCR Generated with AI Co-Authored-By: SLAC AI

…uildKit cache Generated with AI Co-Authored-By: SLAC AI

…t pod specs Replaces docker:27.1.1-dind + docker-gc sidecars with moby/buildkit:v0.15.0-rootless in idf-agent-ldfc, idf-agent-ldfc-arch, and snowflake pod templates. Updates DOCKER_HOST → BUILDKIT_HOST env var and docker-graph-storage → buildkit-socket emptyDir volume in all three specs. Generated with AI Co-Authored-By: SLAC AI

…ners Add two critical flags to all three buildkitd containers: 1. --oci-worker-no-process-sandbox: Required because GKE nodes don't have host-level user-namespace support. Without it, builds fail when runc tries to create a user namespace for process sandboxing. 2. seccompProfile with type Unconfined: Kubernetes RuntimeDefault seccomp profile blocks mount/umount and FUSE syscalls that rootless BuildKit needs for its overlay filesystem. Applied to all three buildkitd containers (jenkins-workers-c4d, idf-agent-ldfc-arch, and snowflake agents). Generated with AI Co-Authored-By: SLAC AI

Replace docker:27.1.1-dind + docker-gc sidecar containers with moby/buildkit:v0.15.0-rootless in all three pod specs (idf-agent-ldfc-dev, idf-agent-ldfc-arch, snowflake-dev). Update jnlp containers to use BUILDKIT_HOST and buildkit-socket volume mount. Mirrors the production values.yaml migration. Generated with AI Co-Authored-By: SLAC AI

…ication plugin Generated with AI Co-Authored-By: SLAC AI

The docker-scipipe image does not add UID 1000 to /etc/passwd (that was done at runtime by the old wrapDockerImage/useradd pattern). Without a /etc/passwd entry for UID 1000 and no HOME in the container environment, Python's Path.home() raises RuntimeError. scons may also spawn pytest without inheriting the full shell environment, so withEnv() alone is insufficient. Setting HOME=/home/jenkins in the K8s container env spec ensures it is present in os.environ from container startup, reachable by any subprocess. jenkinsWrapper still overrides it to ${cwd}/home via withEnv for the actual build.

getpwuid(1000) fails (UID 1000 not in /etc/passwd), so git and other tools fall back to LOGNAME/USER. Without either, they warn and assume an unknown user. Setting USER=jenkins matches the jenkins UID 1000 identity used by LSST images.

…sstswBuild podTemplate { node() } creates a new Jenkins executor whose working directory starts at the workspace root, not at the outer dir(buildDirHash) subdirectory. Without dir(slug) inside the pod, jenkinsWrapper runs in the workspace root: artifacts end up at lsstsw/build/... instead of linux-9-x86/lsstsw/build/..., and jenkinsWrapperPost cannot find them.

SCons builds subprocess environments from its own ENV dict rather than os.environ, so Jenkins withEnv HOME never reaches pytest subprocesses. Python's Path.home() raises RuntimeError only when HOME is absent AND pwd.getpwuid(uid) fails; the old wrapDockerImage called useradd to provide the getpwuid fallback, which insideK8sContainer never did. Add a setup-passwd initContainer (same image, UID 1000, no root needed) that copies /etc/passwd and appends a jenkins:1000 entry if absent, then mount the result over /etc/passwd in the runner container. Also add LOGNAME=jenkins alongside USER/HOME to suppress git getpwuid warnings. Generated with AI Co-Authored-By: SLAC AI

insideK8sContainer allocates a new pod via podTemplate { node() } which gets its own workspace separate from the outer nodeWrap agent. Build artifacts are therefore in the inner pod's workspace, not the outer agent's workspace, so the previous outer finally { jenkinsWrapperPost() } never found them (error: 'linux-9-x86' doesn't exist). Move jenkinsWrapperPost(slug) into a finally block inside runDocker so it runs on the same pod that produced the artifacts. For non-image builds (e.g. macOS), the build runs directly on the outer agent so the existing outer finally path is kept. Generated with AI Co-Authored-By: SLAC AI

…sContainer The subPath mount over /etc/passwd caused all pods created by insideK8sContainer to fail to start, breaking all Jenkins workers. Revert to the simple pod spec while keeping the HOME/USER/LOGNAME env vars. The scarlet_lite getpwuid issue needs a different fix. Generated with AI Co-Authored-By: SLAC AI

…insideK8sContainer Cluster default sets readOnlyRootFilesystem:true; /j does not exist in LSST builder images, so Jenkins cannot create /j/workspace/... without an explicit writable volume. All existing working agent pods in values.yaml use the same pattern: readOnlyRootFilesystem:false + emptyDir at /j. Generated with AI Co-Authored-By: SLAC AI

…uppress getpwuid warning UID 1000 is not in /etc/passwd in LSST base images, so git warns "getpwuid failed, guessing username from LOGNAME or USER variable" on every operation. By mounting an emptyDir at /home/jenkins and using an initContainer to write a .gitconfig there, git finds user.name/user.email without calling getpwuid at all. Generated with AI Co-Authored-By: SLAC AI

The old loadCache created a separate gcloud-cli pod and used a hostPath mount to share the workspace. This breaks with emptyDir workspaces because the workspace path only exists inside the agent container's overlay, not on the node's real filesystem where hostPath looks. The new approach adds gcloud-cli as an optional sidecar to the builder pod via insideK8sContainer(cacheImage: ...). Both containers mount the same j-workspace emptyDir so any files downloaded by the gcloud-cli container are immediately visible to the runner. loadCache now uses container('gcloud-cli') instead of spawning a new pod. Generated with AI Co-Authored-By: SLAC AI

…paths The lsstsw cache tarball was built with workspace rooted at /j/workspace/... so conda bakes those absolute paths into its activation scripts. Without an explicit jnlp container, Jenkins Kubernetes plugin defaults workingDir to /home/jenkins/agent, placing the workspace at /home/jenkins/agent/workspace/... which causes conda.sh to reference a non-existent /j/workspace/... path. Adding a jnlp container stub with workingDir:/j causes the plugin to merge it with its auto-injected jnlp config, rooting the workspace at /j/workspace/... to match what the cache was built with. Generated with AI Co-Authored-By: SLAC AI

saveCache: remove conda install google-cloud-sdk (slow, unreliable in the LSST builder image); clone ci-scripts then delegate to the gcloud-cli sidecar container, matching the pattern used by loadCache. loadCache: patch stale conda prefix after extraction — if the cache tarball was built in a workspace with a different slug the absolute paths baked into conda's activation scripts break; detect and replace them in miniconda/etc, bin, and condabin so conda activates correctly regardless of which slug the cache was originally created under. runDocker: pass cacheImage when cachelsstsw is true so the gcloud-cli sidecar is present for save-cache builds too. Generated with AI Co-Authored-By: SLAC AI

…li sidecar The old implementation created a separate pod with a hostPath mount to share test data with the builder. With emptyDir workspaces the hostPath never resolves (the workspace path lives only in the agent container overlay), so the pod fails with CreateContainerError. There was also a pre-existing silent data-loss bug: dir() context does not cross node() boundaries, so rclone was downloading into the inner pod's ephemeral workspace rather than into the outer agent's testdata directory. Fix: remove the inner pod entirely. loadLSSTCamTestData now calls container('gcloud-cli') — the sidecar already added to the builder pod when CI_LSSTCAM is set — so rclone writes directly into the shared j-workspace emptyDir, making the test data visible to the runner container without any inter-pod data transfer. Generated with AI Co-Authored-By: SLAC AI

Two production failures on the emptyDir migration: aarch64 segfault — grep -rl without -I matched binary files (compiled extensions, the conda executable itself) under miniconda/bin that happened to contain the stale workspace path. sed -i then corrupted those binaries, causing CONDA_EXE to segfault on activation. Fix: grep -rIl skips binary files, limiting the path-fixup to text (activation scripts, shebangs). x86 RuntimeError: Could not determine home directory — static agents had UID 1000 in /etc/passwd so Python's getpwuid(1000) fallback always worked, even when something in ci-scripts/lsstsw unset HOME. The emptyDir pods run UID 1000 with no passwd entry, so the fallback raises KeyError and Python raises RuntimeError. Fix: runner container startup writes a jenkins:1000 entry to /etc/passwd before exec-ing sleep, restoring the getpwuid fallback for any code (astropy, git, etc.) that needs a home directory independently of the HOME env var. Generated with AI Co-Authored-By: SLAC AI

printf 'string\n' inside a Groovy triple-double-quoted string interpolates \n as a real newline, splitting the YAML block scalar across lines. The line ' >> /etc/passwd then lands at column 1 outside the block's indentation, causing SnakeYAML to fail with "could not find expected ':'". echo adds the trailing newline itself, so no escape sequence is needed. Generated with AI Co-Authored-By: SLAC AI

The previous fixup only patched miniconda/etc, bin, and condabin. The stale x86 workspace path was also baked into conda-env helpers such as miniconda/envs/lsst-scipipe-13.0.0/eups/bin/setups.sh, causing eups to reference /j/workspace/stack-os-matrix/linux-9-x86/... on an aarch64 pod. Widen the grep to the entire miniconda directory so any file in any subdirectory (envs, pkgs, lib, etc.) gets patched. The -I flag already ensures binary files are skipped. Generated with AI Co-Authored-By: SLAC AI

Move the pod-template YAML construction into a standalone @NonCPS renderPodYaml(Map) so it can be unit-tested without a live Jenkins. The generated YAML is unchanged; insideK8sContainer now computes pullPolicy and delegates rendering. Generated with AI Co-Authored-By: SLAC AI

…e-to on push buildkitCacheArgs gains a pushCache flag so --cache-to (which needs write auth) is omitted on NO_PUSH builds while --cache-from still accelerates them. Move GCP Artifact Registry auth out of the !noPush gate in build_stack, and add it to build_docker_newinstall (which previously had none), so the registry cache is always authenticated. GHCR image-push login stays gated on push. Generated with AI Co-Authored-By: SLAC AI

Generated with AI Co-Authored-By: SLAC AI

The insideK8sContainer pod rendered by renderPodYaml carried neither an arch nodeSelector nor a toleration for the arm taint (kubernetes.io/arch=arm64:NoSchedule), so both stack-os-matrix instances scheduled on x86 regardless of the matrix entry. Thread an optional arch through insideK8sContainer -> renderPodYaml and emit the arm nodeSelector + toleration when arch=arm64; lsstswBuild derives it from the config label. Also point the matrix at docker-scipipe:pr-14-tickets-DM-54833-2 (which bakes in a uid-1000 jenkins user) to validate the getpwuid/home fix on dev. Revert this image tag to :9-latest once docker-scipipe#14 merges. Generated with AI Co-Authored-By: SLAC AI

Name the inner pod <job>-<build>-<arch> so the two stack-os-matrix instances are distinguishable at a glance (e.g. ...-arm64 vs ...-amd64) instead of both showing opaque random suffixes. Generated with AI Co-Authored-By: SLAC AI

…inspect Replace daemon-dependent docker.image().pull() + docker inspect in ap_verify and verify_drp_metrics with util.imageLabels(), which queries the registry via docker buildx imagetools inspect (no daemon, no pull). Add a pure parseImageLabels parser robust to skopeo/crane/imagetools JSON shapes. Generated with AI Co-Authored-By: SLAC AI

roceb force-pushed the tickets/DM-54833-2 branch from ff1933c to 470452a Compare May 14, 2026 14:22

roceb added 29 commits May 15, 2026 15:01

feat(util): add setupBuildkitBuilder and buildkitCacheArgs helpers

5b6438a

feat(util): add insideK8sContainer as K8s-native replacement for insi…

b09279a

…deDockerWrap

fix(util): fix insideK8sContainer empty mounts YAML, add entry valida…

03bd34c

…tion, style fixes

refactor(util): migrate internal insideDockerWrap calls to insideK8sC…

d63a8e3

…ontainer

refactor(util): migrate remaining insideDockerWrap call sites to insi…

176db6f

…deK8sContainer Generated with AI Co-Authored-By: SLAC AI

refactor(run_rebuild): replace insideDockerWrap with insideK8sContainer

8467120

refactor(run_publish): replace insideDockerWrap with insideK8sContainer

0b01b3d

refactor(tarball): replace wrapDockerImage+docker run with insideK8sC…

6952248

…ontainer

refactor(verify_drp_metrics): replace insideDockerWrap with insideK8s…

c99826b

…Container

refactor(ap_verify): replace insideDockerWrap with insideK8sContainer

c7da1b9

refactor(util): delete buildImage, wrapDockerImage, insideDockerWrap

a3dd583

All callers have been migrated to rootless BuildKit / insideK8sContainer. Remove the three dind-era helpers that are no longer referenced anywhere. Generated with AI Co-Authored-By: SLAC AI

feat(build_docker_newinstall): migrate to docker buildx build with Bu…

a77c4c2

…ildKit cache Generated with AI Co-Authored-By: SLAC AI

fix(build_docker_newinstall): guard merge block when noPush=true or n…

d93b45f

…o digests collected

feat(build_jenkins_swarm_client): migrate to docker buildx, replace D…

9af4fb3

…ocker Hub with GHCR Generated with AI Co-Authored-By: SLAC AI

feat(build_stack): migrate to docker buildx, remove Docker Hub, add B…

d2dc4a9

…uildKit cache Generated with AI Co-Authored-By: SLAC AI

fix(values): rename BuildKit container name buildkit→buildkitd

33b9c1a

chore: remove Docker Hub references, credentials, and dockerhub-notif…

60d227a

…ication plugin Generated with AI Co-Authored-By: SLAC AI

roceb added 4 commits May 15, 2026 15:01

roceb force-pushed the tickets/DM-54833-2 branch from e4951f8 to 57f2a63 Compare May 15, 2026 22:02

roceb added 9 commits May 15, 2026 16:38

fix(util): make setupBuildkitBuilder idempotent and fail loudly

548a4b9

Generated with AI Co-Authored-By: SLAC AI

refactor: move buildcache repos, gcloud-cli image, and eups SA to config

da76df5

Generated with AI Co-Authored-By: SLAC AI

feat(util): include arch in insideK8sContainer pod name

01e73af

Name the inner pod <job>-<build>-<arch> so the two stack-os-matrix instances are distinguishable at a glance (e.g. ...-arm64 vs ...-amd64) instead of both showing opaque random suffixes. Generated with AI Co-Authored-By: SLAC AI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-54833-2: Remove DIND in favor for buildkit#1174

DM-54833-2: Remove DIND in favor for buildkit#1174
roceb wants to merge 42 commits into
mainfrom
tickets/DM-54833-2

roceb commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

roceb commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant