DM-54833-2: Remove DIND in favor for buildkit#1174
Draft
roceb wants to merge 42 commits into
Draft
Conversation
…tion, style fixes
…deK8sContainer Generated with AI Co-Authored-By: SLAC AI
All callers have been migrated to rootless BuildKit / insideK8sContainer. Remove the three dind-era helpers that are no longer referenced anywhere. Generated with AI Co-Authored-By: SLAC AI
…ildKit cache Generated with AI Co-Authored-By: SLAC AI
…o digests collected
…ocker Hub with GHCR Generated with AI Co-Authored-By: SLAC AI
…uildKit cache Generated with AI Co-Authored-By: SLAC AI
…t pod specs Replaces docker:27.1.1-dind + docker-gc sidecars with moby/buildkit:v0.15.0-rootless in idf-agent-ldfc, idf-agent-ldfc-arch, and snowflake pod templates. Updates DOCKER_HOST → BUILDKIT_HOST env var and docker-graph-storage → buildkit-socket emptyDir volume in all three specs. Generated with AI Co-Authored-By: SLAC AI
…ners Add two critical flags to all three buildkitd containers: 1. --oci-worker-no-process-sandbox: Required because GKE nodes don't have host-level user-namespace support. Without it, builds fail when runc tries to create a user namespace for process sandboxing. 2. seccompProfile with type Unconfined: Kubernetes RuntimeDefault seccomp profile blocks mount/umount and FUSE syscalls that rootless BuildKit needs for its overlay filesystem. Applied to all three buildkitd containers (jenkins-workers-c4d, idf-agent-ldfc-arch, and snowflake agents). Generated with AI Co-Authored-By: SLAC AI
Replace docker:27.1.1-dind + docker-gc sidecar containers with moby/buildkit:v0.15.0-rootless in all three pod specs (idf-agent-ldfc-dev, idf-agent-ldfc-arch, snowflake-dev). Update jnlp containers to use BUILDKIT_HOST and buildkit-socket volume mount. Mirrors the production values.yaml migration. Generated with AI Co-Authored-By: SLAC AI
…ication plugin Generated with AI Co-Authored-By: SLAC AI
The docker-scipipe image does not add UID 1000 to /etc/passwd (that was
done at runtime by the old wrapDockerImage/useradd pattern). Without a
/etc/passwd entry for UID 1000 and no HOME in the container environment,
Python's Path.home() raises RuntimeError. scons may also spawn pytest
without inheriting the full shell environment, so withEnv() alone is
insufficient.
Setting HOME=/home/jenkins in the K8s container env spec ensures it is
present in os.environ from container startup, reachable by any subprocess.
jenkinsWrapper still overrides it to ${cwd}/home via withEnv for the
actual build.
getpwuid(1000) fails (UID 1000 not in /etc/passwd), so git and other tools fall back to LOGNAME/USER. Without either, they warn and assume an unknown user. Setting USER=jenkins matches the jenkins UID 1000 identity used by LSST images.
…sstswBuild
podTemplate { node() } creates a new Jenkins executor whose working
directory starts at the workspace root, not at the outer dir(buildDirHash)
subdirectory. Without dir(slug) inside the pod, jenkinsWrapper runs in
the workspace root: artifacts end up at lsstsw/build/... instead of
linux-9-x86/lsstsw/build/..., and jenkinsWrapperPost cannot find them.
SCons builds subprocess environments from its own ENV dict rather than os.environ, so Jenkins withEnv HOME never reaches pytest subprocesses. Python's Path.home() raises RuntimeError only when HOME is absent AND pwd.getpwuid(uid) fails; the old wrapDockerImage called useradd to provide the getpwuid fallback, which insideK8sContainer never did. Add a setup-passwd initContainer (same image, UID 1000, no root needed) that copies /etc/passwd and appends a jenkins:1000 entry if absent, then mount the result over /etc/passwd in the runner container. Also add LOGNAME=jenkins alongside USER/HOME to suppress git getpwuid warnings. Generated with AI Co-Authored-By: SLAC AI
insideK8sContainer allocates a new pod via podTemplate { node() } which
gets its own workspace separate from the outer nodeWrap agent. Build
artifacts are therefore in the inner pod's workspace, not the outer
agent's workspace, so the previous outer finally { jenkinsWrapperPost() }
never found them (error: 'linux-9-x86' doesn't exist).
Move jenkinsWrapperPost(slug) into a finally block inside runDocker so
it runs on the same pod that produced the artifacts. For non-image
builds (e.g. macOS), the build runs directly on the outer agent so the
existing outer finally path is kept.
Generated with AI
Co-Authored-By: SLAC AI
…sContainer The subPath mount over /etc/passwd caused all pods created by insideK8sContainer to fail to start, breaking all Jenkins workers. Revert to the simple pod spec while keeping the HOME/USER/LOGNAME env vars. The scarlet_lite getpwuid issue needs a different fix. Generated with AI Co-Authored-By: SLAC AI
…insideK8sContainer Cluster default sets readOnlyRootFilesystem:true; /j does not exist in LSST builder images, so Jenkins cannot create /j/workspace/... without an explicit writable volume. All existing working agent pods in values.yaml use the same pattern: readOnlyRootFilesystem:false + emptyDir at /j. Generated with AI Co-Authored-By: SLAC AI
…uppress getpwuid warning UID 1000 is not in /etc/passwd in LSST base images, so git warns "getpwuid failed, guessing username from LOGNAME or USER variable" on every operation. By mounting an emptyDir at /home/jenkins and using an initContainer to write a .gitconfig there, git finds user.name/user.email without calling getpwuid at all. Generated with AI Co-Authored-By: SLAC AI
The old loadCache created a separate gcloud-cli pod and used a hostPath
mount to share the workspace. This breaks with emptyDir workspaces because
the workspace path only exists inside the agent container's overlay, not
on the node's real filesystem where hostPath looks.
The new approach adds gcloud-cli as an optional sidecar to the builder pod
via insideK8sContainer(cacheImage: ...). Both containers mount the same
j-workspace emptyDir so any files downloaded by the gcloud-cli container
are immediately visible to the runner. loadCache now uses
container('gcloud-cli') instead of spawning a new pod.
Generated with AI
Co-Authored-By: SLAC AI
…paths The lsstsw cache tarball was built with workspace rooted at /j/workspace/... so conda bakes those absolute paths into its activation scripts. Without an explicit jnlp container, Jenkins Kubernetes plugin defaults workingDir to /home/jenkins/agent, placing the workspace at /home/jenkins/agent/workspace/... which causes conda.sh to reference a non-existent /j/workspace/... path. Adding a jnlp container stub with workingDir:/j causes the plugin to merge it with its auto-injected jnlp config, rooting the workspace at /j/workspace/... to match what the cache was built with. Generated with AI Co-Authored-By: SLAC AI
saveCache: remove conda install google-cloud-sdk (slow, unreliable in the LSST builder image); clone ci-scripts then delegate to the gcloud-cli sidecar container, matching the pattern used by loadCache. loadCache: patch stale conda prefix after extraction — if the cache tarball was built in a workspace with a different slug the absolute paths baked into conda's activation scripts break; detect and replace them in miniconda/etc, bin, and condabin so conda activates correctly regardless of which slug the cache was originally created under. runDocker: pass cacheImage when cachelsstsw is true so the gcloud-cli sidecar is present for save-cache builds too. Generated with AI Co-Authored-By: SLAC AI
…li sidecar
The old implementation created a separate pod with a hostPath mount to share
test data with the builder. With emptyDir workspaces the hostPath never
resolves (the workspace path lives only in the agent container overlay), so
the pod fails with CreateContainerError. There was also a pre-existing silent
data-loss bug: dir() context does not cross node() boundaries, so rclone was
downloading into the inner pod's ephemeral workspace rather than into the
outer agent's testdata directory.
Fix: remove the inner pod entirely. loadLSSTCamTestData now calls
container('gcloud-cli') — the sidecar already added to the builder pod when
CI_LSSTCAM is set — so rclone writes directly into the shared j-workspace
emptyDir, making the test data visible to the runner container without any
inter-pod data transfer.
Generated with AI
Co-Authored-By: SLAC AI
Two production failures on the emptyDir migration: aarch64 segfault — grep -rl without -I matched binary files (compiled extensions, the conda executable itself) under miniconda/bin that happened to contain the stale workspace path. sed -i then corrupted those binaries, causing CONDA_EXE to segfault on activation. Fix: grep -rIl skips binary files, limiting the path-fixup to text (activation scripts, shebangs). x86 RuntimeError: Could not determine home directory — static agents had UID 1000 in /etc/passwd so Python's getpwuid(1000) fallback always worked, even when something in ci-scripts/lsstsw unset HOME. The emptyDir pods run UID 1000 with no passwd entry, so the fallback raises KeyError and Python raises RuntimeError. Fix: runner container startup writes a jenkins:1000 entry to /etc/passwd before exec-ing sleep, restoring the getpwuid fallback for any code (astropy, git, etc.) that needs a home directory independently of the HOME env var. Generated with AI Co-Authored-By: SLAC AI
printf 'string\n' inside a Groovy triple-double-quoted string interpolates \n as a real newline, splitting the YAML block scalar across lines. The line ' >> /etc/passwd then lands at column 1 outside the block's indentation, causing SnakeYAML to fail with "could not find expected ':'". echo adds the trailing newline itself, so no escape sequence is needed. Generated with AI Co-Authored-By: SLAC AI
The previous fixup only patched miniconda/etc, bin, and condabin. The stale x86 workspace path was also baked into conda-env helpers such as miniconda/envs/lsst-scipipe-13.0.0/eups/bin/setups.sh, causing eups to reference /j/workspace/stack-os-matrix/linux-9-x86/... on an aarch64 pod. Widen the grep to the entire miniconda directory so any file in any subdirectory (envs, pkgs, lib, etc.) gets patched. The -I flag already ensures binary files are skipped. Generated with AI Co-Authored-By: SLAC AI
Move the pod-template YAML construction into a standalone @NonCPS renderPodYaml(Map) so it can be unit-tested without a live Jenkins. The generated YAML is unchanged; insideK8sContainer now computes pullPolicy and delegates rendering. Generated with AI Co-Authored-By: SLAC AI
…e-to on push buildkitCacheArgs gains a pushCache flag so --cache-to (which needs write auth) is omitted on NO_PUSH builds while --cache-from still accelerates them. Move GCP Artifact Registry auth out of the !noPush gate in build_stack, and add it to build_docker_newinstall (which previously had none), so the registry cache is always authenticated. GHCR image-push login stays gated on push. Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
The insideK8sContainer pod rendered by renderPodYaml carried neither an arch nodeSelector nor a toleration for the arm taint (kubernetes.io/arch=arm64:NoSchedule), so both stack-os-matrix instances scheduled on x86 regardless of the matrix entry. Thread an optional arch through insideK8sContainer -> renderPodYaml and emit the arm nodeSelector + toleration when arch=arm64; lsstswBuild derives it from the config label. Also point the matrix at docker-scipipe:pr-14-tickets-DM-54833-2 (which bakes in a uid-1000 jenkins user) to validate the getpwuid/home fix on dev. Revert this image tag to :9-latest once docker-scipipe#14 merges. Generated with AI Co-Authored-By: SLAC AI
Name the inner pod <job>-<build>-<arch> so the two stack-os-matrix instances are distinguishable at a glance (e.g. ...-arm64 vs ...-amd64) instead of both showing opaque random suffixes. Generated with AI Co-Authored-By: SLAC AI
…inspect Replace daemon-dependent docker.image().pull() + docker inspect in ap_verify and verify_drp_metrics with util.imageLabels(), which queries the registry via docker buildx imagetools inspect (no daemon, no pull). Add a pure parseImageLabels parser robust to skopeo/crane/imagetools JSON shapes. Generated with AI Co-Authored-By: SLAC AI
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.