Skip to content

DM-54833-2: Remove DIND in favor for buildkit#1174

Draft
roceb wants to merge 42 commits into
mainfrom
tickets/DM-54833-2
Draft

DM-54833-2: Remove DIND in favor for buildkit#1174
roceb wants to merge 42 commits into
mainfrom
tickets/DM-54833-2

Conversation

@roceb

@roceb roceb commented May 12, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@roceb roceb force-pushed the tickets/DM-54833-2 branch from ff1933c to 470452a Compare May 14, 2026 14:22
roceb added 29 commits May 15, 2026 15:01
…deK8sContainer

Generated with AI

Co-Authored-By: SLAC AI
All callers have been migrated to rootless BuildKit / insideK8sContainer.
Remove the three dind-era helpers that are no longer referenced anywhere.

Generated with AI

Co-Authored-By: SLAC AI
…ildKit cache

Generated with AI

Co-Authored-By: SLAC AI
…ocker Hub with GHCR

Generated with AI

Co-Authored-By: SLAC AI
…uildKit cache

Generated with AI

Co-Authored-By: SLAC AI
…t pod specs

Replaces docker:27.1.1-dind + docker-gc sidecars with moby/buildkit:v0.15.0-rootless
in idf-agent-ldfc, idf-agent-ldfc-arch, and snowflake pod templates. Updates DOCKER_HOST
→ BUILDKIT_HOST env var and docker-graph-storage → buildkit-socket emptyDir volume in
all three specs.

Generated with AI

Co-Authored-By: SLAC AI
…ners

Add two critical flags to all three buildkitd containers:
1. --oci-worker-no-process-sandbox: Required because GKE nodes don't have
   host-level user-namespace support. Without it, builds fail when runc tries
   to create a user namespace for process sandboxing.
2. seccompProfile with type Unconfined: Kubernetes RuntimeDefault seccomp
   profile blocks mount/umount and FUSE syscalls that rootless BuildKit needs
   for its overlay filesystem.

Applied to all three buildkitd containers (jenkins-workers-c4d,
idf-agent-ldfc-arch, and snowflake agents).

Generated with AI

Co-Authored-By: SLAC AI
Replace docker:27.1.1-dind + docker-gc sidecar containers with
moby/buildkit:v0.15.0-rootless in all three pod specs
(idf-agent-ldfc-dev, idf-agent-ldfc-arch, snowflake-dev).
Update jnlp containers to use BUILDKIT_HOST and buildkit-socket
volume mount. Mirrors the production values.yaml migration.

Generated with AI

Co-Authored-By: SLAC AI
…ication plugin

Generated with AI

Co-Authored-By: SLAC AI
The docker-scipipe image does not add UID 1000 to /etc/passwd (that was
done at runtime by the old wrapDockerImage/useradd pattern). Without a
/etc/passwd entry for UID 1000 and no HOME in the container environment,
Python's Path.home() raises RuntimeError. scons may also spawn pytest
without inheriting the full shell environment, so withEnv() alone is
insufficient.

Setting HOME=/home/jenkins in the K8s container env spec ensures it is
present in os.environ from container startup, reachable by any subprocess.
jenkinsWrapper still overrides it to ${cwd}/home via withEnv for the
actual build.
getpwuid(1000) fails (UID 1000 not in /etc/passwd), so git and other
tools fall back to LOGNAME/USER. Without either, they warn and assume
an unknown user. Setting USER=jenkins matches the jenkins UID 1000
identity used by LSST images.
…sstswBuild

podTemplate { node() } creates a new Jenkins executor whose working
directory starts at the workspace root, not at the outer dir(buildDirHash)
subdirectory. Without dir(slug) inside the pod, jenkinsWrapper runs in
the workspace root: artifacts end up at lsstsw/build/... instead of
linux-9-x86/lsstsw/build/..., and jenkinsWrapperPost cannot find them.
SCons builds subprocess environments from its own ENV dict rather than
os.environ, so Jenkins withEnv HOME never reaches pytest subprocesses.
Python's Path.home() raises RuntimeError only when HOME is absent AND
pwd.getpwuid(uid) fails; the old wrapDockerImage called useradd to
provide the getpwuid fallback, which insideK8sContainer never did.

Add a setup-passwd initContainer (same image, UID 1000, no root needed)
that copies /etc/passwd and appends a jenkins:1000 entry if absent, then
mount the result over /etc/passwd in the runner container.  Also add
LOGNAME=jenkins alongside USER/HOME to suppress git getpwuid warnings.

Generated with AI

Co-Authored-By: SLAC AI
insideK8sContainer allocates a new pod via podTemplate { node() } which
gets its own workspace separate from the outer nodeWrap agent.  Build
artifacts are therefore in the inner pod's workspace, not the outer
agent's workspace, so the previous outer finally { jenkinsWrapperPost() }
never found them (error: 'linux-9-x86' doesn't exist).

Move jenkinsWrapperPost(slug) into a finally block inside runDocker so
it runs on the same pod that produced the artifacts.  For non-image
builds (e.g. macOS), the build runs directly on the outer agent so the
existing outer finally path is kept.

Generated with AI

Co-Authored-By: SLAC AI
…sContainer

The subPath mount over /etc/passwd caused all pods created by
insideK8sContainer to fail to start, breaking all Jenkins workers.
Revert to the simple pod spec while keeping the HOME/USER/LOGNAME env
vars.  The scarlet_lite getpwuid issue needs a different fix.

Generated with AI

Co-Authored-By: SLAC AI
…insideK8sContainer

Cluster default sets readOnlyRootFilesystem:true; /j does not exist in LSST
builder images, so Jenkins cannot create /j/workspace/... without an explicit
writable volume. All existing working agent pods in values.yaml use the same
pattern: readOnlyRootFilesystem:false + emptyDir at /j.

Generated with AI

Co-Authored-By: SLAC AI
…uppress getpwuid warning

UID 1000 is not in /etc/passwd in LSST base images, so git warns
"getpwuid failed, guessing username from LOGNAME or USER variable"
on every operation. By mounting an emptyDir at /home/jenkins and
using an initContainer to write a .gitconfig there, git finds
user.name/user.email without calling getpwuid at all.

Generated with AI

Co-Authored-By: SLAC AI
The old loadCache created a separate gcloud-cli pod and used a hostPath
mount to share the workspace. This breaks with emptyDir workspaces because
the workspace path only exists inside the agent container's overlay, not
on the node's real filesystem where hostPath looks.

The new approach adds gcloud-cli as an optional sidecar to the builder pod
via insideK8sContainer(cacheImage: ...). Both containers mount the same
j-workspace emptyDir so any files downloaded by the gcloud-cli container
are immediately visible to the runner. loadCache now uses
container('gcloud-cli') instead of spawning a new pod.

Generated with AI

Co-Authored-By: SLAC AI
roceb added 4 commits May 15, 2026 15:01
…paths

The lsstsw cache tarball was built with workspace rooted at /j/workspace/...
so conda bakes those absolute paths into its activation scripts.  Without an
explicit jnlp container, Jenkins Kubernetes plugin defaults workingDir to
/home/jenkins/agent, placing the workspace at /home/jenkins/agent/workspace/...
which causes conda.sh to reference a non-existent /j/workspace/... path.

Adding a jnlp container stub with workingDir:/j causes the plugin to merge
it with its auto-injected jnlp config, rooting the workspace at /j/workspace/...
to match what the cache was built with.

Generated with AI

Co-Authored-By: SLAC AI
saveCache: remove conda install google-cloud-sdk (slow, unreliable in the
LSST builder image); clone ci-scripts then delegate to the gcloud-cli
sidecar container, matching the pattern used by loadCache.

loadCache: patch stale conda prefix after extraction — if the cache tarball
was built in a workspace with a different slug the absolute paths baked into
conda's activation scripts break; detect and replace them in miniconda/etc,
bin, and condabin so conda activates correctly regardless of which slug the
cache was originally created under.

runDocker: pass cacheImage when cachelsstsw is true so the gcloud-cli
sidecar is present for save-cache builds too.

Generated with AI

Co-Authored-By: SLAC AI
…li sidecar

The old implementation created a separate pod with a hostPath mount to share
test data with the builder.  With emptyDir workspaces the hostPath never
resolves (the workspace path lives only in the agent container overlay), so
the pod fails with CreateContainerError.  There was also a pre-existing silent
data-loss bug: dir() context does not cross node() boundaries, so rclone was
downloading into the inner pod's ephemeral workspace rather than into the
outer agent's testdata directory.

Fix: remove the inner pod entirely.  loadLSSTCamTestData now calls
container('gcloud-cli') — the sidecar already added to the builder pod when
CI_LSSTCAM is set — so rclone writes directly into the shared j-workspace
emptyDir, making the test data visible to the runner container without any
inter-pod data transfer.

Generated with AI

Co-Authored-By: SLAC AI
Two production failures on the emptyDir migration:

aarch64 segfault — grep -rl without -I matched binary files (compiled
extensions, the conda executable itself) under miniconda/bin that happened
to contain the stale workspace path.  sed -i then corrupted those binaries,
causing CONDA_EXE to segfault on activation.  Fix: grep -rIl skips binary
files, limiting the path-fixup to text (activation scripts, shebangs).

x86 RuntimeError: Could not determine home directory — static agents had
UID 1000 in /etc/passwd so Python's getpwuid(1000) fallback always worked,
even when something in ci-scripts/lsstsw unset HOME.  The emptyDir pods run
UID 1000 with no passwd entry, so the fallback raises KeyError and Python
raises RuntimeError.  Fix: runner container startup writes a jenkins:1000
entry to /etc/passwd before exec-ing sleep, restoring the getpwuid fallback
for any code (astropy, git, etc.) that needs a home directory independently
of the HOME env var.

Generated with AI

Co-Authored-By: SLAC AI
@roceb roceb force-pushed the tickets/DM-54833-2 branch from e4951f8 to 57f2a63 Compare May 15, 2026 22:02
roceb added 9 commits May 15, 2026 16:38
printf 'string\n' inside a Groovy triple-double-quoted string interpolates
\n as a real newline, splitting the YAML block scalar across lines.  The
line ' >> /etc/passwd then lands at column 1 outside the block's indentation,
causing SnakeYAML to fail with "could not find expected ':'".

echo adds the trailing newline itself, so no escape sequence is needed.

Generated with AI

Co-Authored-By: SLAC AI
The previous fixup only patched miniconda/etc, bin, and condabin.  The
stale x86 workspace path was also baked into conda-env helpers such as
miniconda/envs/lsst-scipipe-13.0.0/eups/bin/setups.sh, causing eups to
reference /j/workspace/stack-os-matrix/linux-9-x86/... on an aarch64 pod.

Widen the grep to the entire miniconda directory so any file in any
subdirectory (envs, pkgs, lib, etc.) gets patched.  The -I flag already
ensures binary files are skipped.

Generated with AI

Co-Authored-By: SLAC AI
Move the pod-template YAML construction into a standalone @NonCPS
renderPodYaml(Map) so it can be unit-tested without a live Jenkins. The
generated YAML is unchanged; insideK8sContainer now computes pullPolicy and
delegates rendering.

Generated with AI

Co-Authored-By: SLAC AI
…e-to on push

buildkitCacheArgs gains a pushCache flag so --cache-to (which needs write auth)
is omitted on NO_PUSH builds while --cache-from still accelerates them. Move GCP
Artifact Registry auth out of the !noPush gate in build_stack, and add it to
build_docker_newinstall (which previously had none), so the registry cache is
always authenticated. GHCR image-push login stays gated on push.

Generated with AI

Co-Authored-By: SLAC AI
Generated with AI

Co-Authored-By: SLAC AI
The insideK8sContainer pod rendered by renderPodYaml carried neither an
arch nodeSelector nor a toleration for the arm taint
(kubernetes.io/arch=arm64:NoSchedule), so both stack-os-matrix instances
scheduled on x86 regardless of the matrix entry. Thread an optional arch
through insideK8sContainer -> renderPodYaml and emit the arm nodeSelector
+ toleration when arch=arm64; lsstswBuild derives it from the config label.

Also point the matrix at docker-scipipe:pr-14-tickets-DM-54833-2 (which
bakes in a uid-1000 jenkins user) to validate the getpwuid/home fix on
dev. Revert this image tag to :9-latest once docker-scipipe#14 merges.

Generated with AI

Co-Authored-By: SLAC AI
Name the inner pod <job>-<build>-<arch> so the two stack-os-matrix
instances are distinguishable at a glance (e.g. ...-arm64 vs ...-amd64)
instead of both showing opaque random suffixes.

Generated with AI

Co-Authored-By: SLAC AI
…inspect

Replace daemon-dependent docker.image().pull() + docker inspect in ap_verify
and verify_drp_metrics with util.imageLabels(), which queries the registry via
docker buildx imagetools inspect (no daemon, no pull). Add a pure
parseImageLabels parser robust to skopeo/crane/imagetools JSON shapes.

Generated with AI

Co-Authored-By: SLAC AI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant