Skip to content

ci(docker-new): split base-cuda layer and restructure CI pipelines#7033

Merged
xmfcx merged 1 commit intomainfrom
feat/split-05-docker-new-restructure
Apr 20, 2026
Merged

ci(docker-new): split base-cuda layer and restructure CI pipelines#7033
xmfcx merged 1 commit intomainfrom
feat/split-05-docker-new-restructure

Conversation

@xmfcx
Copy link
Copy Markdown
Contributor

@xmfcx xmfcx commented Apr 17, 2026

  • Parent Issue: Simplify the docker images and workflows #6852
  • Image graph. Add base-cuda-runtime / base-cuda-devel stages in docker-new/base-cuda.Dockerfile and rebase universe-cuda off them (was stacked on universe). Drop universe-runtime-dependenciesuniverse copies directly from universe-devel. New bake group ci-base-cuda.
  • Ansible split. New ansible/playbooks/rmw.yaml and nvidia.yaml; Dockerfiles bind-mount only the role + playbook files they actually invoke (ansible/roles/rmw_implementation, ansible/roles/nvidia_*). Shrinks the bind-mount tree and the cache-invalidation surface.
  • Named BuildKit mounts. Every RUN --mount=type=cache,... now carries an explicit id= (apt-cache-${ROS_DISTRO}, apt-lists-${ROS_DISTRO}, ccache-${ROS_DISTRO}, pip-cache, pipx-cache). Required for buildkit-cache-dance to round-trip mount state between runs.
  • Per-arch CI. docker-build-pipeline-new.yaml takes a platform input; docker-build-and-push-new.yaml fans out to humble-amd64 / humble-arm64 / jazzy-amd64 / jazzy-arm64. Multi-arch manifest stitching extracted to a dedicated docker-manifest-new.yaml run after both arches succeed.
  • Mount-cache persistence. docker-build-new.yaml saves/restores BuildKit mount tarballs via actions/cache + buildkit-cache-dance, with a size guard and lineage pruning (largest-per-lineage kept on main pushes; stale entries discarded). This is the SAVE side that ci(health-check): rewire cache to match docker-new restructure #7032 reads from.
  • Entrypoint. Bake ENV RMW_IMPLEMENTATION=rmw_cyclonedds_cpp into base so every downstream stage inherits it without relying on runtime-env override.

Why

Previously universe-cuda stacked on universe-dependencies, so a change to either branch forced the other to rebuild and both branches shared one apt cache key — meaning the huge CUDA / cuDNN / TensorRT payload poisoned non-CUDA tarballs. Splitting CUDA into its own base plus per-arch CI jobs lets each branch cache independently, enables parallel amd64/arm64 builds, and matches the docker-new "Ansible-first" convention (version pins live in role defaults/main.yaml). Persisting BuildKit mount caches across runs turns cold builds into near-hot ones and is what lets health-check (#7032) reuse apt/ccache/pip/pipx state without spending compute.


Test plan

  • Dispatched docker-build-and-push-new against this branch: https://github.com/autowarefoundation/autoware/actions/runs/24580553216 — expect all four per-arch pipelines and both manifest jobs to succeed.
  • Inspect the rewired bake graph:
    docker buildx bake -f docker-new/docker-bake.hcl --print default | jq '{groups: .group | keys, targets: .target | keys}'
    Expect groups default, ci-base, ci-core, ci-base-cuda, ci-universe, ci-universe-cuda. Expect new targets base-cuda-runtime, base-cuda-devel. Expect universe-runtime-dependencies absent.
  • Verify the CI matrix on this PR: four per-arch pipeline runs (humble-amd64, humble-arm64, jazzy-amd64, jazzy-arm64) each executing build-basebuild-corebuild-universe plus build-base-cudabuild-universe-cuda; then humble-manifest / jazzy-manifest stitching the multi-arch tags after both arches succeed.
  • Build the new CUDA base locally end-to-end:
    ROS_DISTRO=jazzy REGISTRY=autoware PLATFORM=amd64 TAG_DATE=$(date +%Y%m%d) TAG_VERSION= TAG_REF= docker buildx bake -f docker-new/docker-bake.hcl ci-base-cuda
    Expect autoware:base-cuda-runtime-jazzy-amd64-<date> and autoware:base-cuda-devel-jazzy-amd64-<date> visible in docker images.
  • After this lands on main, confirm the persisted mount cache: the next ci-universe run's Restore BuildKit cache mounts step reports Cache restored from key: buildkit-mounts-ci-universe-<distro>-<platform>-<sha>, and the prune step logs keep: buildkit-mounts-... for exactly one entry per lineage.
  • On a subsequent health-check run (ci(health-check): rewire cache to match docker-new restructure #7032's pipeline), confirm its read-only mount restore now hits instead of missing, via the Cache restored from key: buildkit-mounts-ci-universe-humble-<platform>-... log line.
  • Review the rewritten docker-new/README.md against the new graph in docker-new/docker-bake.hcl.

Restructures the docker-new image graph and CI topology:

- Add base-cuda-runtime / base-cuda-devel stages as a dedicated CUDA
  base, and rebase universe-cuda off them instead of universe.
- Drop universe-runtime-dependencies; fold into universe.
- Split rmw and nvidia ansible roles into standalone playbooks so
  Dockerfiles bind-mount only what they need.
- Switch every BuildKit RUN mount to named IDs
  (apt/ccache/pip/pipx, ROS_DISTRO-scoped where relevant).
- Split per-distro CI into per-arch jobs; extract multi-arch manifest
  stitching into docker-manifest-new.yaml.
- Persist BuildKit mount caches across runs via actions/cache +
  buildkit-cache-dance, with a size guard and lineage pruning.

Signed-off-by: Mete Fatih Cırıt <[email protected]>
@xmfcx
Copy link
Copy Markdown
Contributor Author

xmfcx commented Apr 17, 2026

Docker layer graph update

Before

graph TD
    base(["base"])
    base --> core-dependencies(["core-dependencies"])
    core-dependencies --> core-devel(["core-devel"])
    core-devel --> universe-dependencies(["universe-dependencies"])
    universe-dependencies --> universe-dependencies-cuda(["universe-dependencies-cuda"])
    universe-dependencies --> universe-devel(["universe-devel"])
    universe-dependencies-cuda --> universe-devel-cuda(["universe-devel-cuda"])
    base --> core(["core"])
    core-devel -- " COPY /opt/autoware " --> core
    core --> universe-runtime-dependencies(["universe-runtime-dependencies"])
    universe-runtime-dependencies --> universe(["universe"])
    universe-runtime-dependencies --> universe-cuda(["universe-cuda"])
    universe-devel -- " COPY /opt/autoware " --> universe
    universe-devel-cuda -- " COPY /opt/autoware " --> universe-cuda
    classDef base fill: #e8e8e8, color: #333
    classDef devel fill: #bbdefb, color: #333
    classDef runtime fill: #c8e6c9, color: #333
    classDef cuda fill: #e1bee7, color: #333
    class base base
    class core-dependencies,core-devel,universe-dependencies,universe-devel devel
    class core,universe-runtime-dependencies,universe runtime
    class universe-dependencies-cuda,universe-devel-cuda,universe-cuda cuda
Loading

After

graph TB
    base(["base"]) --> core-dependencies(["core-dependencies"]) & core(["core"]) & base-cuda-runtime(["base-cuda-runtime"])
    core-dependencies --> core-devel(["core-devel"])
    core-devel --> universe-dependencies(["universe-dependencies"])
    universe-dependencies --> universe-devel(["universe-devel"])
    universe-dependencies-cuda(["universe-dependencies-cuda"]) --> universe-devel-cuda(["universe-devel-cuda"])
    core-devel -- " COPY /opt/autoware " --> core
    core --> universe(["universe"])
    universe-devel -- " COPY /opt/autoware " --> universe
    universe-devel-cuda -- " COPY /opt/autoware " --> universe-cuda(["universe-cuda"])
    core-devel -- " COPY /opt/autoware " --> universe-dependencies-cuda
    base-cuda-devel(["base-cuda-devel"]) --> universe-dependencies-cuda
    base-cuda-runtime --> universe-cuda & base-cuda-devel
    classDef base fill: #e8e8e8, color: #333
    classDef devel fill: #bbdefb, color: #333
    classDef runtime fill: #c8e6c9, color: #333
    classDef cuda fill: #e1bee7, color: #333
    class base,base-cuda-runtime,base-cuda-devel base
    class core-dependencies,core-devel,universe-dependencies,universe-devel devel
    class core,universe runtime
    class universe-dependencies-cuda,universe-devel-cuda,universe-cuda cuda
Loading

@autowarefoundation autowarefoundation deleted a comment from github-actions Bot Apr 17, 2026
@xmfcx
Copy link
Copy Markdown
Contributor Author

xmfcx commented Apr 17, 2026

image

I rewired the graph so that:

  • Big and non-changing cuda is on higher layers
  • First install cuda without devel to base-cuda-runtime
  • Then install cuda with devel to base-cuda-devel (inherit from runtime) so diff is small
  • Everything else gets rewired accordingly

This way:

  • As long as cuda is not updated in ansible, those layers will never change
  • Smaller consecutive downloads
  • Smaller footprint on the device due to more efficient layer sharing
  • Easier and faster for CI to process and cache

The vibe coded visualizer script and these html files can be found here.

@xmfcx
Copy link
Copy Markdown
Contributor Author

xmfcx commented Apr 17, 2026

image

It's tested under https://github.com/xmfcx/autoware/actions/runs/24570637486 my fork's main.

The workflows are building, the cache is functional!

All the images are created: https://github.com/xmfcx/autoware/pkgs/container/autoware-new/versions

image

@xmfcx xmfcx merged commit e0baba8 into main Apr 20, 2026
46 of 49 checks passed
@xmfcx xmfcx deleted the feat/split-05-docker-new-restructure branch April 20, 2026 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run:health-check Run health-check

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants