ci(docker-new): split base-cuda layer and restructure CI pipelines by xmfcx · Pull Request #7033 · autowarefoundation/autoware

xmfcx · 2026-04-17T18:22:30Z

Parent Issue: Simplify the docker images and workflows #6852
Image graph. Add base-cuda-runtime / base-cuda-devel stages in docker-new/base-cuda.Dockerfile and rebase universe-cuda off them (was stacked on universe). Drop universe-runtime-dependencies — universe copies directly from universe-devel. New bake group ci-base-cuda.
Ansible split. New ansible/playbooks/rmw.yaml and nvidia.yaml; Dockerfiles bind-mount only the role + playbook files they actually invoke (ansible/roles/rmw_implementation, ansible/roles/nvidia_*). Shrinks the bind-mount tree and the cache-invalidation surface.
Named BuildKit mounts. Every RUN --mount=type=cache,... now carries an explicit id= (apt-cache-${ROS_DISTRO}, apt-lists-${ROS_DISTRO}, ccache-${ROS_DISTRO}, pip-cache, pipx-cache). Required for buildkit-cache-dance to round-trip mount state between runs.
Per-arch CI. docker-build-pipeline-new.yaml takes a platform input; docker-build-and-push-new.yaml fans out to humble-amd64 / humble-arm64 / jazzy-amd64 / jazzy-arm64. Multi-arch manifest stitching extracted to a dedicated docker-manifest-new.yaml run after both arches succeed.
Mount-cache persistence. docker-build-new.yaml saves/restores BuildKit mount tarballs via actions/cache + buildkit-cache-dance, with a size guard and lineage pruning (largest-per-lineage kept on main pushes; stale entries discarded). This is the SAVE side that ci(health-check): rewire cache to match docker-new restructure #7032 reads from.
Entrypoint. Bake ENV RMW_IMPLEMENTATION=rmw_cyclonedds_cpp into base so every downstream stage inherits it without relying on runtime-env override.

Why

Previously universe-cuda stacked on universe-dependencies, so a change to either branch forced the other to rebuild and both branches shared one apt cache key — meaning the huge CUDA / cuDNN / TensorRT payload poisoned non-CUDA tarballs. Splitting CUDA into its own base plus per-arch CI jobs lets each branch cache independently, enables parallel amd64/arm64 builds, and matches the docker-new "Ansible-first" convention (version pins live in role defaults/main.yaml). Persisting BuildKit mount caches across runs turns cold builds into near-hot ones and is what lets health-check (#7032) reuse apt/ccache/pip/pipx state without spending compute.

5 of 6 PRs splitting ci(docker-new): split base-cuda layer and restructure CI pipelines #7025. Builds on ci(health-check): rewire cache to match docker-new restructure #7032 (merged).
Last PR in the stack (docker-new/examples/) follows this one.

Test plan

Dispatched docker-build-and-push-new against this branch: https://github.com/autowarefoundation/autoware/actions/runs/24580553216 — expect all four per-arch pipelines and both manifest jobs to succeed.
Inspect the rewired bake graph:
```
docker buildx bake -f docker-new/docker-bake.hcl --print default | jq '{groups: .group | keys, targets: .target | keys}'
```
Expect groups default, ci-base, ci-core, ci-base-cuda, ci-universe, ci-universe-cuda. Expect new targets base-cuda-runtime, base-cuda-devel. Expect universe-runtime-dependencies absent.
Verify the CI matrix on this PR: four per-arch pipeline runs (humble-amd64, humble-arm64, jazzy-amd64, jazzy-arm64) each executing build-base → build-core → build-universe plus build-base-cuda → build-universe-cuda; then humble-manifest / jazzy-manifest stitching the multi-arch tags after both arches succeed.
Build the new CUDA base locally end-to-end:
```
ROS_DISTRO=jazzy REGISTRY=autoware PLATFORM=amd64 TAG_DATE=$(date +%Y%m%d) TAG_VERSION= TAG_REF= docker buildx bake -f docker-new/docker-bake.hcl ci-base-cuda
```
Expect autoware:base-cuda-runtime-jazzy-amd64-<date> and autoware:base-cuda-devel-jazzy-amd64-<date> visible in docker images.
After this lands on main, confirm the persisted mount cache: the next ci-universe run's Restore BuildKit cache mounts step reports Cache restored from key: buildkit-mounts-ci-universe-<distro>-<platform>-<sha>, and the prune step logs keep: buildkit-mounts-... for exactly one entry per lineage.
On a subsequent health-check run (ci(health-check): rewire cache to match docker-new restructure #7032's pipeline), confirm its read-only mount restore now hits instead of missing, via the Cache restored from key: buildkit-mounts-ci-universe-humble-<platform>-... log line.
Review the rewritten docker-new/README.md against the new graph in docker-new/docker-bake.hcl.

Restructures the docker-new image graph and CI topology: - Add base-cuda-runtime / base-cuda-devel stages as a dedicated CUDA base, and rebase universe-cuda off them instead of universe. - Drop universe-runtime-dependencies; fold into universe. - Split rmw and nvidia ansible roles into standalone playbooks so Dockerfiles bind-mount only what they need. - Switch every BuildKit RUN mount to named IDs (apt/ccache/pip/pipx, ROS_DISTRO-scoped where relevant). - Split per-distro CI into per-arch jobs; extract multi-arch manifest stitching into docker-manifest-new.yaml. - Persist BuildKit mount caches across runs via actions/cache + buildkit-cache-dance, with a size guard and lineage pruning. Signed-off-by: Mete Fatih Cırıt <[email protected]>

xmfcx · 2026-04-17T18:30:31Z

Docker layer graph update

Before

graph TD
    base(["base"])
    base --> core-dependencies(["core-dependencies"])
    core-dependencies --> core-devel(["core-devel"])
    core-devel --> universe-dependencies(["universe-dependencies"])
    universe-dependencies --> universe-dependencies-cuda(["universe-dependencies-cuda"])
    universe-dependencies --> universe-devel(["universe-devel"])
    universe-dependencies-cuda --> universe-devel-cuda(["universe-devel-cuda"])
    base --> core(["core"])
    core-devel -- " COPY /opt/autoware " --> core
    core --> universe-runtime-dependencies(["universe-runtime-dependencies"])
    universe-runtime-dependencies --> universe(["universe"])
    universe-runtime-dependencies --> universe-cuda(["universe-cuda"])
    universe-devel -- " COPY /opt/autoware " --> universe
    universe-devel-cuda -- " COPY /opt/autoware " --> universe-cuda
    classDef base fill: #e8e8e8, color: #333
    classDef devel fill: #bbdefb, color: #333
    classDef runtime fill: #c8e6c9, color: #333
    classDef cuda fill: #e1bee7, color: #333
    class base base
    class core-dependencies,core-devel,universe-dependencies,universe-devel devel
    class core,universe-runtime-dependencies,universe runtime
    class universe-dependencies-cuda,universe-devel-cuda,universe-cuda cuda

After

graph TB
    base(["base"]) --> core-dependencies(["core-dependencies"]) & core(["core"]) & base-cuda-runtime(["base-cuda-runtime"])
    core-dependencies --> core-devel(["core-devel"])
    core-devel --> universe-dependencies(["universe-dependencies"])
    universe-dependencies --> universe-devel(["universe-devel"])
    universe-dependencies-cuda(["universe-dependencies-cuda"]) --> universe-devel-cuda(["universe-devel-cuda"])
    core-devel -- " COPY /opt/autoware " --> core
    core --> universe(["universe"])
    universe-devel -- " COPY /opt/autoware " --> universe
    universe-devel-cuda -- " COPY /opt/autoware " --> universe-cuda(["universe-cuda"])
    core-devel -- " COPY /opt/autoware " --> universe-dependencies-cuda
    base-cuda-devel(["base-cuda-devel"]) --> universe-dependencies-cuda
    base-cuda-runtime --> universe-cuda & base-cuda-devel
    classDef base fill: #e8e8e8, color: #333
    classDef devel fill: #bbdefb, color: #333
    classDef runtime fill: #c8e6c9, color: #333
    classDef cuda fill: #e1bee7, color: #333
    class base,base-cuda-runtime,base-cuda-devel base
    class core-dependencies,core-devel,universe-dependencies,universe-devel devel
    class core,universe runtime
    class universe-dependencies-cuda,universe-devel-cuda,universe-cuda cuda

xmfcx · 2026-04-17T18:36:22Z

I rewired the graph so that:

Big and non-changing cuda is on higher layers
First install cuda without devel to base-cuda-runtime
Then install cuda with devel to base-cuda-devel (inherit from runtime) so diff is small
Everything else gets rewired accordingly

This way:

As long as cuda is not updated in ansible, those layers will never change
Smaller consecutive downloads
Smaller footprint on the device due to more efficient layer sharing
Easier and faster for CI to process and cache

The vibe coded visualizer script and these html files can be found here.

xmfcx · 2026-04-17T18:43:28Z

It's tested under https://github.com/xmfcx/autoware/actions/runs/24570637486 my fork's main.

The workflows are building, the cache is functional!

All the images are created: https://github.com/xmfcx/autoware/pkgs/container/autoware-new/versions

xmfcx requested review from oguzkaganozt and youtalk as code owners April 17, 2026 18:22

xmfcx self-assigned this Apr 17, 2026

xmfcx requested review from isamu-takagi and mitsudome-r as code owners April 17, 2026 18:22

xmfcx mentioned this pull request Apr 17, 2026

ci(docker-new): split base-cuda layer and restructure CI pipelines #7025

Closed

xmfcx added the run:health-check Run health-check label Apr 17, 2026

autowarefoundation deleted a comment from github-actions Bot Apr 17, 2026

mitsudome-r approved these changes Apr 20, 2026

View reviewed changes

xmfcx merged commit e0baba8 into main Apr 20, 2026
46 of 49 checks passed

xmfcx deleted the feat/split-05-docker-new-restructure branch April 20, 2026 07:10

This was referenced Apr 20, 2026

docs(docker-new): add compose examples (basic + awsim + planning-sim) #7037

Merged

ci(docker): replace legacy docker/ pipeline with docker-new #7040

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(docker-new): split base-cuda layer and restructure CI pipelines#7033

ci(docker-new): split base-cuda layer and restructure CI pipelines#7033
xmfcx merged 1 commit intomainfrom
feat/split-05-docker-new-restructure

xmfcx commented Apr 17, 2026 •

edited

Loading

Uh oh!

xmfcx commented Apr 17, 2026

Uh oh!

xmfcx commented Apr 17, 2026 •

edited

Loading

Uh oh!

xmfcx commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xmfcx commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Test plan

Uh oh!

xmfcx commented Apr 17, 2026

Docker layer graph update

Before

After

Uh oh!

xmfcx commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmfcx commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xmfcx commented Apr 17, 2026 •

edited

Loading

xmfcx commented Apr 17, 2026 •

edited

Loading