Skip to content

Retry dependency fetches and fail fast in entrypoint [CI 6/9]#1597

Merged
auphelia merged 1 commit into
Xilinx:devfrom
merkelmarrow:6-docker-build-hardening-pr
Jun 5, 2026
Merged

Retry dependency fetches and fail fast in entrypoint [CI 6/9]#1597
auphelia merged 1 commit into
Xilinx:devfrom
merkelmarrow:6-docker-build-hardening-pr

Conversation

@merkelmarrow

Copy link
Copy Markdown
Contributor

This is PR 6 of 9 of a series intended to make CI faster and more robust.

This patch hardens several observed failure modes in FINN CI. The idea is to fail early when the necessary dependencies could not be fetched, and retry on transient network errors that would otherwise break a long test pipeline. These transient errors are made more frequent when CI is sharded and several shards are fetching dependencies simultaneously (as they will be later in this PR series).

Changes

  • add set -e fail-fast to both the dependency fetch script and the container entrypoint
  • added a retry helper with exponential backoff around network steps.
  • replace git pull with git fetch --tags --force followed by git checkout <pinned-commit> which is safer for detached-HEAD, necessary for stricter set -e policy.
  • detect an unusable clone by testing for a resolvable HEAD instead of just the existence of the directory (valuable in a retry case)
  • fail loudly when a dependency can't be checked out at its pinned commit.
  • download board-file archives with wget -qO <file> so a retried download overwrites a partial file instead of writing "pynq-z1.zip.1" which the unzip would ignore.
  • wrap the qonnx pyproject.toml workaround in an EXIT trap that restores the file even if the editable install fails, so pip failure cannot leave the mounted qonnx checkout half-renamed (necessary with set -e policy on pip installs).
  • quote shell variables as a drive-by improvement (make commands robust to spaces, empty values, and globbing)

Testing

Full functional Jenkins validation runs, no transient network-error build failures since deployed on live CI (20+ full runs).

fetch-repos.sh cloned and checked out dependencies with no retry.
A transient GitHub 5xx error or DNS blip sometimes would abort
a CI build, especially when multiple machines are checking out
the dependencies simulaneously. Add set -eo pipefail, an
exponential-backoff retry around the network steps, and a fetch
plus checkout of the pinned commit that fails loudly on a
commit mismatch. A clone interrupted mid-fetch leaves a tree
with no resolvable HEAD, so re-clone whenever HEAD is missing
and drop any leftover before cloning rather than reusing a
half-clone. run-docker.sh now treats a fetch-repos.sh failure
as fatal rather than building on a partial deps tree.

finn_entrypoint.sh gains set -e plus a trap that restores the
qonnx pyproject.toml even if the editable install fails, so
a partial deps tree is caught when the container starts
rather than much later on in the tests.

Signed-off-by: Marco Blackwell <mblackwe@amd.com>
@auphelia

auphelia commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Thanks @merkelmarrow!

@auphelia auphelia merged commit 7ea2491 into Xilinx:dev Jun 5, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants