Skip to content

Commit eb4334b

Browse files
committed
CI: Retry build upon failure
In Jan-Feb 2026: NuttX CI hit a [record high usage of GitHub Runners](#17914), exceeding the limit enforced by ASF Infrastructure Team. We analysed the PRs and discovered that most GitHub Runners were wasted on __(1) Failure to Download the Build Dependencies__ for DTC Device Tree, OpenAMP Messaging, MicroADB Debugger, MCUBoot Bootloader, NimBLE Bluetooth, etc __(2) Resubmitting PR Commits__: - [Video: Analysing the Most Expensive PR](https://youtu.be/swFaxaTCEQg) - [Video: Second Most Expensive PR](https://youtu.be/uSpQkzBogEw) - [Video: Third Most Expensive PR](https://youtu.be/J7w1gyjwZ1w) - [Video: Most Expensive Apps PR](https://youtu.be/182h8cRpfvI) - [Spreadsheet: Most Expensive PRs](https://docs.google.com/spreadsheets/d/1HY7fIZzd_fs3QPyA0TX7vsYOjL86m1fNOf1Wls93luI/edit?gid=70515654#gid=70515654) Why would __Download Failures__ waste GitHub Runners? That's because Download Failures will terminate the Entire CI Build (across All CI Jobs), requiring a restart of the CI Build. And the CI Build isn't terminated immediately upon failure: NuttX CI waits for the CI Job to complete (e.g. arm-01), before terminating the CI Build. Which means that CI Builds can get terminated 2.5 hours into the CI Build, wasting 2.5 elapsed hours x [7.4 parallel processes](https://lupyuen.org/articles/ci3#live-metric-for-full-time-runners) of GitHub Runners. This PR proposes to __Retry the Build for Each CI Target__. NuttX CI shall rebuild each CI Target (e.g. `sim:nsh`), upon failure, up to 3 times (total 4 builds). Each rebuild will be attempted after a Randomised Delay with Exponential Backoff, initially set to 60 seconds, then 120 seconds, 240 seconds. The rebuilds will mitigate the effects of Intermittent Download Failures that occur in GitHub Actions. (And eliminate developer frustration) If the build fails after 3 retries: Subsequent CI Targets will __not be allowed to rebuild__ upon failure. This is to prevent cascading build failures from overloading GitHub Actions, and consuming too many GitHub Runners. Note that NuttX CI shall retry the build for __Any Kind of Build Failure__, including Download Failures, Compile Errors and Config Errors. We designed it simplistically due to our current constraints: (1) Lack of CI Expertise (2) NuttX CI is Mission Critical (3) Legacy CI Scripts are Highly Complex. To prevent Compile Errors and Config Errors: We expect NuttX Devs to [Build and Test PRs in Our Own Repos](#18568), before submitting to NuttX. What about __Resubmitting PR Commits__ and its wastage of GitHub Runners? We also require NuttX Devs to [Build and Test PRs in Our Own Repos](#18568), before resubmitting to NuttX. GitHub Runners will then be charged to the developer's quota, without affecting the GitHub Runners quota for Apache NuttX Project. We plan to [Kill All CI Jobs](https://youtu.be/182h8cRpfvI?si=MmAuwLISZPPMoqDq&t=1479) for PRs that have been switched to Draft Mode. We'll monitor this through the [NuttX Build Monitor](#18659). Modified Files: `tools/testbuild.sh`: We introduce a New Wrapper Function `retrytest` that will call the Existing Function `dotest`, to build the CI Target and retry on error. `Documentation/components/tools/testbuild.rst`: Updated the `testbuild.sh` doc with the Retry Logic. Signed-off-by: Lup Yuen Lee <luppy@appkaki.com>
1 parent 3f16c4a commit eb4334b

2 files changed

Lines changed: 56 additions & 3 deletions

File tree

Documentation/components/tools/testbuild.rst

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ option shows the usage:
2323
-a <appsdir> provides the relative path to the apps/ directory. Default ../apps
2424
-t <topdir> provides the absolute path to top nuttx/ directory. Default ../nuttx
2525
-p only print the list of configs without running any builds
26-
-A store the build executable artifact in ARTIFACTDIR (defaults to ../buildartifacts
26+
-A store the build executable artifact in ARTIFACTDIR (defaults to ../buildartifacts)
2727
-C Skip tree cleanness check.
2828
-G Use "git clean -xfdq" instead of "make distclean" to clean the tree.
2929
This option may speed up the builds. However, note that:
@@ -73,3 +73,12 @@ The prefix ``-`` can be used to skip a configuration::
7373
or skip a configuration on a specific host(e.g. Darwin)::
7474

7575
-Darwin,sim:rpserver
76+
77+
This script will rebuild each configuration, upon failure, up to 3 times.
78+
Each rebuild will be attempted after a randomised delay with exponential
79+
backoff, initially set to 60 seconds. The rebuilds will mitigate the
80+
effects of intermittent download failures that occur in GitHub Actions.
81+
82+
If the build fails after 3 retries, subsequent configurations will not
83+
be allowed to rebuild upon failure. This is to prevent cascading build
84+
failures from overloading GitHub Actions.

tools/testbuild.sh

Lines changed: 46 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ nuttx=$WD/../nuttx
2424

2525
progname=$0
2626
fail=0
27+
maxbuilds=4 # Retry 3 times on failure
2728
APPSDIR=$WD/../apps
2829
if [ -z $ARTIFACTDIR ]; then
2930
ARTIFACTDIR=$WD/../buildartifacts
@@ -580,6 +581,49 @@ function dotest {
580581
fi
581582
}
582583

584+
# Build one entry from the test list file. Retry on failure.
585+
function retrytest {
586+
# Remember the Fail Status and clear it for each build
587+
local line=$1
588+
local prevfail=$fail
589+
local backoff=60 # Initial Exponential Backoff, in seconds
590+
591+
# Build and retry on failure, with Random Exponential Backoff
592+
for ((i = 1; i <= $maxbuilds; i++)); do
593+
echo "Build Attempt $i of $maxbuilds"
594+
fail=0
595+
dotest $line
596+
597+
# Don't retry if the build succeeded
598+
if [ ${fail} -eq 0 ]; then
599+
break
600+
else
601+
# Build Failed: Clean up any corrupted downloads, don't reuse
602+
git -C $nuttx clean -fd
603+
git -C $APPSDIR clean -fd
604+
pushd $nuttx ; git status ; popd
605+
pushd $APPSDIR ; git status ; popd
606+
fi
607+
608+
# If this is Final Retry: Don't retry subsequent builds
609+
if [ $i -eq $maxbuilds ]; then
610+
maxbuilds=1
611+
break
612+
fi
613+
614+
# Wait for Random Exponential Backoff, then retry
615+
delay=$(( (RANDOM % $backoff) + 1 ))
616+
echo "Wait $delay seconds ($backoff backoff)"
617+
backoff=$(($backoff * 2))
618+
sleep $delay
619+
done
620+
621+
# Return the Previous Fail Status, unless this build failed
622+
if [ ${fail} -eq 0 ]; then
623+
fail=$prevfail
624+
fi
625+
}
626+
583627
# Perform the build test for each entry in the test list file
584628

585629
for line in $testlist; do
@@ -588,10 +632,10 @@ for line in $testlist; do
588632
dir=`echo $line | cut -d',' -f1`
589633
list=`find boards$dir -name defconfig | cut -d'/' -f4,6`
590634
for i in ${list}; do
591-
dotest $i${line/"$dir"/}
635+
retrytest $i${line/"$dir"/}
592636
done
593637
else
594-
dotest $line
638+
retrytest $line
595639
fi
596640
done
597641

0 commit comments

Comments
 (0)