Skip to content

[BACKPORT 2025.1][build] Improve error reporting and retry for archive downloads (#31364)#31554

Open
hari90 wants to merge 2 commits into
yugabyte:2025.1from
hari90:backport-3ce79c6e6-2025.1
Open

[BACKPORT 2025.1][build] Improve error reporting and retry for archive downloads (#31364)#31554
hari90 wants to merge 2 commits into
yugabyte:2025.1from
hari90:backport-3ce79c6e6-2025.1

Conversation

@hari90
Copy link
Copy Markdown
Contributor

@hari90 hari90 commented May 11, 2026

Summary

When the third-party archive checksum download returned an HTML error
page
instead of the expected .sha256 file,
download_and_extract_archive.py
only reported the file size, which made the failure hard to diagnose.
There
was also no retry, so a single transient failure (e.g. a 5xx from
GitHub)
would fail the build.

Example failure:

Checksum file size is too big: 55118 bytes

(failing
job
)

Changes

  • download_url now passes -f to curl so HTTP error responses no
    longer
    get written to disk as the requested artifact, and uses --retry /
    --retry-delay to retry transient failures (5xx, connection errors).
  • The "checksum file size is too big" error now includes the first 1024
    bytes of the file so the underlying error (e.g. an HTML page) is visible
    in build logs.

Co-authored-by: Claude noreply@anthropic.com

Original commit: 3ce79c6 / #31364, 9afef01 / #31427


CSI

…e downloads (yugabyte#31364)

## Summary

When the third-party archive checksum download returned an HTML error
page
instead of the expected `.sha256` file,
`download_and_extract_archive.py`
only reported the file size, which made the failure hard to diagnose.
There
was also no retry, so a single transient failure (e.g. a 5xx from
GitHub)
would fail the build.

Example failure:

Checksum file size is too big: 55118 bytes

([failing
job](https://github.com/yugabyte/yugabyte-db/actions/runs/25144698357/job/73701866705?pr=31359))

## Changes

- `download_url` now passes `-f` to curl so HTTP error responses no
longer
  get written to disk as the requested artifact, and uses `--retry` /
  `--retry-delay` to retry transient failures (5xx, connection errors).
- The "checksum file size is too big" error now includes the first 1024
bytes of the file so the underlying error (e.g. an HTML page) is visible
  in build logs.

---------

Co-authored-by: Claude <noreply@anthropic.com>

Original commit: 3ce79c6 / yugabyte#31364
@hari90 hari90 requested review from es1024 and svarnau May 11, 2026 23:57
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the download process by adding retry logic to the curl command and improving error diagnostics for invalid checksum files. It introduces constants for maximum download attempts and retry delays, and updates the checksum verification to include a content preview when file size limits are exceeded. I have no feedback to provide.

@hari90
Copy link
Copy Markdown
Contributor Author

hari90 commented May 11, 2026

Trigger Jenkins

@hari90
Copy link
Copy Markdown
Contributor Author

hari90 commented May 11, 2026

Jenkins build has been triggered. Results will be posted once it completes. CSI


JenkinsBot

…t the Python level on any curl failure (yugabyte#31427)

## Summary

Thirdparty archive downloads occasionally fail with curl exit status 22
on
GitHub Actions because curl's `--retry` only retries timeouts and HTTP
408/429/5xx, not transient 403s on GitHub's signed release-asset
redirects.
`--retry-all-errors` would cover this but requires curl >= 7.71.0, which
is
unavailable on AlmaLinux 8 / RHEL 8 (curl 7.61.1) and similar runners.

Wrap the curl invocation in a Python retry loop instead, so any curl
failure
is retried regardless of curl version.

Fixes yugabyte#31426.

## Test Plan
Jenkins: compile only

Original commit: 9afef01 / yugabyte#31427
@hari90
Copy link
Copy Markdown
Contributor Author

hari90 commented May 12, 2026

Trigger Jenkins

@hari90
Copy link
Copy Markdown
Contributor Author

hari90 commented May 12, 2026

Jenkins build has been triggered. Results will be posted once it completes. CSI


JenkinsBot

@hari90
Copy link
Copy Markdown
Contributor Author

hari90 commented May 12, 2026

Jenkins build for commit a90e08c2: Fail
CSI
Reason: CSI status: FAIL

Errors:

Checking test failure count per build versus limit of 20 (0 on mac).

Build Failures Status
PR31554-ubuntu22.04-clang19-debug #2 0 Okay
PR31554-mac14-clang-release #2 0 Okay
PR31554-alma8-clang19-release #2 2 Okay
PR31554-alma8-clang19-asan #2 27 FAILURE
PR31554-arm-mac14-clang-release #2 0 Okay
PR31554-alma8-gcc11-fastdebug #2 10 Okay
PR31554-arm-alma8-clang19-release #2 4 Okay
PR31554-alma8-clang19-tsan #2 11 Okay

🔨 DB Build/Test Job Summary

Build Total Passed Failed Failed After Retries
PR31554-ubuntu22.04-clang19-debug 2 2 0 0
PR31554-mac14-clang-release 2 2 0 0
PR31554-alma8-clang19-release 9865 9447 2 2
PR31554-alma8-clang19-asan 10483 9702 27 27
PR31554-arm-mac14-clang-release 16 16 0 0
PR31554-alma8-gcc11-fastdebug 10603 10174 10 10
PR31554-arm-alma8-clang19-release 9863 9442 4 4
PR31554-alma8-clang19-tsan 10404 8842 11 11

JenkinsBot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant