Skip to content

fix: add retry with backoff to get-version calls in version resolution#1

Open
mykaul wants to merge 1 commit intomasterfrom
retry-get-version
Open

fix: add retry with backoff to get-version calls in version resolution#1
mykaul wants to merge 1 commit intomasterfrom
retry-get-version

Conversation

@mykaul
Copy link
Copy Markdown
Owner

@mykaul mykaul commented Mar 11, 2026

Summary

  • Add retry logic (5 attempts, linear 10s backoff) to all get-version invocations in resolve-cassandra-version and resolve-scylla-version Makefile targets
  • Fixes transient CI failures caused by GitHub API rate limiting (HTTP 403) during version resolution

Problem

The get-version binary queries the GitHub Tags API (for Cassandra) and DockerHub API (for Scylla) to resolve version aliases like 4-LATEST. These APIs can return transient errors — notably HTTP 403 rate limit exceeded on unauthenticated GitHub API requests. With set -eo pipefail, any such failure is immediately fatal, causing the entire CI job to fail.

Example failure:

failed to execute query to https://api.github.com/repos/apache/cassandra/tags?per_page=100,
last error: rate limit exceeded: server replied with 403 rate limit exceeded

Solution

A reusable retry() bash function defined via a Make define block, injected into both version resolution targets. On failure, it retries with 10s/20s/30s/40s/50s delays (5 attempts, ~2.5 min max wait). Key properties:

  • stdout/stderr separation: retry progress messages go to stderr, so the version string captured by backtick substitution is never corrupted
  • Fail-fast preserved: after all attempts are exhausted, the last error is printed and the function returns non-zero
  • Cache bypass: the existing version file cache means successful runs never enter the retry path
  • No new dependencies: pure bash, uses the existing .ONESHELL Makefile setup

Testing

Verified with 5 test scenarios:

  1. Command succeeds on first try — output captured correctly
  2. Command always fails — returns non-zero after max attempts
  3. Transient failure (fails 2x, succeeds on 3rd) — the rate-limit scenario
  4. stdout/stderr separation — version string not corrupted
  5. Pipe compatibility — retry cmd | tr -d '"' works (matches actual Makefile usage)

Also verified via make --dry-run that both resolve-cassandra-version and resolve-scylla-version expand correctly.

The get-version binary queries external APIs (GitHub for Cassandra,
DockerHub for Scylla) that can return transient errors such as HTTP 403
rate limit exceeded. Previously any such failure was immediately fatal
due to set -eo pipefail.

Add a reusable retry() shell function (5 attempts, 10s linear backoff,
~2.5 min max wait) that wraps all get-version invocations in both
resolve-cassandra-version and resolve-scylla-version targets. Retry
progress messages go to stderr so they don't corrupt the version string
captured via command substitution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant