Improve Tower telemetry error handling on transient gateway failures#7190
Open
pditommaso wants to merge 4 commits into
Open
Improve Tower telemetry error handling on transient gateway failures#7190pditommaso wants to merge 4 commits into
pditommaso wants to merge 4 commits into
Conversation
When the Seqera Platform progress/telemetry endpoint returns a non-JSON error body (e.g. an HTML `502 Bad Gateway` page from a gateway/proxy), TowerClient previously surfaced the whole HTML payload as the failure cause. The resulting AbortRunException message was an unreadable wall of markup, obscuring the real reason. Reduce HTML error bodies to a concise reason by extracting the `<title>` (which gateways/proxies use to carry the status reason, e.g. "502 Bad Gateway"); the match tolerates attributes and collapses internal whitespace. Non-HTML bodies fall back to the plain text as-is, and JSON error objects are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
A transient gateway error (e.g. HTTP 502) on the Tower telemetry endpoint exhausted the retry policy in ~5 seconds (5 attempts) and aborted the run, even though such errors are usually short-lived. Raise the default maxAttempts for the Tower retry policy from 5 to 10. Combined with the existing delay (350ms), multiplier (2.0) and maxDelay (90s), the 9 exponential backoff gaps span a retry window of about 3 minutes, allowing transient gateway errors to be ridden out before failing the run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Several collectStatus test cases stubbed checkApiConnection but did not isolate SysEnv nor stub createTowerClient. Since collectStatus calls createTowerClient(endpoint, accessToken).getUserInfo() when an access token is present, these tests picked up a real TOWER_ACCESS_TOKEN from the developer environment and made live network calls -- one of them against https://unreachable.example.com, which (after extending the retry window) hung for ~3 minutes. The whole test class could exceed 6 minutes. Isolate the environment with SysEnv.push([:]) / SysEnv.pop() (the pattern already used elsewhere in this class) in the affected cases, so no test contacts the network. Runtime drops from >6.5 min to ~8s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
bentsherman
reviewed
May 29, 2026
Comment on lines
+45
to
+51
| /** | ||
| * Default max number of attempts. Combined with the default {@code delay} (350ms), | ||
| * {@code multiplier} (2.0) and {@code maxDelay} (90s), the 9 exponential backoff gaps | ||
| * span a retry window of about 3 minutes, so that transient gateway errors (e.g. a | ||
| * `502 Bad Gateway`) can be ridden out before aborting the run. | ||
| */ | ||
| static final int DEFAULT_MAX_ATTEMPTS = 10 |
Member
There was a problem hiding this comment.
The description says that the 5 attempts took only 5 seconds... this seems like a really short window for 5 attempts. I wonder if it would be better to increase the initial delay instead
Member
Author
There was a problem hiding this comment.
That's good to recover quickly a temporary glitch. Network errors can have very different nature
Document and lock the behaviour that a background session abort (e.g. a Tower telemetry 502) is captured by `workflow.errorReport` and `workflow.errorMessage` via the `session.error` branch of `WorkflowMetadata.setErrorAttributes()` when no task fault is present. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A pipeline run aborted because the Seqera Platform progress/telemetry endpoint returned a transient HTTP 502 Bad Gateway. From the user's perspective the run "stopped quietly": the failure reason was hard to find and the run gave up after only a few seconds of retries.
Investigation of the run log surfaced three distinct issues:
TowerClientcould not parse the gateway's HTML error page as JSON, so it surfaced the entire<html>…502 Bad Gateway…</html>markup as theAbortRunExceptionmessage.AuthCommandImplTestwas hitting the real network and could run for >6 minutes.Changes
1. Surface a concise reason for HTTP gateway errors
TowerClient.parseCausenow reduces an HTML error body to a concise reason by extracting the<title>text (gateways/proxies put the status there, e.g.502 Bad Gateway); the match tolerates attributes and collapses internal whitespace. Non-HTML bodies fall back to the plain text as-is, and JSON error objects are unchanged.Resulting abort message:
2. Extend the retry window to ~3 minutes
Raise the Tower retry policy default
maxAttemptsfrom 5 to 10. Combined with the existingdelay(350ms),multiplier(2.0) andmaxDelay(90s), the 9 exponential backoff gaps span a retry window of ~3 minutes (0.35·(2^9−1) ≈ 178.9s), so transient gateway errors are ridden out before the run fails.3. Make
AuthCommandImplTesthermeticSeveral
collectStatustest cases stubbedcheckApiConnectionbut did not isolateSysEnvnor stubcreateTowerClient. BecausecollectStatuscallscreateTowerClient(endpoint, accessToken).getUserInfo()when a token is present, these tests picked up a realTOWER_ACCESS_TOKENfrom the developer's environment and made live network calls — one againsthttps://unreachable.example.com, which (after extending the retry window) hung ~3 minutes. Isolating the environment withSysEnv.push([:])/pop()drops the class runtime from >6.5 min to ~8s.Testing
TowerClientTest— 30 tests, incl. new cases for HTML→reason extraction (attributes, multiline title, plain-text fallback) and thetraceProgress502 abort path.TowerRetryPolicyTest— asserts the new default and that the computed backoff window is ~3 min.AuthCommandImplTest— 58 tests, now ~8s.🤖 Generated with Claude Code