Skip to content

Improve Tower telemetry error handling on transient gateway failures#7190

Open
pditommaso wants to merge 4 commits into
masterfrom
fix-tower-502-error-reason
Open

Improve Tower telemetry error handling on transient gateway failures#7190
pditommaso wants to merge 4 commits into
masterfrom
fix-tower-502-error-reason

Conversation

@pditommaso
Copy link
Copy Markdown
Member

Problem

A pipeline run aborted because the Seqera Platform progress/telemetry endpoint returned a transient HTTP 502 Bad Gateway. From the user's perspective the run "stopped quietly": the failure reason was hard to find and the run gave up after only a few seconds of retries.

Investigation of the run log surfaced three distinct issues:

  1. The reason was an unreadable HTML dump. TowerClient could not parse the gateway's HTML error page as JSON, so it surfaced the entire <html>…502 Bad Gateway…</html> markup as the AbortRunException message.
  2. The retry window was too short. The retry policy gave up in ~5 seconds (5 attempts) on a transient 502 that is usually short-lived.
  3. (Discovered while testing) AuthCommandImplTest was hitting the real network and could run for >6 minutes.

Changes

1. Surface a concise reason for HTTP gateway errors

TowerClient.parseCause now reduces an HTML error body to a concise reason by extracting the <title> text (gateways/proxies put the status there, e.g. 502 Bad Gateway); the match tolerates attributes and collapses internal whitespace. Non-HTML bodies fall back to the plain text as-is, and JSON error objects are unchanged.

Resulting abort message:

Unexpected HTTP response
- endpoint    : https://cloud.seqera.io/api/trace/<id>/progress?workspaceId=<id>
- status code : 502
- response msg: 502 Bad Gateway

2. Extend the retry window to ~3 minutes

Raise the Tower retry policy default maxAttempts from 5 to 10. Combined with the existing delay (350ms), multiplier (2.0) and maxDelay (90s), the 9 exponential backoff gaps span a retry window of ~3 minutes (0.35·(2^9−1) ≈ 178.9s), so transient gateway errors are ridden out before the run fails.

3. Make AuthCommandImplTest hermetic

Several collectStatus test cases stubbed checkApiConnection but did not isolate SysEnv nor stub createTowerClient. Because collectStatus calls createTowerClient(endpoint, accessToken).getUserInfo() when a token is present, these tests picked up a real TOWER_ACCESS_TOKEN from the developer's environment and made live network calls — one against https://unreachable.example.com, which (after extending the retry window) hung ~3 minutes. Isolating the environment with SysEnv.push([:])/pop() drops the class runtime from >6.5 min to ~8s.

Testing

  • TowerClientTest — 30 tests, incl. new cases for HTML→reason extraction (attributes, multiline title, plain-text fallback) and the traceProgress 502 abort path.
  • TowerRetryPolicyTest — asserts the new default and that the computed backoff window is ~3 min.
  • AuthCommandImplTest — 58 tests, now ~8s.

🤖 Generated with Claude Code

pditommaso and others added 3 commits May 29, 2026 17:23
When the Seqera Platform progress/telemetry endpoint returns a non-JSON
error body (e.g. an HTML `502 Bad Gateway` page from a gateway/proxy),
TowerClient previously surfaced the whole HTML payload as the failure
cause. The resulting AbortRunException message was an unreadable wall of
markup, obscuring the real reason.

Reduce HTML error bodies to a concise reason by extracting the `<title>`
(which gateways/proxies use to carry the status reason, e.g. "502 Bad
Gateway"); the match tolerates attributes and collapses internal
whitespace. Non-HTML bodies fall back to the plain text as-is, and JSON
error objects are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
A transient gateway error (e.g. HTTP 502) on the Tower telemetry endpoint
exhausted the retry policy in ~5 seconds (5 attempts) and aborted the run,
even though such errors are usually short-lived.

Raise the default maxAttempts for the Tower retry policy from 5 to 10.
Combined with the existing delay (350ms), multiplier (2.0) and maxDelay
(90s), the 9 exponential backoff gaps span a retry window of about 3
minutes, allowing transient gateway errors to be ridden out before
failing the run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Several collectStatus test cases stubbed checkApiConnection but did not
isolate SysEnv nor stub createTowerClient. Since collectStatus calls
createTowerClient(endpoint, accessToken).getUserInfo() when an access
token is present, these tests picked up a real TOWER_ACCESS_TOKEN from the
developer environment and made live network calls -- one of them against
https://unreachable.example.com, which (after extending the retry window)
hung for ~3 minutes. The whole test class could exceed 6 minutes.

Isolate the environment with SysEnv.push([:]) / SysEnv.pop() (the pattern
already used elsewhere in this class) in the affected cases, so no test
contacts the network. Runtime drops from >6.5 min to ~8s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented May 29, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 53af6b0
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/6a19b6e7579467000867fc16

@pditommaso pditommaso requested a review from bentsherman May 29, 2026 15:26
Comment on lines +45 to +51
/**
* Default max number of attempts. Combined with the default {@code delay} (350ms),
* {@code multiplier} (2.0) and {@code maxDelay} (90s), the 9 exponential backoff gaps
* span a retry window of about 3 minutes, so that transient gateway errors (e.g. a
* `502 Bad Gateway`) can be ridden out before aborting the run.
*/
static final int DEFAULT_MAX_ATTEMPTS = 10
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description says that the 5 attempts took only 5 seconds... this seems like a really short window for 5 attempts. I wonder if it would be better to increase the initial delay instead

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to recover quickly a temporary glitch. Network errors can have very different nature

Document and lock the behaviour that a background session abort (e.g. a
Tower telemetry 502) is captured by `workflow.errorReport` and
`workflow.errorMessage` via the `session.error` branch of
`WorkflowMetadata.setErrorAttributes()` when no task fault is present.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants