[TRTLLMINF-43][feat] Extend infrastructure-failure retry to K8s test stages by dpitman-nvda · Pull Request #13530 · NVIDIA/TensorRT-LLM

dpitman-nvda · 2026-04-27T23:14:13Z

Summary by CodeRabbit

Tests
- Enhanced Kubernetes test infrastructure with improved retry logic for handling transient failures.
- Expanded infrastructure error pattern detection for more accurate failure classification.
- Implemented refined retry attempt tracking to prevent test artifact collisions during reruns.

Description

The branch feat/restart-on-node-crashes added a retry loop for transient infrastructure failures around runLLMTestlistOnSlurm, but the K8s-only test path (runLLMTestlistOnPlatform) did not get equivalent protection. Pod evictions, image-pull backoffs, OOMKilled events, JNLP-channel disconnects, and node-NotReady transitions immediately failed those stages with no retry, even though cacheErrorAndUploadResult is already postTag-aware and ensureStageResultNotUploaded is therefore retry-safe.

Out of scope (follow-ups): outer wrapping K8s pods used by launchTestJobsForImagesSanityCheck (failure of the dispatch shell before delegating to runLLMTestlistOnPlatform); build / docker-image / cross-job-dispatch stages; the unified PatternCatalog + FailureClassifier refactor that would dedupe the SLURM and K8s pattern lists.

Test Coverage

N/A, this is a CI change

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

The branch feat/restart-on-node-crashes added a retry loop for transient infrastructure failures around runLLMTestlistOnSlurm, but the K8s-only test path (runLLMTestlistOnPlatform) did not get equivalent protection. Pod evictions, image-pull backoffs, OOMKilled events, JNLP-channel disconnects, and node-NotReady transitions immediately failed those stages with no retry, even though cacheErrorAndUploadResult is already postTag-aware and ensureStageResultNotUploaded is therefore retry-safe. Changes (jenkins/L0_Test.groovy only): - Add K8S_INFRA_FAILURE_PATTERNS, K8S_INFRA_SINGLE_RETRY_PATTERNS, and K8S_INFRA_RETRY_MAX next to the SLURM lists. K8s-specific symptoms covered: ImagePullBackOff, ErrImagePull, OCI runtime exec failed, OOMKilled, node status is not ready, "Cannot contact " (JNLP disconnect), and "Connection failed" (JNLP/HTTP-handshake transient). OOMKilled and "Connection failed" are capped to a single retry to bound the cost of any false-positive match. - Extend classifyInfraFailure with two optional list parameters (extraInfraPatterns, extraSingleRetryPatterns) so the K8s path can pass its extra symptoms in without disturbing the SLURM defaults. The existing SLURM call site continues to call with no extras and has zero behaviour change. - Wrap runLLMTestlistOnPlatform's body in a while(true) retry loop modelled on the SLURM loop in runLLMTestlistOnSlurm, with the same [INFRA-RETRY] log prefix, FlowInterruptedException/exit-code-143 rethrow guards, and 60s cooldown. Each attempt composes its tag as effectivePostTag = postTag + attemptTag so the canonical artifact name is unchanged on attempt 1 and unique on retries (handles callers that already pass postTag like -SubJob-RunTest / -SubJob-TestImage). - Skip the retry loop entirely when testFilter[(DEBUG_MODE)] is set, preserving the existing 2-hour input prompt for human inspection of a failed pod. Out of scope (follow-ups): outer wrapping K8s pods used by launchTestJobsForImagesSanityCheck (failure of the dispatch shell before delegating to runLLMTestlistOnPlatform); build / docker-image / cross-job-dispatch stages; the unified PatternCatalog + FailureClassifier refactor that would dedupe the SLURM and K8s pattern lists. Signed-off-by: Derek Pitman <dpitman@nvidia.com>

coderabbitai · 2026-04-27T23:18:58Z

📝 Walkthrough

Walkthrough

Refactors infrastructure failure classification in the test runner to accept optional custom error patterns. Adds K8s-specific infrastructure failure patterns and a K8s retry limit. Modifies the K8s test runner to implement a retry loop that catches exceptions, classifies them using extended patterns, and retries with cooldown, while deduplicating artifacts via -attempt-N suffix appending.

Changes

Cohort / File(s)	Summary
K8s Test Runner Enhancement `jenkins/L0_Test.groovy`	Added K8s infrastructure error patterns and retry limit configuration. Extended `classifyInfraFailure` function with optional parameters (`extraInfraPatterns`, `extraSingleRetryPatterns`) to allow custom pattern injection. Refactored `runLLMTestlistOnPlatform` to implement retry loop: catches non-interruption exceptions, classifies failures, selects retry budget, and retries with cooldown. Artifacts differentiated via `-attempt-N` suffix on `postTag` to prevent collision across attempts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: extending infrastructure-failure retry handling to K8s test stages.
Description check	✅ Passed	The PR description includes a detailed problem statement, solution overview, and test coverage justification; all critical sections from the template are addressed.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 3123-3175: The retry loop currently retries only the test body
(runOnce) inside the same Kubernetes pod, so pod-level failures
(ImagePullBackOff, eviction, OOM, JNLP disconnect) are not recovered; move the
retry to wrap the pod-launching call instead of runOnce: modify the logic so
that the loop (and attempt/effectivePostTag handling) encloses the call to
trtllm_utils.launchKubernetesPod(...) which invokes runLLMTestlistOnPlatform(),
rather than retrying only runOnce; keep classification via
classifyInfraFailure/K8S_INFRA_FAILURE_PATTERNS/K8S_INFRA_SINGLE_RETRY_PATTERNS
and the same backoff/attempt-count logic (including K8S_INFRA_RETRY_MAX) but
ensure each retry performs a fresh pod creation before running the test body.
- Around line 3072-3111: runOnce currently always calls
cacheErrorAndUploadResult which unconditionally emits the synthetic "Stage
Failed" XML and calls junit(), causing transient infra failures to be
permanently recorded; modify the flow so cacheErrorAndUploadResult (or the
finally block that emits "Stage Failed" and calls junit()) accepts a
parameter/flag or consults an isFinalAttempt boolean to skip publishing JUnit on
retryable infra failures, and only emit the synthetic XML and call junit() when
the attempt is final or the exception is non-retryable; locate the runOnce
closure and the cacheErrorAndUploadResult implementation and wire an
attempt/finality check (or an isRetryableInfraFailure helper) so junit() is
suppressed for retryable infra exceptions and only invoked after the last
attempt.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4f7a31a0-01e4-47ff-8ef5-95be2119ee59

📥 Commits

Reviewing files that changed from the base of the PR and between e7f1062 and 52f9f4f.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy

…t on retry Addresses three review comments on the prior K8s retry commit: 1. Pod-level retry. The previous loop sat inside runLLMTestlistOnPlatform, which runs inside an already-launched pod, so transient pod-level failures (ImagePullBackOff, eviction, OOMKilled, JNLP disconnect, node NotReady) were unrecoverable. New helper runKubernetesPodWithInfraRetry wraps trtllm_utils.launchKubernetesPod with the existing classification (classifyInfraFailure / K8S_INFRA_FAILURE_PATTERNS / K8S_INFRA_SINGLE_RETRY_PATTERNS / K8S_INFRA_RETRY_MAX) and 60s backoff, so each retry gets a fresh pod creation. Used at the main parallelJobsFiltered consumer, the pip-install sanity outer pod, and the NGC image-sanity test pod. SLURM/multi-node/doc-build closures accept the new (attemptTag, isFinalAttempt) args even when they don't need them so the consumer's call shape is uniform; SLURM-internal retries (runLLMTestlistOnSlurm) remain unchanged. 2. junit() / synthetic stage-fail XML on intermediate retries. cacheErrorAndUploadResult now takes isFinalAttempt; when false, it captures the exception, runs classifyInfraFailure, and -- for retryable infra failures -- skips both the synthetic results-stage.xml emission and the junit() call. The tar still uploads to Artifactory for forensics. The new helper passes isFinalAttempt = (attempt > K8S_INFRA_RETRY_MAX); single-retry-only patterns terminate at attempt 2 in which case the synthetic XML is suppressed (acceptable -- the Jenkins build still surfaces as failed, and the tar carries detail). 3. ls -all typo. Fixed in both the K8s path I introduced and the pre-existing SLURM finallyRunner -- the surrounding code uses ls -al, so this was a typo that ls happens to silently tolerate (parses as -a -l -l, with -l double-applied). The K8s retry loop and DEBUG_MODE guard previously inside runLLMTestlistOnPlatform are removed; the function returns to a single-attempt body that threads postTag and isFinalAttempt into cacheErrorAndUploadResult. DEBUG_MODE handling moves to the new helper, which falls back to a single launchKubernetesPod call without retry so the existing 2-hour input prompt for human inspection of a failed pod is preserved verbatim. Signed-off-by: Derek Pitman <dpitman@nvidia.com>

dpitman-nvda · 2026-04-28T15:38:03Z

@coderabbitai Comments have been addressed. Please review and resolve if satisfied.

coderabbitai · 2026-04-28T15:39:36Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get all review comments made by coderabbitai on this PR
gh pr view 13530 --repo NVIDIA/TensorRT-LLM --json reviews,comments

# Also get review comments (inline comments)
gh api repos/NVIDIA/TensorRT-LLM/pulls/13530/comments --jq '.[] | select(.user.login == "coderabbitai") | {id: .id, body: .body, path: .path, line: .line, resolved: .resolved}'

Length of output: 12939

🏁 Script executed:

#!/bin/bash
# Get inline PR review comments by coderabbitai
gh api repos/NVIDIA/TensorRT-LLM/pulls/13530/comments \
  --jq '.[] | select(.user.login == "coderabbitai") | {id: .id, body: .body, path: .path, line: .line, original_line: .original_line, resolved: .resolved}'

Length of output: 0

🏁 Script executed:

#!/bin/bash
# Get all review threads on this PR to find coderabbitai comments
gh api graphql -f query='
{
  repository(owner: "NVIDIA", name: "TensorRT-LLM") {
    pullRequest(number: 13530) {
      reviewThreads(first: 50) {
        nodes {
          id
          isResolved
          isOutdated
          comments(first: 10) {
            nodes {
              author { login }
              body
              path
              line
              originalLine
              id
            }
          }
        }
      }
    }
  }
}'

Length of output: 4600

🏁 Script executed:

#!/bin/bash
# Get the relevant sections of the current file to check if comments were addressed
gh api repos/NVIDIA/TensorRT-LLM/contents/jenkins/L0_Test.groovy?ref=feat/k8s-test-infra-retry \
  --jq '.content' | base64 -d | grep -n "launchKubernetesPod\|runLLMTestlistOnPlatform\|isFinalAttempt\|isRetryable\|classifyInfraFailure\|K8S_INFRA\|while.*true\|INFRA-RETRY\|cacheErrorAndUploadResult\|junit\|attempt" | head -100

Length of output: 91

🏁 Script executed:

#!/bin/bash
# Get PR details to find current head SHA
gh pr view 13530 --repo NVIDIA/TensorRT-LLM --json headRefName,headRefOid,commits

# Also list recent commits on the PR
gh api repos/NVIDIA/TensorRT-LLM/pulls/13530 --jq '{head_sha: .head.sha, head_ref: .head.ref, head_repo: .head.repo.full_name}'

Length of output: 5657

🏁 Script executed:

#!/bin/bash
# Fetch and inspect the current file at the new HEAD SHA
SHA="0b1a3da6fe589447dcecec6831682fa1d31934b2"
gh api "repos/dpitman-nvda/TensorRT-LLM/contents/jenkins/L0_Test.groovy?ref=$SHA" \
  --jq '.content' | base64 -d > /tmp/L0_Test.groovy

# Check runKubernetesPodWithInfraRetry helper
grep -n "runKubernetesPodWithInfraRetry\|isFinalAttempt\|K8S_INFRA\|launchKubernetesPod\|cacheErrorAndUploadResult\|while.*true\|INFRA-RETRY" /tmp/L0_Test.groovy | head -80

Length of output: 6840

🏁 Script executed:

#!/bin/bash
# Get the thread IDs for the two coderabbitai comments and resolve them
# Thread IDs from the GraphQL query:
# PRRT_kwDOKH5No859-rmw  (JUnit suppression)
# PRRT_kwDOKH5No859-rmy  (Pod-level retry)

gh api graphql -f query='
mutation {
  resolveReviewThread(input: {threadId: "PRRT_kwDOKH5No859-rmw"}) {
    thread { isResolved }
  }
}'

gh api graphql -f query='
mutation {
  resolveReviewThread(input: {threadId: "PRRT_kwDOKH5No859-rmy"}) {
    thread { isResolved }
  }
}'

Length of output: 126

@dpitman-nvda: I've verified both comments were properly addressed in the follow-up commit and have resolved both threads:

Pod-level retry — The new runKubernetesPodWithInfraRetry helper (line 3374) correctly wraps trtllm_utils.launchKubernetesPod in the retry loop, so each attempt spawns a fresh pod. It is wired at all three affected sites (main parallelJobsFiltered, pip-install sanity, and NGC image-sanity).
JUnit suppression on intermediate retries — cacheErrorAndUploadResult now accepts a boolean isFinalAttempt flag. When false and the failure is classified as a retryable infra failure, both the synthetic Stage Failed XML emission and the junit() call are suppressed (lines 1766–1770), keeping intermediate attempt failures out of the permanent Jenkins report.

Both review threads are now resolved. ✅

[resolve]

dpitman-nvda · 2026-04-28T15:40:43Z

/bot run

tensorrt-cicd · 2026-04-28T15:47:53Z

PR_Github #45951 [ run ] triggered by Bot. Commit: 0b1a3da Link to invocation

tensorrt-cicd · 2026-04-28T23:23:07Z

PR_Github #45951 [ run ] completed with state SUCCESS. Commit: 0b1a3da
/LLM/main/L0_MergeRequest_PR pipeline #36106 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dpitman-nvda · 2026-04-29T14:20:35Z

/bot run

tensorrt-cicd · 2026-04-29T14:28:00Z

PR_Github #46144 [ run ] triggered by Bot. Commit: 0b1a3da Link to invocation

tensorrt-cicd · 2026-04-29T16:13:21Z

PR_Github #46144 [ run ] completed with state FAILURE. Commit: 0b1a3da
/LLM/main/L0_MergeRequest_PR pipeline #36270 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dpitman-nvda · 2026-04-29T16:14:30Z

/bot run

tensorrt-cicd · 2026-04-29T16:23:11Z

PR_Github #46179 [ run ] triggered by Bot. Commit: ed9d0e5 Link to invocation

tensorrt-cicd · 2026-04-29T23:36:50Z

PR_Github #46179 [ run ] completed with state ABORTED. Commit: ed9d0e5
/LLM/main/L0_MergeRequest_PR pipeline #36298 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dpitman-nvda requested review from a team as code owners April 27, 2026 23:14

dpitman-nvda requested review from yuanjingx87 and zeroepoch April 27, 2026 23:14

github-actions Bot assigned dpitman-nvda Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread jenkins/L0_Test.groovy Outdated

Comment thread jenkins/L0_Test.groovy Outdated

zeroepoch approved these changes Apr 27, 2026

View reviewed changes

Comment thread jenkins/L0_Test.groovy Outdated

Merge branch 'main' into feat/k8s-test-infra-retry

ed9d0e5

Conversation

dpitman-nvda commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 27, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dpitman-nvda commented Apr 28, 2026

Uh oh!

coderabbitai Bot commented Apr 28, 2026

Uh oh!

dpitman-nvda commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

dpitman-nvda commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

dpitman-nvda commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dpitman-nvda commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading