Skip to content

[TRTLLMINF-43][feat] Extend infrastructure-failure retry to K8s test stages#13530

Open
dpitman-nvda wants to merge 3 commits intoNVIDIA:mainfrom
dpitman-nvda:feat/k8s-test-infra-retry
Open

[TRTLLMINF-43][feat] Extend infrastructure-failure retry to K8s test stages#13530
dpitman-nvda wants to merge 3 commits intoNVIDIA:mainfrom
dpitman-nvda:feat/k8s-test-infra-retry

Conversation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator

@dpitman-nvda dpitman-nvda commented Apr 27, 2026

Summary by CodeRabbit

  • Tests
    • Enhanced Kubernetes test infrastructure with improved retry logic for handling transient failures.
    • Expanded infrastructure error pattern detection for more accurate failure classification.
    • Implemented refined retry attempt tracking to prevent test artifact collisions during reruns.

Description

The branch feat/restart-on-node-crashes added a retry loop for transient infrastructure failures around runLLMTestlistOnSlurm, but the K8s-only test path (runLLMTestlistOnPlatform) did not get equivalent protection. Pod evictions, image-pull backoffs, OOMKilled events, JNLP-channel disconnects, and node-NotReady transitions immediately failed those stages with no retry, even though cacheErrorAndUploadResult is already postTag-aware and ensureStageResultNotUploaded is therefore retry-safe.

Out of scope (follow-ups): outer wrapping K8s pods used by launchTestJobsForImagesSanityCheck (failure of the dispatch shell before delegating to runLLMTestlistOnPlatform); build / docker-image / cross-job-dispatch stages; the unified PatternCatalog + FailureClassifier refactor that would dedupe the SLURM and K8s pattern lists.

Test Coverage

N/A, this is a CI change

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

The branch feat/restart-on-node-crashes added a retry loop for transient
infrastructure failures around runLLMTestlistOnSlurm, but the K8s-only
test path (runLLMTestlistOnPlatform) did not get equivalent protection.
Pod evictions, image-pull backoffs, OOMKilled events, JNLP-channel
disconnects, and node-NotReady transitions immediately failed those
stages with no retry, even though cacheErrorAndUploadResult is already
postTag-aware and ensureStageResultNotUploaded is therefore retry-safe.

Changes (jenkins/L0_Test.groovy only):

- Add K8S_INFRA_FAILURE_PATTERNS, K8S_INFRA_SINGLE_RETRY_PATTERNS, and
  K8S_INFRA_RETRY_MAX next to the SLURM lists. K8s-specific symptoms
  covered: ImagePullBackOff, ErrImagePull, OCI runtime exec failed,
  OOMKilled, node status is not ready, "Cannot contact " (JNLP
  disconnect), and "Connection failed" (JNLP/HTTP-handshake transient).
  OOMKilled and "Connection failed" are capped to a single retry to
  bound the cost of any false-positive match.

- Extend classifyInfraFailure with two optional list parameters
  (extraInfraPatterns, extraSingleRetryPatterns) so the K8s path can
  pass its extra symptoms in without disturbing the SLURM defaults.
  The existing SLURM call site continues to call with no extras and
  has zero behaviour change.

- Wrap runLLMTestlistOnPlatform's body in a while(true) retry loop
  modelled on the SLURM loop in runLLMTestlistOnSlurm, with the same
  [INFRA-RETRY] log prefix, FlowInterruptedException/exit-code-143
  rethrow guards, and 60s cooldown. Each attempt composes its tag as
  effectivePostTag = postTag + attemptTag so the canonical artifact
  name is unchanged on attempt 1 and unique on retries (handles
  callers that already pass postTag like -SubJob-RunTest /
  -SubJob-TestImage).

- Skip the retry loop entirely when testFilter[(DEBUG_MODE)] is set,
  preserving the existing 2-hour input prompt for human inspection of
  a failed pod.

Out of scope (follow-ups): outer wrapping K8s pods used by
launchTestJobsForImagesSanityCheck (failure of the dispatch shell
before delegating to runLLMTestlistOnPlatform); build / docker-image
/ cross-job-dispatch stages; the unified PatternCatalog +
FailureClassifier refactor that would dedupe the SLURM and K8s pattern
lists.

Signed-off-by: Derek Pitman <dpitman@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

Refactors infrastructure failure classification in the test runner to accept optional custom error patterns. Adds K8s-specific infrastructure failure patterns and a K8s retry limit. Modifies the K8s test runner to implement a retry loop that catches exceptions, classifies them using extended patterns, and retries with cooldown, while deduplicating artifacts via -attempt-N suffix appending.

Changes

Cohort / File(s) Summary
K8s Test Runner Enhancement
jenkins/L0_Test.groovy
Added K8s infrastructure error patterns and retry limit configuration. Extended classifyInfraFailure function with optional parameters (extraInfraPatterns, extraSingleRetryPatterns) to allow custom pattern injection. Refactored runLLMTestlistOnPlatform to implement retry loop: catches non-interruption exceptions, classifies failures, selects retry budget, and retries with cooldown. Artifacts differentiated via -attempt-N suffix on postTag to prevent collision across attempts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: extending infrastructure-failure retry handling to K8s test stages.
Description check ✅ Passed The PR description includes a detailed problem statement, solution overview, and test coverage justification; all critical sections from the template are addressed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 3123-3175: The retry loop currently retries only the test body
(runOnce) inside the same Kubernetes pod, so pod-level failures
(ImagePullBackOff, eviction, OOM, JNLP disconnect) are not recovered; move the
retry to wrap the pod-launching call instead of runOnce: modify the logic so
that the loop (and attempt/effectivePostTag handling) encloses the call to
trtllm_utils.launchKubernetesPod(...) which invokes runLLMTestlistOnPlatform(),
rather than retrying only runOnce; keep classification via
classifyInfraFailure/K8S_INFRA_FAILURE_PATTERNS/K8S_INFRA_SINGLE_RETRY_PATTERNS
and the same backoff/attempt-count logic (including K8S_INFRA_RETRY_MAX) but
ensure each retry performs a fresh pod creation before running the test body.
- Around line 3072-3111: runOnce currently always calls
cacheErrorAndUploadResult which unconditionally emits the synthetic "Stage
Failed" XML and calls junit(), causing transient infra failures to be
permanently recorded; modify the flow so cacheErrorAndUploadResult (or the
finally block that emits "Stage Failed" and calls junit()) accepts a
parameter/flag or consults an isFinalAttempt boolean to skip publishing JUnit on
retryable infra failures, and only emit the synthetic XML and call junit() when
the attempt is final or the exception is non-retryable; locate the runOnce
closure and the cacheErrorAndUploadResult implementation and wire an
attempt/finality check (or an isRetryableInfraFailure helper) so junit() is
suppressed for retryable infra exceptions and only invoked after the last
attempt.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4f7a31a0-01e4-47ff-8ef5-95be2119ee59

📥 Commits

Reviewing files that changed from the base of the PR and between e7f1062 and 52f9f4f.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy

Comment thread jenkins/L0_Test.groovy Outdated
Comment thread jenkins/L0_Test.groovy Outdated
Comment thread jenkins/L0_Test.groovy Outdated
…t on retry

Addresses three review comments on the prior K8s retry commit:

1. Pod-level retry. The previous loop sat inside runLLMTestlistOnPlatform,
   which runs inside an already-launched pod, so transient pod-level
   failures (ImagePullBackOff, eviction, OOMKilled, JNLP disconnect, node
   NotReady) were unrecoverable. New helper runKubernetesPodWithInfraRetry
   wraps trtllm_utils.launchKubernetesPod with the existing classification
   (classifyInfraFailure / K8S_INFRA_FAILURE_PATTERNS /
   K8S_INFRA_SINGLE_RETRY_PATTERNS / K8S_INFRA_RETRY_MAX) and 60s backoff,
   so each retry gets a fresh pod creation. Used at the main
   parallelJobsFiltered consumer, the pip-install sanity outer pod, and
   the NGC image-sanity test pod. SLURM/multi-node/doc-build closures
   accept the new (attemptTag, isFinalAttempt) args even when they don't
   need them so the consumer's call shape is uniform; SLURM-internal
   retries (runLLMTestlistOnSlurm) remain unchanged.

2. junit() / synthetic stage-fail XML on intermediate retries.
   cacheErrorAndUploadResult now takes isFinalAttempt; when false, it
   captures the exception, runs classifyInfraFailure, and -- for
   retryable infra failures -- skips both the synthetic results-stage.xml
   emission and the junit() call. The tar still uploads to Artifactory
   for forensics. The new helper passes isFinalAttempt = (attempt >
   K8S_INFRA_RETRY_MAX); single-retry-only patterns terminate at attempt
   2 in which case the synthetic XML is suppressed (acceptable -- the
   Jenkins build still surfaces as failed, and the tar carries detail).

3. ls -all typo. Fixed in both the K8s path I introduced and the
   pre-existing SLURM finallyRunner -- the surrounding code uses ls -al,
   so this was a typo that ls happens to silently tolerate (parses as
   -a -l -l, with -l double-applied).

The K8s retry loop and DEBUG_MODE guard previously inside
runLLMTestlistOnPlatform are removed; the function returns to a
single-attempt body that threads postTag and isFinalAttempt into
cacheErrorAndUploadResult. DEBUG_MODE handling moves to the new helper,
which falls back to a single launchKubernetesPod call without retry so
the existing 2-hour input prompt for human inspection of a failed pod
is preserved verbatim.

Signed-off-by: Derek Pitman <dpitman@nvidia.com>
@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

@coderabbitai Comments have been addressed. Please review and resolve if satisfied.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get all review comments made by coderabbitai on this PR
gh pr view 13530 --repo NVIDIA/TensorRT-LLM --json reviews,comments

# Also get review comments (inline comments)
gh api repos/NVIDIA/TensorRT-LLM/pulls/13530/comments --jq '.[] | select(.user.login == "coderabbitai") | {id: .id, body: .body, path: .path, line: .line, resolved: .resolved}'

Length of output: 12939


🏁 Script executed:

#!/bin/bash
# Get inline PR review comments by coderabbitai
gh api repos/NVIDIA/TensorRT-LLM/pulls/13530/comments \
  --jq '.[] | select(.user.login == "coderabbitai") | {id: .id, body: .body, path: .path, line: .line, original_line: .original_line, resolved: .resolved}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Get all review threads on this PR to find coderabbitai comments
gh api graphql -f query='
{
  repository(owner: "NVIDIA", name: "TensorRT-LLM") {
    pullRequest(number: 13530) {
      reviewThreads(first: 50) {
        nodes {
          id
          isResolved
          isOutdated
          comments(first: 10) {
            nodes {
              author { login }
              body
              path
              line
              originalLine
              id
            }
          }
        }
      }
    }
  }
}'

Length of output: 4600


🏁 Script executed:

#!/bin/bash
# Get the relevant sections of the current file to check if comments were addressed
gh api repos/NVIDIA/TensorRT-LLM/contents/jenkins/L0_Test.groovy?ref=feat/k8s-test-infra-retry \
  --jq '.content' | base64 -d | grep -n "launchKubernetesPod\|runLLMTestlistOnPlatform\|isFinalAttempt\|isRetryable\|classifyInfraFailure\|K8S_INFRA\|while.*true\|INFRA-RETRY\|cacheErrorAndUploadResult\|junit\|attempt" | head -100

Length of output: 91


🏁 Script executed:

#!/bin/bash
# Get PR details to find current head SHA
gh pr view 13530 --repo NVIDIA/TensorRT-LLM --json headRefName,headRefOid,commits

# Also list recent commits on the PR
gh api repos/NVIDIA/TensorRT-LLM/pulls/13530 --jq '{head_sha: .head.sha, head_ref: .head.ref, head_repo: .head.repo.full_name}'

Length of output: 5657


🏁 Script executed:

#!/bin/bash
# Fetch and inspect the current file at the new HEAD SHA
SHA="0b1a3da6fe589447dcecec6831682fa1d31934b2"
gh api "repos/dpitman-nvda/TensorRT-LLM/contents/jenkins/L0_Test.groovy?ref=$SHA" \
  --jq '.content' | base64 -d > /tmp/L0_Test.groovy

# Check runKubernetesPodWithInfraRetry helper
grep -n "runKubernetesPodWithInfraRetry\|isFinalAttempt\|K8S_INFRA\|launchKubernetesPod\|cacheErrorAndUploadResult\|while.*true\|INFRA-RETRY" /tmp/L0_Test.groovy | head -80

Length of output: 6840


🏁 Script executed:

#!/bin/bash
# Get the thread IDs for the two coderabbitai comments and resolve them
# Thread IDs from the GraphQL query:
# PRRT_kwDOKH5No859-rmw  (JUnit suppression)
# PRRT_kwDOKH5No859-rmy  (Pod-level retry)

gh api graphql -f query='
mutation {
  resolveReviewThread(input: {threadId: "PRRT_kwDOKH5No859-rmw"}) {
    thread { isResolved }
  }
}'

gh api graphql -f query='
mutation {
  resolveReviewThread(input: {threadId: "PRRT_kwDOKH5No859-rmy"}) {
    thread { isResolved }
  }
}'

Length of output: 126


@dpitman-nvda: I've verified both comments were properly addressed in the follow-up commit and have resolved both threads:

  1. Pod-level retry — The new runKubernetesPodWithInfraRetry helper (line 3374) correctly wraps trtllm_utils.launchKubernetesPod in the retry loop, so each attempt spawns a fresh pod. It is wired at all three affected sites (main parallelJobsFiltered, pip-install sanity, and NGC image-sanity).

  2. JUnit suppression on intermediate retriescacheErrorAndUploadResult now accepts a boolean isFinalAttempt flag. When false and the failure is classified as a retryable infra failure, both the synthetic Stage Failed XML emission and the junit() call are suppressed (lines 1766–1770), keeping intermediate attempt failures out of the permanent Jenkins report.

Both review threads are now resolved. ✅

[resolve]

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45951 [ run ] triggered by Bot. Commit: 0b1a3da Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45951 [ run ] completed with state SUCCESS. Commit: 0b1a3da
/LLM/main/L0_MergeRequest_PR pipeline #36106 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46144 [ run ] triggered by Bot. Commit: 0b1a3da Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46144 [ run ] completed with state FAILURE. Commit: 0b1a3da
/LLM/main/L0_MergeRequest_PR pipeline #36270 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46179 [ run ] triggered by Bot. Commit: ed9d0e5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46179 [ run ] completed with state ABORTED. Commit: ed9d0e5
/LLM/main/L0_MergeRequest_PR pipeline #36298 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants