[TRTLLMINF-43][feat] Extend infrastructure-failure retry to K8s test stages#13530
[TRTLLMINF-43][feat] Extend infrastructure-failure retry to K8s test stages#13530dpitman-nvda wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
The branch feat/restart-on-node-crashes added a retry loop for transient infrastructure failures around runLLMTestlistOnSlurm, but the K8s-only test path (runLLMTestlistOnPlatform) did not get equivalent protection. Pod evictions, image-pull backoffs, OOMKilled events, JNLP-channel disconnects, and node-NotReady transitions immediately failed those stages with no retry, even though cacheErrorAndUploadResult is already postTag-aware and ensureStageResultNotUploaded is therefore retry-safe. Changes (jenkins/L0_Test.groovy only): - Add K8S_INFRA_FAILURE_PATTERNS, K8S_INFRA_SINGLE_RETRY_PATTERNS, and K8S_INFRA_RETRY_MAX next to the SLURM lists. K8s-specific symptoms covered: ImagePullBackOff, ErrImagePull, OCI runtime exec failed, OOMKilled, node status is not ready, "Cannot contact " (JNLP disconnect), and "Connection failed" (JNLP/HTTP-handshake transient). OOMKilled and "Connection failed" are capped to a single retry to bound the cost of any false-positive match. - Extend classifyInfraFailure with two optional list parameters (extraInfraPatterns, extraSingleRetryPatterns) so the K8s path can pass its extra symptoms in without disturbing the SLURM defaults. The existing SLURM call site continues to call with no extras and has zero behaviour change. - Wrap runLLMTestlistOnPlatform's body in a while(true) retry loop modelled on the SLURM loop in runLLMTestlistOnSlurm, with the same [INFRA-RETRY] log prefix, FlowInterruptedException/exit-code-143 rethrow guards, and 60s cooldown. Each attempt composes its tag as effectivePostTag = postTag + attemptTag so the canonical artifact name is unchanged on attempt 1 and unique on retries (handles callers that already pass postTag like -SubJob-RunTest / -SubJob-TestImage). - Skip the retry loop entirely when testFilter[(DEBUG_MODE)] is set, preserving the existing 2-hour input prompt for human inspection of a failed pod. Out of scope (follow-ups): outer wrapping K8s pods used by launchTestJobsForImagesSanityCheck (failure of the dispatch shell before delegating to runLLMTestlistOnPlatform); build / docker-image / cross-job-dispatch stages; the unified PatternCatalog + FailureClassifier refactor that would dedupe the SLURM and K8s pattern lists. Signed-off-by: Derek Pitman <dpitman@nvidia.com>
📝 WalkthroughWalkthroughRefactors infrastructure failure classification in the test runner to accept optional custom error patterns. Adds K8s-specific infrastructure failure patterns and a K8s retry limit. Modifies the K8s test runner to implement a retry loop that catches exceptions, classifies them using extended patterns, and retries with cooldown, while deduplicating artifacts via Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 3123-3175: The retry loop currently retries only the test body
(runOnce) inside the same Kubernetes pod, so pod-level failures
(ImagePullBackOff, eviction, OOM, JNLP disconnect) are not recovered; move the
retry to wrap the pod-launching call instead of runOnce: modify the logic so
that the loop (and attempt/effectivePostTag handling) encloses the call to
trtllm_utils.launchKubernetesPod(...) which invokes runLLMTestlistOnPlatform(),
rather than retrying only runOnce; keep classification via
classifyInfraFailure/K8S_INFRA_FAILURE_PATTERNS/K8S_INFRA_SINGLE_RETRY_PATTERNS
and the same backoff/attempt-count logic (including K8S_INFRA_RETRY_MAX) but
ensure each retry performs a fresh pod creation before running the test body.
- Around line 3072-3111: runOnce currently always calls
cacheErrorAndUploadResult which unconditionally emits the synthetic "Stage
Failed" XML and calls junit(), causing transient infra failures to be
permanently recorded; modify the flow so cacheErrorAndUploadResult (or the
finally block that emits "Stage Failed" and calls junit()) accepts a
parameter/flag or consults an isFinalAttempt boolean to skip publishing JUnit on
retryable infra failures, and only emit the synthetic XML and call junit() when
the attempt is final or the exception is non-retryable; locate the runOnce
closure and the cacheErrorAndUploadResult implementation and wire an
attempt/finality check (or an isRetryableInfraFailure helper) so junit() is
suppressed for retryable infra exceptions and only invoked after the last
attempt.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4f7a31a0-01e4-47ff-8ef5-95be2119ee59
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
…t on retry Addresses three review comments on the prior K8s retry commit: 1. Pod-level retry. The previous loop sat inside runLLMTestlistOnPlatform, which runs inside an already-launched pod, so transient pod-level failures (ImagePullBackOff, eviction, OOMKilled, JNLP disconnect, node NotReady) were unrecoverable. New helper runKubernetesPodWithInfraRetry wraps trtllm_utils.launchKubernetesPod with the existing classification (classifyInfraFailure / K8S_INFRA_FAILURE_PATTERNS / K8S_INFRA_SINGLE_RETRY_PATTERNS / K8S_INFRA_RETRY_MAX) and 60s backoff, so each retry gets a fresh pod creation. Used at the main parallelJobsFiltered consumer, the pip-install sanity outer pod, and the NGC image-sanity test pod. SLURM/multi-node/doc-build closures accept the new (attemptTag, isFinalAttempt) args even when they don't need them so the consumer's call shape is uniform; SLURM-internal retries (runLLMTestlistOnSlurm) remain unchanged. 2. junit() / synthetic stage-fail XML on intermediate retries. cacheErrorAndUploadResult now takes isFinalAttempt; when false, it captures the exception, runs classifyInfraFailure, and -- for retryable infra failures -- skips both the synthetic results-stage.xml emission and the junit() call. The tar still uploads to Artifactory for forensics. The new helper passes isFinalAttempt = (attempt > K8S_INFRA_RETRY_MAX); single-retry-only patterns terminate at attempt 2 in which case the synthetic XML is suppressed (acceptable -- the Jenkins build still surfaces as failed, and the tar carries detail). 3. ls -all typo. Fixed in both the K8s path I introduced and the pre-existing SLURM finallyRunner -- the surrounding code uses ls -al, so this was a typo that ls happens to silently tolerate (parses as -a -l -l, with -l double-applied). The K8s retry loop and DEBUG_MODE guard previously inside runLLMTestlistOnPlatform are removed; the function returns to a single-attempt body that threads postTag and isFinalAttempt into cacheErrorAndUploadResult. DEBUG_MODE handling moves to the new helper, which falls back to a single launchKubernetesPod call without retry so the existing 2-hour input prompt for human inspection of a failed pod is preserved verbatim. Signed-off-by: Derek Pitman <dpitman@nvidia.com>
|
@coderabbitai Comments have been addressed. Please review and resolve if satisfied. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 12939 🏁 Script executed: Length of output: 0 🏁 Script executed: Length of output: 4600 🏁 Script executed: Length of output: 91 🏁 Script executed: Length of output: 5657 🏁 Script executed: Length of output: 6840 🏁 Script executed: Length of output: 126
Both review threads are now resolved. ✅ [resolve] |
|
/bot run |
|
PR_Github #45951 [ run ] triggered by Bot. Commit: |
|
PR_Github #45951 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46144 [ run ] triggered by Bot. Commit: |
|
PR_Github #46144 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46179 [ run ] triggered by Bot. Commit: |
|
PR_Github #46179 [ run ] completed with state
|
Summary by CodeRabbit
Description
The branch feat/restart-on-node-crashes added a retry loop for transient infrastructure failures around runLLMTestlistOnSlurm, but the K8s-only test path (runLLMTestlistOnPlatform) did not get equivalent protection. Pod evictions, image-pull backoffs, OOMKilled events, JNLP-channel disconnects, and node-NotReady transitions immediately failed those stages with no retry, even though cacheErrorAndUploadResult is already postTag-aware and ensureStageResultNotUploaded is therefore retry-safe.
Out of scope (follow-ups): outer wrapping K8s pods used by launchTestJobsForImagesSanityCheck (failure of the dispatch shell before delegating to runLLMTestlistOnPlatform); build / docker-image / cross-job-dispatch stages; the unified PatternCatalog + FailureClassifier refactor that would dedupe the SLURM and K8s pattern lists.
Test Coverage
N/A, this is a CI change
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.