- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.8k
 
[TRTLLM-7458][infra] Retry test stage #7088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
          
📝 WalkthroughWalkthroughSwitches the TRTL Jenkins shared-library reference to a user branch, wraps per-stage execution with  Changes
 Sequence Diagram(s)sequenceDiagram
    autonumber
    actor Jenkins as Jenkins Pipeline
    participant Utils as trtllm_utils.llmStageWithRetry
    participant Closure as Stage Closure (tests)
    participant Shell as Shell / PyTest
    participant Catcher as AbortException inspector
    Jenkins->>Utils: llmStageWithRetry(pipeline, key, closure, [sleepTimeInSecs:120])
    Utils->>Closure: invoke closure
    Closure->>Shell: run test command(s)
    Shell-->>Closure: success / throws Exception
    alt success
        Closure-->>Utils: return success
        Utils-->>Jenkins: stage success
    else failure
        Closure-->>Utils: throw
        Utils->>Catcher: catch hudson.AbortException? / inspect message
        alt infra/machine detected
            Catcher-->>Jenkins: propagate infra error (no rerun)
        else normal failure
            Utils->>Utils: apply retry logic (sleep/attempts)
            alt retries exhausted
                Utils-->>Jenkins: propagate final failure (includes e.message)
            end
        end
    end
    Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
 Suggested reviewers
 Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type  Other keywords and placeholders
 Status, Documentation and Community
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
jenkins/L0_Test.groovy (4)
1-1: Pinning shared lib to a user branch — revert before merge.Good for experimentation, but please switch
trtllm-jenkins-shared-libback to a stable ref (e.g.,main) before merging to avoid pipeline drift or accidental dependency on a personal branch.Apply this diff when ready to merge:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
1414-1414: Remove redundant mutation of testList.The reassignment isn’t used after
testDBListis computed and can confuse future readers.Apply this diff:
- testList = "${testList}_${splitId}"
1492-1495: Declare local variable for clarity and scope.Use a local binding to avoid leaking into the script binding.
Apply this diff:
- isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine) + def isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine)
1500-1519: Avoid re-running perf/report steps on retry.As written, if the perf check/report fails, the retry wrapper will re-run the entire test suite. That’s expensive and unnecessary. Limit retries to the Pytest execution block and leave perf post-processing outside the retry closure.
Apply this diff to close the retry block before perf and add explicit locals:
} - if (perfMode) { - basePerfFilename = stageName.contains("PyTorch") ? "base_perf_pytorch.csv" : "base_perf.csv" - basePerfPath = "${llmSrc}/tests/integration/defs/perf/${basePerfFilename}" + }) + + if (perfMode) { + def basePerfFilename = stageName.contains("PyTorch") ? "base_perf_pytorch.csv" : "base_perf.csv" + def basePerfPath = "${llmSrc}/tests/integration/defs/perf/${basePerfFilename}" stage("Check perf result") { sh """ python3 ${llmSrc}/tests/integration/defs/perf/sanity_perf_check.py \ ${stageName}/perf_script_test_results.csv \ ${basePerfPath} """ } stage("Create perf report") { sh """ python3 ${llmSrc}/tests/integration/defs/perf/create_perf_comparison_report.py \ --output_path ${stageName}/report.pdf \ --files ${stageName}/perf_script_test_results.csv \ ${basePerfPath} """ } - }) + }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy(3 hunks)
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)
1400-1411: Retry wrapper around Pytest stage — looks good.Wrapping the Pytest stage with
llmExecStepWithRetryis a sensible containment of flakiness and should improve stability without altering the test commands.
1400-1400: Confirm wrapper signature supports a closure.Ensure
trtllm_utils.llmExecStepWithRetry(pipeline, String, Closure)is available in the pinned shared-lib branch. If not, this will fail at runtime with a signature mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
1-1: Avoid pinning shared lib to a personal branch when targeting mainUsing
trtllm-jenkins-shared-lib@user/yiqingy/retry_stagein a top-level@Libraryannotation will make all downstream jobs depend on a non-main, personal branch. This is risky to merge and can break unrelated pipelines if that branch changes or is deleted.Please revert to
@mainbefore merging this PR (or validate on a temporary job). If you need dynamic versioning in production, prefer thelibrary()step with a parameterized version inside a guarded path rather than the annotation.Apply this diff before merging:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
 
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)
1400-1512: Stage-level retry wrapper is a good additionWrapping the Pytest stage with
trtllm_utils.llmStageWithRetry(...)makes sense for infra flakiness. The closure body keeps the original logic, and the outercacheErrorAndUploadResult(...)still handles results packaging/publishing, so the flow remains consistent.Two asks:
- Confirm
 llmStageWithRetryexists on the referenced shared-lib branch and that it only retries infra-flaky failures (not genuine test failures).- Ensure the wrapper does not perform its own
 junit/artifact publishing to avoid duplicates withcacheErrorAndUploadResult.
| 
           /bot run --stage-list "A10-PyTorch-1"  | 
    
| 
           PR_Github #15899 [ run ] triggered by Bot  | 
    
| 
           /bot run --stage-list "A10-PyTorch-1"  | 
    
| 
           PR_Github #15901 [ run ] triggered by Bot  | 
    
| 
           PR_Github #15899 [ run ] completed with state   | 
    
| 
           PR_Github #15901 [ run ] completed with state   | 
    
| 
           /bot run --stage-list "A10-PyTorch-1"  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)
1485-1493: Unconditional throw guarantees stage failure; remove or guard behind a flag and preserve root causeRight after pytest completes you unconditionally
throw new Exception("Infrastructure error"). The catch then replaces any failure with the same generic exception. This forces every run to fail and defeats the per-stage retry goal.Fix by:
- Removing the unconditional throw in the try block.
 - Preserving
 InterruptedException.- Only mapping to “Infrastructure error” when explicitly requested (e.g., to exercise retry), otherwise rethrow the original exception.
 Patch:
sh """ rm -rf ${stageName}/ && \ cd ${llmSrc}/tests/integration/defs && \ ${testCmdLine.join(" ")} """ - throw new Exception("Infrastructure error") - } catch (Exception e) { - throw new Exception("Infrastructure error") + } catch (InterruptedException e) { + // Preserve interruption semantics (user abort/timeout) + throw e + } catch (Exception e) { + // Optionally mark infra to exercise retry path + if (env.FORCE_INFRA_RETRY?.toBoolean()) { + throw new Exception("Infrastructure error", e) + } + // Keep the original failure for accurate diagnosis and reporting + throw e }This aligns with the retry-wrapper intent without turning every attempt into a hard failure.
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
2281-2301: Stage-level retry wrapper adoption: confirm it creates a Jenkins stage and avoids duplicate JUnit on retriesWrapping each job with
trtllm_utils.llmStageWithRetry(pipeline, key, { ... })is the right direction. Two verification points:
Ensure the wrapper itself opens a
stage(key)so Blue Ocean and log grouping remain intact per attempt. If it does not, consider nesting astage(key)inside the wrapper or adjusting the wrapper to do so.On retries,
cacheErrorAndUploadResultcurrently always archives and runsjunit(...). Multiple attempts may re-parse/merge JUnit results and re-upload artifacts. Verify the wrapper either:
- cleans
 ${stageName}before each attempt (so previous results aren’t re-ingested), and/or- exposes an “is final attempt” signal (env var like
 TRTLLM_RETRY_FINAL=true) so we can conditionally runjunit(...)only once.If the wrapper already guarantees these properties, we’re good. Otherwise, I can propose a small change to gate
junit(...)on a final-attempt signal.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
 
| @@ -1,4 +1,4 @@ | |||
| @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _ | |||
| @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Do not merge with shared-lib pinned to a personal branch; pin to main or a commit SHA
Using trtllm-jenkins-shared-lib@user/yiqingy/retry_stage couples mainline CI to a mutable personal branch and can break/rewrite CI behavior without a code change. Please either:
- revert to 
@main, or - pin to a reviewed tag/commit SHA for the shared lib.
 
Suggested change:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _If you need the retry wrapper during iteration, gate the dev lib behind a clearly named env/param (e.g., USE_DEV_SHARED_LIB=true) and default it off.
🏁 Script executed:
#!/bin/bash
# List all @Library declarations in the repository to verify usage of personal branches
rg -n "@Library\\(" -C2Length of output: 855
Pin shared library to a stable branch or SHA
The trtllm-jenkins-shared-lib@user/yiqingy/retry_stage reference in your CI pipeline ties mainline tests to an unreviewed, mutable personal branch. To prevent unintended breakages or behavior changes:
• In jenkins/L0_Test.groovy (line 1), replace:
- @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+ @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _—or better—pin trtllm-jenkins-shared-lib to a reviewed tag or commit SHA.
• If you still need the retry_stage logic for development, wrap it behind an explicit flag (e.g. USE_DEV_SHARED_LIB=true) that defaults to off, so CI remains stable by default.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ | |
| @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _ | 
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around line 1, the shared library reference uses a
personal mutable branch (`trtllm-jenkins-shared-lib@user/yiqingy/retry_stage`),
which can introduce unreviewed changes into CI; replace that reference with a
stable reviewed tag or a specific commit SHA (for example
`[email protected]` or `@<commit-sha>`) to pin behavior, or
alternatively implement a simple boolean flag (e.g. USE_DEV_SHARED_LIB=false by
default) and conditionally include the dev branch only when that flag is
explicitly set to true so CI uses the stable pinned library by default.
| 
           PR_Github #16004 [ run ] triggered by Bot  | 
    
| 
           PR_Github #16004 [ run ] completed with state   | 
    
| 
           /bot run --stage-list "A10-PyTorch-1"  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (1)
jenkins/L0_Test.groovy (1)
802-803: Implement a redacted environment dump helper to replace allenv | sortcallsDirectly dumping the full environment can leak secrets that Jenkins’s own masking won’t catch (especially third-party or custom variables). We located 10 occurrences of
sh "env | sort"injenkins/L0_Test.groovyand found no existingsafeEnvDumputility to centralize redaction:
- Lines 555, 802, 834, 1284, 1483, 1579, 2128–2129, 2137–2138, 2374, 2383
 Please refactor as follows:
Add a new helper in
trtllm_utils, e.g.:// File: src/trtllm_utils.groovy (or appropriate shared library) class trtllm_utils { static void safeEnvDump(steps, File outputFile = null) { def cmd = ''' set -e env | sort \ | sed -E 's/(?i)(PASS|PASSWORD|TOKEN|SECRET|KEY|AUTH|COOKIE)=.*/\\1=***REDACTED***/' ''' if (outputFile) { steps.sh "${cmd} > ${outputFile}" } else { steps.sh cmd } } // ... }Replace each
sh "env | sort"(and variants that redirect to files) with:- sh "env | sort${redirect}" + trtllm_utils.safeEnvDump(this${redirect ? ", file('${redirectFile}')" : ""})For example, at lines 802–803:
- sh "env | sort" + trtllm_utils.safeEnvDump(this)Review credential usage to ensure Jenkins masking is correctly configured—but remember that any non-standard environment variables (e.g., those injected by plugins or external tools) will still be redacted by this helper.
This change both unifies environment dumping and ensures consistent redaction across all 10 sites in
jenkins/L0_Test.groovy.
♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)
1-1: Do not pin shared library to a mutable personal branchUsing
trtllm-jenkins-shared-lib@user/yiqingy/retry_stagemakes mainline CI depend on an unreviewed, mutable branch. Pin to a stable branch (e.g.,@main) or a reviewed tag/commit SHA before merge.Apply one of the following:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _—or pin to a release tag/commit:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@<reviewed-tag-or-sha>']) _Confirm there is no other file referencing
@user/yiqingy/retry_stage. If needed, I can provide a repo scan script.
1490-1490: Unconditional failure forces every run to fail
throw new Exception("Infrastructure error")guarantees this stage fails, defeats rerun logic, and matches the observed downstream failures. Remove it or guard it behind an explicit, disabled-by-default flag.Minimal removal:
- throw new Exception("Infrastructure error")Safer: feature-flag for local testing only (defaults off):
- throw new Exception("Infrastructure error") + if (env.FORCE_INFRA_ERROR?.toBoolean()) { + error "Forced infrastructure error for retry-path validation (FORCE_INFRA_ERROR=true)" + }
🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)
1496-1496: Sanitize exception text in user-visible failure messageIncluding raw
${e.message}can leak sensitive data (tokens/URLs) and can be overly verbose. Emit a bounded, sanitized summary and rely on the already-uploaded rerun report for details.- error "The tests still failed after rerun attempt. ${e.message}" + def msg = (e?.message ?: e?.toString() ?: "Unknown error") + // Bound length to avoid log spam; basic hygiene against accidental secrets in messages. + msg = msg.take(512) + error "The tests still failed after rerun attempt. ${msg}"Please confirm no secrets are ever embedded in exception messages from the test harness. If unsure, prefer not logging the raw message here and point to the rerun HTML report instead.
2286-2305: Validate llmStageWithRetry behavior: stage UI, retries, and result publication semanticsGreat to add a retry wrapper. Two things to verify:
- That
 llmStageWithRetry(pipeline, key, { ... })still produces a proper Jenkins Stage in the UI and propagatescurrentBuild.result/failFast correctly underparallel.- That retries don’t double-publish JUnit results or artifacts. Given
 cacheErrorAndUploadResultruns inside the retried closure, the first attempt may upload results before a retry. Consider teaching the wrapper to expose “attempt index/last attempt” so we can skip JUnit/artifact upload on non-final attempts.If the helper supports it, pass an “on-final-attempt” hook; otherwise, we can add a lightweight guard around upload steps.
Could you share the
llmStageWithRetrysignature/semantics or point to its implementation? I can tailor a small refactor to publish results only on the last attempt.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
 
| 
           PR_Github #16108 [ run ] triggered by Bot  | 
    
| 
           PR_Github #16108 [ run ] completed with state   | 
    
| 
           /bot run --stage-list "A10-PyTorch-1"  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)
1485-1499: Remove the unconditional infrastructure exception; it forces every stage to fail
throw new Exception("Infrastructure error")guarantees failure and defeats the retry-on-infra-flakiness goal, explaining the repeated A10-PyTorch failures after triggering reruns.Safe options:
- Remove it entirely
 - """ - rm -rf ${stageName}/ && \ - cd ${llmSrc}/tests/integration/defs && \ - ${testCmdLine.join(" ")} - """ - throw new Exception("Infrastructure error") + """ + rm -rf ${stageName}/ && \ + cd ${llmSrc}/tests/integration/defs && \ + ${testCmdLine.join(" ")} + """
- Or guard behind an explicit, default-off flag to exercise the retry path only when intended:
 """ rm -rf ${stageName}/ && \ cd ${llmSrc}/tests/integration/defs && \ ${testCmdLine.join(" ")} """ - throw new Exception("Infrastructure error") + if (env.FORCE_INFRA_FAILURE?.toBoolean()) { + error "Forced infrastructure failure to exercise retry path" + }
1-1: Do not pin CI to a personal/mutable shared-lib branchReferencing
trtllm-jenkins-shared-lib@user/yiqingy/retry_stagemakes mainline CI behavior mutable and unreviewed. Pin to a stable branch or commit SHA, and use a guarded flag if you must test a dev lib.Apply this diff (or pin to a reviewed tag/SHA):
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _Optional gating approach (outside this line): add a boolean param/env like
USE_DEV_SHARED_LIB=falseand include the dev lib only when explicitly enabled.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
 
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)
1494-1498: Good: include original exception text in rerun failure messageAppending
e.messagemakes the rerun failure actionable in build logs. Keep this.
        
          
                jenkins/L0_Test.groovy
              
                Outdated
          
        
      | trtllm_utils.llmStageWithRetry(pipeline, key, { | ||
| if (key in testFilter[REUSE_STAGE_LIST]) { | ||
| stage("Skip - reused") { | ||
| echo "Skip - Passed in the last pipeline." | ||
| } | ||
| } else if (values instanceof List && dockerNode == null) { | ||
| trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", { | ||
| values[1]() | ||
| }) | ||
| } else if (values instanceof List && dockerNode != null) { | ||
| node(dockerNode) { | ||
| deleteDir() | ||
| docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) { | ||
| values[1]() | ||
| } | ||
| } | ||
| } else { | ||
| values() | ||
| } | ||
| } else { | ||
| values() | ||
| } | ||
| }) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix NPE risk in REUSE_STAGE_LIST check and streamline wrapper logic
key in testFilter[REUSE_STAGE_LIST]will throw ifREUSE_STAGE_LISTis null (the default in this file). Use a null-safe contains check.- Optional: avoid nested 
stage("Skip - reused")inside an outerstage(key)to keep Stage View tidy; a simpleechosuffices. 
Apply this minimal fix:
-                if (key in testFilter[REUSE_STAGE_LIST]) {
+                if (testFilter[REUSE_STAGE_LIST]?.contains(key)) {
                     stage("Skip - reused") {
                         echo "Skip - Passed in the last pipeline."
                     }Optional clean-up (outside this hunk): default REUSE_STAGE_LIST to [] instead of null where testFilter is declared to eliminate null checks.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| trtllm_utils.llmStageWithRetry(pipeline, key, { | |
| if (key in testFilter[REUSE_STAGE_LIST]) { | |
| stage("Skip - reused") { | |
| echo "Skip - Passed in the last pipeline." | |
| } | |
| } else if (values instanceof List && dockerNode == null) { | |
| trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", { | |
| values[1]() | |
| }) | |
| } else if (values instanceof List && dockerNode != null) { | |
| node(dockerNode) { | |
| deleteDir() | |
| docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) { | |
| values[1]() | |
| } | |
| } | |
| } else { | |
| values() | |
| } | |
| } else { | |
| values() | |
| } | |
| }) | |
| trtllm_utils.llmStageWithRetry(pipeline, key, { | |
| if (testFilter[REUSE_STAGE_LIST]?.contains(key)) { | |
| stage("Skip - reused") { | |
| echo "Skip - Passed in the last pipeline." | |
| } | |
| } else if (values instanceof List && dockerNode == null) { | |
| trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", { | |
| values[1]() | |
| }) | |
| } else if (values instanceof List && dockerNode != null) { | |
| node(dockerNode) { | |
| deleteDir() | |
| docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) { | |
| values[1]() | |
| } | |
| } | |
| } else { | |
| values() | |
| } | |
| }) | 
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 2287 to 2306, the check `key in
testFilter[REUSE_STAGE_LIST]` can NPE when REUSE_STAGE_LIST is null and the
inner `stage("Skip - reused")` creates nesting; change the condition to a
null-safe contains check (e.g. use testFilter[REUSE_STAGE_LIST] != null &&
testFilter[REUSE_STAGE_LIST].contains(key) or the Groovy safe navigation
variant) and replace the nested stage block with a simple echo/skip message (no
inner stage) so the outer stage remains tidy; optionally set REUSE_STAGE_LIST
default to [] where testFilter is declared to avoid future null checks.
| 
           PR_Github #16132 [ run ] triggered by Bot  | 
    
| 
           /bot run --stage-list "A10-PyTorch-1"  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/integration/test_lists/test-db/l0_a10.yml (3)
17-20: Pre-merge PyTorch coverage heavily reduced — prefer isolating to a temporary stage instead of editing the default matrix.Commenting out most PyTorch entries in the pre_merge block will materially reduce signal and can mask regressions if this lands as-is. If the goal is to validate the new rerun flow quickly, create a dedicated lightweight “retry” stage (or use a stage overlay) and keep the default pre_merge matrix intact.
This also aligns with prior learnings that CLI flow tests and PyTorch API tests should coexist to cover both paths.
Example approach (add a targeted block instead of editing pre_merge):
# New block to keep experiments isolated from default pre_merge - condition: ranges: system_gpu_count: gte: 1 lte: 1 wildcards: gpu: - '*a10*' linux_distribution_name: ubuntu* terms: stage: retry_stage # trigger with /bot run --stage-list "A10-PyTorch-1" + mapping backend: pytorch tests: - test_e2e.py::test_openai_misc_example[pytorch] - test_e2e.py::test_openai_completions_example[pytorch] - disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0]If introducing a new stage isn’t feasible, please at least add a loud TODO on this block to ensure it’s reverted before merge. I can open a tracking issue if helpful.
21-21: Avoid relying on cuda_graph as the sole PyTorch smoke — it can be flaky and environment-sensitive.Keeping only
test_disaggregated_cuda_graphas the lone active pre_merge PyTorch test is brittle for A10 CI; CUDA graph tests are more sensitive to driver/runtime details and may fail for infra reasons unrelated to the rerun logic.Proposed minimal smoke swap: comment out the cuda_graph test and enable a stable E2E miscellany test instead.
- - disaggregated/test_disaggregated.py::test_disaggregated_cuda_graph[TinyLlama-1.1B-Chat-v1.0] + # Prefer a simpler, faster PyTorch smoke to validate rerun infra + - test_e2e.py::test_openai_misc_example[pytorch]
22-36: Add representative PyTorch smoke tests back into the pre-merge matrixVerification of
l0_a10.ymlshows only one activepre_merge/pytorchtest remains:
- disaggregated/test_disaggregated.py::test_disaggregated_cuda_graph[TinyLlama-1.1B-Chat-v1.0]
 To preserve baseline signal without significantly increasing runtime, please re-enable:
One disaggregated variant:
• disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0]Two fast E2E PyTorch tests:
• test_e2e.py::test_openai_misc_example[pytorch]
• test_e2e.py::test_openai_completions_example[pytorch]Optionally, you may also enable test_openai_reasoning[pytorch] if time allows.
Please apply a diff like:
@@ tests/integration/test_lists/test-db/l0_a10.yml - # - disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0] + - disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0] @@ - # - test_e2e.py::test_openai_misc_example[pytorch] + - test_e2e.py::test_openai_misc_example[pytorch] - # - test_e2e.py::test_openai_completions_example[pytorch] + - test_e2e.py::test_openai_completions_example[pytorch]This will bring the active count up to at least three, exercising both the rerun wrapper and core PyTorch paths without a large runtime impact.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/integration/test_lists/test-db/l0_a10.yml(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_a10.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
 
| 
           PR_Github #16148 [ run ] triggered by Bot  | 
    
| 
           PR_Github #16132 [ run ] completed with state   | 
    
| 
           PR_Github #16148 [ run ] completed with state   | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
jenkins/L0_Test.groovy (3)
1515-1529: Unconditional failure makes every test stage fail; rerun path is unreachable
throw new Exception("Infrastructure error")at Line 1520 forces a failure even when tests pass. The second throw at Line 1524 aborts beforererunFailedTests(...), so reruns and the improved failure message never execute. This likely explains repeated A10-PyTorch-1 stage failures observed in this PR thread.Apply this diff to let reruns happen and keep the enhanced error message:
sh """ rm -rf ${stageName}/ && \ cd ${llmSrc}/tests/integration/defs && \ ${testCmdLine.join(" ")} """ - throw new Exception("Infrastructure error") } catch (InterruptedException e) { throw e } catch (Exception e) { - throw new Exception("Infrastructure error") - isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine) - if (isRerunFailed) { - error "The tests still failed after rerun attempt. ${e.message}" - } + echo "Test execution failed, attempting rerun. Reason: ${e.message}" + def isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine) + if (isRerunFailed) { + error "The tests still failed after rerun attempt. ${e.message}" + } else { + echo "Rerun succeeded; proceeding." + } }If you need to exercise the retry path locally, guard the artificial failure behind an env flag (e.g., only throw when
env.FORCE_INFRA_FAIL == 'true') and default it off.
1-1: Pin shared library to a stable ref (not a personal branch)Referencing
trtllm-jenkins-shared-lib@user/yiqingy/retry_stagecouples CI in main to a mutable personal branch. Pin to@main, a reviewed tag, or a commit SHA. If you must iterate on a dev branch, gate it behind an explicit flag and default to the stable ref.Apply this diff:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _Optional pattern (if you really need a dev lib temporarily, but keep default stable):
- Keep the annotation pinned to main.
 - Load the dev lib only when explicitly requested via the imperative
 library 'trtllm-jenkins-shared-lib@<sha-or-dev-branch>'inside a guarded block (note: be mindful of CPS compilation order and symbol resolution).
2320-2343: Retry wrapper: avoid nested stage and fix null-safe reuse check
if (key in testFilter[REUSE_STAGE_LIST])will NPE whenREUSE_STAGE_LISTis null (default is null in this file).- Nesting
 stage("Skip - reused")insidestage(key)clutters Stage View.Apply this minimal, behavior-preserving diff:
- stage(key) { - trtllm_utils.llmStageWithRetry(pipeline, key, { - if (key in testFilter[REUSE_STAGE_LIST]) { - stage("Skip - reused") { - echo "Skip - Passed in the last pipeline." - } - } else if (values instanceof List && dockerNode == null) { + stage(key) { + trtllm_utils.llmStageWithRetry(pipeline, key, { + if (testFilter[REUSE_STAGE_LIST]?.contains(key)) { + echo "Skip - Passed in the last pipeline." + return + } else if (values instanceof List && dockerNode == null) { trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", { values[1]() }) } else if (values instanceof List && dockerNode != null) { node(dockerNode) { deleteDir() docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) { values[1]() } } } else { values() } }) }Optional hardening outside this hunk: default
REUSE_STAGE_LISTto[]to remove the need for null checks:- (REUSE_STAGE_LIST): null, + (REUSE_STAGE_LIST): [],
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
2320-2323: Confirm llmStageWithRetry does not create its own stage to avoid double-wrapping
stage(key) { llmStageWithRetry(pipeline, key, { ... }) }may produce nested stages if the retry helper also callsstage(key)internally. If it does, prefer a single stage wrapper to keep UI clean and timing accurate.Two options:
- If
 llmStageWithRetryalready manages the stage, drop the outerstage(key)and let the helper own rendering.- If it doesn’t, keep the outer
 stage(key)and ensure the helper focuses solely on retries without creating nested Jenkins stages.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
jenkins/L0_Test.groovy(3 hunks)tests/integration/test_lists/test-db/l0_a10.yml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/test-db/l0_a10.yml
 
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
 
| 
           /bot run --stage-list "A10-PyTorch-1"  | 
    
| 
           /bot run --stage-list "A10-PyTorch-1" --disable-fail-fast  | 
    
| 
           PR_Github #21194 [ run ] triggered by Bot  | 
    
| 
           PR_Github #21194 [ run ] completed with state   | 
    
| 
           /bot run --stage-list "A10-PyTorch-1" --disable-fail-fast  | 
    
| 
           PR_Github #21458 [ run ] triggered by Bot  | 
    
| 
           PR_Github #21458 [ run ] completed with state   | 
    
| 
           /bot run --stage-list "A10-PyTorch-1" --disable-fail-fast  | 
    
| 
           PR_Github #21837 [ run ] triggered by Bot. Commit:   | 
    
| 
           PR_Github #21837 [ run ] completed with state   | 
    
c2ad3d7    to
    fcd412d      
    Compare
  
    | 
           /bot run --stage-list "A10-PyTorch-1" --disable-fail-fast  | 
    
    
      
        1 similar comment
      
    
  
    | 
           /bot run --stage-list "A10-PyTorch-1" --disable-fail-fast  | 
    
| 
           /bot run  | 
    
| 
           PR_Github #23063 [ run ] triggered by Bot. Commit:   | 
    
| 
           PR_Github #23063 [ run ] completed with state   | 
    
Signed-off-by: Yiqing Yan <[email protected]>
| 
           /bot run  | 
    
| 
           PR_Github #23449 [ run ] triggered by Bot. Commit:   | 
    
Signed-off-by: Yiqing Yan <[email protected]>
| 
           /bot run  | 
    
| 
           PR_Github #23451 [ run ] triggered by Bot. Commit:   | 
    
| 
           PR_Github #23449 [ run ] completed with state   | 
    
| 
           PR_Github #23451 [ run ] completed with state   | 
    
Summary by CodeRabbit
Chores
Tests
Improvements
Bug Fixes
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.