| name | ci-failure-retrieval | ||
|---|---|---|---|
| description | Retrieve and diagnose CI test failures from TensorRT-LLM pull requests using the GitHub API and Jenkins testReport API. Use when the user asks about CI failures on a PR, wants to see failed test details, or needs stdout/stderr from a CI run. | ||
| license | Apache-2.0 | ||
| metadata |
|
Input: a PR number or a request to check CI failures. Auth requirement: requires corporate network access to resolve the Jenkins base URL. Output: a summary of failed tests with error details, and optionally full stdout/stderr for specific failures.
Jenkins requires SSL with certificate verification disabled. Always use ssl context bypass in Python or -sk flags in curl:
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONEThe curl -s approach often returns HTML login pages; prefer the Python urllib approach with SSL bypass.
First, determine the latest CI run commit, build number, and high-level pass/fail counts:
source ~/utils/github/set_github_token.sh
PR_NUM=<pr_number>
# Get the latest CI bot comment (contains build number and commit)
gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --paginate --jq \
'[.[] | select(.user.login == "tensorrt-cicd") | select(.body | test("L0_MergeRequest_PR"))] | last | .body'
# Get the PR HEAD commit and its blossom-ci status (high-level pass/fail counts)
HEAD_SHA=$(gh api "repos/NVIDIA/TensorRT-LLM/pulls/${PR_NUM}" --jq '.head.sha')
gh api "repos/NVIDIA/TensorRT-LLM/commits/${HEAD_SHA}/statuses" --jq \
'[.[] | select(.context == "blossom-ci")] | first | {state, description}'The description field shows aggregate counts like "23969 passed, 1 failed, 8962 skipped".
Extract the L0_MergeRequest_PR build number from the CI bot comment:
BUILD_NUM=$(gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --paginate --jq \
'[.[] | select(.user.login == "tensorrt-cicd") | select(.body | test("L0_MergeRequest_PR"))] | last | .body' \
| grep -oP 'L0_MergeRequest_PR/\K\d+')Many CI failures are infrastructure-level (Slurm node issues, pipeline aborts, resource exhaustion) where no test code executes at all. Always check the pipeline stages first:
import json, ssl, urllib.request
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
JENKINS_BASE = "https://prod.blsm.nvidia.com/sw-tensorrt-top-1/job/LLM/job/main/job/L0_MergeRequest_PR"
BUILD_NUM = <build_number>
# Get pipeline stage overview
url = f"{JENKINS_BASE}/{BUILD_NUM}/wfapi/describe"
resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
data = json.loads(resp.read())
print(f"Pipeline status: {data.get('status')}")
for stage in data.get('stages', []):
status = stage.get('status', '')
if status not in ('SUCCESS', 'SKIPPED', 'NOT_EXECUTED'):
name = stage.get('name', '')
print(f" [{status}] {name}")
if 'error' in stage:
print(f" Error: {stage['error']}")The Jenkins console log contains a CI failure analysis summary with sections like ## Recommended Actions and ## Infrastructure Notes. This is the single most valuable source for understanding infrastructure failures:
url = f"{JENKINS_BASE}/{BUILD_NUM}/consoleText"
resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
text = resp.read().decode('utf-8', errors='replace')
# Extract failure-related lines from the end of the log
for line in text[-8000:].split('\n'):
lo = line.lower()
if any(kw in lo for kw in ['fail', 'error', 'abort', 'likely cause',
'recommended action', 'infrastructure',
'no test code', 'stage result']):
print(line.strip()[:300])Key sections to look for in the console log:
Failing job/Failed stage: which Jenkins sub-job and stage failedLikely cause: automated root cause analysis (Slurm issues, pipeline timeouts, etc.)No test code was executed: confirms infrastructure-only failure (no code fix needed)Recommended Actions: whether to re-trigger CI or investigate code changes
Only proceed here if Phase 1.5/1.6 indicate actual test failures (not infrastructure issues):
url = f"{JENKINS_BASE}/{BUILD_NUM}/testReport/api/json"
resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
data = json.loads(resp.read())
print(f'Summary: {data["passCount"]} passed, {data["failCount"]} failed, {data["skipCount"]} skipped')
failed = []
for suite in data.get('suites', []):
for case in suite.get('cases', []):
if case.get('status') in ('FAILED', 'REGRESSION'):
failed.append(case)
if not failed:
print('No test failures in testReport!')
else:
print(f'Failed tests ({len(failed)}):')
for f in failed:
print(f' - {f["className"]}.{f["name"]}')
err = (f.get('errorDetails') or '')[:200]
if err:
print(f' Error: {err}')The errorStackTrace can be incomplete when errors originate from subprocesses. Fetch stdout and stderr for the specific test case to find the real error:
for suite in data.get('suites', []):
for case in suite.get('cases', []):
if case.get('status') in ('FAILED', 'REGRESSION'):
name = f'{case["className"]}.{case["name"]}'
if '<search_term>' in name:
print(f'=== {name} ===')
print('--- Error ---')
print(case.get('errorDetails', ''))
print('--- Stack Trace ---')
print(case.get('errorStackTrace', ''))
print('--- Stdout (last 3000 chars) ---')
print((case.get('stdout') or '')[-3000:])
print('--- Stderr (last 3000 chars) ---')
print((case.get('stderr') or '')[-3000:])
breakclassName,name: test identifierstatus:FAILEDorREGRESSIONerrorDetails: error messageerrorStackTrace: full stack trace (may be incomplete for subprocess errors)stdout,stderr: full test output (can be large, check these when stack trace is insufficient)
| Pattern | Diagnosis | Action |
|---|---|---|
No test code was executed + Slurm errors |
Infrastructure: Slurm node resource exhaustion | Re-trigger CI |
ABORTED stage + Downstream job did not succeed |
Cascading failure from fail-fast policy | Fix root cause stage, re-trigger |
newosproc / errno=11 / fork/exec |
Kernel process table exhaustion on login node | Wait and re-trigger |
testReport: 0 failed but blossom-ci: N failed |
Stage-level failures, not test failures | Check Phase 1.5/1.6 |
testReport: N failed with real test names |
Actual test code failures | Investigate test errors in Phase 3 |
- Do not guess Jenkins URLs; always use the known base
https://prod.blsm.nvidia.com/sw-tensorrt-top-1/job/LLM/job/main/job/L0_MergeRequest_PR. - Do not use
curl -sfor Jenkins API; it returns HTML login pages. Use Pythonurllibwith SSL bypass. - Do not jump to testReport (Phase 2) before checking pipeline stages (Phase 1.5) — many failures are infrastructure-only with zero test failures.
- Do not stop at
errorStackTraceif it mentions generic wrapper failures likeProcess exited with status 1; checkstdoutandstderrfor the real error. - Do not fetch all test cases when looking for a specific failure; use the
<search_term>filter in Phase 3.