| name | ad-pipeline-failure-pr |
|---|---|
| description | Analyze the latest AutoDeploy pipeline or a user-specified pipeline ID, inspect failed job logs, group similar failures into actionable root-cause buckets, and create at most one PR per bucket. Use when the user mentions pipeline IDs, failed jobs, GitLab logs, failure buckets, or opening PRs from CI failures. |
Input: latest AutoDeploy model-coverage GitLab pipeline, or a specific upstream/downstream pipeline ID / pipeline URL. Auth requirement: the user must export a GitLab token in GITLAB_TOKEN before this skill can query pipelines, jobs, or traces. Output: first ask the user which output format is preferred. Default to reporting in chat. Alternative outputs are a Markdown report (md) and a per-failure CSV (csv). The skill still produces a bucketed failure report plus at most one PR per actionable root-cause bucket, and when a PR is not justified but the bucket is still worth tracking, create one issue for that bucket.
This skill must be standalone. Resolve pipelines, failed jobs, and raw logs directly from GitLab APIs and job traces. Do not depend on autodeploy-dashboard code, scripts, CSVs, or its legacy categorization logic. This skill owns the bucketing rules, skip rules, repo ownership decision, and one-PR-per-bucket behavior.
Before any GitLab API call, require GITLAB_TOKEN to be set in the environment. If it is missing, stop immediately and tell the user: Set GITLAB_TOKEN to a GitLab personal access token and rerun this skill.
Before doing the main analysis, ask the user which output is preferred:
chat(default)mdcsv
If the user does not specify, default to chat.
- Default scope is
model-coverage. Do not silently switch to benchmark pipelines. - If the user explicitly asks to analyze a benchmark pipeline, stop and tell them this skill does not support benchmark pipelines.
- If the user gives a pipeline ID or GitLab pipeline URL, use it.
- Treat a user-provided pipeline as potentially either:
- an upstream AutoDeploy pipeline in
ftp/infra/autodeploy-dashboard - a downstream triggered pipeline in
dl/jet/ci
- an upstream AutoDeploy pipeline in
- If the starting pipeline is upstream, follow the failed bridge chain until you reach the first downstream pipeline with terminal
model-coveragejobs. - Otherwise resolve the latest upstream AutoDeploy pipeline that ran
model-coverage, then follow the same bridge chain to the terminal pipeline. - If
GITLAB_TOKENis missing, stop immediately and tell the user exactly how to fix it:Set GITLAB_TOKEN to a GitLab personal access token and rerun this skill.
Use this resolution order:
- Identify whether the provided pipeline belongs to the upstream dashboard project or the downstream
dl/jet/ciproject. - If it is upstream, inspect its bridge jobs and select the failed
model-coveragetrigger path. - If the next pipeline contains only bridge jobs, keep following the failed trigger chain.
- Stop at the first downstream pipeline that contains terminal failed
model-coveragejobs with traces. - Report both:
- the user-facing starting pipeline
- the terminal pipeline that contains the actual failing jobs
Do not analyze only the bridge failure if a deeper downstream pipeline contains the real job traces.
All GitLab API and trace-fetching steps in this skill must authenticate with the token from GITLAB_TOKEN.
For each failed job, collect:
- pipeline ID
- job ID and job URL
- raw log URL
- workload name
- model or benchmark configuration
- first causal error snippet from the raw trace
Also collect:
- starting pipeline ID
- terminal pipeline ID
- whether the job came from a bridge-followed downstream path
Before proposing a fix, read at least one representative raw log for every tentative bucket. Do not rely on legacy labels alone.
Trace-reading rules:
- In
model-coverageterminal pipelines, jobs often come in triplets like[1 logs_before],[2 <runner/stage>],[3 logs_after]. The primary failing workload is usually the[2 ...]job. Use[1]and[3]only as supplemental evidence when needed. - If the trace ends with generic wrapper failures such as
RuntimeError: Executor worker returned error,RuntimeError: Executor worker died during initialization, orERROR: Job failed: Process exited with status 1, keep scanning upward and record the earlier model-, export-, tokenizer-, or environment-specific exception instead. - Prefer the first specific exception that explains the failure over later fallout from worker teardown, Slurm cleanup, or proxy startup.
- When the workload dumps its config in the trace, capture the resolved
model:value and relevantyaml_extra/runtime hints. They are often useful for explaining why a bucket is multimodal, world-size-specific, or using a special mode.
Every analyzed failed job must end up in exactly one bucket. Do not leave failures in an implicit catch-all like other, misc, or untriaged in the final report.
This includes infra and external cases. They still need explicit buckets, for example:
infra/resource/oominfra/runtime/timeout-or-freezeinfra/runtime/cancelledinfra/filesystem/hf-lock-permissionexternal/huggingface/access-forbiddenexternal/huggingface/missing-revisionexternal/huggingface/invalid-tokenizer-or-processorexternal/env/missing-python-packageexternal/transformers/api-mismatch
Do not assume oom or timeout-or-freeze are infra-only. In AutoDeploy pipelines they often reflect real TensorRT-LLM / AutoDeploy bugs. Classify them as infra/... only when the evidence points to cluster noise or a non-code resource problem. Otherwise bucket them under the real owning repo/component.
Group failures together only when all of these are true:
- they point to the same likely code owner and target repo
- they share the same causal failure signature, such as the same failing symbol, op, assertion, stack frame, or config path
- they appear fixable by one coherent code change
- one PR can reasonably explain why the same fix covers every matched job
Split failures into different buckets when any of these are true:
- the first causal error differs even if the legacy category matches
- the same symptom comes from different repos or subsystems
- one failure is infrastructure noise and the other is a code bug
- the likely fixes would touch unrelated files or require different validation
- the evidence is mixed or contradictory
When uncertain, split instead of merge.
If a failed job does not fit any existing bucket, put it in its own one-job bucket. Do not leave it uncategorized.
That one-job bucket must still be labeled as exactly one of:
actionable— likely fixable with a PRissue-only— worth tracking, but not ready for a PR
Do not use a skip PR label. If a bucket should not produce a PR, mark it issue-only when it is still worth tracking.
Buckets such as OOM, timeout/freeze, cancelled, or Hugging Face access failures must still appear explicitly in the report. If the shared failure mode is clear enough to track, prefer issue-only.
The final report must account for all failed jobs:
- include the total failed job count
- include bucket counts
- ensure the sum of all bucket sizes equals the total failed job count
- make unmatched or low-confidence cases explicit as singleton buckets instead of hiding them
Use this evidence priority order when bucketing:
- first causal stack frame or assertion
- explicit failing symbol, op, layer, config key, or script
- repeated error snippet near the first failure
- repeated failure wording across matched traces
- job naming and workload metadata only as a weak tie-breaker
Each bucket must have:
- a short bucket name in the form
repo/component/failure-mode - one representative job
- a list of all matching jobs
- one root-cause hypothesis tied to code
Do not create a PR for a bucket when any of these are true:
- the failures are pure infrastructure noise such as timeout, preemption, cluster cancellation, or log-access failure without code evidence
- the jobs do not share one plausible code fix
- the evidence is too weak to point at a concrete code path
- the issue belongs to external infrastructure or an external dependency outside the checked-out repos
- an open PR already appears to address the same bucket
- the only commonality is a broad status label or superficial wording
If the starting pipeline failed only because a bridge failed, do not treat the bridge as its own actionable bucket unless the downstream terminal pipeline has no failing jobs or no accessible traces.
Infrastructure and external buckets must still be reported as explicit buckets. They should usually be issue-only rather than promoted to a PR unless the evidence clearly points to a repo-owned fix.
Common issue-only patterns seen in AutoDeploy model-coverage pipelines:
- gated or forbidden Hugging Face repos (
403) - missing or renamed Hugging Face revisions/models (
404) - missing optional Python packages such as
timm,num2words,mamba_ssm,causal_conv1d, or similar runtime dependencies - filesystem permission problems on Hugging Face cache lock files
- only clearly non-code resource failures after log review; do not auto-classify CUDA OOM or timeout/freeze as infra without checking for an AutoDeploy root cause
Prefer TensorRT-LLM when the root cause is in:
- AutoDeploy model code
- AutoDeploy runtime or transforms
- tests, configs, or execution paths owned by
TensorRT-LLM - code paths surfaced by
ad-debug-agent
Prefer autodeploy-dashboard when the root cause is in:
- failure-analysis scripts
- workload generation
- job URL or raw-log resolution
- pipeline orchestration or reporting gaps in the AutoDeploy pipeline repo
Do not open a PR when the bucket belongs to cluster infrastructure, GitLab service behavior, or another external system that is not owned by the checked-out repos.
For every bucket:
- Read the representative job log and isolate the first causal failure, not the downstream fallout.
- Read the relevant code, config, or script that the failure points to.
- Confirm that the same hypothesis explains the other jobs in the bucket.
- If deeper AutoDeploy tracing is needed, use the
ad-debug-agentworkflow to inspect the failing code path before editing. - If the representative log does not actually support the bucket hypothesis, split or discard the bucket.
Do not start coding until the bucket has both:
- one representative log snippet
- one code-level hypothesis
Work one bucket at a time.
For an actionable bucket:
- Choose the smallest code change that plausibly fixes the shared root cause.
- Prefer a targeted fix over a broad cleanup.
- Verify with the smallest relevant test or validation step.
- If the validation suggests the bucket actually contains multiple root causes, split it before opening any PRs.
- Create one branch and one PR for the full bucket.
Never open one PR per failed job when the jobs share the same fix.
If a bucket is worth tracking, but you do not have enough confidence for a PR, create one issue for that bucket instead of silently stopping.
Create an issue when all of these are true:
- the bucket has a clear shared failure mode
- the representative logs provide enough evidence to explain the bucket
- one issue can clearly describe the shared failure mode
- a PR is not justified yet because the fix is uncertain, risky, mixed, under-validated, external, or infra-related
Do not create an issue when any of these are true:
- the evidence is too weak to explain the failure mode at all
- an open issue or PR already appears to cover the same bucket
- the bucket is just a duplicate restatement of another bucket
Issues for infra or external buckets are valid. Examples include:
infra/resource/oominfra/runtime/timeout-or-freezeinfra/runtime/cancelledexternal/huggingface/access-forbiddenexternal/huggingface/missing-revisionexternal/env/missing-python-package
For oom and timeout-or-freeze, prefer a repo-owned bucket instead when the traces suggest a reproducible AutoDeploy issue rather than infrastructure noise.
When creating an issue in TensorRT-LLM, use the repository templates in .github/ISSUE_TEMPLATE/ instead of inventing a custom issue body.
- For failure buckets from this skill, use
.github/ISSUE_TEMPLATE/06-bug-report.ymlby default. - Only use another template if the bucket is clearly a feature request or another non-bug category.
Fill the selected issue template with the triage evidence from this skill. At minimum, include:
- pipeline ID and workload scope
- representative job URL
- first causal failure snippet
- matching jobs or affected model families
- likely owner or subsystem when known
- code-level hypothesis when applicable
- why a PR was not created yet
Respect the template's required structure and security guidance. Do not paste sensitive tokens, private credentials, or other secrets into the issue body.
Prefer one issue per bucket, not one issue per job.
Before opening a PR:
- verify there is no existing open PR for the same bucket or failure signature
- confirm the PR target repo matches the bucket owner
- ensure the proposed fix is backed by evidence from logs and code
- make sure the PR description explains why one change covers all jobs in the bucket
For TensorRT-LLM PRs, follow the repo workflow:
- use the local PR title format:
[JIRA/NVBUG/None][type] description - keep the PR focused on one concern
- validate only the smallest relevant tests or commands
Before opening an issue:
- verify there is no existing open issue or PR for the same bucket or failure signature
- confirm the issue target repo is the best available home for the bucket
- make sure the issue explains why no PR was created
- include enough evidence that another engineer can pick it up without redoing the initial triage
- use the appropriate file from
.github/ISSUE_TEMPLATE/, usually06-bug-report.ymlfor failure buckets from this skill
Use this structure:
## Summary
- Fixes root-cause bucket: `<repo/component/failure-mode>`
- Resolves failures from pipeline `<pipeline_id>`
- One change covers `<N>` matching jobs because `<shared-cause>`
## Evidence
- Representative job: `<job_url>`
- Representative log snippet: `<first causal failure>`
- Matching jobs: `<count>` across `<models/workloads>`
- Bucket rule: `<why these failures belong together>`
## Validation
- `<focused test or verification step>`
## Not Included
- `<skipped infra-only or mixed-evidence buckets>`Print a concise final report with:
- target pipeline, terminal pipeline, and workload scope
- all buckets with status such as
actionableorissue-only - representative evidence for each actionable bucket
- PRs created, issues created, or why no PR was created for an
issue-onlybucket - remaining risks or follow-up validation
The final report must also include a bucketization checksum:
total failed jobs = <N>sum of bucket sizes = <N>
If no PRs or issues were created, say that explicitly and explain whether the blocker was:
- duplicate-checks not yet performed
- evidence too weak for a concrete code owner
- no coherent single fix
- external or infra ownership
Honor the user's selected output format:
chat: print the final report directly in chatmd: also write the final report to a Markdown filecsv: also write a per-failure CSV with one row per failed job, including at least job ID, job URL, workload/model, first causal error, bucket, likely owner, and outcome
- Do not trust a legacy category without reading logs.
- Do not depend on
autodeploy-dashboardcode to resolve pipelines or classify failures. - Do not stop at the first failed bridge if the real
model-coveragefailures are deeper in the downstream trigger chain. - Do not merge failures just because they mention the same model.
- Do not create a PR for a bucket that maps to multiple unrelated fixes.
- Do not open PRs for infra-only buckets.
- Do not hide uncertainty; if evidence is mixed, split or skip.