Skip to content

[CI] Show only ROCm failures in parity summary and add cross-arch column#3153

Merged
jithunnair-amd merged 8 commits intodevelopfrom
parity-summary-improvements
Apr 23, 2026
Merged

[CI] Show only ROCm failures in parity summary and add cross-arch column#3153
jithunnair-amd merged 8 commits intodevelopfrom
parity-summary-improvements

Conversation

@ethanwee1
Copy link
Copy Markdown

@ethanwee1 ethanwee1 commented Apr 14, 2026

Summary

  • Only display tests where ROCm status is FAILED in the summary (CUDA status shown as a context column alongside). Previously both ROCm and CUDA failures were shown.
  • Add "Also Failing In" column that shows which other architectures have the same test tuple (test_file, test_class, test_name) failing, making it easy to distinguish all-ROCm issues from architecture-specific ones.
  • Includes count of failed tests in the section header.
  • Add job-level and test-level shard info to "LOG-BASED FAILURES (not in XML)" and "FAILED TESTS" section
  • Includes flaky tests in "LOG-BASED FAILURES (not in XML)" section for any tests that pass when run in new process

Test plan

Only display tests where ROCm status is FAILED (CUDA status shown
as context column). Add cross-architecture lookup so each failure
shows which other architectures have the same test failing.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 14, 2026

Jenkins build for fad16aed2f09fcc9366270a28d73d228b5629220 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

Remove cuda, cuda_dist, cuda_inductor, and baseline entries
from LOG_FILE_MAP since only ROCm failures are relevant to
the parity report.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 14, 2026

Jenkins build for fad16aed2f09fcc9366270a28d73d228b5629220 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Comment thread .automation_scripts/pytorch-unit-test-scripts/generate_summary.py Outdated
- Restore CUDA and baseline log parsing so their failures can be cross-
  referenced, but keep the LOG-BASED FAILURES table's Arch column limited
  to ROCm entries (CUDA rows are hidden from the table itself).
- Add "Also Failing In" column to LOG-BASED FAILURES and include "cuda"
  in the FAILED TESTS "Also Failing In" column when a CUDA log failure
  exists for the same test tuple. This lets us spot tests failing on
  both platforms so we can revert upstream changes instead of filing a
  ROCm DISABLED issue.
- Split the single Shard column in FAILED TESTS into Shard (rocm) and
  Shard (cuda) so each failure can be looked up in either CI job.
- Propagate the active test-file shard to CONSISTENT_FAILURE log entries
  so shard info is no longer blank in the log-based failures table.
@rocm-repo-management-api
Copy link
Copy Markdown

Jenkins build for 902d7cfa0b3c35044138c88548e5991bf5c43049 commit finished as ABORTED
Links: Pipeline Overview / Build artifacts / Test Results

- detect_log_failures.py now computes job-level shard totals by counting
  log files per (platform, test_config) and emits both job_shard
  (e.g. 3/6, derived from filename + file count) and test_shard
  (e.g. 10/15, the intra-file pytest "Running ... N/M" value) for each
  failure, including CONSISTENT_FAILURE entries.
- generate_summary.py LOG-BASED FAILURES table now has separate
  "Job-Level Shard" and "Test-Level Shard" columns so reviewers can
  jump directly to the CI job and any intra-file shard.
- FAILED TESTS table columns renamed from "Shard (rocm/cuda)" to
  "Job-Level Shard (rocm/cuda)" for consistency with the log-based
  table (these values are already derived from the XML report dir
  name, e.g. test-default-3-6).
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 21, 2026

Jenkins build for 81c66e65c6790db005efda1c3918f359684ffd5b commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

When a test failure is already reported in the XML-based FAILED TESTS
table, it would also appear in LOG-BASED FAILURES whenever the same
shard's log contained a "failed!" or "FAILED CONSISTENTLY" line. That
made the summary look like two separate failures when there was only
one. The LOG-BASED section is meant for failures *not* captured by
XML (timeouts, crashes, process kills), so skip any entry whose
(arch, test_file, test_class, test_name) tuple already appears in the
FAILED TESTS table.

Also normalize test_file before comparing, since XML uses dotted paths
(e.g. distributed.test_symmetric_memory) while logs use slash paths
(distributed/test_symmetric_memory, sometimes with a trailing .py).

On run 24735028060 this drops the LOG-BASED section from 21 rows to
6 truly XML-missing timeouts.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 22, 2026

Jenkins build for 8d063809e99a31079baa98d915540d6f88df8a1b commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

…shards inventory

detect_log_failures.py now emits a sibling log_shards_<arch>.csv alongside
log_failures_<arch>.csv, capturing every (platform, test_config, job_shard,
test_file) -> observed test-level shards combination seen in the raw CI logs.

generate_summary.py consumes the inventory to back-fill a "Test-Level Shard
(rocm)" and "Test-Level Shard (cuda)" column in the XML-based FAILED TESTS
table (XML artifacts don't contain test-level shard metadata, so we recover
it by matching the job-level shard + test file to the log inventory). For
intra-file-sharded test files (e.g. test_torchinductor_opinfo_properties
split into 14 pytest shards), the value is rendered compactly as "1,6,12/14".

The LOG-BASED FAILURES table already displayed test-level shard per entry;
no change there beyond the existing column.

parity.yml: exclude log_shards_*.csv from the CSV discovery glob in the
summarize step so the new inventory file isn't mistaken for a parity CSV.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 22, 2026

Jenkins build for 8d063809e99a31079baa98d915540d6f88df8a1b commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

… in LOG-BASED FAILURES

detect_log_failures.py:
- parse_log_file now also returns a flaky_tests list. When CI's
  "Test succeeded in new process, continuing with the rest of the tests"
  marker follows an individual-test PASSED line, the corresponding test
  is recorded as flaky (the preceding normal-process run failed, hence
  the rerun).
- scan_logs emits these as structured records with platform, test_config,
  test_file, test_class, test_name, job_shard, and test_shard.
- A sibling flaky_tests_<arch>.csv is written next to
  log_failures_<arch>.csv, via the generalized _derive_sibling_path().

generate_summary.py:
- load_flaky_tests_as_log_failures() reads the flaky CSV and shapes it
  like log-failure rows with category='FLAKY'. main() appends these to
  the log_failures list.
- FLAKY entries are exempted from the XML-vs-log dedup filter in the
  LOG-BASED FAILURES table, since a rerun-passed signal is orthogonal to
  any hard failure recorded in XML.
- Cross-arch "Also Failing In" now naturally links matching flaky tests
  across architectures.

Verified locally on run 24735028060 artifacts: 20 flaky entries for
mi200 and 9 for mi355 (exact 1:1 with "Test succeeded in new process"
log lines), including tests like
test_flex_attention_with_dynamic_max_autotune_graph_partition_cuda and
test_template_epilogue_fusion_static_analysis_...use_async_compile_True
that the dashboard owner flagged from run 24796654604.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 22, 2026

Jenkins build for 2051d20f0c6d70d2d7b9ca0644c3a2f1a6f5d9ab commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

The summarize job picks the first matching *.csv in the per-arch
artifact dir, filtering out auxiliary files. Now that
detect_log_failures.py also emits a sibling flaky_tests_<arch>.csv,
it can be mistakenly picked up as the parity CSV (e.g. when
ordering puts it first), causing generate_summary.py to crash with
KeyError: 'status_set1'. Add it to the exclusion list alongside
log_failures_*.csv and log_shards_*.csv.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 22, 2026

Jenkins build for 2051d20f0c6d70d2d7b9ca0644c3a2f1a6f5d9ab commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@jithunnair-amd jithunnair-amd changed the title Show only ROCm failures in parity summary and add cross-arch column [CI] Show only ROCm failures in parity summary and add cross-arch column Apr 23, 2026
Copy link
Copy Markdown
Collaborator

@jithunnair-amd jithunnair-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some awesome improvements! I did cross-check that the flaky tests was unable to catch some tests such as the following in shard 5 for mi355:

2026-04-22T10:26:36.9749161Z inductor/test_max_autotune.py::TestMaxAutotuneAsyncPipelined::test_triton_error_precompilation_and_autotuning E0422 10:25:44.014000 1154211 site-packages/torch/_inductor/select_algorithm.py:3854] [0/0] Runtime error for autotuning triton choices, defaulting to extern kernels.
2026-04-22T10:26:36.9750048Z W0422 10:25:46.007000 1155334 site-packages/torch/_native/cutedsl_utils.py:55] CuTeDSL operators require optional Python packages `nvidia-cutlass-dsl` and `apache-tvm-ffi`; missing optional dependency `nvidia_cutlass_dsl` (importlib.util.find_spec(nvidia_cutlass_dsl) failed)
2026-04-22T10:26:36.9750731Z /var/lib/jenkins/pytorch/test/inductor/test_max_autotune.py:123: FutureWarning: torch.cuda._set_allocator_settings is deprecated. Use torch._C._accelerator_setAllocatorSettings instead.
2026-04-22T10:26:36.9751116Z   torch.cuda.memory._set_allocator_settings("expandable_segments:False")
2026-04-22T10:26:36.9751479Z E0422 10:26:02.003000 1154211 site-packages/torch/_inductor/select_algorithm.py:3854] [0/0] Runtime error for autotuning triton choices, defaulting to extern kernels.
2026-04-22T10:26:36.9751791Z PASSED [19.3006s] [100%]
2026-04-22T10:26:36.9751858Z 
2026-04-22T10:26:36.9752076Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_max_autotune/inductor.test_max_autotune-93dbff3468a90d16.xml -
2026-04-22T10:26:36.9752409Z ====================== 1 passed, 276 deselected in 19.34s ======================
2026-04-22T10:26:36.9753715Z Got exit code 0
2026-04-22T10:26:36.9753859Z Test succeeded in new process, continuing with the rest of the tests

We can refine the flaky tests logic to be more robust.

@jithunnair-amd jithunnair-amd merged commit 9f8ad3e into develop Apr 23, 2026
4 of 7 checks passed
@jithunnair-amd jithunnair-amd deleted the parity-summary-improvements branch April 23, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants