Releases: pytorch/test-infra
Releases · pytorch/test-infra
v20250812-181827
Replace lazyproperty with cached_property (#6996)
v20250812-121831
[autorevert] implement autorevert and fix detection logic (#6983) ### Summary - Implemented revert detection/recording - Implemented failure-only rule matching in the autorevert detector to prevent “success” jobs with a classification label from contaminating pattern detection - Added a unit test ### Bug Fixed - Cause: The detector previously matched on `classification_rule` regardless of job `conclusion`. Baseline commit `33ec6e3` had multiple “success” shards labele d with `rule='pytest failure'`, which the detector misread as “older commit alre ady has the same failure,” suppressing the pattern for `bbc0df1`/`4fd5fab`. - Fix: Require `conclusion == 'failure'` wherever the detector compares rules (b oth for newer commit confirmation and older baseline exclusion). This prevents n oise from success+rule rows and correctly flags commit-caused failures like the ROCm case. ### Testing <details> <summary>python -m pytorch_auto_revert autorevert-checker rocm --hours 82 --do-restart --dry-run</summary> ``` python -m pytorch_auto_revert autorevert-checker rocm --hours 82 --do-restart --dry-run Fetching workflow data for 1 workflows since 2025-08-04T08:56:25.851470... Found 161 commits with job data for workflow 'rocm' ✓ 3 AUTOREVERT PATTERNS DETECTED Pattern #1: Failure rule: 'pytest failure' Recent commits with failure: bdb07a2b 8085edc8 Older commit without failure: 41081276 ✗ NOT REVERTED: 8085edc8f9c98f670f585586b4286a942927537a was not reverted ⟳ DRY RUN: Would restart rocm for 8085edc8 ⟳ DRY RUN: Would restart rocm for 41081276 Pattern #2: Failure rule: 'pytest failure' Recent commits with failure: 908c5cc4 b6c53383 Older commit without failure: 33ec6e3e ✗ NOT REVERTED: b6c53383fe2f29e6ed35430e90867dbeb8980d42 was not reverted ⟳ DRY RUN: Would restart rocm for b6c53383 ⟳ DRY RUN: Would restart rocm for 33ec6e3e Pattern #3: Failure rule: 'pytest failure' Recent commits with failure: 4fd5fabe bbc0df10 Older commit without failure: efc4b460 ✓ REVERTED (nosignal): bbc0df1094b5a4dcd2cce83f8402127b07913231 was reverted by 41081276 after 18.5 hours ================================================== SUMMARY STATISTICS ================================================== Workflow(s): rocm Timeframe: 82 hours Commits checked: 161 Auto revert patterns detected: 3 Actual reverts inside auto revert patterns detected (precision): 1 (33.3%) Total revert commits in period: 9 Revert categories: nosignal: 5 (55.6%) ignoredsignal: 2 (22.2%) ghfirst: 2 (22.2%) Total reverts excluding ghfirst: 7 Reverts (excluding ghfirst) that dont match any auto revert pattern detected (recall): 6 (85.7%) Per workflow precision: rocm: 1 reverts out of 3 patterns (33.3%) [excluding ghfirst: 1 (33.3%)] Reverted patterns: - pytest failure: bbc0df10 (nosignal) Restarted workflows: 4 - rocm for 8085edc8 - rocm for 41081276 - rocm for b6c53383 - rocm for 33ec6e3e ``` </details> the actual culprit was correctly identified: ``` Pattern #7: Failure rule: 'pytest failure' Recent commits with failure: 4fd5fabe bbc0df10 Older commit without failure: efc4b460 ✓ REVERTED (nosignal): bbc0df1094b5a4dcd2cce83f8402127b07913231 was reverted by 41081276 after 18.5 hours ``` there are multiple patterns detected, because the failure was jumping across **workflows**: rocm and rocm-mi300 --------- Co-authored-by: Jean Schmidt <[email protected]>
v20250811-123656
[autorevert] fix pending job check (#6982) Pending job detection was incorrect. - Before: `has_pending_jobs` checked `status == "pending"`, which GitHub Actio ns does not emit; statuses are typically `queued`, `in_progress`, `completed`. - Now: Treat any job with `status != "completed"` as pending.
v20250807-215637
[autorevert] Fix bug on job retrival for autorevert lambda (#6979) After fixing the bug, the new statistics are: ``` ================================================== SUMMARY STATISTICS ================================================== Workflow(s): Lint, trunk, pull, inductor, linux-binary-manywheel Timeframe: 4320 hours Commits checked: 33519 Auto revert patterns detected: 1345 Actual reverts inside auto revert patterns detected (precision): 219 (16.3%) Total revert commits in period: 585 Revert categories: nosignal: 202 (34.5%) ghfirst: 156 (26.7%) uncategorized: 104 (17.8%) ignoredsignal: 68 (11.6%) weird: 45 (7.7%) landrace: 10 (1.7%) Total reverts excluding ghfirst: 429 Reverts (excluding ghfirst) that dont match any auto revert pattern detected (recall): 250 (58.3%) Per workflow precision: Lint: 45 reverts out of 75 patterns (60.0%) [excluding ghfirst: 41 (54.7%)] trunk: 30 reverts out of 136 patterns (22.1%) [excluding ghfirst: 28 (20.6%)] pull: 104 reverts out of 859 patterns (12.1%) [excluding ghfirst: 92 (10.7%)] inductor: 39 reverts out of 269 patterns (14.5%) [excluding ghfirst: 38 (14.1%)] linux-binary-manywheel: 1 reverts out of 6 patterns (16.7%) [excluding ghfirst: 0 (0.0%)] ``` The main bug is that when checking for commits before/after with the same job, it actually concatenated all commits+jobs before and after, instead of only returning the next one. I added also the ergonomics for lambda invocation --------- Co-authored-by: Ivan Zaitsev <[email protected]>
v20250806-142042
Base64 decode the app secret for pytorch-auto-revert lamba (#6967) They way that the lambdas are structured and the terraform is setup, it is not possible to pass newlines as environment variables. So the github app key, that is a private key, needs to be b64 encoded.
v20250805-233507
[autorevert] Add workflow restarts (#6962)
### Summary
Adds workflow restart capability to the PyTorch auto-revert tool,
enabling automatic re-running of workflows for commits that match
autorevert patterns but haven't been reverted yet.
### Changes
- Added restart methods to WorkflowRestartChecker:
- restart_workflow(): Restarts a workflow for a specific commit with
duplicate prevention
- Checks ClickHouse for existing restarts before attempting
- Enhanced autorevert-checker command:
- Added --do-restart flag to enable automatic workflow restarts
- Added --dry-run flag to preview restart actions without execution
- Restarts workflows only for non-reverted commits matching autorevert
patterns
- Fixed workflow naming consistency:
- Normalized workflow names by removing .yml extension for ClickHouse
queries
- Added .yml extension only for GitHub API calls
- Updated do-restart command:
- Now requires commit SHA (removed unused restart_latest_workflow)
- Leverages same restart logic with duplicate prevention
### Usage
```
# Check patterns and restart workflows
python -m pytorch_auto_revert autorevert-checker pull trunk --do-restart
# Dry run to preview restarts
python -m pytorch_auto_revert autorevert-checker pull trunk --do-restart --dry-run
# Manual restart
python -m pytorch_auto_revert do-restart trunk abc123def
```
### Testing
```
python -m pytorch_auto_revert autorevert-checker inductor --hours 12 --do-restart --dry-run
Fetching workflow data for 1 workflows since 2025-07-30T22:47:17.357776...
Found 19 commits with job data for workflow 'inductor'
✓ 1 AUTOREVERT PATTERN DETECTED
Pattern #1:
Failure rule: 'GHA error'
Recent commits with failure: f89c28cc 5b2ad927
Older commit without failure: 7a4167a1
✗ NOT REVERTED: 5b2ad9279cb2e440d45253d28f2101a75fd42344 was not reverted
⟳ ALREADY RESTARTED: inductor for 5b2ad927
==================================================
SUMMARY STATISTICS
==================================================
Workflow(s): inductor
Timeframe: 12 hours
Commits checked: 19
Auto revert patterns detected: 1
Actual reverts inside auto revert patterns detected (precision): 0 (0.0%)
Total revert commits in period: 0
Total reverts excluding ghfirst: 0
No non-ghfirst reverts found in the period
Per workflow precision:
inductor: 0 reverts out of 1 patterns (0.0%) [excluding ghfirst: 0 (0.0%)]
```
v20250730-214538
[BE] Use info logs for normal operations (#6955) It's normal for there not being runners available for reuse, those events should not be logged as errors.
v20250730-141611
Filter out restarted jobs from CH queries (#6954) As the prerequisite for running autorevert in shadow mode, we need to filter out restarted jobs from the existing queries where they might skew the results.
v20250716-153909
runners: Remove SSM deletion from terminateRunner (#6934) This is super-ceded by our addition of an expiration policy to our SSM parameters. I also think that this was somewhat causing us to be slow. See: * https://github.com/pytorch/test-infra/pull/6885 --------- Signed-off-by: Eli Uriegas <[email protected]>
v20250715-232140
runners: Add expiration policy to SSM parameters (#6885) This adds an expiration policy to the SSM parameters for the runners. This is to ensure that the parameters are deleted after 30 minutes. Github Runner Tokens typically have a 1 hour expiration time, but our runners are typically expected to be up way quicker than that so 30 minutes is a good balance for when we expect the runners to be up. If a runner isn't conencted to Github by at least 30 minutes we will more than likely have spun it down and it will be deleted. This is an attempted re-land of 2 commits: * #6855 * #6858 --------- Signed-off-by: Eli Uriegas <[email protected]>