Skip to content

Releases: pytorch/test-infra

v20250812-181827

12 Aug 18:20
dc202fb

Choose a tag to compare

Replace lazyproperty with cached_property (#6996)

v20250812-121831

12 Aug 12:20
e9bc36e

Choose a tag to compare

[autorevert] implement autorevert and fix detection logic (#6983)

### Summary

- Implemented revert detection/recording
- Implemented failure-only rule matching in the autorevert detector to
prevent “success” jobs with a classification label from contaminating
pattern detection
- Added a unit test


### Bug Fixed

- Cause: The detector previously matched on `classification_rule`
regardless of
job `conclusion`. Baseline commit `33ec6e3` had multiple “success”
shards labele
d with `rule='pytest failure'`, which the detector misread as “older
commit alre
ady has the same failure,” suppressing the pattern for
`bbc0df1`/`4fd5fab`.
- Fix: Require `conclusion == 'failure'` wherever the detector compares
rules (b
oth for newer commit confirmation and older baseline exclusion). This
prevents n
oise from success+rule rows and correctly flags commit-caused failures
like the
ROCm case.


### Testing

<details>
<summary>python -m pytorch_auto_revert autorevert-checker rocm --hours
82 --do-restart --dry-run</summary>

```
python -m pytorch_auto_revert autorevert-checker rocm --hours 82 --do-restart --dry-run
Fetching workflow data for 1 workflows since 2025-08-04T08:56:25.851470...
Found 161 commits with job data for workflow 'rocm'
✓ 3 AUTOREVERT PATTERNS DETECTED

Pattern #1:
Failure rule: 'pytest failure'
Recent commits with failure: bdb07a2b 8085edc8
Older commit without failure: 41081276
✗ NOT REVERTED: 8085edc8f9c98f670f585586b4286a942927537a was not reverted
  ⟳ DRY RUN: Would restart rocm for 8085edc8
  ⟳ DRY RUN: Would restart rocm for 41081276

Pattern #2:
Failure rule: 'pytest failure'
Recent commits with failure: 908c5cc4 b6c53383
Older commit without failure: 33ec6e3e
✗ NOT REVERTED: b6c53383fe2f29e6ed35430e90867dbeb8980d42 was not reverted
  ⟳ DRY RUN: Would restart rocm for b6c53383
  ⟳ DRY RUN: Would restart rocm for 33ec6e3e

Pattern #3:
Failure rule: 'pytest failure'
Recent commits with failure: 4fd5fabe bbc0df10
Older commit without failure: efc4b460
✓ REVERTED (nosignal): bbc0df1094b5a4dcd2cce83f8402127b07913231 was reverted by 41081276 after 18.5 hours

==================================================
SUMMARY STATISTICS
==================================================
Workflow(s): rocm
Timeframe: 82 hours
Commits checked: 161
Auto revert patterns detected: 3
Actual reverts inside auto revert patterns detected (precision): 1 (33.3%)
Total revert commits in period: 9

Revert categories:
  nosignal: 5 (55.6%)
  ignoredsignal: 2 (22.2%)
  ghfirst: 2 (22.2%)

Total reverts excluding ghfirst: 7
Reverts (excluding ghfirst) that dont match any auto revert pattern detected (recall): 6 (85.7%)
Per workflow precision:
  rocm: 1 reverts out of 3 patterns (33.3%) [excluding ghfirst: 1 (33.3%)]

Reverted patterns:
  - pytest failure: bbc0df10 (nosignal)

Restarted workflows: 4
  - rocm for 8085edc8
  - rocm for 41081276
  - rocm for b6c53383
  - rocm for 33ec6e3e
  ```
</details>

the actual culprit was correctly identified:
```
Pattern #7:
Failure rule: 'pytest failure'
Recent commits with failure: 4fd5fabe bbc0df10
Older commit without failure: efc4b460
✓ REVERTED (nosignal): bbc0df1094b5a4dcd2cce83f8402127b07913231 was
reverted by 41081276 after 18.5 hours
```

there are multiple patterns detected, because the failure was jumping across **workflows**: rocm and rocm-mi300

---------

Co-authored-by: Jean Schmidt <[email protected]>

v20250811-123656

11 Aug 12:38
29f1bcc

Choose a tag to compare

[autorevert] fix pending job check (#6982)

Pending job detection was incorrect.
- Before: `has_pending_jobs` checked `status == "pending"`, which GitHub
Actio
ns does not emit; statuses are typically `queued`, `in_progress`,
`completed`.
  - Now: Treat any job with `status != "completed"` as pending.

v20250807-215637

07 Aug 21:58
4eb9d0c

Choose a tag to compare

[autorevert] Fix bug on job retrival for autorevert lambda (#6979)

After fixing the bug, the new statistics are:

```
==================================================
SUMMARY STATISTICS
==================================================
Workflow(s): Lint, trunk, pull, inductor, linux-binary-manywheel
Timeframe: 4320 hours
Commits checked: 33519
Auto revert patterns detected: 1345
Actual reverts inside auto revert patterns detected (precision): 219 (16.3%)
Total revert commits in period: 585

Revert categories:
  nosignal: 202 (34.5%)
  ghfirst: 156 (26.7%)
  uncategorized: 104 (17.8%)
  ignoredsignal: 68 (11.6%)
  weird: 45 (7.7%)
  landrace: 10 (1.7%)

Total reverts excluding ghfirst: 429
Reverts (excluding ghfirst) that dont match any auto revert pattern detected (recall): 250 (58.3%)
Per workflow precision:
  Lint: 45 reverts out of 75 patterns (60.0%) [excluding ghfirst: 41 (54.7%)]
  trunk: 30 reverts out of 136 patterns (22.1%) [excluding ghfirst: 28 (20.6%)]
  pull: 104 reverts out of 859 patterns (12.1%) [excluding ghfirst: 92 (10.7%)]
  inductor: 39 reverts out of 269 patterns (14.5%) [excluding ghfirst: 38 (14.1%)]
  linux-binary-manywheel: 1 reverts out of 6 patterns (16.7%) [excluding ghfirst: 0 (0.0%)]
```

The main bug is that when checking for commits before/after with the
same job, it actually concatenated all commits+jobs before and after,
instead of only returning the next one.

I added also the ergonomics for lambda invocation

---------

Co-authored-by: Ivan Zaitsev <[email protected]>

v20250806-142042

06 Aug 14:22
a445ab6

Choose a tag to compare

Base64 decode the app secret for pytorch-auto-revert lamba (#6967)

They way that the lambdas are structured and the terraform is setup, it
is not possible to pass newlines as environment variables.

So the github app key, that is a private key, needs to be b64 encoded.

v20250805-233507

05 Aug 23:37
bc704b6

Choose a tag to compare

[autorevert] Add workflow restarts (#6962)

###  Summary

Adds workflow restart capability to the PyTorch auto-revert tool,
enabling automatic re-running of workflows for commits that match
autorevert patterns but haven't been reverted yet.

###  Changes

  - Added restart methods to WorkflowRestartChecker:
- restart_workflow(): Restarts a workflow for a specific commit with
duplicate prevention
    - Checks ClickHouse for existing restarts before attempting

  - Enhanced autorevert-checker command:
    - Added --do-restart flag to enable automatic workflow restarts
    - Added --dry-run flag to preview restart actions without execution
- Restarts workflows only for non-reverted commits matching autorevert
patterns

  - Fixed workflow naming consistency:
- Normalized workflow names by removing .yml extension for ClickHouse
queries
    - Added .yml extension only for GitHub API calls

  - Updated do-restart command:
    - Now requires commit SHA (removed unused restart_latest_workflow)
    - Leverages same restart logic with duplicate prevention

###  Usage

```
  # Check patterns and restart workflows
  python -m pytorch_auto_revert autorevert-checker pull trunk --do-restart

  # Dry run to preview restarts
  python -m pytorch_auto_revert autorevert-checker pull trunk --do-restart --dry-run

  # Manual restart
  python -m pytorch_auto_revert do-restart trunk abc123def
  ```
  
  ### Testing
  
  ```
  python -m pytorch_auto_revert autorevert-checker inductor --hours 12 --do-restart --dry-run
Fetching workflow data for 1 workflows since 2025-07-30T22:47:17.357776...
Found 19 commits with job data for workflow 'inductor'
✓ 1 AUTOREVERT PATTERN DETECTED

Pattern #1:
Failure rule: 'GHA error'
Recent commits with failure: f89c28cc 5b2ad927
Older commit without failure: 7a4167a1
✗ NOT REVERTED: 5b2ad9279cb2e440d45253d28f2101a75fd42344 was not reverted
  ⟳ ALREADY RESTARTED: inductor for 5b2ad927

==================================================
SUMMARY STATISTICS
==================================================
Workflow(s): inductor
Timeframe: 12 hours
Commits checked: 19
Auto revert patterns detected: 1
Actual reverts inside auto revert patterns detected (precision): 0 (0.0%)
Total revert commits in period: 0

Total reverts excluding ghfirst: 0
No non-ghfirst reverts found in the period
Per workflow precision:
  inductor: 0 reverts out of 1 patterns (0.0%) [excluding ghfirst: 0 (0.0%)]
  ```

v20250730-214538

30 Jul 21:47
e085fb7

Choose a tag to compare

[BE] Use info logs for normal operations (#6955)

It's normal for there not being runners available for reuse, those
events should not be logged as errors.

v20250730-141611

30 Jul 14:18
a9b4e21

Choose a tag to compare

Filter out restarted jobs from CH queries (#6954)

As the prerequisite for running autorevert in shadow mode, we need to
filter out restarted jobs from the existing queries where they might
skew the results.

v20250716-153909

16 Jul 15:41
456e399

Choose a tag to compare

runners: Remove SSM deletion from terminateRunner (#6934)

This is super-ceded by our addition of an expiration policy to our SSM
parameters. I also think that this was somewhat causing us to be slow.

See:
* https://github.com/pytorch/test-infra/pull/6885

---------

Signed-off-by: Eli Uriegas <[email protected]>

v20250715-232140

15 Jul 23:23
e22fe6e

Choose a tag to compare

runners: Add expiration policy to SSM parameters (#6885)

This adds an expiration policy to the SSM parameters for the runners.
This is to ensure that the parameters are deleted after 30 minutes.

Github Runner Tokens typically have a 1 hour expiration time, but our
runners are typically expected to be up way quicker than that so 30
minutes is a good balance for when we expect the runners to be up.

If a runner isn't conencted to Github by at least 30 minutes we will
more than likely have spun it down and it will be deleted.

This is an attempted re-land of 2 commits:
* #6855
* #6858

---------

Signed-off-by: Eli Uriegas <[email protected]>