Releases · pytorch/test-infra

03 Jul 02:16

v20250703-021349

7d5d073

v20250703-021349 Latest

Latest

Add revert category extractionand exclude `ghfirst` reverts from stat…

Assets 11

30 Jun 18:36

github-actions

v20250630-183403

5f86d76

v20250630-183403

[Pytorch AutoRevert] - Improves autorevert check heuristics  (#6853)

Do some improvements in the back analisys for the revert logic with the
goal of improving precision and recall and validate as a valid strategy.

Checked against the workflows: pull trunk inductor
linux-binary-manywheel

Old code:
```
Timeframe: 720 hours
Commits checked: 6177
Auto revert patterns detected: 188
Actual reverts inside auto revert patterns detected: 24 (12.8%)
Total revert commits in period: 115
Reverts that dont match any auto revert pattern detected: 91
```

Newer code:
```
Workflow(s): pull, trunk, inductor, linux-binary-manywheel
Timeframe: 720 hours
Commits checked: 5403
Auto revert patterns detected: 442
Actual reverts inside auto revert patterns detected (precision): 48 (10.9%)
Total revert commits in period: 115
Reverts that dont match any auto revert pattern detected (recall): 67 (58.3%)
Per workflow precision:
  pull: 45 reverts out of 411 patterns (10.9%)
  trunk: 1 reverts out of 8 patterns (12.5%)
  inductor: 2 reverts out of 20 patterns (10.0%)
  linux-binary-manywheel: 0 reverts out of 3 patterns (0.0%)
```

Critical implemented changes:
* Look forward and back for the first commit that ran the failed job,
instead of trusting on always looking on the one right before or right
after.
* Job names have parts we don't care, like shards indices. As a failure
could happen in any shard we want to find any shard with the same
failure;

Things I tried and don't lead to great results:
* ignoring error classification - too low precision, not significant
increase in recall
* not requiring error repetition - too low precision, not significant
increase in recall

My take:
With a precision of 10% it justifies the cost of re-running jobs in
order to confirm redness status, even if it is not possible to test, I
suspect that the fact we force require the same output 2 times for all 3
signals, this should elevate the precision to a very high standard.
Unfortunately the only way to test is run this in shadow mode.

With a recall of 55%, it points out to being able to capture **most** of
the introduced trunk redness errors. Lots of reverts might not be caused
by ci redness, especially not in the workflows we are analyzing (could
be performance degradation, GHF/internal reasons and many others). This
number seems comfortable to provide a substantial gain in benefit for CI
quality.

Assets 10

30 Jun 16:45

github-actions

v20250630-164255

3d3500e

v20250630-164255

runners: Revert things related to batch termination (#6868)

This reverts the following PRs:
* #6859 
* #6858 
* #6855 
* #6854
* #6852

These were causing issues where scale-down was too aggressively scaling
down instances leading to runners not being refreshed by scale-up.

I do think the SSM expiration stuff is worth a re-do though but there
were merge conflicts so I have to revert the entire thing.

Assets 11

27 Jun 20:38

github-actions

v20250627-203612

9665a59

v20250627-203612

runners: Fix lint (#6859)

There was some outstanding lint issues from previous PRs.

Fixes the lint and formatting.

Signed-off-by: Eli Uriegas <[email protected]>

Assets 11

27 Jun 20:27

github-actions

v20250627-202541

fd736eb

v20250627-202541

[ez][docs] Add wiki maintenance magic strings to aws/lambda/readme (#…

Assets 11

27 Jun 20:08

github-actions

v20250627-200622

0758ff2

v20250627-200622

runners: make ssm policy an array (#6858)

Fixes an issue where the SSM parameter policies were not being set
correctly.

Resulted in errors like:

ValidationException: Invalid policies input:
{"Type":"Expiration","Version":"1.0","Attributes":{"Timestamp":"2025-06-27T19:11:55.437Z"}}.

Signed-off-by: Eli Uriegas <[email protected]>

Assets 11

27 Jun 19:01

github-actions

v20250627-185904

66a282f

v20250627-185904

[log classifier] Rule for graph break registry check (#6837)

For failures like [GH job
link](https://github.com/pytorch/pytorch/actions/runs/15859789097/job/44714997710)
[HUD commit
link](https://hud.pytorch.org/pytorch/pytorch/commit/c1ad4b8e7a16f54c35a3908b56ed7d9f95eef586)

Currently matches ` ##[error]Process completed with exit code 1.`
but there is a better line
`Found the unimplemented_v2 or unimplemented_v2_with_warning calls below
that don't match the registry in graph_break_registry.json.`

Assets 11

27 Jun 18:37

github-actions

v20250627-183532

99c977d

v20250627-183532

runners: Add expiration policy to SSM parameters (#6855)

Instead of doing expensive cleanups we can rely on SSM parameter
policies to do the cleanup for us!

This is a workaround to avoid the need to do expensive cleanup of SSM
parameters.

Signed-off-by: Eli Uriegas <[email protected]>

Assets 11

27 Jun 18:18

github-actions

v20250627-181625

4556a13

v20250627-181625

runners: More scale-down perf improvements (#6854)

Does the following:
* Removes ssm parameter cleanup from terminateInstances (will be a
follow-up PR to add a termination policy to parameters)
* Removed double check for ghRunner calls (was causing performance
bottlenecks
* NOTE: We will need to monitor removeGHRunnerOrg calls to see if those
introduce another performance bottleneck + job cancellations (if they
rise then we revert, dashboard:
https://hud.pytorch.org/job_cancellation_dashboard)

Signed-off-by: Eli Uriegas <[email protected]>

---------

Signed-off-by: Eli Uriegas <[email protected]>

Assets 11

27 Jun 16:22

github-actions

v20250627-162018

003bee0

v20250627-162018

runners: Add batching for terminateRunner (#6852)

I had noticed that we were terminating instances 1 by 1 in the original
code so this adds batching for terminateRunner calls in order to fix
those performance bottlenecks.

As well during the termination we were deleting ssm parameters one by
one so this also adds batching to the ssm parameter deletion as well.

Goal here was to implement the performance improvements with minimal
changes.

This PR super-cedes #6725

---------

Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>

Assets 11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: pytorch/test-infra

v20250703-021349

Uh oh!

v20250630-183403

Uh oh!

v20250630-164255

Uh oh!

v20250627-203612

Uh oh!

v20250627-202541

Uh oh!

v20250627-200622

Uh oh!

v20250627-185904

Uh oh!

v20250627-183532

Uh oh!

v20250627-181625

Uh oh!

v20250627-162018

Uh oh!