Skip to content

Releases: pytorch/test-infra

v20250703-021349

03 Jul 02:16
7d5d073
Compare
Choose a tag to compare
Add revert category extractionand exclude `ghfirst` reverts from stat…

v20250630-183403

30 Jun 18:36
5f86d76
Compare
Choose a tag to compare
[Pytorch AutoRevert] - Improves autorevert check heuristics  (#6853)

Do some improvements in the back analisys for the revert logic with the
goal of improving precision and recall and validate as a valid strategy.

Checked against the workflows: pull trunk inductor
linux-binary-manywheel

Old code:
```
Timeframe: 720 hours
Commits checked: 6177
Auto revert patterns detected: 188
Actual reverts inside auto revert patterns detected: 24 (12.8%)
Total revert commits in period: 115
Reverts that dont match any auto revert pattern detected: 91
```

Newer code:
```
Workflow(s): pull, trunk, inductor, linux-binary-manywheel
Timeframe: 720 hours
Commits checked: 5403
Auto revert patterns detected: 442
Actual reverts inside auto revert patterns detected (precision): 48 (10.9%)
Total revert commits in period: 115
Reverts that dont match any auto revert pattern detected (recall): 67 (58.3%)
Per workflow precision:
  pull: 45 reverts out of 411 patterns (10.9%)
  trunk: 1 reverts out of 8 patterns (12.5%)
  inductor: 2 reverts out of 20 patterns (10.0%)
  linux-binary-manywheel: 0 reverts out of 3 patterns (0.0%)
```

Critical implemented changes:
* Look forward and back for the first commit that ran the failed job,
instead of trusting on always looking on the one right before or right
after.
* Job names have parts we don't care, like shards indices. As a failure
could happen in any shard we want to find any shard with the same
failure;

Things I tried and don't lead to great results:
* ignoring error classification - too low precision, not significant
increase in recall
* not requiring error repetition - too low precision, not significant
increase in recall

My take:
With a precision of 10% it justifies the cost of re-running jobs in
order to confirm redness status, even if it is not possible to test, I
suspect that the fact we force require the same output 2 times for all 3
signals, this should elevate the precision to a very high standard.
Unfortunately the only way to test is run this in shadow mode.

With a recall of 55%, it points out to being able to capture **most** of
the introduced trunk redness errors. Lots of reverts might not be caused
by ci redness, especially not in the workflows we are analyzing (could
be performance degradation, GHF/internal reasons and many others). This
number seems comfortable to provide a substantial gain in benefit for CI
quality.

v20250630-164255

30 Jun 16:45
3d3500e
Compare
Choose a tag to compare
runners: Revert things related to batch termination (#6868)

This reverts the following PRs:
* #6859 
* #6858 
* #6855 
* #6854
* #6852

These were causing issues where scale-down was too aggressively scaling
down instances leading to runners not being refreshed by scale-up.

I do think the SSM expiration stuff is worth a re-do though but there
were merge conflicts so I have to revert the entire thing.

v20250627-203612

27 Jun 20:38
9665a59
Compare
Choose a tag to compare
runners: Fix lint (#6859)

There was some outstanding lint issues from previous PRs.

Fixes the lint and formatting.

Signed-off-by: Eli Uriegas <[email protected]>

v20250627-202541

27 Jun 20:27
fd736eb
Compare
Choose a tag to compare
[ez][docs] Add wiki maintenance magic strings to aws/lambda/readme (#…

v20250627-200622

27 Jun 20:08
0758ff2
Compare
Choose a tag to compare
runners: make ssm policy an array (#6858)

Fixes an issue where the SSM parameter policies were not being set
correctly.

Resulted in errors like:

ValidationException: Invalid policies input:
{"Type":"Expiration","Version":"1.0","Attributes":{"Timestamp":"2025-06-27T19:11:55.437Z"}}.

Signed-off-by: Eli Uriegas <[email protected]>

v20250627-185904

27 Jun 19:01
66a282f
Compare
Choose a tag to compare
[log classifier] Rule for graph break registry check (#6837)

For failures like [GH job
link](https://github.com/pytorch/pytorch/actions/runs/15859789097/job/44714997710)
[HUD commit
link](https://hud.pytorch.org/pytorch/pytorch/commit/c1ad4b8e7a16f54c35a3908b56ed7d9f95eef586)

Currently matches ` ##[error]Process completed with exit code 1.`
but there is a better line
`Found the unimplemented_v2 or unimplemented_v2_with_warning calls below
that don't match the registry in graph_break_registry.json.`

v20250627-183532

27 Jun 18:37
99c977d
Compare
Choose a tag to compare
runners: Add expiration policy to SSM parameters (#6855)

Instead of doing expensive cleanups we can rely on SSM parameter
policies to do the cleanup for us!

This is a workaround to avoid the need to do expensive cleanup of SSM
parameters.

Signed-off-by: Eli Uriegas <[email protected]>

v20250627-181625

27 Jun 18:18
4556a13
Compare
Choose a tag to compare
runners: More scale-down perf improvements (#6854)

Does the following:
* Removes ssm parameter cleanup from terminateInstances (will be a
follow-up PR to add a termination policy to parameters)
* Removed double check for ghRunner calls (was causing performance
bottlenecks
* NOTE: We will need to monitor removeGHRunnerOrg calls to see if those
introduce another performance bottleneck + job cancellations (if they
rise then we revert, dashboard:
https://hud.pytorch.org/job_cancellation_dashboard)

Signed-off-by: Eli Uriegas <[email protected]>

---------

Signed-off-by: Eli Uriegas <[email protected]>

v20250627-162018

27 Jun 16:22
003bee0
Compare
Choose a tag to compare
runners: Add batching for terminateRunner (#6852)

I had noticed that we were terminating instances 1 by 1 in the original
code so this adds batching for terminateRunner calls in order to fix
those performance bottlenecks.

As well during the termination we were deleting ssm parameters one by
one so this also adds batching to the ssm parameter deletion as well.

Goal here was to implement the performance improvements with minimal
changes.

This PR super-cedes #6725

---------

Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>