Releases: pytorch/test-infra
Releases · pytorch/test-infra
v20250703-021349
Add revert category extractionand exclude `ghfirst` reverts from stat…
v20250630-183403
[Pytorch AutoRevert] - Improves autorevert check heuristics (#6853) Do some improvements in the back analisys for the revert logic with the goal of improving precision and recall and validate as a valid strategy. Checked against the workflows: pull trunk inductor linux-binary-manywheel Old code: ``` Timeframe: 720 hours Commits checked: 6177 Auto revert patterns detected: 188 Actual reverts inside auto revert patterns detected: 24 (12.8%) Total revert commits in period: 115 Reverts that dont match any auto revert pattern detected: 91 ``` Newer code: ``` Workflow(s): pull, trunk, inductor, linux-binary-manywheel Timeframe: 720 hours Commits checked: 5403 Auto revert patterns detected: 442 Actual reverts inside auto revert patterns detected (precision): 48 (10.9%) Total revert commits in period: 115 Reverts that dont match any auto revert pattern detected (recall): 67 (58.3%) Per workflow precision: pull: 45 reverts out of 411 patterns (10.9%) trunk: 1 reverts out of 8 patterns (12.5%) inductor: 2 reverts out of 20 patterns (10.0%) linux-binary-manywheel: 0 reverts out of 3 patterns (0.0%) ``` Critical implemented changes: * Look forward and back for the first commit that ran the failed job, instead of trusting on always looking on the one right before or right after. * Job names have parts we don't care, like shards indices. As a failure could happen in any shard we want to find any shard with the same failure; Things I tried and don't lead to great results: * ignoring error classification - too low precision, not significant increase in recall * not requiring error repetition - too low precision, not significant increase in recall My take: With a precision of 10% it justifies the cost of re-running jobs in order to confirm redness status, even if it is not possible to test, I suspect that the fact we force require the same output 2 times for all 3 signals, this should elevate the precision to a very high standard. Unfortunately the only way to test is run this in shadow mode. With a recall of 55%, it points out to being able to capture **most** of the introduced trunk redness errors. Lots of reverts might not be caused by ci redness, especially not in the workflows we are analyzing (could be performance degradation, GHF/internal reasons and many others). This number seems comfortable to provide a substantial gain in benefit for CI quality.
v20250630-164255
runners: Revert things related to batch termination (#6868) This reverts the following PRs: * #6859 * #6858 * #6855 * #6854 * #6852 These were causing issues where scale-down was too aggressively scaling down instances leading to runners not being refreshed by scale-up. I do think the SSM expiration stuff is worth a re-do though but there were merge conflicts so I have to revert the entire thing.
v20250627-203612
runners: Fix lint (#6859) There was some outstanding lint issues from previous PRs. Fixes the lint and formatting. Signed-off-by: Eli Uriegas <[email protected]>
v20250627-202541
[ez][docs] Add wiki maintenance magic strings to aws/lambda/readme (#…
v20250627-200622
runners: make ssm policy an array (#6858) Fixes an issue where the SSM parameter policies were not being set correctly. Resulted in errors like: ValidationException: Invalid policies input: {"Type":"Expiration","Version":"1.0","Attributes":{"Timestamp":"2025-06-27T19:11:55.437Z"}}. Signed-off-by: Eli Uriegas <[email protected]>
v20250627-185904
[log classifier] Rule for graph break registry check (#6837) For failures like [GH job link](https://github.com/pytorch/pytorch/actions/runs/15859789097/job/44714997710) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/c1ad4b8e7a16f54c35a3908b56ed7d9f95eef586) Currently matches ` ##[error]Process completed with exit code 1.` but there is a better line `Found the unimplemented_v2 or unimplemented_v2_with_warning calls below that don't match the registry in graph_break_registry.json.`
v20250627-183532
runners: Add expiration policy to SSM parameters (#6855) Instead of doing expensive cleanups we can rely on SSM parameter policies to do the cleanup for us! This is a workaround to avoid the need to do expensive cleanup of SSM parameters. Signed-off-by: Eli Uriegas <[email protected]>
v20250627-181625
runners: More scale-down perf improvements (#6854) Does the following: * Removes ssm parameter cleanup from terminateInstances (will be a follow-up PR to add a termination policy to parameters) * Removed double check for ghRunner calls (was causing performance bottlenecks * NOTE: We will need to monitor removeGHRunnerOrg calls to see if those introduce another performance bottleneck + job cancellations (if they rise then we revert, dashboard: https://hud.pytorch.org/job_cancellation_dashboard) Signed-off-by: Eli Uriegas <[email protected]> --------- Signed-off-by: Eli Uriegas <[email protected]>
v20250627-162018
runners: Add batching for terminateRunner (#6852) I had noticed that we were terminating instances 1 by 1 in the original code so this adds batching for terminateRunner calls in order to fix those performance bottlenecks. As well during the termination we were deleting ssm parameters one by one so this also adds batching to the ssm parameter deletion as well. Goal here was to implement the performance improvements with minimal changes. This PR super-cedes #6725 --------- Signed-off-by: Eli Uriegas <[email protected]> Signed-off-by: Eli Uriegas <[email protected]>