v20250630-183403
·
568 commits
to main
since this release
[Pytorch AutoRevert] - Improves autorevert check heuristics (#6853) Do some improvements in the back analisys for the revert logic with the goal of improving precision and recall and validate as a valid strategy. Checked against the workflows: pull trunk inductor linux-binary-manywheel Old code: ``` Timeframe: 720 hours Commits checked: 6177 Auto revert patterns detected: 188 Actual reverts inside auto revert patterns detected: 24 (12.8%) Total revert commits in period: 115 Reverts that dont match any auto revert pattern detected: 91 ``` Newer code: ``` Workflow(s): pull, trunk, inductor, linux-binary-manywheel Timeframe: 720 hours Commits checked: 5403 Auto revert patterns detected: 442 Actual reverts inside auto revert patterns detected (precision): 48 (10.9%) Total revert commits in period: 115 Reverts that dont match any auto revert pattern detected (recall): 67 (58.3%) Per workflow precision: pull: 45 reverts out of 411 patterns (10.9%) trunk: 1 reverts out of 8 patterns (12.5%) inductor: 2 reverts out of 20 patterns (10.0%) linux-binary-manywheel: 0 reverts out of 3 patterns (0.0%) ``` Critical implemented changes: * Look forward and back for the first commit that ran the failed job, instead of trusting on always looking on the one right before or right after. * Job names have parts we don't care, like shards indices. As a failure could happen in any shard we want to find any shard with the same failure; Things I tried and don't lead to great results: * ignoring error classification - too low precision, not significant increase in recall * not requiring error repetition - too low precision, not significant increase in recall My take: With a precision of 10% it justifies the cost of re-running jobs in order to confirm redness status, even if it is not possible to test, I suspect that the fact we force require the same output 2 times for all 3 signals, this should elevate the precision to a very high standard. Unfortunately the only way to test is run this in shadow mode. With a recall of 55%, it points out to being able to capture **most** of the introduced trunk redness errors. Lots of reverts might not be caused by ci redness, especially not in the workflows we are analyzing (could be performance degradation, GHF/internal reasons and many others). This number seems comfortable to provide a substantial gain in benefit for CI quality.