Releases: pytorch/test-infra
Releases · pytorch/test-infra
v20250715-232104
runners: Elevate some debug logs to info logs (#6933) This was annoying me when reading through the logs where some of these messages were getting swallowed when I was filtering out debug logs. Hopefully with these as info logs we'll be able to debug things easier. Signed-off-by: Eli Uriegas <[email protected]>
v20250715-231438
runners: Remove debug logging for listRunners (#6932) This log message was driving me insane and causing a lot of useless noise in the logs. Removing so I can preserve my sanity. Signed-off-by: Eli Uriegas <[email protected]>
v20250715-210301
runners: Move runner removal logic up (#6930) This is a refactor of scale-down to move the runner removal logic up to the top of the loop. This is done to avoid long wait times between determining if a runner should be removed and actually removing it. In practice we were observing wait times of up to 7 to 10 minutes. This might only actually be testable with production traffic / rate limits. --------- Signed-off-by: Eli Uriegas <[email protected]>
v20250709-181311
[ez][CH] Fix infra_metrics.cloud.watch_metrics schema: use DateTime64…
v20250708-173352
[ghinfra] Set up ingestion from s3 -> clickhouse for cloudwatch (#6898) Path: cloudwatch metrics -> firehose -> s3 (new bucket fbossci-cloudwatch-metrics) -> clickhouse This is the s3 -> clickhouse part I think clickhouse has some in built ingestions for kinesis but I'm lazy... Requires https://github.com/pytorch-labs/pytorch-gha-infra/pull/751 Testing: ran the python code via `python tools/rockset_migration/s32ch.py --clickhouse-table "infra_metrics.cloudwatch_metrics" --stored-data t.json --s3-bucket fbossci-cloudwatch-metrics --s3-prefix ghci-related`
v20250703-021349
Add revert category extractionand exclude `ghfirst` reverts from stat…
v20250630-183403
[Pytorch AutoRevert] - Improves autorevert check heuristics (#6853) Do some improvements in the back analisys for the revert logic with the goal of improving precision and recall and validate as a valid strategy. Checked against the workflows: pull trunk inductor linux-binary-manywheel Old code: ``` Timeframe: 720 hours Commits checked: 6177 Auto revert patterns detected: 188 Actual reverts inside auto revert patterns detected: 24 (12.8%) Total revert commits in period: 115 Reverts that dont match any auto revert pattern detected: 91 ``` Newer code: ``` Workflow(s): pull, trunk, inductor, linux-binary-manywheel Timeframe: 720 hours Commits checked: 5403 Auto revert patterns detected: 442 Actual reverts inside auto revert patterns detected (precision): 48 (10.9%) Total revert commits in period: 115 Reverts that dont match any auto revert pattern detected (recall): 67 (58.3%) Per workflow precision: pull: 45 reverts out of 411 patterns (10.9%) trunk: 1 reverts out of 8 patterns (12.5%) inductor: 2 reverts out of 20 patterns (10.0%) linux-binary-manywheel: 0 reverts out of 3 patterns (0.0%) ``` Critical implemented changes: * Look forward and back for the first commit that ran the failed job, instead of trusting on always looking on the one right before or right after. * Job names have parts we don't care, like shards indices. As a failure could happen in any shard we want to find any shard with the same failure; Things I tried and don't lead to great results: * ignoring error classification - too low precision, not significant increase in recall * not requiring error repetition - too low precision, not significant increase in recall My take: With a precision of 10% it justifies the cost of re-running jobs in order to confirm redness status, even if it is not possible to test, I suspect that the fact we force require the same output 2 times for all 3 signals, this should elevate the precision to a very high standard. Unfortunately the only way to test is run this in shadow mode. With a recall of 55%, it points out to being able to capture **most** of the introduced trunk redness errors. Lots of reverts might not be caused by ci redness, especially not in the workflows we are analyzing (could be performance degradation, GHF/internal reasons and many others). This number seems comfortable to provide a substantial gain in benefit for CI quality.
v20250630-164255
runners: Revert things related to batch termination (#6868) This reverts the following PRs: * #6859 * #6858 * #6855 * #6854 * #6852 These were causing issues where scale-down was too aggressively scaling down instances leading to runners not being refreshed by scale-up. I do think the SSM expiration stuff is worth a re-do though but there were merge conflicts so I have to revert the entire thing.
v20250627-203612
runners: Fix lint (#6859) There was some outstanding lint issues from previous PRs. Fixes the lint and formatting. Signed-off-by: Eli Uriegas <[email protected]>
v20250627-202541
[ez][docs] Add wiki maintenance magic strings to aws/lambda/readme (#…