-
-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Describe the bug
TF-via-PR is not resilient against failures of the GH API. For example, I have witnessed multiple timeouts when the TF-via-PR code tries to download artifacts or post results to the PR's.
Not having a retry logic inside TF-via-PR is problematic if using plan-parity checks during the Terraform apply workflow. As the plan will be stale by then. So you can't retry running the whole workflow.
To Reproduce
Reproducing can be challenging to do deterministically. A timeout can happen at any place where gh api is called.
Expected behavior
All gh api calls and any other external dependency should be wrapped in a retry and backoff logic, giving TF-via-PR better resiliency.
Additional context
So far, I have seen the timeout happen in multiple locations whenever TF-via-PR calls gh api. Here are some example logs:
Run op5dev/tf-via-pr@v13
Run # Check for required tools.
Run # Populate variables.
Run # Unique identifier.
Get "https://api.github.com/repos/XXXXXX/XXXXXX/pulls?per_page=100": dial tcp 140.82.116.5:443: i/o timeout
Error: Process completed with exit code 1.
and
Run # Post output.
Patch "https://api.github.com/repos/XXXX/XXXXXXXX/check-runs/XXXXXXXX": dial tcp 140.82.116.6:443: i/o timeout
Error: Process completed with exit code 1.