Open
Description
I've been keeping an eye on intermittent failures in various CI workflows that run sigstore tools (root-signing, root-signing-staging, sigstore-probers)... and my gut feeling is that sigstore-python fails a little more often than the other clients.
Looking at the client implementations, at least cosign, sigstore-java and sigstore-js seem to have some built-in retries for the requests they make to rekor and fulcio. I wonder if we should have that too?
It's not an obvious decision:
- Interactive use and CI use have different expectations WRT responsiveness -- maybe we should only retry on CI?
- It's not entirely trivial to recognize which responses should lead to retries -- potential ones could be 5xx, 429
- by far most failures seem to be on rekor
- Some error responses have
Retry-After
header but most do not seem to (I can't see the actual responses in the load balancer logs so this is partly guesswork but I believe 503 and 429 include the header). 429 specifically only makes sense to retry with the Retry-After value