fix(nvd): retry transient body-read errors and 524 responses#459
Conversation
The NVD updater could fail an entire run on transient errors that were not covered by the retry loop: - The response body was read outside `fetchURL`'s retry loop, so an HTTP/2 `INTERNAL_ERROR` (stream aborted mid-body after a 200 OK) was fatal instead of retried. - Cloudflare's non-standard 524 (origin timeout) fell into the default branch and returned an "unexpected status code" error. Read the body inside the retry loop and add 524 to the retryable statuses. A single request attempt is extracted into `doRequest`, which closes the response body via `defer`.
| // doRequest performs a single NVD request attempt and closes the response body | ||
| // before returning. It returns the response body on success. A nil body with a | ||
| // nil error means the request should be retried after waiting for `wait`. | ||
| func (u Updater) doRequest(c *http.Client, url string, attempt int) (body []byte, wait time.Duration, err error) { |
There was a problem hiding this comment.
nit: passing attempt into doRequest leaks the backoff responsibility into a per-attempt function. attempt is loop state owned by fetchURL, and it's only used here to compute the linear backoff time.Duration(attempt) * time.Second.
The only thing doRequest really needs to hand back to the loop is the server-mandated wait (the rate-limit Retry-After); the backoff itself is the loop's policy. One way to express that is to drop attempt and signal "retryable" via a wrapped error that also carries the optional retry-after, so fetchURL owns the backoff:
// errRetry signals fetchURL to retry. retryAfter is the server-mandated
// minimum wait (rate limit); zero means apply the caller's backoff.
type errRetry struct{ retryAfter time.Duration }
func (errRetry) Error() string { return "retryable" }
func (u Updater) fetchURL(url string) ([]byte, error) {
var c http.Client
for i := 0; i <= u.retry; i++ {
body, err := u.doRequest(&c, url)
var re errRetry
switch {
case err == nil:
return body, nil
case errors.As(err, &re):
wait := re.retryAfter
if wait == 0 {
wait = time.Duration(i) * time.Second
}
time.Sleep(wait)
default:
return nil, err
}
}
return nil, xerrors.Errorf("unable to fetch url. Retry limit exceeded.")
}doRequest then returns ([]byte, error): errRetry{retryAfter: ra} for 403/429, errRetry{} for 503/524/timeout/network/body-read errors, the body for 200, and a plain error for the unexpected-status case. As a bonus this also removes the body != nil retry sentinel, since success is now distinguished by err == nil.
That said, adding a dedicated error type might be overkill for this single spot, so I'll leave the call to you. Not blocking.
There was a problem hiding this comment.
Good idea — that's more accurate.
Updated in 38049fd.
New test run from fork - https://github.com/DmitriyLewen/vuln-list-update/actions/runs/27869235304/job/82478367450
Replace the implicit "nil body + nil error means retry" contract and the attempt parameter passed into doRequest with a dedicated errRetry type. doRequest now returns ([]byte, error): the body on success, errRetry on a retryable condition (with retryAfter for rate limits), and a plain error otherwise. fetchURL owns the backoff policy.
Description
The NVD updater could fail an entire run on transient errors that were not covered by the retry loop:
fetchURL's retry loop, so an HTTP/2INTERNAL_ERROR(the stream aborted mid-body after a200 OK) was fatal instead of retried.524(origin timeout) fell into thedefaultbranch and returned an "unexpected status code" error.This PR reads the body inside the retry loop and adds
524to the retryable statuses.A single request attempt is extracted into a
doRequesthelper that closes the response body viadefer.Context
On 2026-06-17 NVD rolled out a schema/metric expansion (SSVC data from CISA-ADP in addition to CVSS, plus an "affected" block per the CVE Record Format).
Every updated record received a new
Last Modifieddate, which means a large share of the database was re-stamped in a single day.As a result, the updater's per-day window started returning almost the whole database at once.
Compare a normal day to the affected day.
Successful run (run 27657694905):
Failing run (run 27744143884):
Total number of records currently stored:
So a single day's window covered
261624 / 357760 ≈ 73%of all records — roughly 131 pages of 2000 entries.Fetching that many pages takes 40–60 minutes, runs into the NVD rate limit, and eventually hits a transient server error (
503,524, or an HTTP/2INTERNAL_ERROR).Before this change such a transient error failed the whole run and reverted it, so
lastUpdatedDatenever advanced and the next run pulled an even larger window — a self-sustaining loop.This is not a regression on our side; it is the fallout of the NVD update.
Making the transient errors retryable lets a run survive the spike and finish, after which the window shrinks back to normal.
Tests
TestUpdate— added two524cases: a happy path that succeeds after one reconnect (retry: 1), and a sad path that exhausts retries (confirming524is now retried instead of returning "unexpected status code").TestUpdate_RetryOnBodyReadError— hijacks the connection to send a200 OKwith a truncated body (declaredContent-Lengthlarger than the bytes sent, then closes the connection), forcing anio.ReadAllerror after the status code, and verifies the next attempt succeeds.