Skip to content

fix(nvd): retry transient body-read errors and 524 responses#459

Merged
DmitriyLewen merged 2 commits into
aquasecurity:mainfrom
DmitriyLewen:fix/nvd-retry-body-read-and-524
Jun 22, 2026
Merged

fix(nvd): retry transient body-read errors and 524 responses#459
DmitriyLewen merged 2 commits into
aquasecurity:mainfrom
DmitriyLewen:fix/nvd-retry-body-read-and-524

Conversation

@DmitriyLewen

Copy link
Copy Markdown
Contributor

Description

The NVD updater could fail an entire run on transient errors that were not covered by the retry loop:

  • The response body was read outside fetchURL's retry loop, so an HTTP/2 INTERNAL_ERROR (the stream aborted mid-body after a 200 OK) was fatal instead of retried.
  • Cloudflare's non-standard 524 (origin timeout) fell into the default branch and returned an "unexpected status code" error.

This PR reads the body inside the retry loop and adds 524 to the retryable statuses.
A single request attempt is extracted into a doRequest helper that closes the response body via defer.

Context

On 2026-06-17 NVD rolled out a schema/metric expansion (SSVC data from CISA-ADP in addition to CVSS, plus an "affected" block per the CVE Record Format).
Every updated record received a new Last Modified date, which means a large share of the database was re-stamped in a single day.

As a result, the updater's per-day window started returning almost the whole database at once.
Compare a normal day to the affected day.

Successful run (run 27657694905):

2026/06/17 00:39:15 INFO Fetched NVD entries total=169 start_index=0

Failing run (run 27744143884):

2026/06/18 07:38:54 INFO Fetched NVD entries total=261624 start_index=0

Total number of records currently stored:

➜  api find . -type f | wc -l
  357760

So a single day's window covered 261624 / 357760 ≈ 73% of all records — roughly 131 pages of 2000 entries.
Fetching that many pages takes 40–60 minutes, runs into the NVD rate limit, and eventually hits a transient server error (503, 524, or an HTTP/2 INTERNAL_ERROR).
Before this change such a transient error failed the whole run and reverted it, so lastUpdatedDate never advanced and the next run pulled an even larger window — a self-sustaining loop.

This is not a regression on our side; it is the fallout of the NVD update.
Making the transient errors retryable lets a run survive the spike and finish, after which the window shrinks back to normal.

Tests

  • TestUpdate — added two 524 cases: a happy path that succeeds after one reconnect (retry: 1), and a sad path that exhausts retries (confirming 524 is now retried instead of returning "unexpected status code").
  • TestUpdate_RetryOnBodyReadError — hijacks the connection to send a 200 OK with a truncated body (declared Content-Length larger than the bytes sent, then closes the connection), forcing an io.ReadAll error after the status code, and verifies the next attempt succeeds.

The NVD updater could fail an entire run on transient errors that were
not covered by the retry loop:

- The response body was read outside `fetchURL`'s retry loop, so an
  HTTP/2 `INTERNAL_ERROR` (stream aborted mid-body after a 200 OK) was
  fatal instead of retried.
- Cloudflare's non-standard 524 (origin timeout) fell into the default
  branch and returned an "unexpected status code" error.

Read the body inside the retry loop and add 524 to the retryable
statuses. A single request attempt is extracted into `doRequest`, which
closes the response body via `defer`.
@DmitriyLewen

Copy link
Copy Markdown
Contributor Author

@DmitriyLewen DmitriyLewen marked this pull request as ready for review June 18, 2026 15:04
knqyf263
knqyf263 previously approved these changes Jun 19, 2026
Comment thread nvd/nvd.go Outdated
// doRequest performs a single NVD request attempt and closes the response body
// before returning. It returns the response body on success. A nil body with a
// nil error means the request should be retried after waiting for `wait`.
func (u Updater) doRequest(c *http.Client, url string, attempt int) (body []byte, wait time.Duration, err error) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: passing attempt into doRequest leaks the backoff responsibility into a per-attempt function. attempt is loop state owned by fetchURL, and it's only used here to compute the linear backoff time.Duration(attempt) * time.Second.

The only thing doRequest really needs to hand back to the loop is the server-mandated wait (the rate-limit Retry-After); the backoff itself is the loop's policy. One way to express that is to drop attempt and signal "retryable" via a wrapped error that also carries the optional retry-after, so fetchURL owns the backoff:

// errRetry signals fetchURL to retry. retryAfter is the server-mandated
// minimum wait (rate limit); zero means apply the caller's backoff.
type errRetry struct{ retryAfter time.Duration }

func (errRetry) Error() string { return "retryable" }

func (u Updater) fetchURL(url string) ([]byte, error) {
	var c http.Client
	for i := 0; i <= u.retry; i++ {
		body, err := u.doRequest(&c, url)
		var re errRetry
		switch {
		case err == nil:
			return body, nil
		case errors.As(err, &re):
			wait := re.retryAfter
			if wait == 0 {
				wait = time.Duration(i) * time.Second
			}
			time.Sleep(wait)
		default:
			return nil, err
		}
	}
	return nil, xerrors.Errorf("unable to fetch url. Retry limit exceeded.")
}

doRequest then returns ([]byte, error): errRetry{retryAfter: ra} for 403/429, errRetry{} for 503/524/timeout/network/body-read errors, the body for 200, and a plain error for the unexpected-status case. As a bonus this also removes the body != nil retry sentinel, since success is now distinguished by err == nil.

That said, adding a dedicated error type might be overkill for this single spot, so I'll leave the call to you. Not blocking.

@DmitriyLewen DmitriyLewen Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea — that's more accurate.

Updated in 38049fd.

New test run from fork - https://github.com/DmitriyLewen/vuln-list-update/actions/runs/27869235304/job/82478367450

Replace the implicit "nil body + nil error means retry" contract and the
attempt parameter passed into doRequest with a dedicated errRetry type.
doRequest now returns ([]byte, error): the body on success, errRetry on a
retryable condition (with retryAfter for rate limits), and a plain error
otherwise. fetchURL owns the backoff policy.
@DmitriyLewen DmitriyLewen merged commit 0839fbb into aquasecurity:main Jun 22, 2026
3 checks passed
@DmitriyLewen DmitriyLewen deleted the fix/nvd-retry-body-read-and-524 branch June 22, 2026 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants