Skip to content

fix(nimbus): isolate and stream warm_api_caches so v6/v7/v8 actually warm#15622

Merged
jaredlockhart merged 3 commits into
mainfrom
15621
May 14, 2026
Merged

fix(nimbus): isolate and stream warm_api_caches so v6/v7/v8 actually warm#15622
jaredlockhart merged 3 commits into
mainfrom
15621

Conversation

@jaredlockhart
Copy link
Copy Markdown
Collaborator

@jaredlockhart jaredlockhart commented May 13, 2026

Because

  • The hourly warm_api_caches Celery task OOM-kills the worker (~3.5 GB) seven seconds after warming v5:csv, so v6/v7/v8 are never pre-warmed
  • Cache-miss requests on the cold v6/v7/v8 caches risk timing out at the gunicorn worker limit (SystemExit:1 on /api/v{6,7,8}/experiments/ in Sentry)

This commit

  • Splits warm_api_caches into a dispatcher that fires one sub-task per endpoint, so one failure can't tear down the others
  • Streams warm_api_cache via qs.iterator() + JSONEncoder.iterencode so neither the full dict tree nor the full JSON string co-exist in memory. Output byte-identical to the non-streamed render, pinned by a new test
  • Adds per-endpoint metrics so a silently-not-warming endpoint is alertable

Local profile (500 factory experiments, v8 viewset): peak RSS 71.75 → 49.62 MB (−31%).

Fixes #15621

…warm

Because

* The hourly warm_api_caches Celery task OOM-kills the worker pod
  (~3.5 GB anon-rss) about 7 seconds after logging "Warmed v5:csv",
  meaning v6/v7/v8 are never pre-warmed and only get populated by
  ad-hoc cache-miss requests
* When the cache is cold, real requests time out at gunicorn's worker
  limit and surface in Sentry as SystemExit:1 on
  /api/v{6,7,8}/experiments/
* The single-task design propagates one endpoint's failure to all
  later endpoints in the loop, and materializing the full Python data
  list plus the full JSON bytestring at once was the source of the
  memory blowup for the heavier endpoints

This commit

* Splits warm_api_caches into a thin dispatcher that fires one
  warm_api_cache_endpoint sub-task per endpoint via .delay(), so a
  single endpoint failing (OOM, timeout, etc.) no longer prevents the
  others from warming
* Streams the warm path: warm_api_cache now iterates the queryset
  via .iterator(chunk_size=N) and writes per-experiment rendered JSON
  bytes to a BytesIO buffer, eliminating the materialized list of
  serialized dicts that previously co-existed with the final bytes
* Adds per-endpoint markus metrics
  (warm_api_cache_endpoint.<key>.started/completed/failed) so an
  endpoint silently not warming is alertable
* Updates tests to cover the dispatcher pattern, the failure-isolation
  semantics, and the unknown-key path; enables eager Celery mode in
  the affected test classes so existing assertions still hold

Fixes #15621
…amArray

Because

* The previous BytesIO + manual `[`/`,`/`]` framing in stream_render_queryset
  was hand-rolled rather than the idiomatic stdlib pattern for the
  "stream a large JSON array" problem
* The well-documented stdlib idiom — wrap a generator-of-dicts in a
  list-subclass that lies about __len__ and pipe it through
  json.JSONEncoder.iterencode — produces the same byte output with the
  same memory characteristics in a more recognizable shape

This commit

* Replaces the BytesIO buffer in stream_render_queryset with a
  _StreamArray(list) wrapper around the per-item serializer-data
  generator, fed to json.JSONEncoder.iterencode so the encoder emits the
  array framing itself
* Peeks the underlying iterator at construction so the empty case
  reports __len__ == 0, which makes iterencode short-circuit to `[]`
  instead of dropping the opening bracket (the standard pure-Python
  _iterencode_list emits the opening `[` only inside its for-body)
* Builds the encoder with DRFJSONEncoder + SHORT_SEPARATORS pulled from
  rest_framework so the streamed bytes are byte-identical to
  JSONRenderer().render(serializer(qs, many=True).data) — pinned by the
  new test_stream_render_queryset_matches_drf_json_renderer test
* Local tracemalloc profile (500 factory recipes against the v8 viewset
  queryset+serializer) shows peak RSS drops from 71.75 MB on the
  pre-PR full-materialize path to 49.62 MB on the streamed path (-31%),
  with byte-identical output across all variants tested
@jaredlockhart jaredlockhart force-pushed the 15621 branch 2 times, most recently from b183278 to 79ce9b0 Compare May 14, 2026 17:35
Because

* The streaming refactor stripped the pre-existing docstrings on
  get_api_cache_key, warm_api_cache, and CachedListMixin, which weren't
  part of the actual change
* _StreamArray, _drf_compatible_encoder, and stream_render_queryset are
  not obvious from their bodies alone - the StreamArray __len__ trick in
  particular relies on a documented stdlib quirk that's worth calling out

This commit

* Restores the three pre-existing docstrings unchanged
* Adds brief docstrings on the three new symbols explaining what they do
  and why the shape is the way it is
Comment on lines 82 to +88
if sort_key is not None:
qs = sorted(qs, key=sort_key, reverse=True)
data = serializer_class(qs, many=True).data
rendered = renderer.render(data)
qs = sorted(queryset.all(), key=sort_key, reverse=True)
data = serializer_class(qs, many=True).data
rendered = renderer.render(data)
else:
rendered = stream_render_queryset(queryset, serializer_class)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So will the v5:csv still fail since it has a sort_key defined? But it just won't kill the tasks for the other endpoints now?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No v5:csv wasn't failing, it was just succeeding before proceeding to the next ones which were then OOMing. It should still proceed healthily.

@jaredlockhart jaredlockhart added this pull request to the merge queue May 14, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 14, 2026
@jaredlockhart jaredlockhart added this pull request to the merge queue May 14, 2026
Merged via the queue into main with commit 6f1c49c May 14, 2026
26 checks passed
@jaredlockhart jaredlockhart deleted the 15621 branch May 14, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

warm_api_caches OOM-killed before v6/v7/v8 caches ever warm

2 participants