fix(nimbus): isolate and stream warm_api_caches so v6/v7/v8 actually warm by jaredlockhart · Pull Request #15622 · mozilla/experimenter

jaredlockhart · 2026-05-13T18:58:52Z

Because

The hourly warm_api_caches Celery task OOM-kills the worker (~3.5 GB) seven seconds after warming v5:csv, so v6/v7/v8 are never pre-warmed
Cache-miss requests on the cold v6/v7/v8 caches risk timing out at the gunicorn worker limit (SystemExit:1 on /api/v{6,7,8}/experiments/ in Sentry)

This commit

Splits warm_api_caches into a dispatcher that fires one sub-task per endpoint, so one failure can't tear down the others
Streams warm_api_cache via qs.iterator() + JSONEncoder.iterencode so neither the full dict tree nor the full JSON string co-exist in memory. Output byte-identical to the non-streamed render, pinned by a new test
Adds per-endpoint metrics so a silently-not-warming endpoint is alertable

Local profile (500 factory experiments, v8 viewset): peak RSS 71.75 → 49.62 MB (−31%).

…warm Because * The hourly warm_api_caches Celery task OOM-kills the worker pod (~3.5 GB anon-rss) about 7 seconds after logging "Warmed v5:csv", meaning v6/v7/v8 are never pre-warmed and only get populated by ad-hoc cache-miss requests * When the cache is cold, real requests time out at gunicorn's worker limit and surface in Sentry as SystemExit:1 on /api/v{6,7,8}/experiments/ * The single-task design propagates one endpoint's failure to all later endpoints in the loop, and materializing the full Python data list plus the full JSON bytestring at once was the source of the memory blowup for the heavier endpoints This commit * Splits warm_api_caches into a thin dispatcher that fires one warm_api_cache_endpoint sub-task per endpoint via .delay(), so a single endpoint failing (OOM, timeout, etc.) no longer prevents the others from warming * Streams the warm path: warm_api_cache now iterates the queryset via .iterator(chunk_size=N) and writes per-experiment rendered JSON bytes to a BytesIO buffer, eliminating the materialized list of serialized dicts that previously co-existed with the final bytes * Adds per-endpoint markus metrics (warm_api_cache_endpoint.<key>.started/completed/failed) so an endpoint silently not warming is alertable * Updates tests to cover the dispatcher pattern, the failure-isolation semantics, and the unknown-key path; enables eager Celery mode in the affected test classes so existing assertions still hold Fixes #15621

…amArray Because * The previous BytesIO + manual `[`/`,`/`]` framing in stream_render_queryset was hand-rolled rather than the idiomatic stdlib pattern for the "stream a large JSON array" problem * The well-documented stdlib idiom — wrap a generator-of-dicts in a list-subclass that lies about __len__ and pipe it through json.JSONEncoder.iterencode — produces the same byte output with the same memory characteristics in a more recognizable shape This commit * Replaces the BytesIO buffer in stream_render_queryset with a _StreamArray(list) wrapper around the per-item serializer-data generator, fed to json.JSONEncoder.iterencode so the encoder emits the array framing itself * Peeks the underlying iterator at construction so the empty case reports __len__ == 0, which makes iterencode short-circuit to `[]` instead of dropping the opening bracket (the standard pure-Python _iterencode_list emits the opening `[` only inside its for-body) * Builds the encoder with DRFJSONEncoder + SHORT_SEPARATORS pulled from rest_framework so the streamed bytes are byte-identical to JSONRenderer().render(serializer(qs, many=True).data) — pinned by the new test_stream_render_queryset_matches_drf_json_renderer test * Local tracemalloc profile (500 factory recipes against the v8 viewset queryset+serializer) shows peak RSS drops from 71.75 MB on the pre-PR full-materialize path to 49.62 MB on the streamed path (-31%), with byte-identical output across all variants tested

Because * The streaming refactor stripped the pre-existing docstrings on get_api_cache_key, warm_api_cache, and CachedListMixin, which weren't part of the actual change * _StreamArray, _drf_compatible_encoder, and stream_render_queryset are not obvious from their bodies alone - the StreamArray __len__ trick in particular relies on a documented stdlib quirk that's worth calling out This commit * Restores the three pre-existing docstrings unchanged * Adds brief docstrings on the three new symbols explaining what they do and why the shape is the way it is

mikewilli · 2026-05-14T19:43:39Z

    if sort_key is not None:
-        qs = sorted(qs, key=sort_key, reverse=True)
-    data = serializer_class(qs, many=True).data
-    rendered = renderer.render(data)
+        qs = sorted(queryset.all(), key=sort_key, reverse=True)
+        data = serializer_class(qs, many=True).data
+        rendered = renderer.render(data)
+    else:
+        rendered = stream_render_queryset(queryset, serializer_class)
+


So will the v5:csv still fail since it has a sort_key defined? But it just won't kill the tasks for the other endpoints now?

No v5:csv wasn't failing, it was just succeeding before proceeding to the next ones which were then OOMing. It should still proceed healthily.

jaredlockhart added 2 commits May 13, 2026 14:58

jaredlockhart force-pushed the 15621 branch 2 times, most recently from b183278 to 79ce9b0 Compare May 14, 2026 17:35

jaredlockhart force-pushed the 15621 branch from 79ce9b0 to 1286795 Compare May 14, 2026 17:44

jaredlockhart marked this pull request as ready for review May 14, 2026 17:47

jaredlockhart requested review from freshstrangemusic, mikewilli and yashikakhurana as code owners May 14, 2026 17:47

jaredlockhart enabled auto-merge May 14, 2026 17:47

mikewilli approved these changes May 14, 2026

View reviewed changes

jaredlockhart added this pull request to the merge queue May 14, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 14, 2026

jaredlockhart added this pull request to the merge queue May 14, 2026

Merged via the queue into main with commit 6f1c49c May 14, 2026
26 checks passed

jaredlockhart deleted the 15621 branch May 14, 2026 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nimbus): isolate and stream warm_api_caches so v6/v7/v8 actually warm#15622

fix(nimbus): isolate and stream warm_api_caches so v6/v7/v8 actually warm#15622
jaredlockhart merged 3 commits into
mainfrom
15621

jaredlockhart commented May 13, 2026 •

edited

Loading

Uh oh!

mikewilli May 14, 2026

Uh oh!

jaredlockhart May 14, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jaredlockhart commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikewilli May 14, 2026

Choose a reason for hiding this comment

Uh oh!

jaredlockhart May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jaredlockhart commented May 13, 2026 •

edited

Loading