Implement in-memory caching and checkpointing by DanielRyanSmith · Pull Request #4951 · web-platform-tests/wpt.fyi

DanielRyanSmith · 2026-06-15T21:42:29Z

Overview

This PR implements in-memory caching for recent statuses and checkpointing to minimize GCS and Datastore I/O.

Root Cause / Motivation

Even with parallelization, downloading and uploading large JSON status files from GCS for every single revision creates significant network overhead. Caching them in memory and only writing checkpoints to GCS/Datastore periodically (every 20 revisions) dramatically speeds up the catch-up process.

Detailed Changelog

process_test_history.py:
- Implemented global in-memory cache _prev_test_statuses_cache for recent test statuses.
- Implemented gzip compression/decompression for GCS status files.
- Removed GCS status upload from process_single_run.
- Implemented flush_previous_statuses_to_gcs to upload cached statuses.
- Implemented commit_checkpoint to update date in Datastore and flush GCS cache.
- Refactored main loop to use checkpointing (commit every 20 revisions).
- Refactored get_aligned_run_info and process_runs to return dates instead of directly updating Datastore.

jcscottiii

One bug and one question

jcscottiii · 2026-06-17T14:50:49Z

 def _populate_previous_statuses(browser_name: str) -> dict:
    """Create a dict with the most recent test statuses seen for browser."""
    verboseprint('Populating the most recently seen statuses...')
    if parsed_args.generate_new_statuses_json:


Critical Bug: generate_new_statuses_json is broken

Issue: When running with --generate-new-statuses-json (to create baseline statuses), _populate_previous_statuses returns an empty dict {} directly:
if parsed_args.generate_new_statuses_json: verboseprint('Generating new statuses, so returning empty dict.') return {}
This returns a new dict that is not stored in _prev_test_statuses_cache.
Later, commit_checkpoint calls flush_previous_statuses_to_gcs, which checks the cache:
if browser_name not in _prev_test_statuses_cache: verboseprint(f'No cached statuses to flush for {browser_name}') return
Since the browser is not in the cache, it exits early, and the baseline statuses are never uploaded to GCS.

Fix: Initialize the cache when generating new statuses:
if parsed_args.generate_new_statuses_json: verboseprint('Generating new statuses, so returning empty dict.') _prev_test_statuses_cache[browser_name] = {} return _prev_test_statuses_cache[browser_name]

jcscottiii · 2026-06-17T15:21:01Z



+# default parameters used for cloud functions.
+def commit_checkpoint(date_entity: MostRecentHistoryProcessed, new_date: str) -> None:


Question: Checkpoint Commit Order

Thought: In commit_checkpoint, it updates the Datastore date before flushing GCS caches:

# 1. Update date in Datastore update_recent_processed_date(date_entity, new_date) # 2. Flush GCS caches for browser in ...: flush_previous_statuses_to_gcs(browser)

If GCS flush fails (e.g., due to transient network error), the Datastore date has already been advanced. On the next run, the script will start from new_date but GCS will still have old statuses, potentially leading to incorrect diffs (duplicate or missed deltas) in subsequent runs because we skipped updating GCS for the current window.

Should we reverse the order?

# 1. Flush GCS caches for browser in ...: flush_previous_statuses_to_gcs(browser) # 2. Update date in Datastore (only if flush succeeded) update_recent_processed_date(date_entity, new_date)

If GCS flush fails with this reversed order, we crash and the Datastore date is not advanced. On the next run, we start from the last successful checkpoint and re-process the runs. Because of deterministic keys (PR 4949), re-processing is safe (idempotent).

But are there other implications of reversing this order that we should consider?

For the record, Gemini thought, we should reverse it initially. But I'm wondering if there are some implications it may have missed so I had it phrase the feedback here as a question instead.

- Implement global in-memory cache _prev_test_statuses_cache for recent test statuses to avoid redundant GCS downloads. - Implement gzip compression/decompression for GCS status files to reduce network bandwidth. - Remove GCS status upload from process_single_run. - Implement flush_previous_statuses_to_gcs to upload cached statuses to GCS. - Implement commit_checkpoint to update processed date in Datastore and flush GCS cache. - Refactor main loop to only commit checkpoints every 20 revisions (or on timeout/completion), reducing GCS/Datastore write frequency by ~90%. - Refactor get_aligned_run_info and process_runs to return dates instead of directly updating Datastore. - Add unit tests for caching, gzip compression, and checkpointing. TAG=agy CONV=9096c6ce-d7f3-4a97-aa8d-31e76c7337c5

…n check Removing should_process_run check ensures that we reprocess runs since the last checkpoint on recovery, reconstructing the in-memory cache correctly. Deterministic keys ensure this reprocessing is idempotent and safe. Also restored docstring in _get_entry_key_name. TAG=agy CONV=b279eb05-86f7-480a-aa31-ce39ad9d364b

DanielRyanSmith · 2026-06-18T23:45:16Z

I should have waited to merge the previous PR, because this one still needs to be reviewed, and the previous PR mentioned that it should be merged together with this one.

It should be okay, since these changes need to be added to the cloud function anyway to be finalized. 🙂

DanielRyanSmith force-pushed the pr4-parallel-processing branch from be1643c to 10be9a2 Compare June 15, 2026 21:43

DanielRyanSmith force-pushed the pr5-checkpointing branch from d49c3a6 to 9ecaa2e Compare June 15, 2026 21:43

DanielRyanSmith force-pushed the pr4-parallel-processing branch from 10be9a2 to d3cd53c Compare June 15, 2026 21:57

DanielRyanSmith force-pushed the pr5-checkpointing branch from 9ecaa2e to b56aac4 Compare June 15, 2026 21:58

DanielRyanSmith force-pushed the pr4-parallel-processing branch from d3cd53c to a758f41 Compare June 15, 2026 22:09

DanielRyanSmith force-pushed the pr5-checkpointing branch from b56aac4 to 3e6a3a0 Compare June 15, 2026 22:09

DanielRyanSmith force-pushed the pr4-parallel-processing branch from a758f41 to e7acbc0 Compare June 16, 2026 18:58

DanielRyanSmith force-pushed the pr5-checkpointing branch 2 times, most recently from ddfb889 to 403569d Compare June 16, 2026 19:28

DanielRyanSmith force-pushed the pr4-parallel-processing branch from e7acbc0 to 8a6d5cc Compare June 16, 2026 19:28

jcscottiii reviewed Jun 17, 2026

View reviewed changes

Base automatically changed from pr4-parallel-processing to main June 18, 2026 21:29

DanielRyanSmith added 2 commits June 18, 2026 21:33

DanielRyanSmith force-pushed the pr5-checkpointing branch from 403569d to 59106b2 Compare June 18, 2026 21:57

jcscottiii approved these changes Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement in-memory caching and checkpointing#4951

Implement in-memory caching and checkpointing#4951
DanielRyanSmith wants to merge 2 commits into
mainfrom
pr5-checkpointing

DanielRyanSmith commented Jun 15, 2026

Uh oh!

jcscottiii left a comment

Uh oh!

jcscottiii Jun 17, 2026

Uh oh!

jcscottiii Jun 17, 2026

Uh oh!

jcscottiii Jun 17, 2026

Uh oh!

DanielRyanSmith commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		# default parameters used for cloud functions.
		def commit_checkpoint(date_entity: MostRecentHistoryProcessed, new_date: str) -> None:

Conversation

DanielRyanSmith commented Jun 15, 2026

Overview

Root Cause / Motivation

Detailed Changelog

Uh oh!

jcscottiii left a comment

Choose a reason for hiding this comment

Uh oh!

jcscottiii Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

jcscottiii Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

jcscottiii Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

DanielRyanSmith commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants