Skip to content

Implement in-memory caching and checkpointing#4951

Open
DanielRyanSmith wants to merge 2 commits into
mainfrom
pr5-checkpointing
Open

Implement in-memory caching and checkpointing#4951
DanielRyanSmith wants to merge 2 commits into
mainfrom
pr5-checkpointing

Conversation

@DanielRyanSmith

Copy link
Copy Markdown
Contributor

Overview

This PR implements in-memory caching for recent statuses and checkpointing to minimize GCS and Datastore I/O.

Root Cause / Motivation

Even with parallelization, downloading and uploading large JSON status files from GCS for every single revision creates significant network overhead. Caching them in memory and only writing checkpoints to GCS/Datastore periodically (every 20 revisions) dramatically speeds up the catch-up process.

Detailed Changelog

  • process_test_history.py:
    • Implemented global in-memory cache _prev_test_statuses_cache for recent test statuses.
    • Implemented gzip compression/decompression for GCS status files.
    • Removed GCS status upload from process_single_run.
    • Implemented flush_previous_statuses_to_gcs to upload cached statuses.
    • Implemented commit_checkpoint to update date in Datastore and flush GCS cache.
    • Refactored main loop to use checkpointing (commit every 20 revisions).
    • Refactored get_aligned_run_info and process_runs to return dates instead of directly updating Datastore.

@DanielRyanSmith DanielRyanSmith force-pushed the pr4-parallel-processing branch from be1643c to 10be9a2 Compare June 15, 2026 21:43
@DanielRyanSmith DanielRyanSmith force-pushed the pr4-parallel-processing branch from 10be9a2 to d3cd53c Compare June 15, 2026 21:57
@DanielRyanSmith DanielRyanSmith force-pushed the pr4-parallel-processing branch from d3cd53c to a758f41 Compare June 15, 2026 22:09
@DanielRyanSmith DanielRyanSmith force-pushed the pr4-parallel-processing branch from a758f41 to e7acbc0 Compare June 16, 2026 18:58
@DanielRyanSmith DanielRyanSmith force-pushed the pr5-checkpointing branch 2 times, most recently from ddfb889 to 403569d Compare June 16, 2026 19:28
@DanielRyanSmith DanielRyanSmith force-pushed the pr4-parallel-processing branch from e7acbc0 to 8a6d5cc Compare June 16, 2026 19:28

@jcscottiii jcscottiii left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One bug and one question

def _populate_previous_statuses(browser_name: str) -> dict:
"""Create a dict with the most recent test statuses seen for browser."""
verboseprint('Populating the most recently seen statuses...')
if parsed_args.generate_new_statuses_json:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Critical Bug: generate_new_statuses_json is broken
    • Issue: When running with --generate-new-statuses-json (to create baseline statuses), _populate_previous_statuses returns an empty dict {} directly:
      if parsed_args.generate_new_statuses_json:
          verboseprint('Generating new statuses, so returning empty dict.')
          return {}
      This returns a new dict that is not stored in _prev_test_statuses_cache.
      Later, commit_checkpoint calls flush_previous_statuses_to_gcs, which checks the cache:
      if browser_name not in _prev_test_statuses_cache:
          verboseprint(f'No cached statuses to flush for {browser_name}')
          return
      Since the browser is not in the cache, it exits early, and the baseline statuses are never uploaded to GCS.
    • Fix: Initialize the cache when generating new statuses:
      if parsed_args.generate_new_statuses_json:
          verboseprint('Generating new statuses, so returning empty dict.')
          _prev_test_statuses_cache[browser_name] = {}
          return _prev_test_statuses_cache[browser_name]



# default parameters used for cloud functions.
def commit_checkpoint(date_entity: MostRecentHistoryProcessed, new_date: str) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Question: Checkpoint Commit Order
    • Thought: In commit_checkpoint, it updates the Datastore date before flushing GCS caches:

      # 1. Update date in Datastore
      update_recent_processed_date(date_entity, new_date)
      # 2. Flush GCS caches
      for browser in ...: flush_previous_statuses_to_gcs(browser)

      If GCS flush fails (e.g., due to transient network error), the Datastore date has already been advanced. On the next run, the script will start from new_date but GCS will still have old statuses, potentially leading to incorrect diffs (duplicate or missed deltas) in subsequent runs because we skipped updating GCS for the current window.

      Should we reverse the order?

      # 1. Flush GCS caches
      for browser in ...: flush_previous_statuses_to_gcs(browser)
      # 2. Update date in Datastore (only if flush succeeded)
      update_recent_processed_date(date_entity, new_date)

      If GCS flush fails with this reversed order, we crash and the Datastore date is not advanced. On the next run, we start from the last successful checkpoint and re-process the runs. Because of deterministic keys (PR 4949), re-processing is safe (idempotent).

      But are there other implications of reversing this order that we should consider?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, Gemini thought, we should reverse it initially. But I'm wondering if there are some implications it may have missed so I had it phrase the feedback here as a question instead.

Base automatically changed from pr4-parallel-processing to main June 18, 2026 21:29
- Implement global in-memory cache _prev_test_statuses_cache for recent test statuses to avoid redundant GCS downloads.
- Implement gzip compression/decompression for GCS status files to reduce network bandwidth.
- Remove GCS status upload from process_single_run.
- Implement flush_previous_statuses_to_gcs to upload cached statuses to GCS.
- Implement commit_checkpoint to update processed date in Datastore and flush GCS cache.
- Refactor main loop to only commit checkpoints every 20 revisions (or on timeout/completion), reducing GCS/Datastore write frequency by ~90%.
- Refactor get_aligned_run_info and process_runs to return dates instead of directly updating Datastore.
- Add unit tests for caching, gzip compression, and checkpointing.

TAG=agy
CONV=9096c6ce-d7f3-4a97-aa8d-31e76c7337c5
…n check

Removing should_process_run check ensures that we reprocess runs since the last checkpoint on recovery, reconstructing the in-memory cache correctly. Deterministic keys ensure this reprocessing is idempotent and safe.
Also restored docstring in _get_entry_key_name.

TAG=agy
CONV=b279eb05-86f7-480a-aa31-ce39ad9d364b
@DanielRyanSmith

Copy link
Copy Markdown
Contributor Author

I should have waited to merge the previous PR, because this one still needs to be reviewed, and the previous PR mentioned that it should be merged together with this one.

It should be okay, since these changes need to be added to the cloud function anyway to be finalized. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants