Implement in-memory caching and checkpointing#4951
Conversation
be1643c to
10be9a2
Compare
d49c3a6 to
9ecaa2e
Compare
10be9a2 to
d3cd53c
Compare
9ecaa2e to
b56aac4
Compare
d3cd53c to
a758f41
Compare
b56aac4 to
3e6a3a0
Compare
a758f41 to
e7acbc0
Compare
ddfb889 to
403569d
Compare
e7acbc0 to
8a6d5cc
Compare
jcscottiii
left a comment
There was a problem hiding this comment.
One bug and one question
| def _populate_previous_statuses(browser_name: str) -> dict: | ||
| """Create a dict with the most recent test statuses seen for browser.""" | ||
| verboseprint('Populating the most recently seen statuses...') | ||
| if parsed_args.generate_new_statuses_json: |
There was a problem hiding this comment.
- Critical Bug:
generate_new_statuses_jsonis broken- Issue: When running with
--generate-new-statuses-json(to create baseline statuses),_populate_previous_statusesreturns an empty dict{}directly:This returns a new dict that is not stored inif parsed_args.generate_new_statuses_json: verboseprint('Generating new statuses, so returning empty dict.') return {}
_prev_test_statuses_cache.
Later,commit_checkpointcallsflush_previous_statuses_to_gcs, which checks the cache:Since the browser is not in the cache, it exits early, and the baseline statuses are never uploaded to GCS.if browser_name not in _prev_test_statuses_cache: verboseprint(f'No cached statuses to flush for {browser_name}') return
- Fix: Initialize the cache when generating new statuses:
if parsed_args.generate_new_statuses_json: verboseprint('Generating new statuses, so returning empty dict.') _prev_test_statuses_cache[browser_name] = {} return _prev_test_statuses_cache[browser_name]
- Issue: When running with
|
|
||
|
|
||
| # default parameters used for cloud functions. | ||
| def commit_checkpoint(date_entity: MostRecentHistoryProcessed, new_date: str) -> None: |
There was a problem hiding this comment.
- Question: Checkpoint Commit Order
-
Thought: In
commit_checkpoint, it updates the Datastore date before flushing GCS caches:# 1. Update date in Datastore update_recent_processed_date(date_entity, new_date) # 2. Flush GCS caches for browser in ...: flush_previous_statuses_to_gcs(browser)
If GCS flush fails (e.g., due to transient network error), the Datastore date has already been advanced. On the next run, the script will start from
new_datebut GCS will still have old statuses, potentially leading to incorrect diffs (duplicate or missed deltas) in subsequent runs because we skipped updating GCS for the current window.Should we reverse the order?
# 1. Flush GCS caches for browser in ...: flush_previous_statuses_to_gcs(browser) # 2. Update date in Datastore (only if flush succeeded) update_recent_processed_date(date_entity, new_date)
If GCS flush fails with this reversed order, we crash and the Datastore date is not advanced. On the next run, we start from the last successful checkpoint and re-process the runs. Because of deterministic keys (PR 4949), re-processing is safe (idempotent).
But are there other implications of reversing this order that we should consider?
-
There was a problem hiding this comment.
For the record, Gemini thought, we should reverse it initially. But I'm wondering if there are some implications it may have missed so I had it phrase the feedback here as a question instead.
- Implement global in-memory cache _prev_test_statuses_cache for recent test statuses to avoid redundant GCS downloads. - Implement gzip compression/decompression for GCS status files to reduce network bandwidth. - Remove GCS status upload from process_single_run. - Implement flush_previous_statuses_to_gcs to upload cached statuses to GCS. - Implement commit_checkpoint to update processed date in Datastore and flush GCS cache. - Refactor main loop to only commit checkpoints every 20 revisions (or on timeout/completion), reducing GCS/Datastore write frequency by ~90%. - Refactor get_aligned_run_info and process_runs to return dates instead of directly updating Datastore. - Add unit tests for caching, gzip compression, and checkpointing. TAG=agy CONV=9096c6ce-d7f3-4a97-aa8d-31e76c7337c5
…n check Removing should_process_run check ensures that we reprocess runs since the last checkpoint on recovery, reconstructing the in-memory cache correctly. Deterministic keys ensure this reprocessing is idempotent and safe. Also restored docstring in _get_entry_key_name. TAG=agy CONV=b279eb05-86f7-480a-aa31-ce39ad9d364b
403569d to
59106b2
Compare
|
I should have waited to merge the previous PR, because this one still needs to be reviewed, and the previous PR mentioned that it should be merged together with this one. It should be okay, since these changes need to be added to the cloud function anyway to be finalized. 🙂 |
Overview
This PR implements in-memory caching for recent statuses and checkpointing to minimize GCS and Datastore I/O.
Root Cause / Motivation
Even with parallelization, downloading and uploading large JSON status files from GCS for every single revision creates significant network overhead. Caching them in memory and only writing checkpoints to GCS/Datastore periodically (every 20 revisions) dramatically speeds up the catch-up process.
Detailed Changelog
process_test_history.py:_prev_test_statuses_cachefor recent test statuses.process_single_run.flush_previous_statuses_to_gcsto upload cached statuses.commit_checkpointto update date in Datastore and flush GCS cache.mainloop to use checkpointing (commit every 20 revisions).get_aligned_run_infoandprocess_runsto return dates instead of directly updating Datastore.