-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Pipeline Run Has No Way to Bypass Cache and Re-Query BigQuery
Context
The pipeline run command caches query results locally to avoid hitting BigQuery on every invocation. This saves both time and money. However, once data is cached, there is currently no way to force a re-query from the CLI — even when upstream data has changed or the existing cache is known to be stale.
TODO(bassosimone): add support for -f/--force to bypass cacheWhere the Cache Skips Happen
There are two separate locations where the pipeline short-circuits on cached data.
- sync_mlab() in i
qb_pipeline.py
with entry.lock():
if not entry.exists():
entry.sync()
If data.parquet and stats.json exist on disk, entry.exists() returns True and sync() is never called at all no syncer runs, no BigQuery, nothing
def _bq_syncer() in pipeline.py:126
def _bq_syncer(self, entry: PipelineCacheEntry) -> bool:
if entry.exists():
log.info("querying for %s... skipped (cached)", entry)
return TrueEven if sync() is somehow called, the BigQuery syncer itself bails out early when files exist. This is a second layer of caching that independently prevents re-querying. Both checks are necessary for normal operation but both need to be bypassed when force is used.
###nExisting Precedent
The iqb cache pull already implements a -f/--force flag cache_pull.py that re-downloads files when hashes mismatch.
Adding the same flag to pipeline run would Maintain CLI consistency