Skip to content

pipeline run ignores --force , no way to bypass cache and re-query BigQuery #168

@sanskar0627

Description

@sanskar0627

Pipeline Run Has No Way to Bypass Cache and Re-Query BigQuery

Context

The pipeline run command caches query results locally to avoid hitting BigQuery on every invocation. This saves both time and money. However, once data is cached, there is currently no way to force a re-query from the CLI — even when upstream data has changed or the existing cache is known to be stale.

TODO(bassosimone): add support for -f/--force to bypass cache

Where the Cache Skips Happen

There are two separate locations where the pipeline short-circuits on cached data.

  1. sync_mlab() in i qb_pipeline.py
 with entry.lock():
    if not entry.exists():
        entry.sync()

If data.parquet and stats.json exist on disk, entry.exists() returns True and sync() is never called at all no syncer runs, no BigQuery, nothing

def  _bq_syncer() in pipeline.py:126
def _bq_syncer(self, entry: PipelineCacheEntry) -> bool:
    if entry.exists():
        log.info("querying for %s... skipped (cached)", entry)
        return True

Even if sync() is somehow called, the BigQuery syncer itself bails out early when files exist. This is a second layer of caching that independently prevents re-querying. Both checks are necessary for normal operation but both need to be bypassed when force is used.

###nExisting Precedent

The iqb cache pull already implements a -f/--force flag cache_pull.py that re-downloads files when hashes mismatch.
Adding the same flag to pipeline run would Maintain CLI consistency

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions