This directory holds cache artifacts, pipeline configuration, and the remote-cache manifest produced by the IQB data pipeline.
pipeline.yaml— configuration consumed byiqb pipeline run.cache/— local cache written by theiqb pipeline run.state/ghremote/manifest.json— manifest used byiqb cache.
-
Python 3.13 using
uvas documented in the toplevel README.md. -
Google Cloud SDK (
gcloud) installed. -
gcloud auth loginwith an account subscribed to the M-Lab Discuss mailing list. -
gcloud auth application-default loginusing the same account.
The state/ghremote/manifest.json file lists all the query results already cached at GCS. Run:
uv run iqb cache pull -d .to sync files from GCS to the local copy.
Omit -d . if running from the top-level directory.
Run uv run iqb cache pull --help for more help.
Run the pipeline to query BigQuery and populate the local cache:
uv run iqb pipeline run -d .This command loads pipeline.yaml to determine the query matrix and
executes BigQuery to generate data. If the cache already contains data, we
do not execute BigQuery to avoid burning cloud credits.
Omit -d . if running from the top-level directory.
Run uv run iqb pipeline run --help for more help.
Show which entries are local, remote, or missing:
uv run iqb cache status -d .Omit -d . if running from the top-level directory.
Run uv run iqb cache status --help for more help.
Show per-period cache statistics including parquet file sizes, cumulative BigQuery bytes billed, and query durations:
uv run iqb cache usage -d .Omit -d . if running from the top-level directory.
Run uv run iqb cache usage --help for more help.
After generating new cache files locally using iqb pipeline run, push
them to GCS and update the manifest:
uv run iqb cache push -d .Then commit the updated state/ghremote/manifest.json.
Omit -d . if running from the top-level directory.
Run uv run iqb cache push --help for more help.
The GCS bucket we use was created in the mlab-sandbox project with:
gcloud storage buckets create gs://mlab-sandbox-iqb-us-central1 \
--project=mlab-sandbox \
--location=us-central1 \
--uniform-bucket-level-accessPublic read access was granted so that the library can download cache files without authentication:
gcloud storage buckets add-iam-policy-binding gs://mlab-sandbox-iqb-us-central1 \
--member=allUsers \
--role=roles/storage.objectViewerRaw query results stored efficiently as Parquet files for flexible analysis:
- Location:
./cache/v1/{start_date}/{end_date}/{query_type}/ - Files:
data.parquet— query results (~1-60 MiB, streamable, chunked row groups)stats.json— query metadata (start time, duration, bytes processed/billed, template hash)