Date: 2026-01-20
Status: Implemented (GCS bucket mlab-sandbox-iqb-us-central1, PR #131)
Builds on: 2025-12-21-remote.md (layered cache lookup)
The distribution system must satisfy:
- Individual file addressability. Users download specific parquet files without fetching the entire dataset.
- Incremental downloads. A user interested in country-level data for October 2025 should not download city-level data or other months.
- Incremental uploads. New months are appendable without re-uploading existing data.
- Data updates with versioning. Methodology changes produce new
version namespaces (
cache/v1/,cache/v2/). - Data integrity. SHA256 checksums for every file.
- Discovery. A programmatic manifest listing available files, checksums, and sizes.
- Reasonable performance. Individual file downloads in 1--10 seconds (files range from 58 KB to 47 MB).
- Standard tooling. Accessible via HTTP clients and cloud SDKs without custom software.
The IQB cache contains ~1.4 GB across 156 parquet files (plus 156
stats.json metadata files), covering 13 months at 6 aggregation levels:
cache/v1/{start_date}/{end_date}/{metric}_by_{aggregation}/
├─ data.parquet (58 KB to 47 MB)
└─ stats.json (~260 bytes, query metadata)
| Aggregation Level | Typical Size |
|---|---|
by_country |
58 KB |
by_country_subdivision1 |
558 KB |
by_country_asn |
8.8 MB |
by_country_subdivision1_asn |
11 MB |
by_country_city |
15 MB |
by_country_city_asn |
42--47 MB |
GCS replaced the original GitHub Releases backend (see
2025-12-21-remote.md) that served as the
first remote cache implementation. The ghremote package was
repurposed from "GitHub Remote" to "globally-hosted remote", keeping
its name and Protocol interface unchanged.
GCS object storage satisfies all eight requirements:
- Individual files and incremental access (R1--R3): Objects map 1:1 to cache files. New months are uploaded independently.
- Versioning (R4): The
cache/v1/path prefix provides version namespaces. GCS object versioning preserves history if needed. - Integrity (R5): GCS stores MD5/CRC32C checksums; the manifest additionally provides SHA256.
- Discovery (R6): Bucket listing API and manifest file.
- Performance (R7): Optimized for this workload. The IQB prototype
runs on Cloud Run in the same GCS region, so
iqb cache pullfrom the prototype incurs no cross-region egress. - Standard tooling (R8):
gsutil,gcloud, HTTP, Python cloud SDKs.
The cache/v1/ directory structure is preserved as-is in the bucket,
so local paths and GCS paths are identical.
File discovery uses a centralized manifest:
{
"files": {
"cache/v1/{period}/{aggregation}/data.parquet": {
"sha256": "3a421c62...",
"url": "https://storage.googleapis.com/mlab-sandbox-iqb-us-central1/..."
}
},
"v": 0
}The manifest provides a point-in-time snapshot of available files. Updating it atomically ensures users never see a partial state. SHA256 checksums in the manifest enable integrity verification independently of the storage backend.