Date: 2025-11-24 (design crystallized; exploration started 2025-11-20) Status: Implemented (PRs #33--69, v0.1.0 and v0.2.0)
IQB needs to cache BigQuery results to avoid expensive re-queries during
parameter exploration. Cache keys are simple: a time range and an
aggregation level. The cache must support overlapping time ranges
(e.g., [Oct, Nov) and [Oct 15, Nov 15) simultaneously).
The first design (2025-11-20) hashed query parameters into a Git-like
ab/cdef1234.../ directory structure. This works for general caching
but is a poor fit here: the filesystem becomes opaque (you need metadata
files to know what's cached), and the hash treats each parameter
combination independently with no way to reason about temporal overlap.
IQB's cache keys are too simple and too structured to benefit from hashing.
IQB queries are parameterized by a time range and an aggregation level. Use them directly as the directory path:
cache/v1/{start_date}T00:00:00Z/{end_date}T00:00:00Z/{aggregation}/
downloads.parquet
uploads.parquet
The directory path IS the cache key. No hash computation, no manifest files, no metadata files. Start is inclusive, end is exclusive. ISO 8601 timestamps make boundaries unambiguous. Different time ranges produce different paths, so overlapping ranges coexist without collision.
Measured with real IQB data (October 2024):
| Dataset | Rows | Parquet Size |
|---|---|---|
| Country-level (236 countries) | 236 | 0.09 MB |
| Country+city+ASN (full month) | 232K | 70 MB |
Parquet is 4--10x smaller than JSON for percentile data. Predicate pushdown allows filtering by country/city/ASN without loading the full file: 39.54 ms filtered vs 183.33 ms full load (4.6x speedup), 207 MB vs 368 MB peak memory. JSON and CSV were rejected (larger, no predicate pushdown, must parse entire file).
The cache supports six aggregation levels, enabled by the geodata fill rates documented in 2025-11-14-geodata.md:
| Aggregation | GROUP BY columns |
|---|---|
country |
country_code |
country_asn |
country_code, asn |
country_province |
country_code, subdivision1 |
country_province_asn |
country_code, subdivision1, asn |
country_city |
country_code, city |
country_city_asn |
country_code, city, asn |
Percentile labels are inverted for "lower is better" metrics so that p95 uniformly means "typical best performance":
- Download throughput (higher is better):
OFFSET(95)→download_p95 - Latency (lower is better):
OFFSET(5)→latency_p95(inverted) - Packet loss (lower is better):
OFFSET(5)→loss_p95(inverted)
This allows uniform comparison logic: checking a p95 threshold always asks "can the top ~5% of users perform this use case?"
Golden file tests (v0 JSON vs v1 Parquet, 3 countries × 2 periods)
revealed that all differences were due to BigQuery APPROX_QUANTILES
non-determinism, not Parquet precision loss. Observed variance:
| Percentile | Typical Variance | Max Observed |
|---|---|---|
| p50 | 0.1--0.3% | 0.5% |
| p95 (IQB uses) | 0.1--0.5% | 1.1% |
| p99 | 0.5--1.5% | 2.9% |
Golden file tests use a 3% tolerance to accommodate this. The p95 percentile critical for IQB scoring is stable.
- Pipeline (
pipeline.py): Executes BigQuery queries and writes Parquet files to disk. - Cache (
cache.py): Uses the directory naming convention to construct paths and read data. Operates on loaded DataFrames to produce per-project output (e.g., separate download/upload tables for M-Lab).