Timestamp-Based Content-Addressable Cache with Parquet

Date: 2025-11-24 (design crystallized; exploration started 2025-11-20) Status: Implemented (PRs #33--69, v0.1.0 and v0.2.0)

Problem

IQB needs to cache BigQuery results to avoid expensive re-queries during parameter exploration. Cache keys are simple: a time range and an aggregation level. The cache must support overlapping time ranges (e.g., [Oct, Nov) and [Oct 15, Nov 15) simultaneously).

Rejected Alternative: SHA256-Based Cache

The first design (2025-11-20) hashed query parameters into a Git-like ab/cdef1234.../ directory structure. This works for general caching but is a poor fit here: the filesystem becomes opaque (you need metadata files to know what's cached), and the hash treats each parameter combination independently with no way to reason about temporal overlap. IQB's cache keys are too simple and too structured to benefit from hashing.

Design: Path-as-Key

IQB queries are parameterized by a time range and an aggregation level. Use them directly as the directory path:

cache/v1/{start_date}T00:00:00Z/{end_date}T00:00:00Z/{aggregation}/
    downloads.parquet
    uploads.parquet

The directory path IS the cache key. No hash computation, no manifest files, no metadata files. Start is inclusive, end is exclusive. ISO 8601 timestamps make boundaries unambiguous. Different time ranges produce different paths, so overlapping ranges coexist without collision.

Why Parquet

Measured with real IQB data (October 2024):

Dataset	Rows	Parquet Size
Country-level (236 countries)	236	0.09 MB
Country+city+ASN (full month)	232K	70 MB

Parquet is 4--10x smaller than JSON for percentile data. Predicate pushdown allows filtering by country/city/ASN without loading the full file: 39.54 ms filtered vs 183.33 ms full load (4.6x speedup), 207 MB vs 368 MB peak memory. JSON and CSV were rejected (larger, no predicate pushdown, must parse entire file).

Aggregation Levels

The cache supports six aggregation levels, enabled by the geodata fill rates documented in 2025-11-14-geodata.md:

Aggregation	GROUP BY columns
`country`	`country_code`
`country_asn`	`country_code`, `asn`
`country_province`	`country_code`, `subdivision1`
`country_province_asn`	`country_code`, `subdivision1`, `asn`
`country_city`	`country_code`, `city`
`country_city_asn`	`country_code`, `city`, `asn`

Percentile Polarity Normalization

Percentile labels are inverted for "lower is better" metrics so that p95 uniformly means "typical best performance":

Download throughput (higher is better): OFFSET(95) → download_p95
Latency (lower is better): OFFSET(5) → latency_p95 (inverted)
Packet loss (lower is better): OFFSET(5) → loss_p95 (inverted)

This allows uniform comparison logic: checking a p95 threshold always asks "can the top ~5% of users perform this use case?"

APPROX_QUANTILES Non-Determinism

Golden file tests (v0 JSON vs v1 Parquet, 3 countries × 2 periods) revealed that all differences were due to BigQuery APPROX_QUANTILES non-determinism, not Parquet precision loss. Observed variance:

Percentile	Typical Variance	Max Observed
p50	0.1--0.3%	0.5%
p95 (IQB uses)	0.1--0.5%	1.1%
p99	0.5--1.5%	2.9%

Golden file tests use a 3% tolerance to accommodate this. The p95 percentile critical for IQB scoring is stable.

Pipeline / Cache Separation

Pipeline (pipeline.py): Executes BigQuery queries and writes Parquet files to disk.
Cache (cache.py): Uses the directory naming convention to construct paths and read data. Operates on loaded DataFrames to produce per-project output (e.g., separate download/upload tables for M-Lab).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamp-Based Content-Addressable Cache with Parquet

Problem

Rejected Alternative: SHA256-Based Cache

Design: Path-as-Key

Why Parquet

Aggregation Levels

Percentile Polarity Normalization

APPROX_QUANTILES Non-Determinism

Pipeline / Cache Separation

FilesExpand file tree

2025-11-24-cache.md

Latest commit

History

2025-11-24-cache.md

File metadata and controls

Timestamp-Based Content-Addressable Cache with Parquet

Problem

Rejected Alternative: SHA256-Based Cache

Design: Path-as-Key

Why Parquet

Aggregation Levels

Percentile Polarity Normalization

APPROX_QUANTILES Non-Determinism

Pipeline / Cache Separation