Skip to content

Latest commit

 

History

History
101 lines (75 loc) · 3.86 KB

File metadata and controls

101 lines (75 loc) · 3.86 KB

Timestamp-Based Content-Addressable Cache with Parquet

Date: 2025-11-24 (design crystallized; exploration started 2025-11-20) Status: Implemented (PRs #33--69, v0.1.0 and v0.2.0)

Problem

IQB needs to cache BigQuery results to avoid expensive re-queries during parameter exploration. Cache keys are simple: a time range and an aggregation level. The cache must support overlapping time ranges (e.g., [Oct, Nov) and [Oct 15, Nov 15) simultaneously).

Rejected Alternative: SHA256-Based Cache

The first design (2025-11-20) hashed query parameters into a Git-like ab/cdef1234.../ directory structure. This works for general caching but is a poor fit here: the filesystem becomes opaque (you need metadata files to know what's cached), and the hash treats each parameter combination independently with no way to reason about temporal overlap. IQB's cache keys are too simple and too structured to benefit from hashing.

Design: Path-as-Key

IQB queries are parameterized by a time range and an aggregation level. Use them directly as the directory path:

cache/v1/{start_date}T00:00:00Z/{end_date}T00:00:00Z/{aggregation}/
    downloads.parquet
    uploads.parquet

The directory path IS the cache key. No hash computation, no manifest files, no metadata files. Start is inclusive, end is exclusive. ISO 8601 timestamps make boundaries unambiguous. Different time ranges produce different paths, so overlapping ranges coexist without collision.

Why Parquet

Measured with real IQB data (October 2024):

Dataset Rows Parquet Size
Country-level (236 countries) 236 0.09 MB
Country+city+ASN (full month) 232K 70 MB

Parquet is 4--10x smaller than JSON for percentile data. Predicate pushdown allows filtering by country/city/ASN without loading the full file: 39.54 ms filtered vs 183.33 ms full load (4.6x speedup), 207 MB vs 368 MB peak memory. JSON and CSV were rejected (larger, no predicate pushdown, must parse entire file).

Aggregation Levels

The cache supports six aggregation levels, enabled by the geodata fill rates documented in 2025-11-14-geodata.md:

Aggregation GROUP BY columns
country country_code
country_asn country_code, asn
country_province country_code, subdivision1
country_province_asn country_code, subdivision1, asn
country_city country_code, city
country_city_asn country_code, city, asn

Percentile Polarity Normalization

Percentile labels are inverted for "lower is better" metrics so that p95 uniformly means "typical best performance":

  • Download throughput (higher is better): OFFSET(95)download_p95
  • Latency (lower is better): OFFSET(5)latency_p95 (inverted)
  • Packet loss (lower is better): OFFSET(5)loss_p95 (inverted)

This allows uniform comparison logic: checking a p95 threshold always asks "can the top ~5% of users perform this use case?"

APPROX_QUANTILES Non-Determinism

Golden file tests (v0 JSON vs v1 Parquet, 3 countries × 2 periods) revealed that all differences were due to BigQuery APPROX_QUANTILES non-determinism, not Parquet precision loss. Observed variance:

Percentile Typical Variance Max Observed
p50 0.1--0.3% 0.5%
p95 (IQB uses) 0.1--0.5% 1.1%
p99 0.5--1.5% 2.9%

Golden file tests use a 3% tolerance to accommodate this. The p95 percentile critical for IQB scoring is stable.

Pipeline / Cache Separation

  • Pipeline (pipeline.py): Executes BigQuery queries and writes Parquet files to disk.
  • Cache (cache.py): Uses the directory naming convention to construct paths and read data. Operates on loaded DataFrames to produce per-project output (e.g., separate download/upload tables for M-Lab).