Implements municipio_id normalization in R scripts and adds coverage audit. Enhances data integrity by ensuring consistent formatting of municipio_id values across datasets and introduces an automated audit process for municipal forecast coverage.

JohnPalmer · JohnPalmer · commit 50132178a568 · 2025-11-20T17:04:19.000+01:00
diff --git a/docs/pending_municipal_id_fix.md b/docs/pending_municipal_id_fix.md
@@ -1,20 +1,50 @@
 # Municipal Forecast ID Normalization (Paused Task)
 
-_Last updated: 2025-11-19_
+_Last updated: 2025-11-20_
 
 ## Status
-- Nationwide municipal forecast run produced 65,737 rows for `2025-11-19`, but `municipio_id` values are stored without left padding and some contain stray whitespace/newlines.
-- Direct comparisons against `data/input/municipalities.csv.gz` therefore flag entire provinces (e.g., Barcelona, Badajoz, Burgos) as missing even though data exists.
+- `scripts/r/get_forecast_data_hybrid.R` pads and trims `municipio_id` values everywhere via `normalize_municipio_id()` (live since 2025-11-20).
+- `update_municipal_forecasts_only.sh` (job 27127, shards 1–5) re-ran with the fix at 11:32 CET; cumulative file now holds 664,489 rows with correctly padded IDs.
+- Coverage audit installed: shard 1 now runs `python3 scripts/python/audit_municipal_forecast_coverage.py` after each array completion. The audit exits non-zero if any non-excluded IDs are missing.
+- Latest audit (2025-11-20): 8,129 reference municipios, 8,037 collected; shortfall limited to the known excluded sets below.
 
-## Outstanding Work
-1. Patch the municipal forecast collector(s) so `municipio_id` values are `str_trim`\+`str_pad(width = 5, pad = "0")` prior to persistence.
-2. Regenerate today’s municipal forecasts after deploying the fix to validate that all ~8k municipalities collect successfully.
-3. Re-run the audit script to confirm the differential drops to the expected handful of communal territories (53xxx codes, North African islets, etc.).
+## Expected Gaps (excluded from coverage metrics)
 
-## Notes
-- Example bad value: `municipio_id = "8001"` (should be `08001`).
-- CSV file to reprocess: `data/output/daily_municipal_forecast.csv.gz` (plain-text CSV despite `.gz` extension).
-- Reference mapping: `data/input/municipalities.csv.gz`.
-- Prior Python helper lives in shell history: `python - <<'PY' ...` extracting outstanding IDs.
+### New municipios without AEMET forecasts (monitor if they appear)
+- `11903` — San Martín del Tesorillo
+- `14901` — Fuente Carreteros
+- `14902` — La Guijarrosa
+- `18077` — Fornes
+- `21902` — La Zarza-Perrunal
+- `41904` — El Palmar de Troya
+
+### Communal / parzonería / ledanía territories
+- `53000`–`53083`
+- `54001`–`54005`
+
+The audit script ignores the IDs above for coverage calculations but will print a warning if any of them begin to appear in the AEMET output so we can revisit downstream handling.
+
+## Coverage Snapshot — 2025-11-20
+- Reference municipalities: 8,129
+- Output municipalities (after de-duplication): 8,037
+- Ignored IDs: 92
+- Unexpected extras: none
 
-Resume from **Step 1** once Barcelona pipeline issues are resolved.
+Full ignored list:
+
+```
+11903, 14901, 14902, 18077, 21902, 41904, 53000, 53001, 53002, 53003, 53004,
+53005, 53006, 53007, 53008, 53009, 53010, 53011, 53012, 53013, 53014, 53015,
+53016, 53017, 53018, 53019, 53020, 53021, 53022, 53023, 53024, 53025, 53026,
+53027, 53028, 53029, 53031, 53032, 53033, 53034, 53035, 53036, 53037, 53038,
+53039, 53040, 53041, 53042, 53043, 53044, 53045, 53046, 53047, 53048, 53049,
+53050, 53051, 53052, 53053, 53054, 53055, 53056, 53057, 53058, 53059, 53060,
+53061, 53062, 53063, 53064, 53065, 53066, 53067, 53068, 53069, 53070, 53071,
+53072, 53073, 53074, 53075, 53076, 53077, 53078, 53080, 53081, 53083, 54001,
+54002, 54003, 54004, 54005
+```
+
+## Notes
+- Forecast cumulative file is plain CSV at `data/output/daily_municipal_forecast.csv.gz` despite the `.gz` suffix.
+- Municipal coverage audit runs automatically for SLURM array task 1; rerun manually with `python3 scripts/python/audit_municipal_forecast_coverage.py` if needed.
+- Manual rewrite script (2025-11-20) remains in shell history should another one-off normalization ever be required.
diff --git a/scripts/python/audit_municipal_forecast_coverage.py b/scripts/python/audit_municipal_forecast_coverage.py
@@ -0,0 +1,109 @@
+#!/usr/bin/env python3
+"""Audit municipal forecast coverage against the reference list."""
+
+from __future__ import annotations
+
+import csv
+import gzip
+import sys
+from pathlib import Path
+
+PROJECT_ROOT = Path(__file__).resolve().parents[2]
+REF_PATH = PROJECT_ROOT / "data/input/municipalities.csv.gz"
+FORECAST_PATH = PROJECT_ROOT / "data/output/daily_municipal_forecast.csv.gz"
+
+NEW_MUNICIPIOS = {
+    "11903",  # San Martín del Tesorillo
+    "14901",  # Fuente Carreteros
+    "14902",  # La Guijarrosa
+    "18077",  # Fornes
+    "21902",  # La Zarza-Perrunal
+    "41904",  # El Palmar de Troya
+}
+
+COMMUNAL_CODES = {
+    "53000", "53001", "53002", "53003", "53004", "53005", "53006", "53007",
+    "53008", "53009", "53010", "53011", "53012", "53013", "53014", "53015",
+    "53016", "53017", "53018", "53019", "53020", "53021", "53022", "53023",
+    "53024", "53025", "53026", "53027", "53028", "53029", "53031", "53032",
+    "53033", "53034", "53035", "53036", "53037", "53038", "53039", "53040",
+    "53041", "53042", "53043", "53044", "53045", "53046", "53047", "53048",
+    "53049", "53050", "53051", "53052", "53053", "53054", "53055", "53056",
+    "53057", "53058", "53059", "53060", "53061", "53062", "53063", "53064",
+    "53065", "53066", "53067", "53068", "53069", "53070", "53071", "53072",
+    "53073", "53074", "53075", "53076", "53077", "53078", "53080", "53081",
+    "53083", "54001", "54002", "54003", "54004", "54005",
+}
+
+EXPECTED_ABSENT = NEW_MUNICIPIOS | COMMUNAL_CODES
+
+
+def load_reference_ids(path: Path) -> set[str]:
+    if not path.exists():
+        print(f"ERROR: reference file not found: {path}", file=sys.stderr)
+        sys.exit(2)
+    with gzip.open(path, "rt", encoding="utf-8") as handle:
+        reader = csv.DictReader(handle)
+        if reader.fieldnames is None or "CUMUN" not in reader.fieldnames:
+            print("ERROR: reference file missing CUMUN header", file=sys.stderr)
+            sys.exit(2)
+        ids = set()
+        for row in reader:
+            raw = row.get("CUMUN")
+            if raw is None:
+                continue
+            stripped = raw.strip()
+            if not stripped:
+                continue
+            ids.add(stripped.zfill(5))
+    return ids
+
+
+def load_forecast_ids(path: Path) -> set[str]:
+    if not path.exists():
+        print(f"ERROR: forecast file not found: {path}", file=sys.stderr)
+        sys.exit(2)
+    with open(path, newline="", encoding="utf-8") as handle:
+        reader = csv.DictReader(handle)
+        if reader.fieldnames is None or "municipio_id" not in reader.fieldnames:
+            print("ERROR: forecast file missing municipio_id header", file=sys.stderr)
+            sys.exit(2)
+        ids = set()
+        for row in reader:
+            mid = row.get("municipio_id")
+            if mid:
+                ids.add(mid)
+    return ids
+
+
+def main() -> int:
+    ref_ids = load_reference_ids(REF_PATH)
+    forecast_ids = load_forecast_ids(FORECAST_PATH)
+
+    missing = sorted(ref_ids - forecast_ids - EXPECTED_ABSENT)
+    unexpected_present = sorted(forecast_ids & EXPECTED_ABSENT)
+
+    print(f"Reference municipalities: {len(ref_ids)}")
+    print(f"Forecast municipalities: {len(forecast_ids)}")
+    print(f"Ignored IDs (expected absent): {len(EXPECTED_ABSENT)}")
+
+    if unexpected_present:
+        print(
+            "WARNING: expected-absent IDs present in forecast data: "
+            + ", ".join(unexpected_present)
+        )
+
+    if missing:
+        print(f"ERROR: {len(missing)} reference municipios missing from forecasts.")
+        print(
+            "Sample missing IDs: " + ", ".join(missing[:20])
+            + ("..." if len(missing) > 20 else "")
+        )
+        return 1
+
+    print("Municipal forecast coverage OK (excluding expected gaps).")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/scripts/r/get_forecast_data_hybrid.R b/scripts/r/get_forecast_data_hybrid.R
@@ -18,6 +18,20 @@ library(stringr)
   x
 }
 
+normalize_municipio_id <- function(x) {
+  if (is.null(x)) return(character())
+  if (!length(x)) return(character())
+  na_mask <- is.na(x)
+  x_chr <- as.character(x)
+  x_chr[na_mask] <- NA_character_
+  trimmed <- str_trim(x_chr)
+  if (!length(trimmed)) return(trimmed)
+  trimmed[!is.na(trimmed) & trimmed == ""] <- NA_character_
+  padded <- str_pad(trimmed, width = 5, pad = "0")
+  padded[is.na(trimmed)] <- NA_character_
+  padded
+}
+
 parse_cli_args <- function(args) {
   if (!length(args)) return(list())
   parsed <- list()
@@ -98,6 +112,7 @@ load_cumulative_data <- function(path) {
   if (!"municipio_id" %in% names(dt)) {
     dt[, municipio_id := NA_character_]
   }
+  dt[, municipio_id := normalize_municipio_id(municipio_id)]
   if (!inherits(dt$fecha, "Date")) {
     dt[, fecha := as.Date(fecha)]
   }
@@ -117,6 +132,7 @@ if (nrow(cumulative_data)) {
     !is.na(municipio_id) & as.Date(collected_at, tz = "UTC") == RUN_DATE,
     unique(municipio_id)
   ]
+  completed_today <- completed_today[!is.na(completed_today)]
   if (length(completed_today)) {
     cat("Already collected", length(completed_today), "municipalities for", RUN_DATE, "\n")
   }
@@ -152,19 +168,23 @@ release_file_lock <- function(path) {
 persist_batch <- function(batch_dt) {
   if (!nrow(batch_dt)) return()
   batch_dt[, collected_at := as.POSIXct(collected_at, tz = "UTC")]
+  batch_dt[, municipio_id := normalize_municipio_id(municipio_id)]
   acquire_file_lock(lock_path)
   on.exit(release_file_lock(lock_path), add = TRUE)
   latest_disk <- load_cumulative_data(cumulative_path)
   combined <- rbind(latest_disk, batch_dt, fill = TRUE)
+  if (nrow(combined)) {
+    combined[, municipio_id := normalize_municipio_id(municipio_id)]
+  }
   setorderv(combined, c("municipio_id", "fecha", "elaborado", "collected_at"))
   combined <- unique(combined, by = c("municipio_id", "fecha", "elaborado"), fromLast = TRUE)
   cumulative_data <<- combined
   save_cumulative_data(cumulative_path, cumulative_data)
 }
 
-# Load municipality data 
+# Load municipality data
 cat("Loading municipality codes...\n")
-municipalities_data = fread(
+municipalities_data <- fread(
   "data/input/municipalities.csv.gz",
   colClasses = list(character = "CUMUN")
 )
@@ -173,7 +193,8 @@ if(!"CUMUN" %in% names(municipalities_data)){
   stop("CUMUN column not found in municipalities.csv.gz")
 }
 
-all_municipios = str_pad(trimws(municipalities_data$CUMUN), width = 5, pad = "0")
+all_municipios = normalize_municipio_id(municipalities_data$CUMUN)
+all_municipios = all_municipios[!is.na(all_municipios)]
 cat("Loaded", length(all_municipios), "municipalities\n")
 
 if(TESTING_MODE) {
@@ -240,7 +261,12 @@ while (length(remaining_municipios) > 0 && pass_number <= MAX_COLLECTION_PASSES)
     
     # Function to attempt forecast collection with key rotation on failure
     collect_with_retry <- function(municipios, max_retries = MAX_BATCH_RETRIES) {
-      municipios <- str_pad(trimws(municipios), width = 5, pad = "0")
+      municipios <- normalize_municipio_id(municipios)
+      municipios <- municipios[!is.na(municipios)]
+      if (!length(municipios)) {
+        cat("No valid municipality IDs remain after normalization; skipping batch.\n")
+        return(data.frame())
+      }
       for (attempt in seq_len(max_retries)) {
         result <- tryCatch({
           aemet_api_key(get_current_api_key(), install = TRUE, overwrite = TRUE)
@@ -340,7 +366,7 @@ while (length(remaining_municipios) > 0 && pass_number <= MAX_COLLECTION_PASSES)
           temp_min = temperatura_minima
         ) %>%
         mutate(
-          municipio_id = str_pad(as.character(municipio_id), width = 5, pad = "0")
+          municipio_id = normalize_municipio_id(municipio_id)
         ) %>%
         mutate(
           temp_avg = rowMeans(cbind(temp_max, temp_min), na.rm = TRUE),
@@ -356,7 +382,7 @@ while (length(remaining_municipios) > 0 && pass_number <= MAX_COLLECTION_PASSES)
           humid_min = humedadRelativa_minima
         ) %>%
         mutate(
-          municipio = str_pad(as.character(municipio), width = 5, pad = "0")
+          municipio = normalize_municipio_id(municipio)
         )
       
       # Get wind data
@@ -367,7 +393,7 @@ while (length(remaining_municipios) > 0 && pass_number <= MAX_COLLECTION_PASSES)
           wind_speed = viento_velocidad
         ) %>%
         mutate(
-          municipio = str_pad(as.character(municipio), width = 5, pad = "0")
+          municipio = normalize_municipio_id(municipio)
         )
       
       # Combine all data
@@ -387,6 +413,9 @@ while (length(remaining_municipios) > 0 && pass_number <= MAX_COLLECTION_PASSES)
       })
       
       batch_final_dt <- as.data.table(batch_final)
+      if (nrow(batch_final_dt)) {
+        batch_final_dt[, municipio_id := normalize_municipio_id(municipio_id)]
+      }
       if (!nrow(batch_final_dt)) {
         cat("No records produced after processing batch", batch_idx, "- skipping persistence.\n\n")
         next
@@ -440,6 +469,7 @@ cat("=== FINAL PROCESSING ===\n")
 if(length(all_forecasts) > 0) {
   final_data <- rbindlist(all_forecasts, use.names = TRUE, fill = TRUE)
   if (nrow(final_data)) {
+    final_data[, municipio_id := normalize_municipio_id(municipio_id)]
     final_data[, fecha := as.Date(fecha)]
     final_data[, collected_at := as.POSIXct(collected_at, tz = "UTC")]
   }
@@ -455,6 +485,8 @@ if(length(all_forecasts) > 0) {
   cat("No new forecast data collected in this run (municipalities may already be up to date or all API calls failed).\n")
 }
 
+cumulative_data <- load_cumulative_data(cumulative_path)
+
 if (nrow(cumulative_data)) {
   cat("\n=== SUMMARY STATISTICS ===\n")
   cat("Total municipalities in cumulative file:", length(unique(cumulative_data$municipio_id)), "\n")
diff --git a/update_municipal_forecasts_only.sh b/update_municipal_forecasts_only.sh
@@ -35,6 +35,7 @@ fi
 KEY_POOL=${KEY_POOLS[$((SHARD_INDEX-1))]}
 
 # Dataset 4: Municipal forecasts
+EXIT_CODE=0
 echo "Dataset 4: Municipal forecasts shard ${SHARD_INDEX}/${SHARD_COUNT} using key pool '${KEY_POOL}'..."
 srun Rscript scripts/r/get_forecast_data_hybrid.R \
     --shard-index=${SHARD_INDEX} \
@@ -44,11 +45,24 @@ if [ $? -eq 0 ]; then
     echo "✅ Forecast collection completed"
 else
     echo "❌ Forecast collection failed"
+    EXIT_CODE=1
 fi
 
 echo "=== Collection Summary ==="
 echo "Completed: $(date)"
 ls -la data/output/*.csv.gz
 
+if [ "${SLURM_ARRAY_TASK_ID:-1}" -eq 1 ]; then
+    echo "Running municipal forecast coverage audit..."
+    if python3 scripts/python/audit_municipal_forecast_coverage.py; then
+        echo "✅ Municipal forecast coverage audit passed"
+    else
+        echo "❌ Municipal forecast coverage audit failed"
+        EXIT_CODE=1
+    fi
+fi
+
+exit ${EXIT_CODE}
+
 # Run with
 # sbatch ~/research/weather-data-collector-spain/update_municipal_forecasts_only.sh