HDS5106/bym2_multi_year_country.Rmd at main · SAHagen/HDS5106 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
---
title: "dhs_multiyear_country_data"
author: ""
date: ""
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

# loading all require libraries
library(tidyverse)
library(haven)
library(sf)
library(geodata)
library(survey)
library(spdep)
library(Matrix)
library(INLA)
library(cmdstanr)
library(terra)
library(exactextractr)
library(geodata)

ghana_tt <- st_read("dhs_gh_14/GHGE71FL/GHGE71FL.shp")
```

# About the Data

# Demographic and Health Surveys (DHS) Data

The data for this project is sourced from the Demographic and Health Surveys (DHS) program, which provides comprehensive datasets on health and population indicators across various countries. The specific datasets used in this analysis include two different DHS rounds for the selected countries namely, Ghana, Burkina Faso and Cote d'Ivoire. The are two main files for each round: the household recode (HR) file and GPS data file. The HR files contain detailed information on household characteristics which includes Malaria Results, the outcome we want to estimate, while the GPS data files provide geospatial coordinates for the survey clusters.

The datasets were obtained from the DHS Program's official website, which requires registration and approval for access. The data is typically provided in Stata (.dta) format, which can be easily imported into R using the `haven` package. The specific files used in this analysis are as follows:

- For Ghana:
  - DHS 2014: `GHPR72DT.dta` (Household Recode), `GHGE71FL.shp` (GPS data)
  - DHS 2022: 'GHPR8CDT.dta' (Household Recode), `GHGE8AFL.shp` (GPS data)

- For Burkina Faso:
 - DHS 2010: `BFPR62DT.dta` (Household Recode), `BFGE61FL.shp` (GPS data)
 - DHS 2021: `BFPR81DT.dta` (Household Recode), `BFGE81FL.shp` (GPS data)

-   For Cote d'Ivoire:
  -   DHS 2012: `CIPR62DT.dta` (Household Recode), `CIGE61FL.shp` (GPS data)
  -   DHS 2021: `CIPR81DT.dta` (Household Recode), `CIGE81FL.shp` (GPS data)

# Auxiliary Data

# Data loading and cleaning

```{r data-loading, echo=TRUE}
#Survey Index
survey_index <- tribble(
    ~country, ~iso, ~year, ~round_id, ~pr_path, ~gps_path,
  "Ghana",  "GHA", 2014,  1L,
  "dhs_gh_14/GHPR72DT/GHPR72FL.DTA", "dhs_gh_14/GHGE71FL/GHGE71FL.shp",
  "Ghana",  "GHA", 2022,  2L,
  "dhs_gh_22/GHPR8CDT/GHPR8CFL.DTA", "dhs_gh_22/GHGE8AFL/GHGE8AFL.shp",
  "Burkina","BFA", 2010,  1L,
    "dhs_bf_10/BFPR62DT/BFPR62FL.DTA", "dhs_bf_10/BFGE61FL/BFGE61FL.shp",
    "Burkina","BFA", 2021,  2L,
    "dhs_bf_21/BFPR81DT/BFPR81FL.DTA", "dhs_bf_21/BFGE81FL/BFGE81FL.shp",
    "Cote d'Ivoire", "CIV", 2012, 1L,
    "dhs_ci_12/CIPR62DT/CIPR62FL.DTA", "dhs_ci_12/CIGE61FL/CIGE61FL.shp",
    "Cote d'Ivoire", "CIV", 2021, 2L,
    "dhs_ci_21/CIPR81DT/CIPR81FL.DTA", "dhs_ci_21/CIGE81FL/CIGE81FL.shp"
)
```

# District Boundaries for each country

The district boundaries are obtained form the GADM database, which provides high-resolution administrative boundary data for all countries. The specific administrative level used in this analysis is level 2 for Ghana and Level 3 for Burkina Faso and Cote d'Ivoire, which corresponds to districts. The GADM data is typically provided in shapefile format, which can be easily imported into R using the `sf` package.

```{r district-boundaries, echo=TRUE}
# Load district boundaries for each country
# we import the boundaries and transform them to the same coordinate reference system (CRS) as the GPS data (WGS 84, EPSG:4326). We also rename the relevant columns for consistency and add an ISO code column for easier merging with the survey data later on.
load_districts <- function(iso) {
  if (iso == "GHA") {
    gadm(country = iso, level = 2, path = "boundaries/") %>%
      st_as_sf() %>%
      st_transform(crs = 4326) %>%
      rename(district_name = NAME_2, district_code = GID_2) %>%
      mutate(iso = iso)
  } else {
    gadm(country = iso, level = 3, path = "boundaries/") %>%
      st_as_sf() %>%
      st_transform(crs = 4326) %>%
      rename(district_name = NAME_3, district_code = GID_3) %>%
      mutate(iso = iso)
  }
}

districts_raw <- map(c("GHA", "BFA", "CIV"), load_districts)
names(districts_raw) <- c("GHA", "BFA", "CIV")

# Since we are using Stan to fit our models, Stan does not understand individual district names or codes rather it requires numeric identifiers for each district. To address this, we create a lookup table for each country that assigns a unique numeric identifier to each district. This is done by dropping the geometry from the spatial data frame, selecting the relevant columns (district name, district code, and ISO code), and then creating a new column `district_id` that assigns a unique number to each district based on its position in the ordered list of districts. We then merge this lookup table back with the original spatial data frame to ensure that the districts are ordered according to their numeric identifiers. THis will help in building the adjacency matrix for the spatial model and in mapping the posterior results back.

district_lookups <- map(districts_raw, function(sf) {
  sf %>%
    st_drop_geometry() %>%
    select(district_name, district_code, iso) %>%
    distinct() %>%
    arrange(district_name) %>%
    mutate(district_id = row_number())
})

districts_ordered <- map2(districts_raw, district_lookups, function(sf, lkp) {
  sf %>%
    left_join(lkp, by = c("district_name", "district_code", "iso")) %>%
    arrange(district_id)
})

# quick district count for each country
map(district_lookups, nrow)
# as expected, Ghana has 260 districts, Burkina Faso has 351 and Cote d'Ivoire has 113 districts. Because we now know these numbers, we can then know how many districts had clusters sampled and how many had no clusters sampled, which will be important for the spatial modeling step later on.
```

# The Household Member Record (PR) files

The PR contain detailed information on household characteristics, including the malaria test results for children under five years old. It is also the one that has the malaria test results at household level from all eligilbe children present at the time of the survey, regardless of weather their mother was interview or not.

Malaria infection was measured using microscopic examination of thick blood smears from children aged 6–59 months (hml32 in the DHS Household Member Recode). Microscopy was selected over rapid diagnostic testing (RDT, hml35) because it provides a cleaner measure of point prevalence, the proportion of children currently infected rather than recent infection including post-treatment antigenic persistence. Sample sizes for the two tests were nearly identical across surveys. The distinction between the two measurement approaches is discussed in detail by Florey (2014).

Florey, L. (2014). Measures of Malaria Parasitemia Prevalence in National Surveys: Agreement between Rapid Diagnostic Testing and Microscopy. DHS Analytical Studies No. 43. Rockville, MD: ICF International.

```{r pr-files, echo=TRUE}
# a function that takes the path to a pr file, the gps path, iso , year , round id, distirct_ordered a and district look up as in the trebble defined above. loads both files at the same time and joins them based on cluster_id.

load_pr <- function(pr_path, gps_path, iso, year, round_id,
districts_ordered, district_lookup) {

    # loading GPS and joining to district boundaries from GADM
    gps <- read_sf(gps_path) %>%
        rename(cluster_id = DHSCLUST) %>%
        st_transform(crs = 4326)

    gps_joined <- st_join(gps, districts_ordered, join = st_nearest_feature) %>%
        select(cluster_id, district_name, district_code) %>%
        st_drop_geometry() %>%
        left_join(district_lookup, by = c("district_name", "district_code")) %>%
        select(cluster_id, district_id, district_name)

    # loading PR file and joining to GPS data
    pr <- read_dta(pr_path) %>%
        rename_with(tolower) %>%
        select(cluster_id = hv001, hh_id = hv002,
        line_id = hvidx, weight_raw = hv005, stratum = hv023,
        age_months = hc1, malaria_microscopy = hml32) %>%
        filter(age_months < 60, malaria_microscopy %in% c(0, 1)) %>%
        mutate(
            weight= weight_raw / 1e6,
            malaria = as.integer(malaria_microscopy),
            cluster_id = as.integer(cluster_id),
            stratum = as.integer(stratum),
            country = iso,
            year = year,
            round_id = round_id
        ) %>%
            select(-weight_raw, -malaria_microscopy) %>%
        left_join(gps_joined, by = "cluster_id")
}

pr_list <- pmap(
  survey_index,
  function(country, iso, year, round_id, pr_path, gps_path) {
    load_pr(
      pr_path = pr_path,
      gps_path = gps_path,
      iso = iso,
      year = year,
      round_id  = round_id,
      districts_ordered = districts_ordered[[iso]],
      district_lookup = district_lookups[[iso]]
    )
  }
)

# The abve produces a list of 6 data frames one for each survey round. Each data frame contains the malaria test results for children under five years old, along with the corresponding cluster and district information.
```

# Using Svy Desing to quickly get the national prevalence for each country

Direct Weighted Esimates of National Prevalence

National under-5 malaria prevalence is estimated using thedesign-based methods that account fo the DHS complex sampling design. The DHS uses a stratified two-stage sample in which clusters (enumeration areas) are first selected with probability proportional to size within the rural/urban strata, and then households are randomly selected within clusters [Croft et al., 2018]. Valid estimators then must account for the clustering and statification. For each survey the population-weighted prevalence is estimated using Hajek (1971) ratio estimator:

$\hat{p}_{s}^{w} = \frac{\sum_{k\in S_{s}} d_{k}y_{k}}{\sum_{k\in S_{s}} d_{k}}$
where:
- $S_{s}$ is the set of sampled children aged 6-59 months in survey $s$, with a valid test result
- $d_{k}$ is the sampling weight for child $k$, computed as $hv005/1000000$ following DHS convention [Rutstein & Rojas, 2006] and
- $y_{k}$ is the malaria test result for child $k$ (1 if positive, 0 if negative).


Croft, T. N., Marshall, A. M. J., & Allen, C. K. (2018). Guide to DHS Statistics.
  Rockville, Maryland, USA: ICF. https://dhsprogram.com/Data/Guide-to-DHS-Statistics/

Rutstein, S. O., & Rojas, G. (2006). Guide to DHS Statistics. Calverton, Maryland: ORC Macro.

```{r national-prevalence, echo=TRUE}
national_prevalence <- map(pr_list, function(pr) {
  design_national <- svydesign(
    ids = ~cluster_id,
    strata = ~stratum,
    weights = ~weight,
    data = pr,
    nest = TRUE
  )
  est <- svymean(~malaria, design = design_national, na.rm = TRUE)
  data.frame(
    country = pr$country[1],
    year = pr$year[1],
    round_id = pr$round_id[1],
    prev = as.numeric(est),
    se = as.numeric(SE(est)))
})

print(national_prevalence)

```

The above code calcuates on the national prevelance and for model fitting we will need to calculate the prevelance at the district level, which we will do in the next step. We will also need to calculate the number of children tested and the number of positive cases at the district level, which will be used as the response variable in our spatial model.

```{r district-level-prevalence, echo=TRUE}
# creating a function
get_direct_estimates <- function(pr) {
  # Setting up the survey design for this country's PR file as above
  design_district <- svydesign(
    ids = ~cluster_id, # clusters (enumeration areas)
    strata = ~stratum, # urban/rural × region
    weights = ~weight, # hv005 / 1,000,000
    data = pr,
    nest= TRUE # clusters nested within strata
  )
    # Computing weighted prevalence PER DISTRICT
  svyby(
    formula = ~malaria,  # the outcome
    by = ~district_id,  # stratify by district
    design= design_district, # object created above
    FUN = svymean,  # use weighted mean
    na.rm  = TRUE,
    keep.var = TRUE   # also return SEs
  ) %>%
    # Renaming for clarity
    rename(p_hat = malaria, se_p = se) %>%
    # Drop districts with pathological estimates
    # (p = 0, p = 1, or zero SE as these would break the logit transform)
    filter(!is.na(p_hat), !is.na(se_p),
           se_p > 0, p_hat > 0, p_hat < 1) %>%
    # Add metadata and compute logit-scale quantities for Fay-Herriot
    mutate(
      country = pr$country[1],
      iso = pr$country[1],
      year = pr$year[1],
      round_id  = pr$round_id[1],

      # Bounding p away from 0 and 1 so the logit transform is finite
      p_bounded = pmax(0.02, pmin(0.98, p_hat)),
      # Logit-transformed direct estimate this is y_i in our Fay-Herriot
      y_i = log(p_bounded / (1 - p_bounded)),

      # Delta-method variance on the logit scale
      # From Var(logit(p)) ≈ Var(p) / [p(1-p)]²
      logit_var = se_p^2 / (pmax(0.02, p_hat)^2 * pmax(0.02, 1 - p_hat)^2),

      # SE on logit scale  this is psi_i in Fay-Herriot
      # Floored at 0.1 to prevent degenerate likelihood
      # Capped at 2.0 to prevent explosive uncertainty
      psi_i = pmax(0.1, pmin(sqrt(logit_var), 2.0))
    ) %>%

    filter(is.finite(y_i), is.finite(psi_i)) %>%
    select(country, iso, year, round_id, district_id, y_i, psi_i)
}

direct_est_list <- map(pr_list, get_direct_estimates)
direct_est_all  <- bind_rows(direct_est_list)

```

For numerical stability, we apply two bounds to the direct estimates. Prevalences are bounded away from 0 and 1 using the interval [0.02, 0.98] before the logit transformation, preventing infinite values in districts with very sparse data. Logit-scale standard errors $\psi_{i}$ are floored at 0.1 and capped at 2.0, preventing districts with anomalously small or large sampling variance from dominating or breaking the Fay-Herriot likelihood. These bounds follow the convention used in prior DHS-based small area estimation work [Wakefield et al., 2020; Wu et al., 2021].

Wakefield, J., Okonek, T., & Pedersen, J. (2020). Small Area Estimation for Disease
  Prevalence Mapping. International Statistical Review, 88(2), 398–418.
  https://doi.org/10.1111/insr.12400

Wu, Y., Li, Z. R., Mayala, B. K., Wang, H., Gao, P. A., Paige, J., Fuglstad, G-A.,
  Moe, C., Godwin, J., Donohue, R. E., Janocha, B., Croft, T. N., & Wakefield, J.
  (2021). Spatial Modeling for Subnational Administrative Level 2 Small-Area
  Estimation. DHS Spatial Analysis Reports No. 21. Rockville, Maryland, USA: ICF.

Full District by Round Panel Per COuntry

```{r full-district-panel, echo=TRUE}
build_panel <- function(country_iso, district_lookup, direct_est_all) {

  # Use country_iso throughout — avoids collision with iso column in joins
  rounds <- direct_est_all %>%
    filter(country == country_iso) %>%
    distinct(year) %>%
    arrange(year) %>%
    mutate(round_id = row_number())

  cat(country_iso, "years found:", rounds$year, "\n")  # verify

  est_clean <- direct_est_all %>%
    filter(country == country_iso) %>%
    select(district_id, year, y_i, psi_i)

  panel <- expand_grid(
    district_id = district_lookup$district_id,
    year        = rounds$year
  ) %>%
    left_join(
      district_lookup %>% select(district_id, district_name),
      by = "district_id"
    ) %>%
    left_join(est_clean, by = c("district_id", "year")) %>%
    left_join(rounds,    by = "year") %>%
    arrange(year, district_id) %>%
    mutate(
      country  = country_iso,
      observed = as.integer(!is.na(y_i))
    )

  cat(country_iso, "— rows:", nrow(panel),
      "| expected:", nrow(district_lookup) * nrow(rounds), "\n")

  panel %>%
    group_by(year) %>%
    summarise(
      total        = n(),
      n_observed   = sum(observed == 1),
      n_unobserved = sum(observed == 0),
      .groups      = "drop"
    ) %>%
    print()

  panel
}

# Update the map call to match new parameter name
panels <- map(
  c("GHA", "BFA", "CIV"),
  ~build_panel(.x, district_lookups[[.x]], direct_est_all)
)

names(panels) <- c("GHA", "BFA", "CIV")

```

The above code, a balanced panel is created by taking the cartesian product of all districts and all survey years for each country. The direct estimates are then merged onto this panel, resulting in a data frame where each row corresponds to a specific district-year combination. An `observed` indicator is created to flag which rows have direct estimates (i.e., where `y_i` is not NA) and which do not. Coverage varied by country and round: 153/260 (59%) and 124/260 (48%) districts in Ghana (2014, 2022); 232/351 (66%) and 153/351 (44%) departments in Burkina Faso (2010, 2021); and 80/113 (71%) and 103/113 (91%) departments in Côte d'Ivoire (2012, 2021). The balanced panel is structural to the model: observed district-years contribute to the Fay-Herriot likelihood while unobserved ones receive posterior estimates through the BYM2 spatial prior and the shared temporal component [Rao & Molina, 2015; Wu et al., 2021].

# Adjacency Matrix and Scaling factor

For each country, district adjacency was defined using queen contiguity, where two districts are neighbours if they share any portion of their boundary [Bivand et al., 2013]. This produces a symmetric binary adjacency matrix $\mathbf{W}$ W from which the ICAR precision matrix is constructed as $Q_{ICAR}= D − W$ D is a diagonal matrix of neighbour counts [Besag et al., 1991].

Because $Q_{ICAR}$ is rank-deficient by one, the constant vector lies in its null space, the marginal variances of the ICAR depend on the graph structure and are not unit-scale. Following Riebler et al. (2016), we computed a scaling factor $s$ as the geometric mean of the ICAR marginal variances:

$s = \exp\!\left(\frac{1}{N}\sum_{i=1}^{N} \log [\mathbf{Q}_{\text{ICAR}}^{-}]_{ii}\right)$

where  $Q_{ICAR}^−$ denotes the generalised inverse, approximated here by $(Q_{ICAR}+0.001I)^{−1}$ for numerical stability. Dividing the ICAR component by $\sqrt{s}$ in the BYM2 parametrisation ensures that the spatial component has geometric-mean marginal variance of 1, matching the IID component and making the spatial fraction $\phi$ interpretable and comparable across countries with different graph structures. The resulting scaling factors were 4.36 for Ghana, 3.36 for Burkina Faso, and 9.30 for Côte d'Ivoire, reflecting differences in the number and connectivity of administrative units.


```{r adjacency-matrix, echo=TRUE}
compute_scaling_factor <- function(nb, D) {

  adj_mat <- nb2mat(nb, style = "B", zero.policy = TRUE)
  Q_icar <- Diagonal(x = rowSums(adj_mat)) - as(adj_mat, "sparseMatrix")

  # Regularise to break rank deficiency
  eps <- 0.001
  Q_reg <- Q_icar + eps * Diagonal(D)

  # Diagonal of Q_reg^{-1} = marginal variances
  Q_inv_diag <- diag(solve(Q_reg))

  # Geometric mean of marginal variances (Riebler et al. 2016)
  exp(mean(log(Q_inv_diag)))
}

build_spatial <- function(iso, districts_ordered) {

  D <- nrow(districts_ordered)
  nb <- poly2nb(districts_ordered, queen = TRUE)

  # Fix islands using small snap tolerance
  if (sum(sapply(nb, length) == 0) > 0) {
    cat(iso, "— islands detected; applying snap fix\n")
    nb <- poly2nb(districts_ordered, queen = TRUE, snap = 0.1)
  }

  adj_mat <- nb2mat(nb, style = "B", zero.policy = TRUE)
  edges <- which(adj_mat == 1, arr.ind = TRUE) %>%
    as.data.frame() %>%
    filter(row < col)

  scaling_factor <- compute_scaling_factor(nb, D)

  cat(iso, "— districts:", D,
      "| edges:", nrow(edges),
      "| scaling factor:", round(scaling_factor, 4), "\n")

  list(
    D  = D,
    nb = nb,
    node1 = edges$row,
    node2 = edges$col,
    N_edges = nrow(edges),
    scaling_factor = scaling_factor
  )
}

spatial <- map(
  c("GHA", "BFA", "CIV"),
  ~build_spatial(.x, districts_ordered[[.x]])
)
names(spatial) <- c("GHA", "BFA", "CIV")


```


```{r covariates, echo=TRUE}
# COVARIATE PREPARATION — GHANA
# match district names to GADM boundaries, standardise for modellin
covariates_raw <- readxl::read_xlsx("covariates_ghana.xlsx") %>%
  rename(
    country = Country,
    date = Date,
    district_code = District,
    district_name = `...4`,
    elevation = Elevation,
    rainfall = Rainfall,
    pop_density  = PopDensity,
    pop_density_u5 = PopDensityUnder5
  ) %>%
  select(district_code, district_name, elevation, rainfall, pop_density_u5)

print(map_int(covariates_raw, ~sum(is.na(.x))))

# HANDLE MISSING RAINFALL REGIONAL MEAN IMPUTATION
covariates_clean <- covariates_raw %>%
  mutate(
    rainfall  = as.numeric(rainfall),
    imputed_flag = is.na(rainfall),
    region_code  = str_sub(district_code, 3, 4)
  ) %>%
  group_by(region_code) %>%
  mutate(
    rainfall = ifelse(is.na(rainfall),
                      mean(rainfall, na.rm = TRUE),
                      rainfall)
  ) %>%
  ungroup()

# Regional breakdown of imputation
covariates_clean %>%
  group_by(region_code) %>%
  summarise(
    total  = n(),
    imputed  = sum(imputed_flag),
    pct = round(100 * imputed / total, 1),
    .groups  = "drop"
  ) %>%
  print()

# 3. NAME CLEANING — STRIP ADMIN SUFFIXES AND PUNCTUATION
clean_name_strict <- function(x) {
  x %>%
    str_to_lower() %>%
    str_replace_all("[[:punct:]]+\\s*$", "") %>%
    str_replace_all("[-/]", " ") %>%
    str_replace_all("[[:punct:]]", "") %>%
    str_replace_all(
      "\\s+(municipal|municipality|metropolitan|metropolis|assembly|district)\\s*$",
      ""
    ) %>%
    str_replace_all("\\s+", " ") %>%
    str_squish()
}

covariates_clean <- covariates_clean %>%
  mutate(name_clean = clean_name_strict(district_name))

gadm_lookup <- district_lookups[["GHA"]] %>%
  mutate(name_clean = clean_name_strict(district_name))

# These districts have different names in the covariates file vs GADM due to
# historical renaming, splitting, or word-order variation:
#
#   Covariate name        GADM name              Reasoning
#   --------------------  --------------------   ------------------
#   Assin Fosu            Assin Central          Renamed (2017)
#   Awutu Senya           Awutu Senya West       Original district; GADM kept full name
#   Akyem Mansa           Akyemansa              Spelling variation
#   Asene Akroso Manso    Asene-Manso-Akroso     Word reordering
#   Lower Manya           Lower Manya-Krobo      Abbreviation

manual_crosswalk <- tribble(
  ~cov_clean,              ~gadm_clean,
  "assin fosu",            "assin central",
  "awutu senya",           "awutu senya west",
  "akyem mansa",           "akyemansa",
  "asene akroso manso",    "asene manso akroso",
  "lower manya",           "lower manya krobo"
)

covariates_clean <- covariates_clean %>%
  left_join(manual_crosswalk, by = c("name_clean" = "cov_clean")) %>%
  mutate(name_clean = coalesce(gadm_clean, name_clean)) %>%
  select(-gadm_clean)

# MATCH COVARIATES TO GADM DISTRICTS
matches_final <- covariates_clean %>%
  inner_join(gadm_lookup, by = "name_clean", suffix = c("_cov", "_gadm"))

cat("\nMatched districts:", nrow(matches_final), "of 260\n")

# Verify no remaining mismatches
cov_unmatched  <- anti_join(covariates_clean, gadm_lookup, by = "name_clean")
gadm_unmatched <- anti_join(gadm_lookup, covariates_clean, by = "name_clean")
#cat("Unmatched covariate rows:", nrow(cov_unmatched), "\n")
#cat("Unmatched GADM districts:", nrow(gadm_unmatched), "\n")

# BUILDING FINAL COVARIATE TABLE INDEXED BY DISTRICT_ID
covariates_gha <- matches_final %>%
  select(
    district_id,
    elevation,
    rainfall,
    pop_density_u5,
    imputed_flag
  )

cat("\nFinal covariate table:\n")
summary(covariates_gha)

#STANDARDISING FOR MODELLING
covariates_gha_std <- covariates_gha %>%
  mutate(across(
    c(elevation, rainfall, pop_density_u5),
    ~as.numeric(scale(.x)),
    .names = "{.col}_std"
  )) %>%
  select(district_id, ends_with("_std"), imputed_flag)

# Joining the covariates to the panel data for Ghana
panels[["GHA"]] <- panels[["GHA"]] %>%
  left_join(covariates_gha_std, by = "district_id")


```

#Building the model data list for Stan

```{r model-data-list, echo=TRUE}
build_stan_data <- function(iso, panels, spatial,
                            covar_cols = NULL) {

  panel <- panels[[iso]]
  sp <- spatial[[iso]]
  obs_idx <- which(panel$observed == 1)

  # Covariate matrix — only use columns that exist in the panel
  if (!is.null(covar_cols)) {
    available <- intersect(covar_cols, names(panel))
    if (length(available) > 0) {
      X <- as.matrix(panel[, available])
      K <- ncol(X)
    } else {
      X <- matrix(0, nrow = nrow(panel), ncol = 0)
      K <- 0L
    }
  } else {
    X <- matrix(0, nrow = nrow(panel), ncol = 0)
    K <- 0L
  }

  list(
    N = sp$D,
    T = 2L,
    NT = nrow(panel),
    N_obs = length(obs_idx),
    obs_idx = obs_idx,
    district = panel$district_id,
    time = panel$round_id,
    y = panel$y_i[obs_idx],
    psi = panel$psi_i[obs_idx],
    N_edges = sp$N_edges,
    node1 = sp$node1,
    node2 = sp$node2,
    scaling_factor = sp$scaling_factor,
    K = K,
    X = X
  )
}

#intialisng an empty list to hold the data for each country, which will be passed to Stan for model fitting. For Ghana, we include the covariates (elevation, rainfall, and under-5 population density) as these are available and have been processed. For Burkina Faso and Cote d'Ivoire, we currently set K = 0 and use an empty covariate matrix, as we do not have covariate data for those countries at this time. This structure allows us to easily extend the model to include covariates for the other countries in the future if such data becomes available.
stan_data <- list()

stan_data[["GHA"]] <- build_stan_data(
  "GHA", panels, spatial,
  covar_cols = c("elevation_std", "rainfall_std", "pop_density_u5_std")
)

# Burkina and CIV remain covariate-free for now (K = 0)
stan_data[["BFA"]] <- build_stan_data("BFA", panels, spatial)
stan_data[["CIV"]] <- build_stan_data("CIV", panels, spatial)

# checking the structure of the data list for Ghana to ensure it has been built correctly and contains all the necessary components for model fitting. This includes verifying the number of districts, the number of observed district-year combinations, the dimensions of the covariate matrix, and the scaling factor for the spatial component.
cat("Ghana Stan data:\n")
+ cat("  N (districts):     ", stan_data[["GHA"]]$N, "\n")
+ cat("  NT (rows):         ", stan_data[["GHA"]]$NT, "\n")
+ cat("  N_obs:             ", stan_data[["GHA"]]$N_obs, "\n")
+ cat("  N_edges:           ", stan_data[["GHA"]]$N_edges, "\n")
+ cat("  K (covariates):    ", stan_data[["GHA"]]$K, "\n")
+ cat("  X dimensions:      ", paste(dim(stan_data[["GHA"]]$X), collapse = " × "), "\n")
+ cat("  Scaling factor:    ", round(stan_data[["GHA"]]$scaling_factor, 3), "\n")

```

# Model Fitting

## Model specification

Following the Bayesian Fay-Herriot framework established by @wakefield2020 and extended by @wu2021 for DHS-based subnational estimation, we fit separate multilevel models for each country. The decision to fit country-specific models — rather than a single pooled model across Ghana, Burkina Faso, and Côte d'Ivoire — follows the established convention in DHS-based small area estimation [@wakefield2020; @wu2021] and is justified by structural differences in graph topology, transmission dynamics, and survey design across the three countries.

For each country, logit-transformed direct estimates $y_{it}$ and their design-based sampling standard errors $\psi_{it}$ entered a Gaussian likelihood $y_{it} \mid \theta_{it} \sim \mathcal{N}(\theta_{it}, \psi_{it}^2)$ for observed district-years. The latent logit prevalence decomposed as

$$
\theta_{it} = \mu + b_i + \delta_t + \mathbf{x}_i^\top {\beta},
$$

where $b_i$ is a district-level random effect, $\delta_t$ captures national temporal change, and $\mathbf{x}_i$ contains standardised static covariates.

The district effect used the BYM2 reparameterisation of @riebler2016,

$$
b_i = \sigma_b\!\left(\sqrt{1-\phi}\, v_i + \sqrt{\phi/s}\, u_i\right),
$$

which combines an IID component ($v_i \sim \mathcal{N}(0, 1)$) with a scaled intrinsic conditional autoregressive (ICAR) component ($\mathbf{u}$ with sum-to-zero constraint) through the mixing parameter $\phi \in (0, 1)$. The scaling factor $s$ was computed as the geometric mean of marginal variances under the unscaled ICAR prior [@riebler2016], rendering $\phi$ interpretable as the fraction of district variance that is spatially structured.

For the temporal component with $T = 2$ survey rounds, a first-order random walk prior is rank-deficient by one (analogous to the ICAR) and equivalent, after reparameterisation, to a sum-to-zero constraint [@goicoa2018; @knorrheld2000]. We therefore used $\delta_1 = \alpha$, $\delta_2 = -\alpha$, which automatically satisfies the constraint and makes $2\alpha$ directly interpretable as the national logit-scale change between rounds.

Priors followed PC-style recommendations: $\sigma_b \sim \text{Exp}(4.6)$ (approximating $P(\sigma_b > 1) = 0.01$), $\phi \sim \text{Beta}(3, 2)$, and $\mu \sim \mathcal{N}(-2, 1)$. For the three standardised covariates (elevation, rainfall, population density under 5) we used weakly informative priors $\beta_k \sim \mathcal{N}(0, 1)$. All models were fit in Stan [@stan2024] via the `cmdstanr` interface with 4 chains, 1500 warmup and 2000 sampling iterations, `adapt_delta = 0.99`, and `max_treedepth = 14`.


```{r model-fitting, echo=TRUE}
mod_st_cov <- cmdstan_model("fh_bym2_st_covar.stan")

fit_gha_cov <- mod_st_cov$sample(
  data = stan_data[["GHA"]],
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 1500,
  iter_sampling = 2000,
  adapt_delta = 0.99,
  max_treedepth = 14,
  seed = 123,
  refresh = 500
)

# Diagnostics
fit_gha_cov$diagnostic_summary()

# Main parameters
fit_gha_cov$print(c("mu", "alpha", "sigma_b", "phi",
                    "delta_change", "p_round1", "p_round2",
                    "beta[1]", "beta[2]", "beta[3]"))

```