initial decreasing forecasters rmd

dsweber2 · dsweber2 · commit 55ddeee824e2 · 2025-04-10T17:25:13.000-05:00
diff --git a/scripts/reports/decreasing_forecasters.Rmd b/scripts/reports/decreasing_forecasters.Rmd
@@ -0,0 +1,176 @@
+---
+title: "Decreasing Forecasters"
+author: Delphi Forecast Team
+date: "`Sys.date()`"
+output:
+  html_document:
+    code_folding: show
+    toc: True
+    # self_contained: False
+    # lib_dir: libs
+params:
+  disease: "covid"
+  forecast_res: !r ""
+  forecast_date: !r ""
+  truth_data: !r ""
+---
+
+$$\\[.4in]$$
+
+```{r echo=FALSE}
+knitr::opts_chunk$set(
+  fig.align = "center",
+  message = FALSE,
+  warning = FALSE,
+  cache = FALSE
+)
+knitr::opts_knit$set(root.dir = here::here())
+ggplot2::theme_set(ggplot2::theme_bw())
+source(here::here("R/load_all.R"))
+```
+
+Partially part of the retrospective from this year.
+For many of the direct forecasters, the forecast is strictly decreasing, even in the middle of the surge.
+This effect is most prominent in flu, but occurs somewhat in covid.
+We need to resolve the source of this.
+It is some combination of the data and the models used.
+
+This notebook depends on having successfully run the `flu_hosp_explore` targets pipeline to handle the creation of the basic dataset.
+Accordingly, you need `.Renviron` to include `TAR_PROJECT=flu_hosp_prod`.
+```{r}
+tar_make(joined_archive_data)
+joined_archive <- tar_read(joined_archive_data)
+hhs_archive <- tar_read(hhs_archive) %>% as_epi_archive()
+```
+
+To avoid running too frequently, we'll limit to a single forecast date just after the peak of the rate of growth, so that ~ everywhere is increasing.
+
+```{r}
+hhs_archive %>% epix_as_of_current() %>% filter(time_value > "2023-10-01") %>% autoplot(hhs) 
+hhs_gr <- hhs_archive %>%
+  epix_as_of_current() %>%
+  group_by(geo_value) %>%
+  mutate(gr_hhs = growth_rate(hhs)) %>%
+  filter(time_value > "2023-10-01")
+hhs_gr %>%
+  arrange(gr_hhs) %>%
+  drop_na() %>%
+  slice_max(gr_hhs) %>%
+  ungroup() %>%
+  group_by(time_value) %>%
+  summarize(nn = length(hhs))
+```
+
+So the peak is ~ 11/15
+
+```{r}
+forecast_date <- as.Date("2023-11-29")
+hhs_gr %>% autoplot(gr_hhs) +
+  geom_vline(aes(xintercept = forecast_date), lty = 2) +
+  labs(title = "growth rates")
+```
+
+And most locations are still increasing 2 weeks later on the 29th, so we'll use that
+
+# Some utility functions
+
+Since we don't really need to run the full pipeline to get forecasts from a single day and forecaster, we build a couple of functions for inspecting forecasts.
+```{r}
+forecast_aheads <- function(forecaster, epi_data = hhs_forecast, aheads = 0:4 * 7) {
+  all_forecasts <- map(aheads, \(ahead) forecaster(epi_data, ahead)) %>% list_rbind()
+  all_forecasts
+}
+```
+
+Here's a way to easily plot a subset of the forecasts, with bands at the 80% and 50% intervals (.1-.9 and .25-.75) against the finalized data.
+```{r}
+plot_forecasts <- function(all_forecasts,
+                           geo_values,
+                           data_archive = hhs_archive,
+                           earliest_truth_data = NULL) {
+  if (is.null(earliest_truth_data)) {
+    earliest_truth_data <- all_forecasts$forecast_date[[1]] - as.difftime(365, units = "days")
+  }
+  # transform the archive to something useful for comparison
+  finalized_plotting <- data_archive %>%
+    epix_as_of_current() %>%
+    filter(time_value <= max(all_forecasts$target_end_date), geo_value %in% geo_values) %>%
+    as_tibble() %>%
+    mutate(forecast_date = time_value) %>%
+    filter(time_value >= earliest_truth_data)
+  all_forecasts %>% filter(geo_value %in% geo_values) %>%
+    pivot_wider(names_from = quantile, values_from = value) %>%
+    ggplot(aes(x = target_end_date, group = geo_value, fill = forecast_date)) +
+    geom_ribbon(aes(ymin = `0.1`, ymax = `0.9`), alpha = 0.4) +
+    geom_ribbon(aes(ymin = `0.25`, ymax = `0.75`), alpha = 0.6) +
+    geom_line(aes(y = `0.5`, color = forecast_date)) +
+    geom_line(
+      data = finalized_plotting, aes(x = time_value, y = hhs)) +
+    facet_wrap(~geo_value, scale = "free") +
+    theme(legend.position = "none")
+}
+```
+
+And a method to inspect whether things are increasing that isn't just the eyeball norm on a few of them.
+This calculates growth rates for each quantile and each location.
+```{r}
+get_growth_rates <- function(forecasts, quantiles = NULL, outlier_bound = 1e2, ...) {
+  if (is.null(quantiles)) {
+    quantiles <- forecasts$quantile %>% unique()
+  }
+  forecasts %>%
+    group_by(geo_value, quantile) %>%
+    filter(min(value) != max(value), quantile %in% quantiles) %>%
+    mutate(growth = growth_rate(value, ...)) %>%
+    filter(abs(growth) < outlier_bound)
+}
+```
+
+# Establishing the problem
+
+```{r}
+hhs_forecast <- hhs_archive %>% epix_as_of(forecast_date)
+all_forecasts <- forecast_aheads(\(x, ahead) scaled_pop(x, "hhs", ahead = ahead))
+default_geos <- c("ca", "fl", "ny", "pa", "tx")
+plot_forecasts(all_forecasts, default_geos)
+```
+
+All the forecasts are going down rather than up, even though they have multiple weeks of data!
+More quantitatively, across all geos:
+```{r}
+basic_gr <- get_growth_rates(all_forecasts, quantiles = 0.5, method = "smooth_spline")
+basic_gr %>% arrange(desc(growth))
+```
+The only places where the growth rate is positive are american samoa and the US overall, both of which have unusual data trends (as because it is ~0, and the US because it is unusually large).
+As a histogram (each state is included 5 times, once per ahead):
+```{r}
+basic_gr %>%  ggplot(aes(x = growth)) + geom_histogram(bins = 300)
+```
+## It goes away if we use very short windows
+If we limit to the last 3 weeks of data (so effectively just a linear extrapolation shared across geos), it goes away:
+```{r}
+hhs_forecast <- hhs_archive %>% epix_as_of(forecast_date)
+all_short_forecasts <- forecast_aheads(\(x, ahead) scaled_pop(x, "hhs", ahead = ahead, n_training=3))
+plot_forecasts(all_short_forecasts, default_geos)
+```
+
+They're pretty jittery, but strictly decreasing they are not.
+And the corresponding growth rates:
+
+```{r}
+short_gr <- get_growth_rates(all_short_forecasts, quantiles = 0.5, method = "smooth_spline")
+short_gr %>% arrange(growth) %>% ggplot(aes(x = growth)) + geom_histogram(bins = 300)
+```
+So on a day-over-day basis the growth rate is mostly increasing, with some strong positive outliers and some amount of decrease.
+
+# Is it geo pooling?
+Let's see what happens if we restrict ourselves to training each geo separately.
+```{r}
+hhs_forecast <- hhs_archive %>% epix_as_of(forecast_date)
+all_geos <- hhs_forecast %>% distinct(geo_value) %>% pull(geo_value)
+hhs_forecast %>% filter(!is.na(hhs)) %>% group_by(geo_value) %>% summarize(n_points = n()) %>% arrange(n_points)
+all_geos_forecasts <- map(all_geos, \(geo) forecast_aheads(\(x, ahead) scaled_pop(x, "hhs", ahead = ahead), epi_data = hhs_forecast %>% filter(geo_value == geo)))
+all_geos_forecasts %>% list_rbind() %>% plot_forecasts(default_geos)
+```
+
+And the phenomina is still happening