Skip to content

Commit c2ef046

Browse files
committed
doc: season summary lint and covid updates
1 parent 49f8921 commit c2ef046

File tree

1 file changed

+36
-38
lines changed

1 file changed

+36
-38
lines changed

scripts/reports/season_summary_2025.Rmd

+36-38
Original file line numberDiff line numberDiff line change
@@ -86,15 +86,15 @@ forecast_weeks_to_plot %in% (covid_scores$forecast_date %>% unique())
8686

8787
# Models used
8888
One thing to note: all of these models filter out the 2020/21 and 2021/22 seasons.
89-
For both flu and covid they are either unusually large or unusually small, and don't warrant inclusion.
89+
For both flu and covid these seasons are either unusually large or unusually small, and don't warrant inclusion.
9090
We can split the models and ensembles into 3 categories: the ad-hoc models that we created in response to the actual data that we saw, the AR models that we had been backtesting, and the ensembles.
9191

9292
### The "ad-hoc" models
9393

94-
- `climate_base` uses a 7 week window around the target and forecast date to establish quantiles.
95-
`climate_base` does this separately for each geo
94+
- `climate_base` uses a 7 week window around the target and forecast date to establish quantiles.
95+
`climate_base` does this separately for each geo.
9696
- `climate_geo_agged` on the other hand converts to rates, pools all geos, computes quantiles using similar time windows, and then converts back to counts.
97-
There is effectively only one prediction, scaled to fit each geo.
97+
There is effectively only one prediction, scaled to fit each geo.
9898
- `linear` does a linear extrapolation of the last 4 weeks of data on a rates scale.
9999
Initially it had an intercept, but this was removed when it caused the model to not reproduce the -1 ahead data exactly.
100100
This change was made on Jan 8th, in the commit with hash 5f7892b.
@@ -127,9 +127,9 @@ weights %>% filter(forecast_family == "climate") %>% ggplot(aes(x = factor(ahead
127127

128128
- `windowed_seasonal` is an AR forecaster using lags 0 and 7 that uses training data from an 8 week window from each year.
129129
It does quartic root scaling along with quantile and median whitening.
130+
In addition to dropping the first 2 seasons, the windowed models drop the summers for the purposes of determining whitening behavior.
130131
For flu, this augments with ili and flusurv (so they are added as additional rows, with their own scaling/centering).
131132
Covid doesn't have a comparable dataset.
132-
In addition to dropping the first 2 seasons, the windowed models drop the summers for the purposes of determining whitening behavior.
133133
- `windowed_seasonal_nssp` is like `windowed_seasonal`, but also has `nssp` as an exogenous component.
134134
Note that for flu, this effectively means throwing out the ili and flusurv data, since `nssp` is only defined recently.
135135
For covid, `windowed_seasonal_nssp` is effectively the same model, but with auxiliary data.
@@ -141,9 +141,9 @@ weights %>% filter(forecast_family == "climate") %>% ggplot(aes(x = factor(ahead
141141
- `retro_submission` is a retroactive recreation of `CMU-TimeSeries` using updated methods (`linear` always matching the most recent value, for example).
142142
The weights for the various models can be found in [`flu_geo_exclusions`](https://github.com/cmu-delphi/exploration-tooling/blob/main/flu_geo_exclusions.csv) or [`covid_geo_exclusions`](https://github.com/cmu-delphi/exploration-tooling/blob/main/covid_geo_exclusions.csv).
143143
These can vary on a state by state basis.
144-
- `CMU-TimeSeries` is what we actually submitted.
144+
- `CMU-TimeSeries` is what we actually submitted.
145145
This is a moving target that has changed a number of times. For a detailed list of the weights used, see [`flu_geo_exclusions`](https://github.com/cmu-delphi/exploration-tooling/blob/main/flu_geo_exclusions.csv) or [`covid_geo_exclusions`](https://github.com/cmu-delphi/exploration-tooling/blob/main/covid_geo_exclusions.csv) for specific weights.
146-
146+
147147
<details>
148148
<summary> A timeline of the changes to `CMU-timeseries` </summary>
149149
```{r cmu_timeseries_timeline, echo=FALSE}
@@ -192,7 +192,7 @@ The best wis-scoring model is actually just the ensemble at 35.2, with the next-
192192
Coverage in covid is somewhat better, though a larger fraction of teams are within +/-10% of 95% coverage; we specifically got within 1%.
193193
Like with flu, there was systematic under-coverage though, so the models are also biased towards too small of intervals for the 95% band.
194194
The 50% coverage is likewise more accurate than for flu, with most forecasts within +/-10%.
195-
`CMU-TimeSeries` is at 52.7%, so slightly over.
195+
`CMU-TimeSeries` is at 52.7%, so slightly over.
196196
Generally, more teams were under 50% coverage than over, so there is also a systemic bias towards under-coverage in covid.
197197

198198
## Flu Scores
@@ -266,7 +266,7 @@ flu_current %>%
266266
There is a wide variety of length for the peak by this definition, but it does seem to naturally reflect the difference in dynamics.
267267
`ok` is quite short for example, because it has a simple clean peak, whereas `or` has literally 2 peaks with the same height, so the entire interval between them is classified as peak.
268268

269-
Boiling down these plots somewhat, let's look at the averages for the start of the peak and the end of the peak.
269+
Boiling down these plots somewhat, let's look at the averages for the start of the peak and the end of the peak.
270270
First, for the start:
271271

272272
```{r flu_peak_start}
@@ -561,9 +561,8 @@ It is worth noting that phase doesn't correspond to just grouping the dates, bec
561561

562562
#### Ahead
563563

564-
Factoring by ahead, the models that include an AR component generally degrade with ahead less badly.
565-
Interestingly, the pure `climate` models having a mostly consistent (and bad) score, but remains much more consistent as aheads increase.
566-
Most of the advantage of `PSI-PROF` and `FluSight-lop_norm` comes from having more accurate 2 and 3 week aheads.
564+
Factoring by ahead, the models that include an AR component generally degrade with ahead less badly.
565+
Interestingly, the pure `climate` models having a mostly consistent (and bad) score, but remains much more consistent as aheads increase (after the -1 ahead where it typically has exact data).
567566

568567
#### Sample forecasts
569568

@@ -575,13 +574,6 @@ The well performing models from other teams also had this behavior this year.
575574

576575
## Covid Scores
577576

578-
Overall, the best covid forecaster is `windowed_seasonal_extra_sources`, which uses a window of data around the given time period
579-
580-
One peculiar thing about Covid scoring: the first day has *much* worse scores than almost any of the subsequent days (you can see this in the Scores Aggregated By Forecast Date tab below).
581-
This mostly comes from the first week having larger revisions than normal.
582-
This is discussed in more detail in [this notebook](first_day_wrong.html).
583-
584-
585577
Before we get into the actual scores, we need to define how we go about creating 4 different phases.
586578
They are `increasing`, `peak`, `decreasing`, and `flat`.
587579
The last phase, `flat`, covers geos which didn't have an appreciable season for the year, which was relatively common for covid.
@@ -630,7 +622,7 @@ covid_current %>%
630622
Then we can see a very muted season in many locations, such as `ar` or `co`, and no season at all in some locations, such as `ak`.
631623
Others, such as `az`, `in`, or `mn` have a season that is on-par with historical ones.
632624

633-
How to handle this?
625+
How to handle this?
634626
One option is to include a separate phase for no season that applies to the entire `geo_value` if more than half of the `time_value`s are within 50% of the peak:
635627

636628
```{r}
@@ -661,23 +653,20 @@ Possible exceptions:
661653
There are several locations such as `al` and `ar` which don't have a peak so much as an elevated level for approximately the entire period.
662654
This is awkward to handle for this classification.
663655

664-
Finally, like for Flu we should examine a summary of the start/end dates for the peak of the season.
665-
Boiling down these plots somewhat, let's look at the averages for the start of the peak and the end of the peak.
666-
First, for the start:
656+
Finally, like for Flu, we should examine a summary of the start/end dates for the peak of the covid season.
657+
Boiling down these plots somewhat, let's look at the averages for the start of the peak and the end of the peak.
658+
First, for the start of start of the peak:
667659

668660
```{r}
669661
covid_within_max$first_above %>% summary()
670662
```
671663

672-
So the `increasing` phase ends at earliest on December 28st, on average on January 18th, and at the latest on April 19th.
673-
Which suggests
664+
Second, for the end of the peak:
674665

675666
```{r}
676667
covid_within_max$last_above %>% summary()
677668
```
678669

679-
Similarly, the `peak` phase ends at the earliest on the 11th of December, on average on the first of March, and at the latest on March 22nd.
680-
681670
</details>
682671

683672
### Forecaster Scores for Covid: {.tabset}
@@ -704,12 +693,10 @@ covid_score_summary <- covid_scores %>%
704693
mean_coverage_50 = round(Mean(interval_coverage_50), 2),
705694
mean_coverage_90 = round(Mean(interval_coverage_90), 2),
706695
n = n()
707-
) %>%
708-
arrange(mean_wis)
709-
710-
wis_score_order <- covid_score_summary %>% pull(forecaster)
711-
pop_score_order <- covid_score_summary %>% arrange(pop_norm_wis) %>% pull(forecaster)
712-
datatable(covid_score_summary)
696+
) %>%
697+
arrange(mean_wis) %>%
698+
rename(id = forecaster) %>%
699+
datatable()
713700
```
714701

715702
#### Scores Aggregated By Phase
@@ -937,10 +924,20 @@ covid_forecast_plt <- filtered_covid_forecasts %>%
937924
938925
ggplotly(covid_forecast_plt)
939926
```
927+
940928
### Results
941929

942-
`windowed_seasonal_nssp` is a clear winner regardless of the metric used.
943-
`ensemble_windowed` is nearly as good, but since it is effectively averaging `windowed_seasonal_nssp` with `windowed_seasonal` and losing accuracy as a result, it is hardly worth it.
930+
One peculiar thing about Covid scoring: on the first forecast date, CMU-TimeSeries has *much* worse scores than almost any of the subsequent days (you can see this in the Scores Aggregated By Forecast Date tab below).
931+
There are two related issues here:
932+
- first, our initial model combined climate_base and linear, and the climate_base component was unusually bad early in the season, because this season started later than previous seasons,
933+
- second, the data had substantial revisions (this is discussed in detail in [this notebook](first_day_wrong.html)), however this effect is much smaller, since other forecasters had access to the same data.
934+
935+
This mishap dragged the CMU-TimeSeries score down overall by quite a lot and its better performance later in the season is not enough to make up for it.
936+
937+
Overall, the best covid forecaster is `windowed_seasonal_nssp`, outperforming `CovidHub-ensemble`, regardless of the metric used.
938+
This forecaster uses a window of data around the given time period, along with the NSSP exogenous features.
939+
`ensemble_windowed` is nearly as good, but since it is effectively averaging `windowed_seasonal_nssp` with `windowed_seasonal` and losing accuracy as a result, so it is hardly worth it.
940+
Given its simplicity, the `climate_linear` forecaster does quite well, though it's not as good as `windowed_seasonal_nssp`.
944941

945942
The pure climate models were substantially worse for covid than for flu, at ~4.6x the best model, rather than ~2x.
946943
Given the unusual nature of the season, this is somewhat unsurprising.
@@ -975,11 +972,12 @@ The always decreasing problem is definitely not present in these forecasts.
975972
If anything, our best forecasts are *too* eager to predict an increasing value, e.g. in `tx` and `ca`.
976973
Several of our worse forecasts are clearly caused by revision behavior.
977974

975+
978976
# Revision behavior and data substitution
979977

980-
This is covered in more detail in [revision_summary_report_2025](revision_summary_report_2025.html).
978+
This is covered in more detail in [revision_summary_report_2025](revision_summary_2025.html).
981979
NHSN has substantial under-reporting behavior that is fairly consistent for any single geo, though there a number of aberrant revisions, some of which change the entire trajectory for a couple of weeks.
982-
This is even more true for NSSP than NHSN, though the size of the revisions is much smaller, and they occur more quickly.
980+
This is even more true for NSSP than NHSN, though the size of the revisions is much smaller, and they occur more quickly.
983981
Because of the speed in revision behavior, it matters only for prediction, rather than for correcting data for fitting the forecaster.
984982
We can probably improve our forecasts by incorporating revision behavior for both nhsn and nssp.
985983

@@ -1109,5 +1107,5 @@ covid_gr %>%
11091107
It's scored on N=4160 vs the local 3692, which probably comes down to negative aheads.
11101108
Note that both "bests" in this paragraph are ignoring models which have far fewer submission values, since they're likely to be unrepresentative.
11111109

1112-
[^2]: this is further off both in absolute and further yet in relative terms from our local scoring, which has `CMU-TimeSeries` at 46.32 rather than 44.8.
1110+
[^2]: this is further off both in absolute and further yet in relative terms from our local scoring, which has `CMU-TimeSeries` at 46.32 rather than 44.8.
11131111
It's unclear why; there are 3952 samples scored on the remote vs 3692 locally, so ~300 scored there that we don't score where we apparently did better.

0 commit comments

Comments
 (0)