You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One thing to note: all of these models filter out the 2020/21 and 2021/22 seasons.
89
-
For both flu and covid they are either unusually large or unusually small, and don't warrant inclusion.
89
+
For both flu and covid these seasons are either unusually large or unusually small, and don't warrant inclusion.
90
90
We can split the models and ensembles into 3 categories: the ad-hoc models that we created in response to the actual data that we saw, the AR models that we had been backtesting, and the ensembles.
91
91
92
92
### The "ad-hoc" models
93
93
94
-
-`climate_base` uses a 7 week window around the target and forecast date to establish quantiles.
95
-
`climate_base` does this separately for each geo
94
+
-`climate_base` uses a 7 week window around the target and forecast date to establish quantiles.
95
+
`climate_base` does this separately for each geo.
96
96
-`climate_geo_agged` on the other hand converts to rates, pools all geos, computes quantiles using similar time windows, and then converts back to counts.
97
-
There is effectively only one prediction, scaled to fit each geo.
97
+
There is effectively only one prediction, scaled to fit each geo.
98
98
-`linear` does a linear extrapolation of the last 4 weeks of data on a rates scale.
99
99
Initially it had an intercept, but this was removed when it caused the model to not reproduce the -1 ahead data exactly.
100
100
This change was made on Jan 8th, in the commit with hash 5f7892b.
-`retro_submission` is a retroactive recreation of `CMU-TimeSeries` using updated methods (`linear` always matching the most recent value, for example).
142
142
The weights for the various models can be found in [`flu_geo_exclusions`](https://github.com/cmu-delphi/exploration-tooling/blob/main/flu_geo_exclusions.csv) or [`covid_geo_exclusions`](https://github.com/cmu-delphi/exploration-tooling/blob/main/covid_geo_exclusions.csv).
143
143
These can vary on a state by state basis.
144
-
-`CMU-TimeSeries` is what we actually submitted.
144
+
-`CMU-TimeSeries` is what we actually submitted.
145
145
This is a moving target that has changed a number of times. For a detailed list of the weights used, see [`flu_geo_exclusions`](https://github.com/cmu-delphi/exploration-tooling/blob/main/flu_geo_exclusions.csv) or [`covid_geo_exclusions`](https://github.com/cmu-delphi/exploration-tooling/blob/main/covid_geo_exclusions.csv) for specific weights.
146
-
146
+
147
147
<details>
148
148
<summary> A timeline of the changes to `CMU-timeseries` </summary>
149
149
```{r cmu_timeseries_timeline, echo=FALSE}
@@ -192,7 +192,7 @@ The best wis-scoring model is actually just the ensemble at 35.2, with the next-
192
192
Coverage in covid is somewhat better, though a larger fraction of teams are within +/-10% of 95% coverage; we specifically got within 1%.
193
193
Like with flu, there was systematic under-coverage though, so the models are also biased towards too small of intervals for the 95% band.
194
194
The 50% coverage is likewise more accurate than for flu, with most forecasts within +/-10%.
195
-
`CMU-TimeSeries` is at 52.7%, so slightly over.
195
+
`CMU-TimeSeries` is at 52.7%, so slightly over.
196
196
Generally, more teams were under 50% coverage than over, so there is also a systemic bias towards under-coverage in covid.
197
197
198
198
## Flu Scores
@@ -266,7 +266,7 @@ flu_current %>%
266
266
There is a wide variety of length for the peak by this definition, but it does seem to naturally reflect the difference in dynamics.
267
267
`ok` is quite short for example, because it has a simple clean peak, whereas `or` has literally 2 peaks with the same height, so the entire interval between them is classified as peak.
268
268
269
-
Boiling down these plots somewhat, let's look at the averages for the start of the peak and the end of the peak.
269
+
Boiling down these plots somewhat, let's look at the averages for the start of the peak and the end of the peak.
270
270
First, for the start:
271
271
272
272
```{r flu_peak_start}
@@ -561,9 +561,8 @@ It is worth noting that phase doesn't correspond to just grouping the dates, bec
561
561
562
562
#### Ahead
563
563
564
-
Factoring by ahead, the models that include an AR component generally degrade with ahead less badly.
565
-
Interestingly, the pure `climate` models having a mostly consistent (and bad) score, but remains much more consistent as aheads increase.
566
-
Most of the advantage of `PSI-PROF` and `FluSight-lop_norm` comes from having more accurate 2 and 3 week aheads.
564
+
Factoring by ahead, the models that include an AR component generally degrade with ahead less badly.
565
+
Interestingly, the pure `climate` models having a mostly consistent (and bad) score, but remains much more consistent as aheads increase (after the -1 ahead where it typically has exact data).
567
566
568
567
#### Sample forecasts
569
568
@@ -575,13 +574,6 @@ The well performing models from other teams also had this behavior this year.
575
574
576
575
## Covid Scores
577
576
578
-
Overall, the best covid forecaster is `windowed_seasonal_extra_sources`, which uses a window of data around the given time period
579
-
580
-
One peculiar thing about Covid scoring: the first day has *much* worse scores than almost any of the subsequent days (you can see this in the Scores Aggregated By Forecast Date tab below).
581
-
This mostly comes from the first week having larger revisions than normal.
582
-
This is discussed in more detail in [this notebook](first_day_wrong.html).
583
-
584
-
585
577
Before we get into the actual scores, we need to define how we go about creating 4 different phases.
586
578
They are `increasing`, `peak`, `decreasing`, and `flat`.
587
579
The last phase, `flat`, covers geos which didn't have an appreciable season for the year, which was relatively common for covid.
@@ -630,7 +622,7 @@ covid_current %>%
630
622
Then we can see a very muted season in many locations, such as `ar` or `co`, and no season at all in some locations, such as `ak`.
631
623
Others, such as `az`, `in`, or `mn` have a season that is on-par with historical ones.
632
624
633
-
How to handle this?
625
+
How to handle this?
634
626
One option is to include a separate phase for no season that applies to the entire `geo_value` if more than half of the `time_value`s are within 50% of the peak:
635
627
636
628
```{r}
@@ -661,23 +653,20 @@ Possible exceptions:
661
653
There are several locations such as `al` and `ar` which don't have a peak so much as an elevated level for approximately the entire period.
662
654
This is awkward to handle for this classification.
663
655
664
-
Finally, like for Flu we should examine a summary of the start/end dates for the peak of the season.
665
-
Boiling down these plots somewhat, let's look at the averages for the start of the peak and the end of the peak.
666
-
First, for the start:
656
+
Finally, like for Flu, we should examine a summary of the start/end dates for the peak of the covid season.
657
+
Boiling down these plots somewhat, let's look at the averages for the start of the peak and the end of the peak.
658
+
First, for the start of start of the peak:
667
659
668
660
```{r}
669
661
covid_within_max$first_above %>% summary()
670
662
```
671
663
672
-
So the `increasing` phase ends at earliest on December 28st, on average on January 18th, and at the latest on April 19th.
673
-
Which suggests
664
+
Second, for the end of the peak:
674
665
675
666
```{r}
676
667
covid_within_max$last_above %>% summary()
677
668
```
678
669
679
-
Similarly, the `peak` phase ends at the earliest on the 11th of December, on average on the first of March, and at the latest on March 22nd.
`windowed_seasonal_nssp` is a clear winner regardless of the metric used.
943
-
`ensemble_windowed` is nearly as good, but since it is effectively averaging `windowed_seasonal_nssp` with `windowed_seasonal` and losing accuracy as a result, it is hardly worth it.
930
+
One peculiar thing about Covid scoring: on the first forecast date, CMU-TimeSeries has *much* worse scores than almost any of the subsequent days (you can see this in the Scores Aggregated By Forecast Date tab below).
931
+
There are two related issues here:
932
+
- first, our initial model combined climate_base and linear, and the climate_base component was unusually bad early in the season, because this season started later than previous seasons,
933
+
- second, the data had substantial revisions (this is discussed in detail in [this notebook](first_day_wrong.html)), however this effect is much smaller, since other forecasters had access to the same data.
934
+
935
+
This mishap dragged the CMU-TimeSeries score down overall by quite a lot and its better performance later in the season is not enough to make up for it.
936
+
937
+
Overall, the best covid forecaster is `windowed_seasonal_nssp`, outperforming `CovidHub-ensemble`, regardless of the metric used.
938
+
This forecaster uses a window of data around the given time period, along with the NSSP exogenous features.
939
+
`ensemble_windowed` is nearly as good, but since it is effectively averaging `windowed_seasonal_nssp` with `windowed_seasonal` and losing accuracy as a result, so it is hardly worth it.
940
+
Given its simplicity, the `climate_linear` forecaster does quite well, though it's not as good as `windowed_seasonal_nssp`.
944
941
945
942
The pure climate models were substantially worse for covid than for flu, at ~4.6x the best model, rather than ~2x.
946
943
Given the unusual nature of the season, this is somewhat unsurprising.
@@ -975,11 +972,12 @@ The always decreasing problem is definitely not present in these forecasts.
975
972
If anything, our best forecasts are *too* eager to predict an increasing value, e.g. in `tx` and `ca`.
976
973
Several of our worse forecasts are clearly caused by revision behavior.
977
974
975
+
978
976
# Revision behavior and data substitution
979
977
980
-
This is covered in more detail in [revision_summary_report_2025](revision_summary_report_2025.html).
978
+
This is covered in more detail in [revision_summary_report_2025](revision_summary_2025.html).
981
979
NHSN has substantial under-reporting behavior that is fairly consistent for any single geo, though there a number of aberrant revisions, some of which change the entire trajectory for a couple of weeks.
982
-
This is even more true for NSSP than NHSN, though the size of the revisions is much smaller, and they occur more quickly.
980
+
This is even more true for NSSP than NHSN, though the size of the revisions is much smaller, and they occur more quickly.
983
981
Because of the speed in revision behavior, it matters only for prediction, rather than for correcting data for fitting the forecaster.
984
982
We can probably improve our forecasts by incorporating revision behavior for both nhsn and nssp.
985
983
@@ -1109,5 +1107,5 @@ covid_gr %>%
1109
1107
It's scored on N=4160 vs the local 3692, which probably comes down to negative aheads.
1110
1108
Note that both "bests" in this paragraph are ignoring models which have far fewer submission values, since they're likely to be unrepresentative.
1111
1109
1112
-
[^2]: this is further off both in absolute and further yet in relative terms from our local scoring, which has `CMU-TimeSeries` at 46.32 rather than 44.8.
1110
+
[^2]: this is further off both in absolute and further yet in relative terms from our local scoring, which has `CMU-TimeSeries` at 46.32 rather than 44.8.
1113
1111
It's unclear why; there are 3952 samples scored on the remote vs 3692 locally, so ~300 scored there that we don't score where we apparently did better.
0 commit comments