Skip to content

Commit 6056fae

Browse files
authored
Set up vignette using vh55-3he6 data (#220)
Make the vignette run "out of the box": - Update documentation - Update and use the repo-tracked NIS data - Add `data/get_nis.py` - Remove old NIS data in `data/` - Add new raw data there - Simplify the config - Add a `scripts/describe_data.py` (which will later supercede the `figs_*.py` scripts) - Update Makefile to ensure the dependencies get propagated (i.e., don't use *directories* as targets) - Small updates to `iup/` files for better type hinting and bug tracking
1 parent 5ef92e2 commit 6056fae

20 files changed

Lines changed: 369 additions & 8335 deletions

.github/workflows/mkdocs.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ jobs:
2727
- uses: actions/setup-python@v6
2828
with:
2929
python-version-file: ".python-version"
30-
- run: uv sync --locked --only-group mkdocs
30+
- run: uv sync --frozen --only-group mkdocs
3131
- run: uv run mkdocs build --strict
3232
- uses: actions/upload-pages-artifact@v4
3333
with:

Makefile

Lines changed: 22 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,47 @@
11
RUN_ID = test
2-
TOKEN_PATH = scripts/socrata_app_token.txt
3-
TOKEN = $(shell cat $(TOKEN_PATH))
42
CONFIG = scripts/config.yaml
53
SETTINGS = output/settings/$(RUN_ID)/
6-
RAW_DATA = output/data/$(RUN_ID)/
7-
MODEL_FITS = output/fits/$(RUN_ID)/
4+
RAW_DATA = data/raw.parquet
5+
DATA = output/data/$(RUN_ID)/nis.parquet
6+
FITS = output/fits/$(RUN_ID)/
87
DIAGNOSTICS = output/diagnostics/$(RUN_ID)/
9-
PREDICTIONS = output/forecasts/$(RUN_ID)/
8+
FORECASTS = output/forecasts/$(RUN_ID)/
109
SCORES = output/scores/$(RUN_ID)/
10+
DATA_PLOT = output/diagnostics/$(RUN_ID)/data_national.png
1111

1212

13-
.PHONY: clean nis delete_nis viz
13+
.PHONY: clean viz
1414

15-
all: $(SETTINGS) $(RAW_DATA) $(MODEL_FITS) $(DIAGNOSTICS) $(PREDICTIONS) $(SCORES)
15+
all: $(SETTINGS) $(DATA) $(FITS) $(DIAGNOSTICS) $(FORECASTS) $(SCORES) $(DATA_PLOT)
1616

1717
viz:
1818
streamlit run scripts/viz.py -- \
19-
--obs=$(RAW_DATA) --pred=$(PREDICTIONS) --score=$(SCORES) --config=$(CONFIG)
19+
--obs=$(DATA) --pred=$(FORECASTS) --score=$(SCORES) --config=$(CONFIG)
2020

21-
$(SCORES): scripts/eval.py $(PREDICTIONS) $(RAW_DATA)
21+
$(SCORES): scripts/eval.py $(FORECASTS) $(DATA)
2222
python $< \
23-
--pred=$(PREDICTIONS) --obs=$(RAW_DATA) --config=$(CONFIG) \
23+
--pred=$(FORECASTS) --obs=$(DATA) --config=$(CONFIG) \
2424
--output=$@
2525

26-
$(PREDICTIONS): scripts/forecast.py $(RAW_DATA) $(MODEL_FITS) $(CONFIG)
27-
python $< --input=$(RAW_DATA) --models=$(MODEL_FITS) --config=$(CONFIG) \
26+
$(FORECASTS): scripts/forecast.py $(DATA) $(FITS) $(CONFIG)
27+
python $< --data=$(DATA) --models=$(FITS) --config=$(CONFIG) \
2828
--output=$@
2929

30-
$(DIAGNOSTICS): scripts/diagnostics.py $(MODEL_FITS) $(CONFIG)
31-
python $< --input=$(MODEL_FITS) --config=$(CONFIG) --output=$@
30+
$(DIAGNOSTICS): scripts/diagnostics.py $(FITS) $(CONFIG)
31+
python $< --input=$(FITS) --config=$(CONFIG) --output=$@
3232

33-
$(MODEL_FITS): scripts/fit.py $(RAW_DATA) $(CONFIG)
34-
python $< --input=$(RAW_DATA) --config=$(CONFIG) --output=$@
33+
$(FITS): scripts/fit.py $(DATA) $(CONFIG)
34+
python $< --data=$(DATA) --config=$(CONFIG) --output=$@
3535

36-
$(RAW_DATA): scripts/preprocess.py $(CONFIG)
37-
python $< --config=$(CONFIG) --output=$@
36+
$(DATA_PLOT): scripts/describe_data.py $(DATA)
37+
python $< --input=$(DATA) --output_dir=output/diagnostics/$(RUN_ID)/
38+
39+
$(DATA): scripts/preprocess.py $(RAW_DATA) $(CONFIG)
40+
python $< --config=$(CONFIG) --input=$(RAW_DATA) --output=$@
3841

3942
$(SETTINGS): $(CONFIG)
4043
mkdir -p $(SETTINGS)
4144
cp $(CONFIG) $(SETTINGS)
4245

4346
clean:
44-
rm -r $(SETTINGS) $(RAW_DATA) $(MODEL_FITS) $(DIAGNOSTICS) $(PREDICTIONS) $(SCORES)
45-
46-
nis:
47-
python -c "import nisapi"
48-
python -m nisapi cache --app-token=$(TOKEN)
49-
50-
delete_nis:
51-
python -c "import nisapi"
52-
python -m nisapi delete
47+
rm -r $(SETTINGS) $(DATA) $(FITS) $(DIAGNOSTICS) $(FORECASTS) $(SCORES)

README.md

Lines changed: 55 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,50 @@
11
# Immunization uptake projections
22

3-
This repo contains statistical tools to predict the uptake of immunizations (primarily vaccines and boosters). The three primary steps are:
3+
This repo contains statistical tools to predict the uptake of immunizations (primarily vaccines and boosters).
44

5-
1. Import data sets on past uptake and cast them into a standardize format
6-
2. Fit a variety of models that both capture past uptake as well as project future uptake, and
7-
3. Evaluate model projections against realized uptake.
5+
## Getting started
86

9-
All three steps are currently under development.
7+
1. Read the docs at <https://cdcgov.github.io/cfa-immunization-uptake-projection>, or build them locally with `mkdocs serve`
8+
1. This project uses [`uv`](https://docs.astral.sh/uv/) for environment and dependency management. Ensure you can `uv sync`. Use the uv-managed virtual environment (e.g., by prepending `uv run`).
9+
1. Run the [vignette](#vignette).
1010

11-
This approach is applicable to seasonal adult immunizations. Each year, the uptake process starts afresh, and individuals' transitions across age groups are not relevant.
11+
## Vignette
1212

13-
## Data sources
13+
The vignette demonstrates a workflow using this package:
1414

15-
Use <https://github.com/CDCgov/nis-py-api> for access to the NIS data.
15+
1. Fit a model to uptake data from past seasons
16+
1. Use it to forecast future uptake data in the latest season
17+
1. Evaluate forecasts against observed values
1618

17-
## Getting started
19+
### Data source
20+
21+
For convenience, the raw data are tracked in this repo under `data/`, which includes the script `get_nis.py`, used to collect that data with [`nis-py-api`](https://github.com/CDCgov/nis-py-api). These are estimates of season flu vaccine coverage, tracked monthly from the 2009/2010 to 2022/2023 seasons, from the [National Immunization Survey](https://www.cdc.gov/nis/about/index.html).
22+
23+
### Running the vignette
24+
25+
1. Copy `scripts/config_template.yaml` to `scripts/config.yaml`. This config can be modified; see the [file structure](#config-file-structure) below.
26+
1. Run `make` to run the model fitting and forecasting pipeline. Each run of the pipeline is assigned a `RUN_ID`. When a new `RUN_ID` is given, a new subfolder will be created inside each of the above six folders to store the corresponding outputs. When an existing `RUN_ID` is given, the contents of that `RUN_ID`'s existing subfolders will be overwritten, assuming the pipeline inputs have changed since the last run. `RUN_ID` can be assigned in line 1 of the Makefile or directly in the command line `make RUN_ID=name_of_run`.
27+
1. Inspect the `output/` subfolders:
28+
- `settings`: a copy of the config.
29+
- `data`: the pre-processed data.
30+
- `fits`: the fit model object(s).
31+
- `diagnostics`: diagnostic plots and tables for the desired model(s) and forecast date(s).
32+
- `forecasts`: posterior predictions and forecasts.
33+
- `scores`: evaluation scores comparing model structures and/or forecast dates.
34+
1. Run `make viz` to open a streamlit app in web browser, which shows the individual forecast trajectories, credible intervals, and evaluation scores, with options of dimensions and filters to customize the visualization.
35+
1. Optionally, `make clean` to remove all outputs for a particular `RUN_ID` .
36+
37+
### Config file structure
1838

19-
1. Either set up a virtual environment and install all dependencies with `uv sync` and then enter the virtual environment (with `.venv/Scripts/activate`, `.venv/bin/activate`, or similar), or else remember to prepend each of your command-line entries with `uv run` (e.g. `uv run make nis`).
20-
2. Get a [Socrata app token](https://github.com/CDCgov/nis-py-api?tab=readme-ov-file#getting-started) and save it in `scripts/socrata_app_token.txt`.
21-
3. Cache NIS data with `make nis`.
22-
4. Copy the config template in `scripts/config_template.yaml` to `scripts/config.yaml` and fill in the necessary fields.
23-
- data: specify the vaccination uptake data to use, including a de facto annual start of the disease season, filters for rows and columns to keep, and grouping factors by which to partition forecasts.
24-
- forecast_timeframe: specify the start and the end of the forecast period and the interval between reference dates in the forecast (using the [polars string language](https://docs.pola.rs/api/python/dev/reference/expressions/api/polars.date_range.html), e.g., `7d`).
25-
- evaluation_timeframe: specify the interval between forecast dates if multiple forecasts are desired (sharing the same end of the forecast period). This will create different forecast horizons, which can be compared with evaluation scores. If blank, no evaluation score will not be computed.
26-
- models: specify the name of the model (refer to `iup.models`), random seed, initial values of parameters, and parameters to use NUTS kernel in MCMC run.
27-
- scores: specify the quantile of the posterior forecasts to use for evaluation, the date(s) on which to compute absolute difference, and any additional evaluation metrics (e.g. mean squared prediction error as `mspe`).
28-
- forecast_plots: specify the credible interval (in fractional terms) and number of randomly chosen trajectories to show on forecast plots.
29-
- diagnostics: specify the model (refer to `iup.models`) and the range of forecast dates (i.e. a list of earliest and latest) on which to perform diagnostics, as well as the types of plots and tables to create (refer to `iup.diagnostics`).
30-
5. Run `make all` to run the model fitting and forecasting pipeline. This will create six `output/` subfolders:
31-
- `settings`: a copy of the config.
32-
- `data`: the pre-processed data.
33-
- `fits`: the fit model object(s).
34-
- `diagnostics`: diagnostic plots and tables for the desired model(s) and forecast date(s).
35-
- `forecasts`: posterior predictions and forecasts.
36-
- `scores`: evaluation scores comparing model structures and/or forecast dates.
37-
Each run of the pipeline is assigned a `RUN_ID`. When a new `RUN_ID` is given, a new subfolder will be created inside each of the above six folders to store the corresponding outputs. When an existing `RUN_ID` is given, the contents of that `RUN_ID`'s existing subfolders will be overwritten, assuming the pipeline inputs have changed since the last run. `RUN_ID` can be assigned in line 1 of the Makefile or directly in the command line `make all RUN_ID=name_of_run`.
38-
6. Run `make viz` to open a streamlit app in web browser, which shows the individual forecast trajectories, credible intervals, and evaluation scores, with options of dimensions and filters to customize the visualization.
39-
7. Run `make clean` to remove all outputs for a particular `RUN_ID` and `make delete_nis` to delete the NIS data from the cache.
40-
41-
#### Package workflow:
39+
- data: specify the vaccination uptake data to use, including a de facto annual start of the disease season, filters for rows and columns to keep, and grouping factors by which to partition forecasts.
40+
- forecast_timeframe: specify the start and the end of the forecast period and the interval between reference dates in the forecast (using the [polars string language](https://docs.pola.rs/api/python/dev/reference/expressions/api/polars.date_range.html), e.g., `7d`).
41+
- evaluation_timeframe: specify the interval between forecast dates if multiple forecasts are desired (sharing the same end of the forecast period). This will create different forecast horizons, which can be compared with evaluation scores. If blank, no evaluation score will not be computed.
42+
- models: specify the name of the model (refer to `iup.models`), random seed, initial values of parameters, and parameters to use NUTS kernel in MCMC run.
43+
- scores: specify the quantile of the posterior forecasts to use for evaluation, the date(s) on which to compute absolute difference, and any additional evaluation metrics (e.g. mean squared prediction error as `mspe`).
44+
- forecast_plots: specify the credible interval (in fractional terms) and number of randomly chosen trajectories to show on forecast plots.
45+
- diagnostics: specify the model (refer to `iup.models`) and the range of forecast dates (i.e. a list of earliest and latest) on which to perform diagnostics, as well as the types of plots and tables to create (refer to `iup.diagnostics`).
46+
47+
### Vignette workflow
4248

4349
```mermaid
4450
@@ -103,7 +109,6 @@ config --> diagnostics.py
103109
config --> forecast.py
104110
config --> eval.py
105111
106-
107112
style nis_data fill: #8451b5
108113
style forecast fill: #8451b5
109114
style scores fill: #8451b5
@@ -118,68 +123,44 @@ style diagnostic_plot fill: #b46060
118123
style proj_plot fill: #b46060
119124
style pred_summary fill: #b46060
120125
style score_plot fill: #b46060
121-
122-
123126
```
124127

125128
## Project admins
126129

127-
- Edward Schrom (CDC/CFA/Predict) <tec0@cdc.gov>
130+
- Scott Olesen (CDC/CFA/Predict) <ulp7@cdc.gov>
131+
132+
## Disclaimers
128133

129-
## General Disclaimer
134+
### General Disclaimer
130135

131136
This repository was created for use by CDC programs to collaborate on public health related projects in support of the [CDC mission](https://www.cdc.gov/about/organization/mission.htm). GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.
132137

133-
## Public Domain Standard Notice
138+
### Public Domain Standard Notice
134139

135-
This repository constitutes a work of the United States Government and is not
136-
subject to domestic copyright protection under 17 USC § 105. This repository is in
137-
the public domain within the United States, and copyright and related rights in
138-
the work worldwide are waived through the [CC0 1.0 Universal public domain dedication](https://creativecommons.org/publicdomain/zero/1.0/).
139-
All contributions to this repository will be released under the CC0 dedication. By
140-
submitting a pull request you are agreeing to comply with this waiver of
141-
copyright interest.
140+
This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the [CC0 1.0 Universal public domain dedication](https://creativecommons.org/publicdomain/zero/1.0/). All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.
142141

143-
## License Standard Notice
142+
### License Standard Notice
144143

145144
This repository is licensed under ASL v2 or later.
146145

147-
This source code in this repository is free: you can redistribute it and/or modify it under
148-
the terms of the Apache Software License version 2, or (at your option) any
149-
later version.
146+
This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.
150147

151-
This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY
152-
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
153-
PARTICULAR PURPOSE. See the Apache Software License for more details.
148+
This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.
154149

155-
You should have received a copy of the Apache Software License along with this
156-
program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html
150+
You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html
157151

158152
The source code forked from other open source projects will inherit its license.
159153

160-
## Privacy Standard Notice
154+
### Privacy Standard Notice
161155

162-
This repository contains only non-sensitive, publicly available data and
163-
information. All material and community participation is covered by the
164-
[Disclaimer](https://github.com/CDCgov/template/blob/master/DISCLAIMER.md)
165-
and [Code of Conduct](https://github.com/CDCgov/template/blob/master/code-of-conduct.md).
166-
For more information about CDC's privacy policy, please visit [http://www.cdc.gov/other/privacy.html](https://www.cdc.gov/other/privacy.html).
156+
This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the [Disclaimer](https://github.com/CDCgov/template/blob/master/DISCLAIMER.md) and [Code of Conduct](https://github.com/CDCgov/template/blob/master/code-of-conduct.md). For more information about CDC's privacy policy, please visit [http://www.cdc.gov/other/privacy.html](https://www.cdc.gov/other/privacy.html).
167157

168-
## Contributing Standard Notice
158+
### Contributing Standard Notice
169159

170-
Anyone is encouraged to contribute to the repository by [forking](https://help.github.com/articles/fork-a-repo)
171-
and submitting a pull request. (If you are new to GitHub, you might start with a
172-
[basic tutorial](https://help.github.com/articles/set-up-git).) By contributing
173-
to this project, you grant a world-wide, royalty-free, perpetual, irrevocable,
174-
non-exclusive, transferable license to all users under the terms of the
175-
[Apache Software License v2](http://www.apache.org/licenses/LICENSE-2.0.html) or
176-
later.
160+
Anyone is encouraged to contribute to the repository by [forking](https://help.github.com/articles/fork-a-repo) and submitting a pull request. (If you are new to GitHub, you might start with a [basic tutorial](https://help.github.com/articles/set-up-git).) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the [Apache Software License v2](http://www.apache.org/licenses/LICENSE-2.0.html) or later.
177161

178-
All comments, messages, pull requests, and other submissions received through
179-
CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at [http://www.cdc.gov/other/privacy.html](http://www.cdc.gov/other/privacy.html).
162+
All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at [http://www.cdc.gov/other/privacy.html](http://www.cdc.gov/other/privacy.html).
180163

181-
## Records Management Standard Notice
164+
### Records Management Standard Notice
182165

183-
This repository is not a source of government records but is a copy to increase
184-
collaboration and collaborative potential. All government records will be
185-
published through the [CDC web site](http://www.cdc.gov).
166+
This repository is not a source of government records but is a copy to increase collaboration and collaborative potential. All government records will be published through the [CDC web site](http://www.cdc.gov).

data/.placeholder

Whitespace-only changes.

data/get_nis.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
"""
2+
Download data from https://data.cdc.gov/Flu-Vaccinations/Influenza-Vaccination-Coverage-for-All-Ages-6-Mont/vh55-3he6/about_data
3+
"""
4+
5+
import nisapi
6+
import polars as pl
7+
8+
data = (
9+
nisapi.get_nis()
10+
.filter(
11+
pl.col("vaccine") == pl.lit("flu"),
12+
pl.col("geography_type").is_in(["nation", "admin1"]),
13+
pl.col("domain_type") == pl.lit("age & possible risk"),
14+
pl.col("domain") == pl.lit(">=18 years"),
15+
pl.col("time_type") == pl.lit("month"),
16+
pl.col("indicator_type") == pl.lit("received a vaccination"),
17+
pl.col("indicator") == pl.lit("yes"),
18+
pl.col("id") == pl.lit("vh55-3he6"),
19+
)
20+
.select(
21+
[
22+
"geography_type",
23+
"geography",
24+
"time_end",
25+
"estimate",
26+
"lci",
27+
"uci",
28+
"sample_size",
29+
]
30+
)
31+
.sort(["geography_type", "geography", "time_end"])
32+
.collect()
33+
)
34+
35+
data.write_parquet("data/raw.parquet")

0 commit comments

Comments
 (0)