Skip to content

Commit 021d8be

Browse files
committed
Add initial implementation for combining malaria draws and uploading data
- Created a new script `01_combine_as_draws_parallel.py` for parallel processing of malaria draws. - Added a Jupyter notebook `combine_as_draws.ipynb` for data processing and upload preparation. - Implemented the main logic in `combine_as_draws.py` to handle data extraction, processing, and saving. - Introduced a new Jupyter notebook `todo.ipynb` for tracking code checks and specific tasks. - Ensured compatibility with existing constants and helper functions from the `idd_forecast_mbp` package. - Added argument parsing for flexible input parameters in the draw processing script.
1 parent fb7aa62 commit 021d8be

File tree

74 files changed

+22251
-443
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+22251
-443
lines changed

function_summaries.md

Lines changed: 341 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,341 @@
1+
# Function Documentation Summary
2+
3+
## YAML and Configuration Functions
4+
5+
### `load_yaml_dictionary(yaml_path: str) -> dict`
6+
**Purpose**: Loads a YAML file and extracts the COVARIATE_DICT section.
7+
8+
**Inputs**:
9+
- `yaml_path`: String path to the YAML file
10+
11+
**Outputs**:
12+
- Dictionary containing the COVARIATE_DICT from the YAML file
13+
14+
**Functions it uses**: None (uses built-in `yaml.safe_load`)
15+
16+
**Functions that use it**: `parse_yaml_dictionary()`
17+
18+
---
19+
20+
### `parse_yaml_dictionary(covariate: str) -> dict`
21+
**Purpose**: Parses covariate-specific configuration from the YAML dictionary and calculates derived values.
22+
23+
**Inputs**:
24+
- `covariate`: String name of the covariate to extract configuration for
25+
26+
**Outputs**:
27+
- Dictionary with parsed covariate configuration including:
28+
- `covariate_name`: Name of the covariate
29+
- `covariate_resolution`: Calculated resolution (numerator/denominator)
30+
- `years`: List of years from start to end
31+
- `synoptic`: Synoptic flag
32+
- `cc_sensitive`: Climate change sensitivity flag
33+
- `summary_statistic`: Summary statistic method
34+
- `path`: File path
35+
36+
**Functions it uses**: `load_yaml_dictionary()`
37+
38+
**Functions that use it**: Not directly called by other functions in this module
39+
40+
---
41+
42+
## Data Merging and Reading Functions
43+
44+
### `merge_dataframes(model_df, dfs)`
45+
**Purpose**: Merges multiple DataFrames with a base model DataFrame on location_id and year_id.
46+
47+
**Inputs**:
48+
- `model_df`: Base pandas DataFrame
49+
- `dfs`: Dictionary of DataFrames to merge
50+
51+
**Outputs**:
52+
- Merged pandas DataFrame with suffixes added for duplicate columns
53+
54+
**Functions it uses**: None (uses pandas merge)
55+
56+
**Functions that use it**: Not directly called by other functions in this module
57+
58+
---
59+
60+
### `read_income_paths(income_paths, rcp_scenario, VARIABLE_DATA_PATH)`
61+
**Purpose**: Reads multiple income data files, filters by RCP scenario, and processes them.
62+
63+
**Inputs**:
64+
- `income_paths`: Dictionary of file paths
65+
- `rcp_scenario`: RCP scenario to filter by
66+
- `VARIABLE_DATA_PATH`: Base path for variable data
67+
68+
**Outputs**:
69+
- Dictionary of filtered pandas DataFrames (scenario column dropped)
70+
71+
**Functions it uses**: `read_parquet_with_integer_ids()`
72+
73+
**Functions that use it**: Not directly called by other functions in this module
74+
75+
---
76+
77+
### `read_urban_paths(urban_paths, VARIABLE_DATA_PATH)`
78+
**Purpose**: Reads multiple urban data files and standardizes column names.
79+
80+
**Inputs**:
81+
- `urban_paths`: Dictionary of file paths
82+
- `VARIABLE_DATA_PATH`: Base path for variable data
83+
84+
**Outputs**:
85+
- Dictionary of processed pandas DataFrames with standardized column names
86+
87+
**Functions it uses**: None (uses pandas read_parquet)
88+
89+
**Functions that use it**: Not directly called by other functions in this module
90+
91+
---
92+
93+
## Data Type and I/O Utility Functions
94+
95+
### `ensure_id_columns_are_integers(df)`
96+
**Purpose**: Converts columns ending with '_id' to integer type.
97+
98+
**Inputs**:
99+
- `df`: pandas DataFrame
100+
101+
**Outputs**:
102+
- DataFrame with ID columns converted to integers
103+
104+
**Functions it uses**: None (uses pandas type operations)
105+
106+
**Functions that use it**: `read_parquet_with_integer_ids()`
107+
108+
---
109+
110+
### `read_parquet_with_integer_ids(path, **kwargs)`
111+
**Purpose**: Reads a parquet file and ensures ID columns are integers.
112+
113+
**Inputs**:
114+
- `path`: File path to parquet file
115+
- `**kwargs`: Additional arguments for pd.read_parquet
116+
117+
**Outputs**:
118+
- pandas DataFrame with integer ID columns
119+
120+
**Functions it uses**: `ensure_id_columns_are_integers()`
121+
122+
**Functions that use it**: `read_income_paths()`
123+
124+
---
125+
126+
### `write_parquet(df, filepath, max_retries=3, compression='snappy', index=False, **kwargs)`
127+
**Purpose**: Writes parquet files with validation and retry logic for robustness.
128+
129+
**Inputs**:
130+
- `df`: pandas DataFrame to write
131+
- `filepath`: Destination file path
132+
- `max_retries`: Number of retry attempts (default: 3)
133+
- `compression`: Compression method (default: 'snappy')
134+
- `index`: Whether to include index (default: False)
135+
- `**kwargs`: Additional arguments for to_parquet
136+
137+
**Outputs**:
138+
- Boolean indicating success/failure
139+
140+
**Functions it uses**: None (uses pandas and os operations)
141+
142+
**Functions that use it**:
143+
- `rake_aa_count_lsae_to_gbd()`
144+
- `make_aa_rate_variable()`
145+
- `aggregate_aa_count_lsae_to_gbd()`
146+
- `make_full_aa_rate_df_from_aa_count_df()`
147+
148+
---
149+
150+
## Raking Functions
151+
152+
### `prep_df(df, hierarchy_df)`
153+
**Purpose**: Prepares DataFrame by adding level column and removing parent_id if present.
154+
155+
**Inputs**:
156+
- `df`: pandas DataFrame to prepare
157+
- `hierarchy_df`: Hierarchy DataFrame containing location_id and level mappings
158+
159+
**Outputs**:
160+
- Prepared DataFrame with level column added and parent_id removed
161+
162+
**Functions it uses**: None (uses pandas merge and drop)
163+
164+
**Functions that use it**:
165+
- `rake_aa_count_lsae_to_gbd()`
166+
- `aggregate_aa_count_lsae_to_gbd()`
167+
- `aggregate_aa_rate_lsae_to_gbd()`
168+
169+
---
170+
171+
### `rake_level(count_variable, level_df, level_m1_df, hierarchy_df)`
172+
**Purpose**: Rakes (adjusts) data at one level to match aggregated totals from the next higher level.
173+
174+
**Inputs**:
175+
- `count_variable`: Name of the count variable to rake
176+
- `level_df`: DataFrame for current level
177+
- `level_m1_df`: DataFrame for the level above (level minus 1)
178+
- `hierarchy_df`: Hierarchy DataFrame
179+
180+
**Outputs**:
181+
- DataFrame with raked values that sum to the higher level totals
182+
183+
**Functions it uses**: None (uses pandas operations)
184+
185+
**Functions that use it**: `rake_aa_count_lsae_to_gbd()`
186+
187+
---
188+
189+
### `rake_aa_count_lsae_to_gbd(count_variable, hierarchy_df, gbd_aa_count_df, lsae_aa_count_df, full_aa_count_df_path, return_full_df=False)`
190+
**Purpose**: Rakes LSAE age-aggregated count data to match GBD totals across hierarchy levels.
191+
192+
**Inputs**:
193+
- `count_variable`: Name of count variable
194+
- `hierarchy_df`: Hierarchy DataFrame
195+
- `gbd_aa_count_df`: GBD age-aggregated count data
196+
- `lsae_aa_count_df`: LSAE age-aggregated count data
197+
- `full_aa_count_df_path`: Output file path
198+
- `return_full_df`: Whether to return the DataFrame (default: False)
199+
200+
**Outputs**:
201+
- Optionally returns full raked DataFrame if return_full_df=True
202+
203+
**Functions it uses**:
204+
- `prep_df()`
205+
- `rake_level()`
206+
- `write_parquet()`
207+
208+
**Functions that use it**: Not directly called by other functions in this module
209+
210+
---
211+
212+
### `make_aa_rate_variable(count_variable, full_aa_count_df, aa_population_df, full_lsae_aa_rate_df_path, return_full_df=False)`
213+
**Purpose**: Converts age-aggregated count data to rate data using population denominators.
214+
215+
**Inputs**:
216+
- `count_variable`: Name of count variable
217+
- `full_aa_count_df`: Full age-aggregated count DataFrame
218+
- `aa_population_df`: Age-aggregated population DataFrame
219+
- `full_lsae_aa_rate_df_path`: Output file path
220+
- `return_full_df`: Whether to return DataFrame (default: False)
221+
222+
**Outputs**:
223+
- Optionally returns rate DataFrame if return_full_df=True
224+
225+
**Functions it uses**: `write_parquet()`
226+
227+
**Functions that use it**: Not directly called by other functions in this module
228+
229+
---
230+
231+
## Aggregation Functions
232+
233+
### `aggregate_level(count_variable, level_df, hierarchy_df)`
234+
**Purpose**: Aggregates count data from one hierarchy level to the next higher level.
235+
236+
**Inputs**:
237+
- `count_variable`: Name of count variable to aggregate
238+
- `level_df`: DataFrame for current level
239+
- `hierarchy_df`: Hierarchy DataFrame
240+
241+
**Outputs**:
242+
- DataFrame with aggregated counts at the parent level
243+
244+
**Functions it uses**: None (uses pandas operations)
245+
246+
**Functions that use it**: `aggregate_aa_count_lsae_to_gbd()`
247+
248+
---
249+
250+
### `aggregate_aa_count_lsae_to_gbd(count_variable, hierarchy_df, lsae_aa_count_df, full_aa_count_df_path, return_full_df=False)`
251+
**Purpose**: Aggregates LSAE age-aggregated count data up through all hierarchy levels (5 to 0).
252+
253+
**Inputs**:
254+
- `count_variable`: Name of count variable
255+
- `hierarchy_df`: Hierarchy DataFrame
256+
- `lsae_aa_count_df`: LSAE age-aggregated count data
257+
- `full_aa_count_df_path`: Output file path
258+
- `return_full_df`: Whether to return DataFrame (default: False)
259+
260+
**Outputs**:
261+
- Optionally returns full aggregated DataFrame if return_full_df=True
262+
263+
**Functions it uses**:
264+
- `prep_df()`
265+
- `aggregate_level()`
266+
- `write_parquet()`
267+
268+
**Functions that use it**: `aggregate_aa_rate_lsae_to_gbd()`
269+
270+
---
271+
272+
### `make_full_aa_rate_df_from_aa_count_df(rate_variable, count_variable, full_aa_count_df, aa_population_df, full_aa_rate_df_path=None, return_full_df=False)`
273+
**Purpose**: Converts aggregated count data to rate data using population.
274+
275+
**Inputs**:
276+
- `rate_variable`: Name of rate variable to create
277+
- `count_variable`: Name of count variable
278+
- `full_aa_count_df`: Full age-aggregated count DataFrame
279+
- `aa_population_df`: Age-aggregated population DataFrame
280+
- `full_aa_rate_df_path`: Optional output file path
281+
- `return_full_df`: Whether to return DataFrame (default: False)
282+
283+
**Outputs**:
284+
- Optionally returns rate DataFrame if return_full_df=True
285+
286+
**Functions it uses**: `write_parquet()`
287+
288+
**Functions that use it**: `aggregate_aa_rate_lsae_to_gbd()`
289+
290+
---
291+
292+
### `aggregate_aa_rate_lsae_to_gbd(rate_variable, hierarchy_df, lsae_aa_rate_df, aa_population_df, full_aa_rate_df_path=None, return_full_df=False)`
293+
**Purpose**: Aggregates LSAE age-aggregated rate data by first converting to counts, aggregating, then converting back to rates.
294+
295+
**Inputs**:
296+
- `rate_variable`: Name of rate variable
297+
- `hierarchy_df`: Hierarchy DataFrame
298+
- `lsae_aa_rate_df`: LSAE age-aggregated rate data
299+
- `aa_population_df`: Age-aggregated population DataFrame
300+
- `full_aa_rate_df_path`: Optional output file path
301+
- `return_full_df`: Whether to return DataFrame (default: False)
302+
303+
**Outputs**:
304+
- Optionally returns full aggregated rate DataFrame if return_full_df=True
305+
306+
**Functions it uses**:
307+
- `prep_df()`
308+
- `aggregate_aa_count_lsae_to_gbd()`
309+
- `make_full_aa_rate_df_from_aa_count_df()`
310+
311+
**Functions that use it**: Not directly called by other functions in this module
312+
313+
---
314+
315+
## Function Dependency Tree
316+
317+
```
318+
Configuration Functions:
319+
├── load_yaml_dictionary()
320+
└── parse_yaml_dictionary() → uses load_yaml_dictionary()
321+
322+
Data I/O Functions:
323+
├── ensure_id_columns_are_integers()
324+
├── read_parquet_with_integer_ids() → uses ensure_id_columns_are_integers()
325+
├── read_income_paths() → uses read_parquet_with_integer_ids()
326+
├── read_urban_paths()
327+
├── merge_dataframes()
328+
└── write_parquet() → used by multiple functions
329+
330+
Raking Functions:
331+
├── prep_df() → used by multiple raking/aggregation functions
332+
├── rake_level() → used by rake_aa_count_lsae_to_gbd()
333+
├── rake_aa_count_lsae_to_gbd() → uses prep_df(), rake_level(), write_parquet()
334+
└── make_aa_rate_variable() → uses write_parquet()
335+
336+
Aggregation Functions:
337+
├── aggregate_level() → used by aggregate_aa_count_lsae_to_gbd()
338+
├── aggregate_aa_count_lsae_to_gbd() → uses prep_df(), aggregate_level(), write_parquet()
339+
├── make_full_aa_rate_df_from_aa_count_df() → uses write_parquet()
340+
└── aggregate_aa_rate_lsae_to_gbd() → uses prep_df(), aggregate_aa_count_lsae_to_gbd(), make_full_aa_rate_df_from_aa_count_df()
341+
```

0 commit comments

Comments
 (0)