Skip to content

Small bug fixes for getting LBWSG metadata#9

Open
zmbc wants to merge 2 commits intoihmeuw:mainfrom
zmbc:zeb_bug_fixes
Open

Small bug fixes for getting LBWSG metadata#9
zmbc wants to merge 2 commits intoihmeuw:mainfrom
zmbc:zeb_bug_fixes

Conversation

@zmbc
Copy link

@zmbc zmbc commented Oct 24, 2025

  • Make the file work if get_draws is not installed
  • Don't require MEIDs when using gbd_mapping source

elif source == 'get_ids':
cats = get_category_to_meid_map(lbwsg_exposure_df, validate=validate)
else:
cats = lbwsg_exposure_df.reset_index()[['parameter']].drop_duplicates().set_index('parameter')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the gbd_mapping case (hmm, should I have made that more explicit...?)? Maybe add a comment here to explain what this is doing and why it works? Will the result of this be an empty DataFrame with the categories in the index?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now that I spent a little more time looking over this function, I think I'm remembering what I intended here (I wish I had documented it better...).

I think my intention was that the cat_df dataframe would always contain the MEID, in the spirit of recording as much information as possible about the category so you could easily look it up later with the GBD shared functions. Since gbd_mapping does not include the MEID, the MEID needs to come from somewhere else, either the hard-coded values in the module, or an exposure dataframe from GBD (which is where the hard-coded values came from in the first place). The source parameter was then a way to map MEIDs or category names to category descriptions, namely get_ids, gbd_mapping, or a user-provided dictionary.

In particular, if I'm understanding correctly, I think the new logic is confusing things a bit (again, my fault for not documenting anything). Namely, I don't think it's exactly correct to change the else clause to elif source == 'get_ids', because the code in that block doesn't rely on get_ids at all, it just needs the exposure dataframe.

The PR description says the goal is "Don't require MEIDs when using gbd_mapping source," but I think MEIDs are already not "required" in this case, in the sense that the user doesn't need to pass them in: If the user passes 'gbd_mapping' for source and does not provide an exposure dataframe, then the MEIDs will be read from the hard-coded dictionary instead of a user-supplied source. Were you running into some kind of error with the MEIDs when using the gbd_mapping source?

If the intention is to simply omit the MEIDs when using the gbd_mapping source instead of reading the hard-coded values, then it seems like maybe it would be better to add another boolean flag, include_meids, and adjust the logic accordingly. If you don't want to include MEIDs, and you want to use the GBD mapping source, then you shouldn't need the cats Series/DataFrame at all, since you get all the information you want from gbd_mapping directly. So you could pass in None for lbwsg_exposure_df, and the code could skip reading the meid_map from either the dataframe or the hard-coded dictionary (the final construction of the returned DataFrame would have to be changed, since the point of this code in the gbd_mapping case was to merge in the MEIDs).

Does that all make sense? I haven't tried running this code recently, so I could be misremembering or misunderstanding things and missing the mark.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zmbc Oh, I realized: I bet you tried passing Artifact data in for the LBWSG exposure distribution. That won't work because all the functions in this module were designed to work with get_draws output, not Artifact data, so if you don't have get_draws output available, the correct way to call the function is to pass None. The ability to even pass an exposure dataframe here was a recent addition I made when updating to GBD 2021, so that I could verify the MEIDs were the same. Previously, the only parameter to this function was source, and it always just read the hard-coded MEID dictionary.

So perhaps the new functions signature is a bit confusing, and I've never liked the hard-coding aspect of it. An alternative would be:

get_category_descriptions(
    source='gbd_mapping',
    cat_to_meid_map=None, 
    validate=None,
)

Then calling with the default values would use gbd_mapping to get the category descriptions, bypassing the MEIDs, which would not be included in the result. If you want the MEIDs, you could pass in either a dictionary (e.g., explicitly pass in the hard-coded dict in the module) or Series for cat_to_meid_map, or you could pass in a dataframe that contains both categories and MEIDs as columns, like an exposure dataframe from get_draws. If you switched the source to get_ids, then the cat_to_meid_map would have to be provided since in that case you need to go through MEIDs to get the description. And validate would only be applicable if cat_to_meid_map is not None.

Does that seem like a reasonable approach? Maybe it's a bit complicated, but the goal was to make things more explicit...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NathanielBlairStahn I've updated this PR with something along the lines of what you said, but a bit simpler. I separated out the validation of the LBWSG exposure dataframe from the getting of the metadata. I didn't make the cat_to_meid_map configurable but I made the default gbd_mapping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying over my comment from Slack:

Nathaniel Blair-Stahn
Today at 10:14 AM
Hey @Zeb, I reviewed your PR after you made the update and was going to request a change, but I never got around to it due to Alzheimer's deadlines. I forget exactly what I was thinking, but I think the basic idea was that I was going to request restoring the ability to pass in a dataframe in case source='get_ids', to avoid relying on either the hard-coded values or on gbd_mapping, both of which could get out of sync with the latest round of GBD. I may have had some idea how to do this that was simpler than my original proposal, but I forget...
10:18
Oh, maybe my idea was not to pass in a dataframe, but to allow passing in the Series returned by get_category_to_meid_map, which could default to the hardcoded dictionary if we want. (edited)

- Makes gbd_mapping the default source of LBWSG categories
- Separates loading from validation (new, separate validation function)
- Adds some rudimentary docstrings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments