Skip to content

Comments

Hjafari/feature/mic 5517 guardian test#498

Merged
hussain-jafari merged 14 commits intoepic/full_scale_testingfrom
hjafari/feature/MIC-5517_guardian_test
Mar 25, 2025
Merged

Hjafari/feature/mic 5517 guardian test#498
hussain-jafari merged 14 commits intoepic/full_scale_testingfrom
hjafari/feature/MIC-5517_guardian_test

Conversation

@hussain-jafari
Copy link
Contributor

@hussain-jafari hussain-jafari commented Mar 17, 2025

full scale guardian duplication test

Description

Expand duplicate guardian test to full scale.

Testing

Ran test on census on RI and US data.


self.data = data

def keep_schema_columns(self, data, dataset_schema) -> pd.DataFrame:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename and change to static method

) -> None:
if dataset_name != DatasetNames.CENSUS:
return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments about why you're patching

"pseudopeople.dataset.Dataset.keep_schema_columns", side_effect=lambda df, _: df
)
mocker.patch(
"pseudopeople.configuration.generator.validate_overrides",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check whether these validations are being tested

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to patch over these?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does validate_overrides not allow for 0 probabilities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For age differences, by default we can't have a non-zero probability of keeping the ages the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say you should set that noise type to not noise then

# return dataset


# Helper function to format group dataframe and merging with their dependents
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function was copied without changes

Copy link
Contributor

@stevebachmeier stevebachmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I'd like clarification on a few things before I approve

new_probability = [0.0 for x in probability]
elif isinstance(probability, dict):
new_probability = {key: 0.0 for key in probability.keys()}
# NOTE: this will fail default config validations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this? Why would the key be an integer? I don't really undwerstand your note, either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This structure is for "possible age differences" where the keys can be -2 or 1 to indicate what int to add to a simulant's actual age and the value is the probability of picking each age difference.

"pseudopeople.dataset.Dataset.keep_schema_columns", side_effect=lambda df, _: df
)
mocker.patch(
"pseudopeople.configuration.generator.validate_overrides",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does validate_overrides not allow for 0 probabilities?

group_data = unnoised.loc[
(unnoised["age"].astype(int) < age)
& (unnoised["housing_type"] == housing_type)
& (unnoised["guardian_1"].notna())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because guardian_1 is never nan, right? But guardian_2 might be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if "guardian_1" is notna then there is at least one guardian but not necessarily guardian 2

if index_to_copy.empty:
continue
noised_group_df = group_df.loc[index_to_copy]
noised_group_df["old_housing_type"] = noised_group_df["housing_type"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love adding this column in the "real" data when it's only used for testing. But it also seems too big of a pain to refactor this somehow so that you can make a test fixture of the call and add the col there.

Instead, couldn't you get the "old_housing_type" from the unnoised_data in the tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 if this is just for testing you could save a copy of the old_housing_type series and then map that to verify your test or something.

@hussain-jafari hussain-jafari merged commit d3faa25 into epic/full_scale_testing Mar 25, 2025
8 checks passed
@hussain-jafari hussain-jafari deleted the hjafari/feature/MIC-5517_guardian_test branch March 25, 2025 16:59
hussain-jafari added a commit that referenced this pull request May 7, 2025
Category: test
JIRA issue: MIC-5517
Expand duplicate guardian test to full scale.

Testing
Ran test on census on RI and US data.
hussain-jafari added a commit that referenced this pull request May 7, 2025
Category: test
JIRA issue: MIC-5517
Expand duplicate guardian test to full scale.

Testing
Ran test on census on RI and US data.
hussain-jafari added a commit that referenced this pull request Jul 24, 2025
Category: test
JIRA issue: MIC-5517
Expand duplicate guardian test to full scale.

Testing
Ran test on census on RI and US data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants