Hjafari/feature/mic 5517 guardian test#498
Hjafari/feature/mic 5517 guardian test#498hussain-jafari merged 14 commits intoepic/full_scale_testingfrom
Conversation
src/pseudopeople/dataset.py
Outdated
|
|
||
| self.data = data | ||
|
|
||
| def keep_schema_columns(self, data, dataset_schema) -> pd.DataFrame: |
There was a problem hiding this comment.
Rename and change to static method
| ) -> None: | ||
| if dataset_name != DatasetNames.CENSUS: | ||
| return | ||
|
|
There was a problem hiding this comment.
Add comments about why you're patching
| "pseudopeople.dataset.Dataset.keep_schema_columns", side_effect=lambda df, _: df | ||
| ) | ||
| mocker.patch( | ||
| "pseudopeople.configuration.generator.validate_overrides", |
There was a problem hiding this comment.
Check whether these validations are being tested
There was a problem hiding this comment.
Why do you need to patch over these?
There was a problem hiding this comment.
Does validate_overrides not allow for 0 probabilities?
There was a problem hiding this comment.
For age differences, by default we can't have a non-zero probability of keeping the ages the same.
There was a problem hiding this comment.
I'd say you should set that noise type to not noise then
| # return dataset | ||
|
|
||
|
|
||
| # Helper function to format group dataframe and merging with their dependents |
There was a problem hiding this comment.
This whole function was copied without changes
stevebachmeier
left a comment
There was a problem hiding this comment.
Looks good! I'd like clarification on a few things before I approve
| new_probability = [0.0 for x in probability] | ||
| elif isinstance(probability, dict): | ||
| new_probability = {key: 0.0 for key in probability.keys()} | ||
| # NOTE: this will fail default config validations |
There was a problem hiding this comment.
Can you explain this? Why would the key be an integer? I don't really undwerstand your note, either.
There was a problem hiding this comment.
This structure is for "possible age differences" where the keys can be -2 or 1 to indicate what int to add to a simulant's actual age and the value is the probability of picking each age difference.
| "pseudopeople.dataset.Dataset.keep_schema_columns", side_effect=lambda df, _: df | ||
| ) | ||
| mocker.patch( | ||
| "pseudopeople.configuration.generator.validate_overrides", |
There was a problem hiding this comment.
Does validate_overrides not allow for 0 probabilities?
| group_data = unnoised.loc[ | ||
| (unnoised["age"].astype(int) < age) | ||
| & (unnoised["housing_type"] == housing_type) | ||
| & (unnoised["guardian_1"].notna()) |
There was a problem hiding this comment.
because guardian_1 is never nan, right? But guardian_2 might be?
There was a problem hiding this comment.
Yes if "guardian_1" is notna then there is at least one guardian but not necessarily guardian 2
src/pseudopeople/noise_functions.py
Outdated
| if index_to_copy.empty: | ||
| continue | ||
| noised_group_df = group_df.loc[index_to_copy] | ||
| noised_group_df["old_housing_type"] = noised_group_df["housing_type"] |
There was a problem hiding this comment.
I don't love adding this column in the "real" data when it's only used for testing. But it also seems too big of a pain to refactor this somehow so that you can make a test fixture of the call and add the col there.
Instead, couldn't you get the "old_housing_type" from the unnoised_data in the tests?
There was a problem hiding this comment.
+1 if this is just for testing you could save a copy of the old_housing_type series and then map that to verify your test or something.
Category: test JIRA issue: MIC-5517 Expand duplicate guardian test to full scale. Testing Ran test on census on RI and US data.
Category: test JIRA issue: MIC-5517 Expand duplicate guardian test to full scale. Testing Ran test on census on RI and US data.
Category: test JIRA issue: MIC-5517 Expand duplicate guardian test to full scale. Testing Ran test on census on RI and US data.
full scale guardian duplication test
Description
Expand duplicate guardian test to full scale.
Testing
Ran test on census on RI and US data.