Skip to content

Comments

full scale test missingness#501

Merged
hussain-jafari merged 6 commits intoepic/full_scale_testingfrom
hjafari/testing/MIC-5515_test_missingness
Mar 31, 2025
Merged

full scale test missingness#501
hussain-jafari merged 6 commits intoepic/full_scale_testingfrom
hjafari/testing/MIC-5515_test_missingness

Conversation

@hussain-jafari
Copy link
Contributor

full scale test missingness

Description

  • Category: feature
  • JIRA issue: MIC-5515

Add full scale test_dataset_missingness.
Typing.
Add default values to overload functions for generating data with dask.

Testing

Ran tests on acs, cps, and wic for RI and USA.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://mypy.readthedocs.io/en/stable/more_types.html#function-overloading:

The default values of a function’s arguments don’t affect its signature – only the absence or presence of a default value does. So in order to reduce redundancy, it’s possible to replace default values in overload definitions with ... as a placeholder

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. I guess we should replace the defaults in all of our overloads with ... then

kwargs["state"] = state
unnoised_data = dataset_func(**kwargs)

# We must manually clean the data for noising since we are recreating our main noising loop
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer explanation in one comment at top

config = get_configuration()

# NOTE: This is recreating Dataset._noise_dataset but adding assertions for missingness
for noise_type in NOISE_TYPES:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test missingness for ALL noise types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be resolved by putting missingness test in refactor loop

if isinstance(noise_type, RowNoiseType):
if config.has_noise_type(dataset.dataset_schema.name, noise_type.name):
noise_type(dataset, config)
# Check missingness is synced with data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Check that dataset.missingness was updated correctly by noising function to match noised data"

# Get dataframe for each dependent group to merge with guardians
in_households_under_18 = dataset.data.loc[
(dataset.data["age"] < 18)
(dataset.data["age"].astype(int) < 18)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cols are all strings, right? How was this ever working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This column dtype is an int when it's first read in and during noising during data generation, but a str/object in our tests which noise post-processed unnoised data.

@hussain-jafari hussain-jafari merged commit fabc791 into epic/full_scale_testing Mar 31, 2025
8 checks passed
@hussain-jafari hussain-jafari deleted the hjafari/testing/MIC-5515_test_missingness branch March 31, 2025 18:52
hussain-jafari added a commit that referenced this pull request May 7, 2025
Category: feature
JIRA issue: MIC-5515

Add full scale test_dataset_missingness.
Typing.
Add default values to overload functions for generating data with dask.

Testing
Ran tests on acs, cps, and wic for RI and USA.
hussain-jafari added a commit that referenced this pull request May 7, 2025
Category: feature
JIRA issue: MIC-5515

Add full scale test_dataset_missingness.
Typing.
Add default values to overload functions for generating data with dask.

Testing
Ran tests on acs, cps, and wic for RI and USA.
hussain-jafari added a commit that referenced this pull request Jul 24, 2025
Category: feature
JIRA issue: MIC-5515

Add full scale test_dataset_missingness.
Typing.
Add default values to overload functions for generating data with dask.

Testing
Ran tests on acs, cps, and wic for RI and USA.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants