sequential row noising#509
Conversation
zmbc
left a comment
There was a problem hiding this comment.
This is really cool stuff! I had a few things I wanted to check about target_proportions and then a few more nitpicky comments
src/pseudopeople/noise_functions.py
Outdated
|
|
||
| from typing import TYPE_CHECKING | ||
|
|
||
| import dask.dataframe as dd |
There was a problem hiding this comment.
We don't want to import this here without a try/except or an if TYPE_CHECKING as it will make dask a required dependency
| if str(source) == RI_FILEPATH and has_small_shards: | ||
| dataset_data = [pd.concat(dataset_data).reset_index()] |
There was a problem hiding this comment.
Are we doing this when actually noising using generate_X? If not, I'm concerned that we are no longer testing the right thing.
| assert set(noised_data.columns) == set(original_data.columns) | ||
| assert (noised_data.dtypes == original_data.dtypes).all() | ||
| for noise_type in NOISE_TYPES: | ||
| prenoised_dataframes = [dataset.data.copy() for dataset in datasets] |
There was a problem hiding this comment.
| prenoised_dataframes = [dataset.data.copy() for dataset in datasets] | |
| pre_noise_dataframes = [dataset.data.copy() for dataset in datasets] |
Perhaps?
There was a problem hiding this comment.
Interesting...my gut reaction was that I definitely want there to be a "d" at the end. But the more I sit with it the less I care.
| prenoised_dataframes = [dataset.data.copy() for dataset in datasets] | ||
| if isinstance(noise_type, RowNoiseType): | ||
| if config.has_noise_type(dataset_schema.name, noise_type.name): | ||
| [noise_type(dataset, config) for dataset in datasets] |
There was a problem hiding this comment.
It's a matter of taste but I'd prefer a one-line for loop to this unassigned list comprehension expression
There was a problem hiding this comment.
I agree, I got really confused as to why this list wasn't being saved as anything...
| name="test_do_not_respond", | ||
| observed_numerator=numerator, | ||
| observed_denominator=denominator, | ||
| # 3% uncertainty on either side |
There was a problem hiding this comment.
I can't find where this was written right now but I have a hunch this was meant to be percentage points, i.e. expected_noise - 0.03, expected_noise + 0.03.
| name="test_omit_row", | ||
| observed_numerator=numerator, | ||
| observed_denominator=denominator, | ||
| # 3% uncertainty on either side |
There was a problem hiding this comment.
I wouldn't expect this to have uncertainty
| observed_numerator=numerators[probability_name], | ||
| observed_denominator=denominators[probability_name], | ||
| target_proportion=expected_noise, | ||
| name_additional=f"noised_data", |
There was a problem hiding this comment.
Nit: Don't need the f string
tests/unit/test_interface.py
Outdated
| assert np.isclose(sum(worker["memory_limit"] / 1024**3 for worker in workers.values()), available_memory, rtol=0.01) | ||
| available_memory = psutil.virtual_memory().total / (1024**3) | ||
| assert np.isclose( | ||
| sum(worker["memory_limit"] / 1024**3 for worker in workers.values()), |
There was a problem hiding this comment.
Is this checking we were close to an expected memory that was profiled or something?
There was a problem hiding this comment.
It's checking that the memory of the dask cluster is close to what we expected when setting up the dask cluster
src/pseudopeople/filter.py
Outdated
| return self.column_name, self.operator, self.value | ||
|
|
||
|
|
||
| def get_generate_data_filters( |
There was a problem hiding this comment.
Nit: this name sounds awkward. Maybe get_data_filters()?
| if str(source) == RI_FILEPATH and has_small_shards: | ||
| dataset_data = [pd.concat(dataset_data).reset_index()] |
There was a problem hiding this comment.
Are we doing this when actually noising using generate_X? If not, I'm concerned that we are no longer testing the right thing.
| prenoised_dataframes = [dataset.data.copy() for dataset in datasets] | ||
| if isinstance(noise_type, RowNoiseType): | ||
| if config.has_noise_type(dataset_schema.name, noise_type.name): | ||
| [noise_type(dataset, config) for dataset in datasets] |
src/pseudopeople/filter.py
Outdated
| filters = [] | ||
|
|
||
| # add year filter for SSA | ||
| if dataset_schema.name == DatasetNames.SSA: |
There was a problem hiding this comment.
This is much cleaner than we had before, and has removed a lot of duplicated code, but I think we can do even better!
If we add these attributes to DatasetSchema, we can simplify this code:
has_state_filterhas_year_lower_filterhas_year_upper_filterhas_exact_year_filter
Then this could be
if dataset_schema.has_state_filter and state is not None:
state_column = cast(str, dataset_schema.state_column_name)
filters.append(DataFilter(state_column, "==", get_state_abbreviation(state)))
if year is not None:
try:
if dataset_schema.has_year_lower_filter:
date_lower_filter = DataFilter(
dataset_schema.date_column_name,
">=",
pd.Timestamp(year=year, month=1, day=1),
)
filters.append(date_lower_filter)
if dataset_schema.has_year_upper_filter:
date_lower_filter = DataFilter(
dataset_schema.date_column_name,
"<=",
pd.Timestamp(year=year, month=12, day=31),
)
filters.append(date_upper_filter)
except(pd.errors.OutOfBoundsDatetime, ValueError):
raise ValueError(f"Invalid year provided: '{year}'")
if dataset_schema.has_exact_year_filter:
filters.append(DataFilter(dataset_schema.date_column_name, "==", year))
src/pseudopeople/filter.py
Outdated
| return self.column_name, self.operator, self.value | ||
|
|
||
|
|
||
| def get_generate_data_filters( |
There was a problem hiding this comment.
Not sure what conversation led to this encapsulation, but I really like it!
Should we add a unit test against this function? I believe we're testing all of this filtering at an integration level so not sure it's precisely necessary but would be nice. @rmudambi ?
src/pseudopeople/interface.py
Outdated
| if state is not None: | ||
| state_column_name = cast(str, DATASET_SCHEMAS.census.state_column_name) | ||
| filters.append(DataFilter(state_column_name, "==", get_state_abbreviation(state))) | ||
| filters: list[DataFilter] = get_generate_data_filters(DATASET_SCHEMAS.census, year, state) |
There was a problem hiding this comment.
Now that there's a discrete filter-generator function, how reasonable would it be to just generate the filters inside of the _generate_dataset? That would be slightly dryer and then filter stuff could be even more encapulated
| from pathlib import Path | ||
| from typing import Any, Literal | ||
|
|
||
| import dask.dataframe as dd |
There was a problem hiding this comment.
same here as Zeb said elsewhere - let's not make dask a required dependency
| source = paths.SAMPLE_DATA_ROOT | ||
| elif isinstance(source, str) or isinstance(source, Path): | ||
| source = Path(source) | ||
| validate_source_compatibility(source, dataset_schema) |
There was a problem hiding this comment.
nit: Can you add a docstring to validate_source_compatibility? I know it's not a new thing in this PR but it would have been helpful when reviewing.
| dataset = Dataset(dataset_schema, original_data, SEED) | ||
| NOISE_TYPES.omit_row(dataset, config) | ||
| noised_data = dataset.data | ||
| has_small_shards = dataset_name == "acs" or dataset_name == "cps" or dataset_name == "wic" |
There was a problem hiding this comment.
Add a comment explaining what you're doing here - I wasn't expecting this.
| # 3% uncertainty on either side | ||
| target_proportion=(expected_noise * 0.97, expected_noise * 1.03), | ||
| name_additional=f"noised_data", | ||
| run_column_noising_tests( |
There was a problem hiding this comment.
Why is there column noising stuff in this row-noising PR? Should I be reviewing all this?
There was a problem hiding this comment.
No you can ignore this
| @@ -0,0 +1,222 @@ | |||
| from __future__ import annotations | |||
|
|
|||
| import dask.dataframe as dd | |||
There was a problem hiding this comment.
same here - don't import dask globally
| from pseudopeople.schema_entities import DATASET_SCHEMAS | ||
|
|
||
|
|
||
| def run_do_not_respond_tests( |
There was a problem hiding this comment.
are these all exact copy/pastes from the previous tests?
tests/unit/test_interface.py
Outdated
| client = get_client() | ||
| client.shutdown() | ||
| client.shutdown() # type: ignore [no-untyped-call] | ||
| time.sleep(30) |
There was a problem hiding this comment.
remove this, it's not the source of the bug from before (and we'd never want to sleep 30 seconds in a unit test, anyway)
There was a problem hiding this comment.
Actually - this will all change when you rebase on the epic branch
tests/unit/test_interface.py
Outdated
| assert np.isclose(sum(worker["memory_limit"] / 1024**3 for worker in workers.values()), available_memory, rtol=0.01) | ||
| available_memory = psutil.virtual_memory().total / (1024**3) | ||
| assert np.isclose( | ||
| sum(worker["memory_limit"] / 1024**3 for worker in workers.values()), |
There was a problem hiding this comment.
It's checking that the memory of the dask cluster is close to what we expected when setting up the dask cluster
|
|
||
| datasets = [Dataset(dataset_schema, data, SEED) for data in dataset_data] | ||
| seed = SEED | ||
| if dataset_schema.name != DatasetNames.CENSUS and year is not None: |
There was a problem hiding this comment.
Why not do this for census?
There was a problem hiding this comment.
I don't know why, but the generate_decennial_census function doesn't update the seed (unlike every other generate_data function).
| datasets = [Dataset(dataset_schema, data, SEED) for data in dataset_data] | ||
| seed = SEED | ||
| if dataset_schema.name != DatasetNames.CENSUS and year is not None: | ||
| seed = seed * 10_000 + year |
There was a problem hiding this comment.
I wouldn't change it, but can't you also just do seed += year?
There was a problem hiding this comment.
That would work for this particular case, but I wanted to make it obvious that this code was taken from the generate_data functions.
There was a problem hiding this comment.
I'm not sure it matters in these tests, but this 10_000 is actually important, quite hacky, and needs (at least!) a comment to explain it.
It's there because we don't want different years of the same dataset to get "the same" noise using the same (e.g. default) seed. What I mean by the same noise is, you'd get typos in the same place in the same column, even though the underlying data is different. This is similar to what we also do with different seeds per-shard:
pseudopeople/src/pseudopeople/interface.py
Lines 97 to 102 in a4cdfcf
So how does 10_000 get involved? If we simply did seed += year there could be collisions. For example, the user passing in seed=1234, year=2019 would be the same as if they passed seed=1233, year=2020. So we're no longer getting the property of seeds that any different is totally different. With the * 10_000, as long as years are always less than 10,000, such collisions cannot occur, and the desired property is restored.
There was a problem hiding this comment.
seed = hash(f"{seed}_{year}") might be a clearer way to achieve the same.
| [noise_type(dataset, config) for dataset in datasets] | ||
| for dataset in datasets: | ||
| # noise datasets in place | ||
| noise_type(dataset, config) |
Category: feature JIRA issue: MIC-5982 Add tests for sequential row noising. Read in data using load_standard_dataset and noise by looping through noise types, all shard-wise except RI with ACS, CPS, and WIC. Testing Ran new tests with ACS, CPS, and WIC on USA and RI and SSA on RI.
Category: feature JIRA issue: MIC-5982 Add tests for sequential row noising. Read in data using load_standard_dataset and noise by looping through noise types, all shard-wise except RI with ACS, CPS, and WIC. Testing Ran new tests with ACS, CPS, and WIC on USA and RI and SSA on RI.
Category: feature JIRA issue: MIC-5982 Add tests for sequential row noising. Read in data using load_standard_dataset and noise by looping through noise types, all shard-wise except RI with ACS, CPS, and WIC. Testing Ran new tests with ACS, CPS, and WIC on USA and RI and SSA on RI.
sequential row noising
Description
Add tests for sequential row noising.
Read in data using load_standard_dataset and noise by looping through noise types, all shard-wise except RI with ACS, CPS, and WIC.
Testing
Ran new tests with ACS, CPS, and WIC on USA and RI and SSA on RI.