sequential row noising by hussain-jafari · Pull Request #509 · ihmeuw/pseudopeople

hussain-jafari · 2025-04-28T15:36:23Z

sequential row noising

Description

Category: feature
JIRA issue: MIC-5982

Add tests for sequential row noising.
Read in data using load_standard_dataset and noise by looping through noise types, all shard-wise except RI with ACS, CPS, and WIC.

Testing

Ran new tests with ACS, CPS, and WIC on USA and RI and SSA on RI.

zmbc

This is really cool stuff! I had a few things I wanted to check about target_proportions and then a few more nitpicky comments

zmbc · 2025-04-28T15:43:26Z

src/pseudopeople/noise_functions.py


 from typing import TYPE_CHECKING

+import dask.dataframe as dd


We don't want to import this here without a try/except or an if TYPE_CHECKING as it will make dask a required dependency

zmbc · 2025-04-28T15:45:19Z

tests/integration/release/test_release.py

+    if str(source) == RI_FILEPATH and has_small_shards:
+        dataset_data = [pd.concat(dataset_data).reset_index()]


Curious about this!

Are we doing this when actually noising using generate_X? If not, I'm concerned that we are no longer testing the right thing.

zmbc · 2025-04-29T17:30:01Z

tests/integration/release/test_release.py

-    assert set(noised_data.columns) == set(original_data.columns)
-    assert (noised_data.dtypes == original_data.dtypes).all()
+    for noise_type in NOISE_TYPES:
+        prenoised_dataframes = [dataset.data.copy() for dataset in datasets]


Suggested change

prenoised_dataframes = [dataset.data.copy() for dataset in datasets]

pre_noise_dataframes = [dataset.data.copy() for dataset in datasets]

Perhaps?

Interesting...my gut reaction was that I definitely want there to be a "d" at the end. But the more I sit with it the less I care.

zmbc · 2025-04-29T17:31:25Z

tests/integration/release/test_release.py

+        prenoised_dataframes = [dataset.data.copy() for dataset in datasets]
+        if isinstance(noise_type, RowNoiseType):
+            if config.has_noise_type(dataset_schema.name, noise_type.name):
+                [noise_type(dataset, config) for dataset in datasets]


It's a matter of taste but I'd prefer a one-line for loop to this unassigned list comprehension expression

I agree, I got really confused as to why this list wasn't being saved as anything...

zmbc · 2025-04-29T17:33:42Z

tests/integration/release/utilities.py

+            name="test_do_not_respond",
+            observed_numerator=numerator,
+            observed_denominator=denominator,
+            # 3% uncertainty on either side


I can't find where this was written right now but I have a hunch this was meant to be percentage points, i.e. expected_noise - 0.03, expected_noise + 0.03.

zmbc · 2025-04-29T17:34:22Z

tests/integration/release/utilities.py

+            name="test_omit_row",
+            observed_numerator=numerator,
+            observed_denominator=denominator,
+            # 3% uncertainty on either side


I wouldn't expect this to have uncertainty

albrja · 2025-04-29T22:12:07Z

tests/integration/release/utilities.py

+                observed_numerator=numerators[probability_name],
+                observed_denominator=denominators[probability_name],
+                target_proportion=expected_noise,
+                name_additional=f"noised_data",


Nit: Don't need the f string

albrja · 2025-04-29T22:13:25Z

tests/unit/test_interface.py

-    assert np.isclose(sum(worker["memory_limit"] / 1024**3 for worker in workers.values()), available_memory, rtol=0.01)
+        available_memory = psutil.virtual_memory().total / (1024**3)
+    assert np.isclose(
+        sum(worker["memory_limit"] / 1024**3 for worker in workers.values()),


Is this checking we were close to an expected memory that was profiled or something?

It's checking that the memory of the dask cluster is close to what we expected when setting up the dask cluster

rmudambi · 2025-05-01T22:23:52Z

src/pseudopeople/filter.py

        return self.column_name, self.operator, self.value
+
+
+def get_generate_data_filters(


Nit: this name sounds awkward. Maybe get_data_filters()?

rmudambi · 2025-05-01T22:28:31Z

tests/integration/release/test_release.py

+    if str(source) == RI_FILEPATH and has_small_shards:
+        dataset_data = [pd.concat(dataset_data).reset_index()]


Are we doing this when actually noising using generate_X? If not, I'm concerned that we are no longer testing the right thing.

rmudambi · 2025-05-01T22:29:07Z

tests/integration/release/test_release.py

+        prenoised_dataframes = [dataset.data.copy() for dataset in datasets]
+        if isinstance(noise_type, RowNoiseType):
+            if config.has_noise_type(dataset_schema.name, noise_type.name):
+                [noise_type(dataset, config) for dataset in datasets]


rmudambi · 2025-05-01T23:14:10Z

src/pseudopeople/filter.py

+    filters = []
+
+    # add year filter for SSA
+    if dataset_schema.name == DatasetNames.SSA:


This is much cleaner than we had before, and has removed a lot of duplicated code, but I think we can do even better!

If we add these attributes to DatasetSchema, we can simplify this code:

has_state_filter

has_year_lower_filter

has_year_upper_filter

has_exact_year_filter

Then this could be

if dataset_schema.has_state_filter and state is not None: state_column = cast(str, dataset_schema.state_column_name) filters.append(DataFilter(state_column, "==", get_state_abbreviation(state))) if year is not None: try: if dataset_schema.has_year_lower_filter: date_lower_filter = DataFilter( dataset_schema.date_column_name, ">=", pd.Timestamp(year=year, month=1, day=1), ) filters.append(date_lower_filter) if dataset_schema.has_year_upper_filter: date_lower_filter = DataFilter( dataset_schema.date_column_name, "<=", pd.Timestamp(year=year, month=12, day=31), ) filters.append(date_upper_filter) except(pd.errors.OutOfBoundsDatetime, ValueError): raise ValueError(f"Invalid year provided: '{year}'") if dataset_schema.has_exact_year_filter: filters.append(DataFilter(dataset_schema.date_column_name, "==", year))

stevebachmeier · 2025-05-02T14:57:35Z

src/pseudopeople/filter.py

        return self.column_name, self.operator, self.value
+
+
+def get_generate_data_filters(


Not sure what conversation led to this encapsulation, but I really like it!

Should we add a unit test against this function? I believe we're testing all of this filtering at an integration level so not sure it's precisely necessary but would be nice. @rmudambi ?

stevebachmeier · 2025-05-02T15:02:10Z

src/pseudopeople/interface.py

-    if state is not None:
-        state_column_name = cast(str, DATASET_SCHEMAS.census.state_column_name)
-        filters.append(DataFilter(state_column_name, "==", get_state_abbreviation(state)))
+    filters: list[DataFilter] = get_generate_data_filters(DATASET_SCHEMAS.census, year, state)


Now that there's a discrete filter-generator function, how reasonable would it be to just generate the filters inside of the _generate_dataset? That would be slightly dryer and then filter stuff could be even more encapulated

stevebachmeier · 2025-05-02T15:05:53Z

tests/integration/release/test_release.py

+from pathlib import Path
 from typing import Any, Literal

+import dask.dataframe as dd


same here as Zeb said elsewhere - let's not make dask a required dependency

stevebachmeier · 2025-05-02T15:10:18Z

tests/integration/release/test_release.py

+        source = paths.SAMPLE_DATA_ROOT
+    elif isinstance(source, str) or isinstance(source, Path):
+        source = Path(source)
+        validate_source_compatibility(source, dataset_schema)


nit: Can you add a docstring to validate_source_compatibility? I know it's not a new thing in this PR but it would have been helpful when reviewing.

stevebachmeier · 2025-05-02T15:12:05Z

tests/integration/release/test_release.py

-    dataset = Dataset(dataset_schema, original_data, SEED)
-    NOISE_TYPES.omit_row(dataset, config)
-    noised_data = dataset.data
+    has_small_shards = dataset_name == "acs" or dataset_name == "cps" or dataset_name == "wic"


Add a comment explaining what you're doing here - I wasn't expecting this.

stevebachmeier · 2025-05-02T15:32:39Z

tests/integration/release/test_release.py

-        # 3% uncertainty on either side
-        target_proportion=(expected_noise * 0.97, expected_noise * 1.03),
-        name_additional=f"noised_data",
+    run_column_noising_tests(


Why is there column noising stuff in this row-noising PR? Should I be reviewing all this?

No you can ignore this

stevebachmeier · 2025-05-02T15:33:41Z

tests/integration/release/utilities.py

@@ -0,0 +1,222 @@
+from __future__ import annotations
+
+import dask.dataframe as dd


same here - don't import dask globally

stevebachmeier · 2025-05-02T15:34:09Z

tests/integration/release/utilities.py

+from pseudopeople.schema_entities import DATASET_SCHEMAS
+
+
+def run_do_not_respond_tests(


are these all exact copy/pastes from the previous tests?

stevebachmeier · 2025-05-02T15:34:35Z

tests/unit/test_interface.py

        client = get_client()
-        client.shutdown()
+        client.shutdown()  # type: ignore [no-untyped-call]
+        time.sleep(30)


remove this, it's not the source of the bug from before (and we'd never want to sleep 30 seconds in a unit test, anyway)

Actually - this will all change when you rebase on the epic branch

stevebachmeier · 2025-05-02T15:35:32Z

tests/unit/test_interface.py

-    assert np.isclose(sum(worker["memory_limit"] / 1024**3 for worker in workers.values()), available_memory, rtol=0.01)
+        available_memory = psutil.virtual_memory().total / (1024**3)
+    assert np.isclose(
+        sum(worker["memory_limit"] / 1024**3 for worker in workers.values()),


It's checking that the memory of the dask cluster is close to what we expected when setting up the dask cluster

…sequential_row_noising

stevebachmeier · 2025-05-06T20:14:49Z

tests/integration/release/test_release.py

-
-    datasets = [Dataset(dataset_schema, data, SEED) for data in dataset_data]
+    seed = SEED
+    if dataset_schema.name != DatasetNames.CENSUS and year is not None:


Why not do this for census?

I don't know why, but the generate_decennial_census function doesn't update the seed (unlike every other generate_data function).

This seems like a bug to me!

stevebachmeier · 2025-05-06T20:15:19Z

tests/integration/release/test_release.py

-    datasets = [Dataset(dataset_schema, data, SEED) for data in dataset_data]
+    seed = SEED
+    if dataset_schema.name != DatasetNames.CENSUS and year is not None:
+        seed = seed * 10_000 + year


I wouldn't change it, but can't you also just do seed += year?

That would work for this particular case, but I wanted to make it obvious that this code was taken from the generate_data functions.

I'm not sure it matters in these tests, but this 10_000 is actually important, quite hacky, and needs (at least!) a comment to explain it.

It's there because we don't want different years of the same dataset to get "the same" noise using the same (e.g. default) seed. What I mean by the same noise is, you'd get typos in the same place in the same column, even though the underlying data is different. This is similar to what we also do with different seeds per-shard:

pseudopeople/src/pseudopeople/interface.py

Lines 97 to 102 in a4cdfcf

# Use a different seed for each data file/shard, otherwise the randomness will duplicate

# and the Nth row in each shard will get the same noise

data_path_seed = f"{seed}_{data_file_index}"

noised_data = _prep_and_noise_dataset(

data, dataset, configuration_tree, data_path_seed

)

So how does 10_000 get involved? If we simply did seed += year there could be collisions. For example, the user passing in seed=1234, year=2019 would be the same as if they passed seed=1233, year=2020. So we're no longer getting the property of seeds that any different is totally different. With the * 10_000, as long as years are always less than 10,000, such collisions cannot occur, and the desired property is restored.

seed = hash(f"{seed}_{year}") might be a clearer way to achieve the same.

stevebachmeier · 2025-05-06T20:17:08Z

tests/integration/release/test_release.py

-                [noise_type(dataset, config) for dataset in datasets]
+                for dataset in datasets:
+                    # noise datasets in place
+                    noise_type(dataset, config)


Category: feature JIRA issue: MIC-5982 Add tests for sequential row noising. Read in data using load_standard_dataset and noise by looping through noise types, all shard-wise except RI with ACS, CPS, and WIC. Testing Ran new tests with ACS, CPS, and WIC on USA and RI and SSA on RI.

Hussain Jafari added 3 commits April 11, 2025 12:08

sequential row noising tests

529d1bd

noise datasets shardwise

a04bf42

concat WIC RI data

b5193d9

hussain-jafari requested review from albrja, patricktnast, rmudambi and stevebachmeier as code owners April 28, 2025 15:36

Hussain Jafari added 3 commits April 28, 2025 09:10

move filter logic to function

0c170cc

test sleeping in dask test

581f4da

lint

e7a01cd

zmbc reviewed Apr 29, 2025

View reviewed changes

albrja approved these changes Apr 29, 2025

View reviewed changes

rmudambi reviewed May 1, 2025

View reviewed changes

stevebachmeier requested changes May 2, 2025

View reviewed changes

Hussain Jafari added 4 commits May 2, 2025 13:01

Merge branch 'epic/full_scale_testing' into hjafari/feature/MIC-5885_…

d65abb9

…sequential_row_noising

PR feedback

aebdbba

remove unused import

bd2448e

use same seed as in interface.py

ed6c44d

stevebachmeier reviewed May 6, 2025

View reviewed changes

stevebachmeier approved these changes May 6, 2025

View reviewed changes

Hussain Jafari added 3 commits May 6, 2025 13:51

correct dask import

f92bd20

trigger build

d1961e5

isort

5541661

hussain-jafari merged commit 71cab55 into epic/full_scale_testing May 7, 2025
8 checks passed

hussain-jafari deleted the hjafari/feature/MIC-5885_sequential_row_noising branch May 7, 2025 18:02

		if str(source) == RI_FILEPATH and has_small_shards:
		dataset_data = [pd.concat(dataset_data).reset_index()]

	prenoised_dataframes = [dataset.data.copy() for dataset in datasets]
	pre_noise_dataframes = [dataset.data.copy() for dataset in datasets]

		return self.column_name, self.operator, self.value


		def get_generate_data_filters(

		@@ -0,0 +1,222 @@
		from __future__ import annotations

		import dask.dataframe as dd

		from pseudopeople.schema_entities import DATASET_SCHEMAS


		def run_do_not_respond_tests(

	# Use a different seed for each data file/shard, otherwise the randomness will duplicate
	# and the Nth row in each shard will get the same noise
	data_path_seed = f"{seed}_{data_file_index}"
	noised_data = _prep_and_noise_dataset(
	data, dataset, configuration_tree, data_path_seed
	)

Conversation

hussain-jafari commented Apr 28, 2025