change dask default client by stevebachmeier · Pull Request #499 · ihmeuw/pseudopeople

stevebachmeier · 2025-03-17T22:37:00Z

Change default Dask client

Description

Category: feature
JIRA issue: https://jira.ihme.washington.edu/browse/MIC-5523

Dask by default uses a threaded scheduler which isn't helpful for these workloads.
This fixes it so that if first checks to see if a dask cluster is set up and, if so,
just uses that. If one isn't, it uses a LocalCluster with threads per node = 1.

Testing

all tests pass (pytest --runslow)

zmbc · 2025-03-18T14:55:09Z

Just to be really pedantic, it isn't that the threaded scheduler isn't helpful on our cluster but that it isn't helpful for the workload we run on it! Because most of the runtime hotspots in pseudopeople can only effectively use one thread.

stevebachmeier · 2025-03-19T17:37:53Z

tests/integration/test_interface.py

-    # Generate a new (non-fixture) dataset for a single year but mocked such
-    # that no noise actually happens (otherwise the years would get noised and
-    # we couldn't tell if the filter was working properly)
-    mocker.patch("pseudopeople.dataset.Dataset._noise_dataset")


These mocks were added prior to implementing the config = psp.NO_NOISE option - but it doesn't work for a distributed cluster since the no_noise call is split among different processes

stevebachmeier · 2025-03-19T17:38:14Z

tests/integration/test_interface.py

-####################
-# HELPER FUNCTIONS #
-####################
-def _get_column_noise_level(


This just wasn't being used anymore

stevebachmeier · 2025-03-19T17:38:59Z

tests/unit/test_interface.py

+            available_memory = float(os.environ["SLURM_MEM_PER_NODE"]) / 1024
+        except KeyError:
+            raise RuntimeError(
+                "You are on Slurm but SLURM_MEM_PER_NODE is not set. "


Happy to work on this message. It only shows up if you run pytest while SSHed into a cluster node (the obvious example being you're using vscode on the cluster)

stevebachmeier · 2025-03-19T17:41:07Z

src/pseudopeople/interface.py

+        from dask.system import CPU_COUNT
+
+        # extract the memory limit from the environment variable
+        cluster = LocalCluster(  # type: ignore [no-untyped-call]


I couldn't get mypy to be happy without just ignoring these dask untyped calls. It does mean that the added @overload blocks I added throughout this file aren't strictly required for mypy - but they are more correct regardless.

stevebachmeier · 2025-03-19T17:42:45Z

src/pseudopeople/interface.py

@@ -126,6 +155,8 @@ def _generate_dataset(
        import dask
        import dask.dataframe as dd


@zmbc Remind me why you prefer having dask imports in local scope? It leads to a weird thing where we import dask.dataframe as dd if typechecking and then import it here during runtime if the engine is dask.

It's so that a user doesn't have to have dask in their environment to run pseudopeople if they are using the pandas engine.

rmudambi · 2025-03-19T17:49:29Z

src/pseudopeople/interface.py

+
+
+@overload
+def _generate_dataset(


This is for type-hinting to work?

Well, kinda. Technically mypy didn't care anyway b/c I am ignoring the dask_data.map_partitions() call anyway ([no-untyped-arg]) and so mypy has no way of actually knowing what type it is. But this seems to be the correct way to handle return types that are arguemnt-specific

rmudambi · 2025-03-19T17:50:59Z

src/pseudopeople/interface.py

@@ -126,6 +155,8 @@ def _generate_dataset(
        import dask
        import dask.dataframe as dd


It's so that a user doesn't have to have dask in their environment to run pseudopeople if they are using the pandas engine.

zmbc

Looks great, thank you for all the tricky debugging you did to get this working @stevebachmeier!

zmbc · 2025-03-20T16:27:44Z

src/pseudopeople/interface.py

+@overload
 def generate_decennial_census(
    source: Path | str | None = None,
    seed: int = 0,
    config: Path | str | dict[str, Any] | None = None,
    year: int | None = 2020,
    state: str | None = None,
    verbose: bool = False,
-    engine: Literal["pandas", "dask"] = "pandas",
+    engine: Literal["pandas"] = "pandas",
 ) -> pd.DataFrame:
+    ...
+
+
+@overload
+def generate_decennial_census(
+    source: Path | str | None,
+    seed: int,
+    config: Path | str | dict[str, Any] | None,
+    year: int | None,
+    state: str | None,
+    verbose: bool,
+    engine: Literal["dask"],
+) -> dd.DataFrame:
+    ...


Note: I'm curious how overloads look in the docs, will take a look at this after I finish reviewing the code.

The overloads don't show up 😞 Oh well, at least they could be helpful for autocomplete etc

Hmm, that's too bad. I didn't think to check myself but sphinx def worked on it.

zmbc · 2025-03-20T16:30:09Z

src/pseudopeople/interface.py

+            n_workers=CPU_COUNT,
+            threads_per_worker=1,
+        )
+        cluster.get_client()  # type: ignore [no-untyped-call]


I have a vague memory of needing to have this client object in scope for it to be used. But I presume your testing here has ensured this client is used? I wonder if .get_client is even necessary?

Oh, I should probably add to the test that the client is actually type LocalCluster. Then I think I'd be happy b/c that's def not just the default

Shoot, the dask-default client happens to be a LocalCluster as well. I'll add a name to our new default (name = "pseudopeople_dask_cluster" unless you disagree) and assert that it's correct

tests/unit/test_interface.py

Co-authored-by: Zeb Burke-Conte <zmbc@users.noreply.github.com>

stevebachmeier requested a review from zmbc March 17, 2025 22:49

change default dask client

1709f35

stevebachmeier force-pushed the sbachmei/mic-5523/change-dask-scheduler-defaults branch from be9103b to 1709f35 Compare March 18, 2025 20:01

stevebachmeier added 2 commits March 18, 2025 13:13

mypy fixes

c42e5d5

add typing overloads to dataset generation functions

d465017

stevebachmeier force-pushed the sbachmei/mic-5523/change-dask-scheduler-defaults branch from 93ec3ec to d465017 Compare March 18, 2025 22:20

stevebachmeier added 2 commits March 19, 2025 10:24

fix broken pytests

17a8a34

nitpick dd.DataFrame

6c9a45e

stevebachmeier commented Mar 19, 2025

View reviewed changes

stevebachmeier marked this pull request as ready for review March 19, 2025 17:42

stevebachmeier requested review from a team, albrja, hussain-jafari, patricktnast and rmudambi as code owners March 19, 2025 17:42

rmudambi approved these changes Mar 19, 2025

View reviewed changes

hussain-jafari approved these changes Mar 19, 2025

View reviewed changes

albrja approved these changes Mar 19, 2025

View reviewed changes

zmbc approved these changes Mar 20, 2025

View reviewed changes

stevebachmeier and others added 2 commits March 20, 2025 10:43

Update tests/unit/test_interface.py

cdcca24

Co-authored-by: Zeb Burke-Conte <zmbc@users.noreply.github.com>

beef up test

c3a5ab3

stevebachmeier merged commit b1952c3 into epic/full_scale_testing Mar 20, 2025
6 of 8 checks passed

stevebachmeier deleted the sbachmei/mic-5523/change-dask-scheduler-defaults branch March 20, 2025 17:32

hussain-jafari pushed a commit that referenced this pull request May 7, 2025

change dask default client (#499)

acf44ae

hussain-jafari pushed a commit that referenced this pull request May 7, 2025

change dask default client (#499)

dedc4a5

hussain-jafari pushed a commit that referenced this pull request Jul 24, 2025

change dask default client (#499)

f8499d6

		@@ -126,6 +155,8 @@ def _generate_dataset(
		import dask
		import dask.dataframe as dd

Comments

Conversation

stevebachmeier commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change default Dask client

Description

Testing

Uh oh!

zmbc commented Mar 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevebachmeier Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zmbc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevebachmeier Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stevebachmeier commented Mar 17, 2025 •

edited

Loading

stevebachmeier Mar 19, 2025 •

edited

Loading

stevebachmeier Mar 20, 2025 •

edited

Loading