Skip to content

Comments

change dask default client#499

Merged
stevebachmeier merged 7 commits intoepic/full_scale_testingfrom
sbachmei/mic-5523/change-dask-scheduler-defaults
Mar 20, 2025
Merged

change dask default client#499
stevebachmeier merged 7 commits intoepic/full_scale_testingfrom
sbachmei/mic-5523/change-dask-scheduler-defaults

Conversation

@stevebachmeier
Copy link
Contributor

@stevebachmeier stevebachmeier commented Mar 17, 2025

Change default Dask client

Description

Dask by default uses a threaded scheduler which isn't helpful for these workloads.
This fixes it so that if first checks to see if a dask cluster is set up and, if so,
just uses that. If one isn't, it uses a LocalCluster with threads per node = 1.

Testing

  • all tests pass (pytest --runslow)

@stevebachmeier stevebachmeier requested a review from zmbc March 17, 2025 22:49
@zmbc
Copy link
Collaborator

zmbc commented Mar 18, 2025

Just to be really pedantic, it isn't that the threaded scheduler isn't helpful on our cluster but that it isn't helpful for the workload we run on it! Because most of the runtime hotspots in pseudopeople can only effectively use one thread.

@stevebachmeier stevebachmeier force-pushed the sbachmei/mic-5523/change-dask-scheduler-defaults branch from be9103b to 1709f35 Compare March 18, 2025 20:01
@stevebachmeier stevebachmeier force-pushed the sbachmei/mic-5523/change-dask-scheduler-defaults branch from 93ec3ec to d465017 Compare March 18, 2025 22:20
# Generate a new (non-fixture) dataset for a single year but mocked such
# that no noise actually happens (otherwise the years would get noised and
# we couldn't tell if the filter was working properly)
mocker.patch("pseudopeople.dataset.Dataset._noise_dataset")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These mocks were added prior to implementing the config = psp.NO_NOISE option - but it doesn't work for a distributed cluster since the no_noise call is split among different processes

####################
# HELPER FUNCTIONS #
####################
def _get_column_noise_level(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just wasn't being used anymore

available_memory = float(os.environ["SLURM_MEM_PER_NODE"]) / 1024
except KeyError:
raise RuntimeError(
"You are on Slurm but SLURM_MEM_PER_NODE is not set. "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to work on this message. It only shows up if you run pytest while SSHed into a cluster node (the obvious example being you're using vscode on the cluster)

from dask.system import CPU_COUNT

# extract the memory limit from the environment variable
cluster = LocalCluster( # type: ignore [no-untyped-call]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get mypy to be happy without just ignoring these dask untyped calls. It does mean that the added @overload blocks I added throughout this file aren't strictly required for mypy - but they are more correct regardless.

@@ -126,6 +155,8 @@ def _generate_dataset(
import dask
import dask.dataframe as dd
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zmbc Remind me why you prefer having dask imports in local scope? It leads to a weird thing where we import dask.dataframe as dd if typechecking and then import it here during runtime if the engine is dask.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so that a user doesn't have to have dask in their environment to run pseudopeople if they are using the pandas engine.

@stevebachmeier stevebachmeier marked this pull request as ready for review March 19, 2025 17:42


@overload
def _generate_dataset(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for type-hinting to work?

Copy link
Contributor Author

@stevebachmeier stevebachmeier Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, kinda. Technically mypy didn't care anyway b/c I am ignoring the dask_data.map_partitions() call anyway ([no-untyped-arg]) and so mypy has no way of actually knowing what type it is. But this seems to be the correct way to handle return types that are arguemnt-specific

@@ -126,6 +155,8 @@ def _generate_dataset(
import dask
import dask.dataframe as dd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so that a user doesn't have to have dask in their environment to run pseudopeople if they are using the pandas engine.

Copy link
Collaborator

@zmbc zmbc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you for all the tricky debugging you did to get this working @stevebachmeier!

Comment on lines +237 to +260
@overload
def generate_decennial_census(
source: Path | str | None = None,
seed: int = 0,
config: Path | str | dict[str, Any] | None = None,
year: int | None = 2020,
state: str | None = None,
verbose: bool = False,
engine: Literal["pandas", "dask"] = "pandas",
engine: Literal["pandas"] = "pandas",
) -> pd.DataFrame:
...


@overload
def generate_decennial_census(
source: Path | str | None,
seed: int,
config: Path | str | dict[str, Any] | None,
year: int | None,
state: str | None,
verbose: bool,
engine: Literal["dask"],
) -> dd.DataFrame:
...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I'm curious how overloads look in the docs, will take a look at this after I finish reviewing the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overloads don't show up 😞 Oh well, at least they could be helpful for autocomplete etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's too bad. I didn't think to check myself but sphinx def worked on it.

n_workers=CPU_COUNT,
threads_per_worker=1,
)
cluster.get_client() # type: ignore [no-untyped-call]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a vague memory of needing to have this client object in scope for it to be used. But I presume your testing here has ensured this client is used? I wonder if .get_client is even necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I should probably add to the test that the client is actually type LocalCluster. Then I think I'd be happy b/c that's def not just the default

Copy link
Contributor Author

@stevebachmeier stevebachmeier Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot, the dask-default client happens to be a LocalCluster as well. I'll add a name to our new default (name = "pseudopeople_dask_cluster" unless you disagree) and assert that it's correct

stevebachmeier and others added 2 commits March 20, 2025 10:43
Co-authored-by: Zeb Burke-Conte <zmbc@users.noreply.github.com>
@stevebachmeier stevebachmeier merged commit b1952c3 into epic/full_scale_testing Mar 20, 2025
6 of 8 checks passed
@stevebachmeier stevebachmeier deleted the sbachmei/mic-5523/change-dask-scheduler-defaults branch March 20, 2025 17:32
hussain-jafari pushed a commit that referenced this pull request May 7, 2025
hussain-jafari pushed a commit that referenced this pull request May 7, 2025
hussain-jafari pushed a commit that referenced this pull request Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants