Skip to content

Add Dask engine to dataset generation functions#404

Merged
zmbc merged 7 commits intorelease-candidate/dtypes-distributed-noisingfrom
feature/dask
Apr 25, 2024
Merged

Add Dask engine to dataset generation functions#404
zmbc merged 7 commits intorelease-candidate/dtypes-distributed-noisingfrom
feature/dask

Conversation

@zmbc
Copy link
Collaborator

@zmbc zmbc commented Apr 22, 2024

Add Dask engine to dataset generation functions

Description

  • Category: feature
  • JIRA issue: none

Successor to #349 -- using Dask instead of Modin. This simplifies things and hopefully gives us a shorter path to releasing distributed noising.

Testing

  • all tests pass (pytest --runslow)

Added some tests that use Dask.

Copy link
Member

@aflaxman aflaxman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add something to the docs about how to install and use this, and a cautionary note that this is not important for the small data distributed with psp, but can be a big time saver for people working with the full dataset?

Also, I think I'm doing it wrong... without dask
E.g. without this I loaded Alabama in 1.29 hours, but with it, on a 32-core cluster node, it took 1.57 hours.

@zmbc
Copy link
Collaborator Author

zmbc commented Apr 23, 2024

@aflaxman How did you start your Dask cluster? And which dataset did you load?

@aflaxman
Copy link
Member

I didn't start one! I just used

        df = psp.generate_decennial_census(source=full_data_path, state=name,
                                           config=my_config, verbose=False, engine='dask')

with a path to the full data that you sent me yesterday.

@zmbc
Copy link
Collaborator Author

zmbc commented Apr 23, 2024

I have a feeling you might be starting Dask with as many workers as there are physical CPUs on your cluster node, and then it is thrashing.

I'm in the process of testing this branch on the case study.

@zmbc
Copy link
Collaborator Author

zmbc commented Apr 23, 2024

Note: running into a fiddly bug with dtypes on a few columns. I don't think it will be overly complicated to fix.

@zmbc zmbc mentioned this pull request Apr 24, 2024
1 task
@zmbc
Copy link
Collaborator Author

zmbc commented Apr 24, 2024

Blocked by #405 -- Dask currently can't handle writing our strange dtypes to Parquet files.

@aflaxman
Copy link
Member

I didn't start one! I just used

        df = psp.generate_decennial_census(source=full_data_path, state=name,
                                           config=my_config, verbose=False, engine='dask')

with a path to the full data that you sent me yesterday.

When I added this code block (on a node with the srun from the commented line, after installing the packages from the other commented lines), I was able to load the full USA data in 40 minutes:

# pip install "dask[distributed]" --upgrade
# pip install --upgrade pyarrow

# srun -t 14-00:00:00 --mem=100G -c 32 -A proj_simscience -p long.q --pty bash

from dask.distributed import LocalCluster
cluster = LocalCluster(n_workers=10, threads_per_worker=1)

@rmudambi
Copy link
Collaborator

@zmbc can you target a (new) release branch rather than main? We probably want to release all of your PRs together as one release, and merging to main requires we release.

@zmbc zmbc changed the base branch from main to release-candidate/dtypes-distributed-noising April 25, 2024 20:06
@zmbc zmbc requested review from a team and pletale as code owners April 25, 2024 20:25
@zmbc zmbc merged commit a96d337 into release-candidate/dtypes-distributed-noising Apr 25, 2024
@zmbc zmbc deleted the feature/dask branch April 25, 2024 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments