Add Dask engine to dataset generation functions by zmbc · Pull Request #404 · ihmeuw/pseudopeople

zmbc · 2024-04-22T22:37:58Z

Add Dask engine to dataset generation functions

Description

Category: feature
JIRA issue: none

Successor to #349 -- using Dask instead of Modin. This simplifies things and hopefully gives us a shorter path to releasing distributed noising.

Testing

all tests pass (pytest --runslow)

Added some tests that use Dask.

aflaxman

Can you also add something to the docs about how to install and use this, and a cautionary note that this is not important for the small data distributed with psp, but can be a big time saver for people working with the full dataset?

Also, I think I'm doing it wrong... without dask
E.g. without this I loaded Alabama in 1.29 hours, but with it, on a 32-core cluster node, it took 1.57 hours.

src/pseudopeople/interface.py

zmbc · 2024-04-23T20:52:37Z

@aflaxman How did you start your Dask cluster? And which dataset did you load?

aflaxman · 2024-04-23T21:03:18Z

I didn't start one! I just used

        df = psp.generate_decennial_census(source=full_data_path, state=name,
                                           config=my_config, verbose=False, engine='dask')

with a path to the full data that you sent me yesterday.

zmbc · 2024-04-23T21:04:57Z

I have a feeling you might be starting Dask with as many workers as there are physical CPUs on your cluster node, and then it is thrashing.

I'm in the process of testing this branch on the case study.

zmbc · 2024-04-23T23:47:00Z

Note: running into a fiddly bug with dtypes on a few columns. I don't think it will be overly complicated to fix.

zmbc · 2024-04-24T01:04:05Z

Blocked by #405 -- Dask currently can't handle writing our strange dtypes to Parquet files.

aflaxman · 2024-04-24T14:31:23Z

I didn't start one! I just used
        df = psp.generate_decennial_census(source=full_data_path, state=name,
                                           config=my_config, verbose=False, engine='dask')
with a path to the full data that you sent me yesterday.

When I added this code block (on a node with the srun from the commented line, after installing the packages from the other commented lines), I was able to load the full USA data in 40 minutes:

# pip install "dask[distributed]" --upgrade
# pip install --upgrade pyarrow

# srun -t 14-00:00:00 --mem=100G -c 32 -A proj_simscience -p long.q --pty bash

from dask.distributed import LocalCluster
cluster = LocalCluster(n_workers=10, threads_per_worker=1)

setup.py

src/pseudopeople/interface.py

rmudambi · 2024-04-25T17:59:47Z

@zmbc can you target a (new) release branch rather than main? We probably want to release all of your PRs together as one release, and merging to main requires we release.

…uted-noising' into feature/dask

Add Dask engine to dataset generation functions

b37a484

zmbc requested review from albrja, hussain-jafari, patricktnast, rmudambi and stevebachmeier as code owners April 22, 2024 22:37

aflaxman approved these changes Apr 23, 2024

View reviewed changes

src/pseudopeople/interface.py Outdated Show resolved Hide resolved

zmbc mentioned this pull request Apr 24, 2024

Fix dtype bug in int columns #405

Merged

1 task

Switch to layered_config_tree

c47bd42

rmudambi approved these changes Apr 25, 2024

View reviewed changes

setup.py Show resolved Hide resolved

src/pseudopeople/interface.py Outdated Show resolved Hide resolved

src/pseudopeople/interface.py Show resolved Hide resolved

zmbc added 2 commits April 25, 2024 09:37

Clarify file-directory distinction

9172062

Include default in docstrings

8c925a4

zmbc changed the base branch from main to release-candidate/dtypes-distributed-noising April 25, 2024 20:06

zmbc added 2 commits April 25, 2024 13:15

Merge remote-tracking branch 'origin/release-candidate/dtypes-distrib…

9c53090

…uted-noising' into feature/dask

Add more docs

ae96e84

zmbc requested review from a team and pletale as code owners April 25, 2024 20:25

Remove cleanse_int_cols argument

e7ac21c

zmbc merged commit a96d337 into release-candidate/dtypes-distributed-noising Apr 25, 2024

zmbc deleted the feature/dask branch April 25, 2024 21:24

albrja pushed a commit that referenced this pull request May 29, 2024

Add Dask engine to dataset generation functions (#404)

825bcf7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dask engine to dataset generation functions#404

Add Dask engine to dataset generation functions#404
zmbc merged 7 commits intorelease-candidate/dtypes-distributed-noisingfrom
feature/dask

zmbc commented Apr 22, 2024

Uh oh!

aflaxman left a comment

Uh oh!

Uh oh!

zmbc commented Apr 23, 2024

Uh oh!

aflaxman commented Apr 23, 2024

Uh oh!

zmbc commented Apr 23, 2024

Uh oh!

zmbc commented Apr 23, 2024

Uh oh!

zmbc commented Apr 24, 2024

Uh oh!

aflaxman commented Apr 24, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rmudambi commented Apr 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

zmbc commented Apr 22, 2024