Add Dask engine to dataset generation functions#404
Add Dask engine to dataset generation functions#404zmbc merged 7 commits intorelease-candidate/dtypes-distributed-noisingfrom
Conversation
aflaxman
left a comment
There was a problem hiding this comment.
Can you also add something to the docs about how to install and use this, and a cautionary note that this is not important for the small data distributed with psp, but can be a big time saver for people working with the full dataset?
Also, I think I'm doing it wrong... without dask
E.g. without this I loaded Alabama in 1.29 hours, but with it, on a 32-core cluster node, it took 1.57 hours.
|
@aflaxman How did you start your Dask cluster? And which dataset did you load? |
|
I didn't start one! I just used with a path to the full data that you sent me yesterday. |
|
I have a feeling you might be starting Dask with as many workers as there are physical CPUs on your cluster node, and then it is thrashing. I'm in the process of testing this branch on the case study. |
|
Note: running into a fiddly bug with dtypes on a few columns. I don't think it will be overly complicated to fix. |
|
Blocked by #405 -- Dask currently can't handle writing our strange dtypes to Parquet files. |
When I added this code block (on a node with the |
|
@zmbc can you target a (new) release branch rather than |
…uted-noising' into feature/dask
Add Dask engine to dataset generation functions
Description
Successor to #349 -- using Dask instead of Modin. This simplifies things and hopefully gives us a shorter path to releasing distributed noising.
Testing
pytest --runslow)Added some tests that use Dask.