Description
Note: I'm moving over the list of proposed workflows from the roadmap to this repo. I'll continue to iterate a bit on this issue
Data loading and cleaning
Dask is often used to schlep data from one format to another, cleaning or manipulating it along the way. This occurs in both dataframe and array use cases. There are lots of possible configurations here, but we’ll focus on just a few to start.
Exploratory Analysis
This is where most of our demos live today. Load a dataset, fool around, make some pretty charts
- Uber/Lyft, perform various simple dataframe computations and find novel results
- Pangeo / earth science workflows #770
-
RenRepunting during our first pass over workflows
Embarrassing parallel ✅
The matplotlib-arXiv notebook is a good example we have today of embarrassingly parallel workflows. This is “Dask a as a big for loop”. It also shows cloud data access and processes 3TB of real data.
Imaging
There is a surprisingly large community of people using Dask for bio-medical imaging. This includes applications like fMRI brain scans, and very high resolution microscopy (3d movies at micro resolution of cells). These folks often want to load in data, apply image processing filters across that data using map_overlap, and then visually explore the result. They want this processing done with human-in-the-loop systems.
XGBoost
Probably our most common application in ML, folks want to load data into a dask dataframe and then hand off to XGBoost’s Dask integration, possibly with GPUs. They also want to do this with Hyper-Parameter-Optimization.
We already have Guido’s work here at https://github.com/coiled/dask-xgboost-nyctaxi . Maybe we want to extend it with GPUs or cost analysis.
- Train on a large dataset
- Train on a large dataset with HPO with Optuna
- Add GPUs
PyTorch + HyperParameter Optimization
We have Optuna. We use it above for XGBoost but we should also show how to use it in more vanilla settings with a model that can be trained on a single machine, presumably a GPU. Let’s use PyTorch for this.
Train some PyTorch GPU model that fits on a single GPU with Optuna for HPO on a cluster