Skip to content

Workflow ideas #725

Open
Open
@jrbourbeau

Description

@jrbourbeau

Note: I'm moving over the list of proposed workflows from the roadmap to this repo. I'll continue to iterate a bit on this issue

Data loading and cleaning

Dask is often used to schlep data from one format to another, cleaning or manipulating it along the way. This occurs in both dataframe and array use cases. There are lots of possible configurations here, but we’ll focus on just a few to start.

Exploratory Analysis

This is where most of our demos live today. Load a dataset, fool around, make some pretty charts

Embarrassing parallel ✅

The matplotlib-arXiv notebook is a good example we have today of embarrassingly parallel workflows. This is “Dask a as a big for loop”. It also shows cloud data access and processes 3TB of real data.

Imaging

There is a surprisingly large community of people using Dask for bio-medical imaging. This includes applications like fMRI brain scans, and very high resolution microscopy (3d movies at micro resolution of cells). These folks often want to load in data, apply image processing filters across that data using map_overlap, and then visually explore the result. They want this processing done with human-in-the-loop systems.

XGBoost

Probably our most common application in ML, folks want to load data into a dask dataframe and then hand off to XGBoost’s Dask integration, possibly with GPUs. They also want to do this with Hyper-Parameter-Optimization.

We already have Guido’s work here at https://github.com/coiled/dask-xgboost-nyctaxi . Maybe we want to extend it with GPUs or cost analysis.

  • Train on a large dataset
  • Train on a large dataset with HPO with Optuna
  • Add GPUs

PyTorch + HyperParameter Optimization

We have Optuna. We use it above for XGBoost but we should also show how to use it in more vanilla settings with a model that can be trained on a single machine, presumably a GPU. Let’s use PyTorch for this.

Train some PyTorch GPU model that fits on a single GPU with Optuna for HPO on a cluster

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflowsRelated to representative Dask user workflows

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions