Workflow ideas

_Note: I'm moving over the list of proposed workflows from the roadmap to this repo. I'll continue to iterate a bit on this issue_
 
## Data loading and cleaning

Dask is often used to schlep data from one format to another, cleaning or manipulating it along the way.  This occurs in both dataframe and array use cases.  There are lots of possible configurations here, but we’ll focus on just a few to start.

- [ ] https://github.com/coiled/coiled-runtime/issues/726

## Exploratory Analysis

This is where most of our demos live today.  Load a dataset, fool around, make some pretty charts

- [x] Uber/Lyft, perform various simple dataframe computations and find novel results
- [ ] https://github.com/coiled/coiled-runtime/issues/770
- [ ] ~~RenRe~~ punting during our first pass over workflows

## Embarrassing parallel ✅ 

The matplotlib-arXiv notebook is a good example we have today of embarrassingly parallel workflows.  This is “Dask a as a big for loop”.  It also shows cloud data access and processes 3TB of real data.

- [x] https://github.com/coiled/coiled-runtime/pull/724

## Imaging 

There is a surprisingly large community of people using Dask for bio-medical imaging.  This includes applications like fMRI brain scans, and very high resolution microscopy (3d movies at micro resolution of cells).  These folks often want to load in data, apply image processing filters across that data using map_overlap, and then visually explore the result.  They want this processing done with human-in-the-loop systems.

- [ ] https://github.com/coiled/coiled-runtime/issues/751

## XGBoost

Probably our most common application in ML, folks want to load data into a dask dataframe and then hand off to XGBoost’s Dask integration, possibly with GPUs.  They also want to do this with Hyper-Parameter-Optimization.

We already have Guido’s work here at https://github.com/coiled/dask-xgboost-nyctaxi .  Maybe we want to extend it with GPUs or cost analysis.

- [ ] Train on a large dataset
- [ ] Train on a large dataset with HPO with Optuna
- [ ] Add GPUs

## PyTorch + HyperParameter Optimization

We have Optuna.  We use it above for XGBoost but we should also show how to use it in more vanilla settings with a model that can be trained on a single machine, presumably a GPU.  Let’s use PyTorch for this.

Train some PyTorch GPU model that fits on a single GPU with Optuna for HPO on a cluster

- [ ] https://github.com/coiled/coiled-runtime/issues/759


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow ideas #725

Data loading and cleaning

Exploratory Analysis

Embarrassing parallel ✅

Imaging

XGBoost

PyTorch + HyperParameter Optimization

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Workflow ideas #725

Description

Data loading and cleaning

Exploratory Analysis

Embarrassing parallel ✅

Imaging

XGBoost

PyTorch + HyperParameter Optimization

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions