PTDS Benchmarks

With https://github.com/cupy/cupy/pull/4322 merged in we can begin profiling/benchmarking performance of more workers per GPU.  


## Setup 
- Build latest cupy `python -m pip install .`
- CUPY_CUDA_PER_THREAD_DEFAULT_STREAM=1 dask-cuda-worker --nthreads=2
- Use at least 4 GPUs
- test `nthreads`: 1, 2, 4, 8, 32`


## Test Operations

- sum
- mean
- svd
- Matrix Multiply
- Array Slicing 
- transpose + sum:  (x + x.T).sum()

Note @pentschev previously tested these operations on a single GPU: 
- https://github.com/pentschev/pybench/blob/89d65a6c418a1fee39d447bd11b8a999835b74a9/pybench/benchmarks/benchmark_array.py
- https://github.com/pentschev/pybench/blob/master/examples/plot_array_example.ipynb


## Example of CuPy with Dask

```python
import cupy
import dask.array as da

rs = da.random.RandomState(RandomState=cupy.random.RandomState) 
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))
x.sum().compute()
```


cc @charlesbluca 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PTDS Benchmarks #517

Setup

Test Operations

Example of CuPy with Dask

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PTDS Benchmarks #517

Description

Setup

Test Operations

Example of CuPy with Dask

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions