Optimize `to_cupy` and `values`

Currently `series.values` and especially `series.to_cupy()` are substantially slower than `cupy.asarray(series)`. 
```
In [2]: s = cudf.Series(range(10000))

In [3]: %timeit s.values
81.4 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [4]: %timeit cp.asarray(s)
19.1 µs ± 168 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit s.to_cupy()
349 µs ± 75.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

There are at least two obvious potential culprits in `Frame._to_array` (the underlying method for `to_cupy`):
- [It always performs an extra allocation](https://github.com/rapidsai/cudf/blob/branch-22.10/python/cudf/cudf/core/frame.py#L484), even when `copy=False`.
- [It performs dtype inference using `find_common_dtype`](https://github.com/rapidsai/cudf/blob/branch-22.10/python/cudf/cudf/core/frame.py#L479), which is _slow_ (and slower for `DataFrame`s with many columns):
```
In [11]: df = cudf.DataFrame({'a': [1], 'b': [3.], 'c': ['a']})

In [12]: %timeit cudf.utils.dtypes.find_common_type([col.dtype for col in df._data.values()])
53.6 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [13]: df = cudf.DataFrame({'a': [1], 'b': [3.]})

In [14]: %timeit cudf.utils.dtypes.find_common_type([col.dtype for col in df._data.values()])
39.8 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```

The implementation of `values` drops down to `ColumnBase.values` and requires some deeper consideration. However, since we use `.values` frequently internally (and we occasionally use `to_cupy`) we are likely giving up a lot of performance. We should profile these functions to determine the bottlenecks, and if there are valid reasons for them we should establish some policies on how to select the right function to use when performing these conversions to arrays internally. While this exact analogy does not hold for `DataFrame` (because that doesn't support the conversion to an array), any optimization that we make for `Series` will likely also help speed up `DataFrame` operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize `to_cupy` and `values` #11648

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize to_cupy and values #11648

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Optimize `to_cupy` and `values` #11648