Skip to content

Optimize to_cupy and values #11648

Open
Open
@vyasr

Description

@vyasr

Currently series.values and especially series.to_cupy() are substantially slower than cupy.asarray(series).

In [2]: s = cudf.Series(range(10000))

In [3]: %timeit s.values
81.4 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [4]: %timeit cp.asarray(s)
19.1 µs ± 168 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit s.to_cupy()
349 µs ± 75.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

There are at least two obvious potential culprits in Frame._to_array (the underlying method for to_cupy):

In [11]: df = cudf.DataFrame({'a': [1], 'b': [3.], 'c': ['a']})

In [12]: %timeit cudf.utils.dtypes.find_common_type([col.dtype for col in df._data.values()])
53.6 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [13]: df = cudf.DataFrame({'a': [1], 'b': [3.]})

In [14]: %timeit cudf.utils.dtypes.find_common_type([col.dtype for col in df._data.values()])
39.8 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The implementation of values drops down to ColumnBase.values and requires some deeper consideration. However, since we use .values frequently internally (and we occasionally use to_cupy) we are likely giving up a lot of performance. We should profile these functions to determine the bottlenecks, and if there are valid reasons for them we should establish some policies on how to select the right function to use when performing these conversions to arrays internally. While this exact analogy does not hold for DataFrame (because that doesn't support the conversion to an array), any optimization that we make for Series will likely also help speed up DataFrame operations.

Metadata

Metadata

Assignees

Labels

PerformancePerformance related issuePythonAffects Python cuDF API.improvementImprovement / enhancement to an existing function

Type

No type

Projects

Status

In Progress

Relationships

None yet

Development

No branches or pull requests

Issue actions