Description
Is your feature request related to a problem? Please describe.
DataFrame.dtypes
is used in many places in the code. For pandas compatibility, this method constructs a pd.Series
from the column dtypes. This construction introduces unnecessary overhead that could be avoided, especially because in many cases the output is immediately converted back to a list or a {colname: dtype}
dict.
Describe the solution you'd like
DataFrame.dtypes
should be decorated with _external_only_api
. All usage should be switched to instead use Frame._dtypes
, which simply constructs a dict and avoids the unnecessary overhead. Here's a quick indication of the benefits:
In [1]: import cudf
In [2]: df = cudf.DataFrame({f"{i}": [i] for i in range(10)})
In [3]: %timeit df._dtypes
2.31 µs ± 15.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [4]: %timeit df.dtypes
165 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [5]: df = cudf.DataFrame({f"{i}": [i] for i in range(100)})
In [6]: %timeit df._dtypes
13.9 µs ± 47.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [7]: %timeit df.dtypes
316 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Describe alternatives you've considered
None
Additional context
If there is any internal functionality that is actually relying on the output of dtypes
being a Series
, we should carefully consider whether that method should be reimplemented. There is almost no reason that a Series
should be preferable to a dict
internally.
Metadata
Metadata
Assignees
Type
Projects
Status