Skip to content

[FEA] Mark DataFrame.dtypes as _external_only_api #11458

Open
@vyasr

Description

@vyasr

Is your feature request related to a problem? Please describe.
DataFrame.dtypes is used in many places in the code. For pandas compatibility, this method constructs a pd.Series from the column dtypes. This construction introduces unnecessary overhead that could be avoided, especially because in many cases the output is immediately converted back to a list or a {colname: dtype} dict.

Describe the solution you'd like
DataFrame.dtypes should be decorated with _external_only_api. All usage should be switched to instead use Frame._dtypes, which simply constructs a dict and avoids the unnecessary overhead. Here's a quick indication of the benefits:

In [1]: import cudf

In [2]: df = cudf.DataFrame({f"{i}": [i] for i in range(10)})

In [3]: %timeit df._dtypes
2.31 µs ± 15.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [4]: %timeit df.dtypes
165 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [5]: df = cudf.DataFrame({f"{i}": [i] for i in range(100)})

In [6]: %timeit df._dtypes
13.9 µs ± 47.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [7]: %timeit df.dtypes
316 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Describe alternatives you've considered
None

Additional context
If there is any internal functionality that is actually relying on the output of dtypes being a Series, we should carefully consider whether that method should be reimplemented. There is almost no reason that a Series should be preferable to a dict internally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions