Skip to content

[FEA] Reduce implicit conversions to minimize unnecessary column materialization #9593

Open
@vyasr

Description

@vyasr

Is your feature request related to a problem? Please describe.
A number of deceptively harmless functions in cuDF Python actually involve insidious memory allocations that in aggregate have substantial deleterious impacts on performance. The two most prominent examples of this are ColumnBase.astype calls and all RangeIndex logic that involves materializing an Int64Index, although others may exist as well. While these conversions are sometimes unavoidable, in many cases they are not. RangeIndex conversions are particularly harmful because the class is designed precisely to avoid such overheads, so many internal code paths are likely creating unnecessary and unexpected memory pressure.

Describe the solution you'd like
Type conversions that are currently performed deep in the call stack should be moved to the highest possible level. This change would prevent APIs from performing the cast except where absolutely necessary, and at minimum would reduce the number of calls to validation functions like is_categorical_dtype. Internal functions should accept a relatively narrow set of parameter types, and only user-facing functions should perform casts to handle the near-arbitrary user data that pandas APIs require us to accept.

On the Index front, there are a number of tasks that should be tackled separately. Roughly, the following two tasks should be possible to address in parallel:

  • Currently RangeIndex implicitly defers all methods that aren't implemented to Int64Index via the __getattr__ override. We should remove this override and actually implement the relevant methods in RangeIndex. In cases where materializing an Int64Index is necessary, we should exploit the sort of monkey-patching that we do for defining unary and binary operations for RangeIndex to add those methods explicitly. In other words, we should use an allowlist rather than a blocklist to choose methods where the conversion happens.
  • BaseIndex should be reduced as closely as possible to an abstract class. While there are a subset of APIs that truly make sense for all types of index objects, in almost all cases the optimal implementation for RangeIndex (and MultiIndex, for that matter) is very different from the implementation for GenericIndex. In addition, this change reduces cognitive load for developers by simplifying the inheritance hierarchy.

Once those two tasks are accomplished to whatever degree makes sense, we should revisit removing two attributes that contribute to this problem:

  • BaseIndex._values: This property doesn't really make sense since it's internal-only and there isn't a useful implementation for MultiIndex. It's a trivial alias for _column in GenericIndex, but it leads to automatic Int64Index creation when called on a RangeIndex. Removing the property would encourage developers to prevent that. Addressing the first two issues above should reduce the need for this property substantially and make it easier to assess where we're really using it.
  • BaseIndex._data: We promise that this property will exist so that BaseIndex looks like a Frame in places that it needs to, but this again causes unnecessary memory allocations when used for RangeIndex. Any code path that is relying on this attribute existing for RangeIndex should be rewritten, again only explicitly creating a column when necessary. Only Frame methods should really expect _data to work.

Finally, we should also be cognizant of where we copy indexes. Since indexes are immutable, in many cases we can avoid making copies altogether. It is fine for multiple frames etc to refer to the same index since any operation that "mutates" an index should just be creating a new index object and assigning it under the hood.

Additional context
This issue was originally raised in the cudf refactoring roadmap, but after being brought up in #9558 (comment) I've documented it as a Github issue for greater visibility.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformancePerformance related issuePythonAffects Python cuDF API.feature requestNew feature or request

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions