[FEA] Reduce implicit conversions to minimize unnecessary column materialization

**Is your feature request related to a problem? Please describe.**
A number of deceptively harmless functions in cuDF Python actually involve insidious memory allocations that in aggregate have substantial deleterious impacts on performance. The two most prominent examples of this are `ColumnBase.astype` calls and all `RangeIndex` logic that involves materializing an `Int64Index`, although others may exist as well. While these conversions are sometimes unavoidable, in many cases they are not. `RangeIndex` conversions are particularly harmful because the class is designed precisely to avoid such overheads, so many internal code paths are likely creating unnecessary and unexpected memory pressure.

**Describe the solution you'd like**
Type conversions that are currently performed deep in the call stack should be moved to the highest possible level. This change would prevent APIs from performing the cast except where absolutely necessary, and at minimum would reduce the number of calls to validation functions like `is_categorical_dtype`. Internal functions should accept a relatively narrow set of parameter types, and only user-facing functions should perform casts to handle the near-arbitrary user data that pandas APIs require us to accept.

On the Index front, there are a number of tasks that should be tackled separately. Roughly, the following two tasks should be possible to address in parallel:
- Currently `RangeIndex` implicitly defers all methods that aren't implemented to `Int64Index` via the `__getattr__` override. We should remove this override and actually implement the relevant methods in `RangeIndex`. In cases where materializing an `Int64Index` is necessary, we should exploit the sort of monkey-patching that we do for defining unary and binary operations for `RangeIndex` to add those methods explicitly. In other words, we should use an allowlist rather than a blocklist to choose methods where the conversion happens. 
  - Done in #10538
- `BaseIndex` should be reduced as closely as possible to an abstract class. While there are a subset of APIs that truly make sense for all types of index objects, in almost all cases the optimal implementation for `RangeIndex` (and `MultiIndex`, for that matter) is very different from the implementation for `GenericIndex`. In addition, this change reduces cognitive load for developers by simplifying the inheritance hierarchy.
  - Done in #10389

Once those two tasks are accomplished to whatever degree makes sense, we should revisit removing two attributes that contribute to this problem:
- `BaseIndex._values`: This property doesn't really make sense since it's internal-only and there isn't a useful implementation for MultiIndex. It's a trivial alias for `_column` in `GenericIndex`, but it leads to automatic Int64Index creation when called on a `RangeIndex`. Removing the property would encourage developers to prevent that. Addressing the first two issues above should reduce the need for this property substantially and make it easier to assess where we're really using it.
- `BaseIndex._data`: We promise that this property will exist so that `BaseIndex` looks like a `Frame` in places that it needs to, but this again causes unnecessary memory allocations when used for `RangeIndex`. Any code path that is relying on this attribute existing for `RangeIndex` should be rewritten, again only explicitly creating a column when necessary. Only `Frame` methods should really expect `_data` to work.

Finally, we should also be cognizant of where we copy indexes. Since indexes are immutable, in many cases we can avoid making copies altogether. It is fine for multiple frames etc to refer to the same index since any operation that "mutates" an index should just be creating a new index object and assigning it under the hood.

**Additional context**
This issue was originally raised in the cudf refactoring roadmap, but after being brought up in https://github.com/rapidsai/cudf/pull/9558#discussion_r740584876 I've documented it as a Github issue for greater visibility.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Reduce implicit conversions to minimize unnecessary column materialization #9593

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Reduce implicit conversions to minimize unnecessary column materialization #9593

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions