Description
Is your feature request related to a problem? Please describe.
A number of deceptively harmless functions in cuDF Python actually involve insidious memory allocations that in aggregate have substantial deleterious impacts on performance. The two most prominent examples of this are ColumnBase.astype
calls and all RangeIndex
logic that involves materializing an Int64Index
, although others may exist as well. While these conversions are sometimes unavoidable, in many cases they are not. RangeIndex
conversions are particularly harmful because the class is designed precisely to avoid such overheads, so many internal code paths are likely creating unnecessary and unexpected memory pressure.
Describe the solution you'd like
Type conversions that are currently performed deep in the call stack should be moved to the highest possible level. This change would prevent APIs from performing the cast except where absolutely necessary, and at minimum would reduce the number of calls to validation functions like is_categorical_dtype
. Internal functions should accept a relatively narrow set of parameter types, and only user-facing functions should perform casts to handle the near-arbitrary user data that pandas APIs require us to accept.
On the Index front, there are a number of tasks that should be tackled separately. Roughly, the following two tasks should be possible to address in parallel:
- Currently
RangeIndex
implicitly defers all methods that aren't implemented toInt64Index
via the__getattr__
override. We should remove this override and actually implement the relevant methods inRangeIndex
. In cases where materializing anInt64Index
is necessary, we should exploit the sort of monkey-patching that we do for defining unary and binary operations forRangeIndex
to add those methods explicitly. In other words, we should use an allowlist rather than a blocklist to choose methods where the conversion happens. BaseIndex
should be reduced as closely as possible to an abstract class. While there are a subset of APIs that truly make sense for all types of index objects, in almost all cases the optimal implementation forRangeIndex
(andMultiIndex
, for that matter) is very different from the implementation forGenericIndex
. In addition, this change reduces cognitive load for developers by simplifying the inheritance hierarchy.
Once those two tasks are accomplished to whatever degree makes sense, we should revisit removing two attributes that contribute to this problem:
BaseIndex._values
: This property doesn't really make sense since it's internal-only and there isn't a useful implementation for MultiIndex. It's a trivial alias for_column
inGenericIndex
, but it leads to automatic Int64Index creation when called on aRangeIndex
. Removing the property would encourage developers to prevent that. Addressing the first two issues above should reduce the need for this property substantially and make it easier to assess where we're really using it.BaseIndex._data
: We promise that this property will exist so thatBaseIndex
looks like aFrame
in places that it needs to, but this again causes unnecessary memory allocations when used forRangeIndex
. Any code path that is relying on this attribute existing forRangeIndex
should be rewritten, again only explicitly creating a column when necessary. OnlyFrame
methods should really expect_data
to work.
Finally, we should also be cognizant of where we copy indexes. Since indexes are immutable, in many cases we can avoid making copies altogether. It is fine for multiple frames etc to refer to the same index since any operation that "mutates" an index should just be creating a new index object and assigning it under the hood.
Additional context
This issue was originally raised in the cudf refactoring roadmap, but after being brought up in #9558 (comment) I've documented it as a Github issue for greater visibility.
Metadata
Metadata
Assignees
Type
Projects
Status