Skip to content

feat: ArrowDataFrame.explode #1644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

feat: ArrowDataFrame.explode #1644

wants to merge 27 commits into from

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Dec 22, 2024

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

I will leave this as draft until we decide how to move forward.

To summarize the discussion(s) in #1542 :

  • pyarrow native methods ignore nulls and empty list in explode
  • the workaround here is to have a "fast_path" for when nulls or empty lists are not present, and a dedicated path for when they are
  • the issue is that we enter python world to create the index via one .to_pylist() call
  • pandas seems to enter cython anyway to explode a list

@dangotbanned dangotbanned mentioned this pull request Mar 17, 2025
12 tasks
@dangotbanned dangotbanned added enhancement New feature or request pyarrow Issue is related to pyarrow backend labels Mar 25, 2025
@dangotbanned
Copy link
Member

dangotbanned commented Mar 25, 2025

@FBruzzesi I feel like this shouldn't have got lost!

ArrowDataFrame.explode is 1 of 3 remaining implementations we need

explode = not_implemented()

join_asof = not_implemented()

I might add a PR for ArrowDataFrame.clone - since it can just utilize arrow data being immutable

clone = not_implemented()

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Mar 25, 2025

I feel like this shouldn't have got lost!

Thanks @dangotbanned β™₯️ The main concern was a conversion to python object: filled_counts.to_pylist() in:

    if fast_path:
        indices = pc.list_parent_indices(native_frame[to_explode[0]])
        flatten_func = pc.list_flatten

    else:
        filled_counts = pc.max_element_wise(counts, 1, skip_nulls=True)
        indices = pa.array(
            [
                i
                for i, count in enumerate(filled_counts.to_pylist())
                for _ in range(count)
            ]
        )

@dangotbanned
Copy link
Member

#1644 (comment)

Maybe we can figure out another path hidden somewhere in the stubs? πŸ€”

dangotbanned added a commit that referenced this pull request Mar 25, 2025
`.to_pylist` being called on a scalar is all that is left
@dangotbanned
Copy link
Member

dangotbanned commented Mar 25, 2025

@FBruzzesi @MarcoGorelli

It seems like polars wants to make a breaking change in the next major version - resulting in the same behavior as pyarrow.

If we had that behavior as the goal - I think pc.list_flatten(..., recursive=True) would get us most of the way there.
Just something to keep in mind for the future πŸ™‚

dangotbanned and others added 3 commits March 25, 2025 13:41
Just leaving as-is, since this'll probably change in the future
#1644 (comment)
> error: Incompatible redefinition (redefinition with type "Callable[[ChunkedArray[ListScalar[Any]]], ChunkedArray[Any]]", original type overloaded function)  [misc]

https://github.com/narwhals-dev/narwhals/actions/runs/14060304329/job/39369169923?pr=1644
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pyarrow Issue is related to pyarrow backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants