Skip to content

GH-46606: [Python] Do not require numpy when normalizing slice #46732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 10, 2025

Conversation

shu-kitamura
Copy link
Contributor

@shu-kitamura shu-kitamura commented Jun 6, 2025

Rationale for this change

Slicing an array in non-trivial steps raises an exception when Numpy is not installed.
#46606

What changes are included in this PR?

I changed np.arange(...) to list(range(...)) In python/pyarrow/array.pxi

Are these changes tested?

Yes

Are there any user-facing changes?

No

Copy link

github-actions bot commented Jun 6, 2025

⚠️ GitHub issue #46606 has been automatically assigned in GitHub to PR creator.

@AlenkaF
Copy link
Member

AlenkaF commented Jun 6, 2025

Thanks for the contribution @shu-kitamura !
It would be good to add a test case with the example from the issue to test_array.py. Conda Python 3.11 without NumPy CI job will show if this is working correctly.

@shu-kitamura
Copy link
Contributor Author

@AlenkaF
Thank you for your quick review.

I added the test test_slicing_with_non_trivial_step() to test_array.py.

I ran the test in an environment without numpy and confirmed that it passes.

~/py_projects/arrow/python/pyarrow$ pip3 list | grep numpy
~/py_projects/arrow/python/pyarrow$ pytest tests/test_array.py::test_slicing_with_non_trivial_step
============================================== test session starts ==============================================
platform linux -- Python 3.8.10, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/shusei/py_projects/arrow/python
configfile: setup.cfg
plugins: hypothesis-6.113.0
collected 1 item                                                                                                

tests/test_array.py .                                                                                     [100%]

=============================================== 1 passed in 0.38s ===============================================

@AlenkaF
Copy link
Member

AlenkaF commented Jun 7, 2025

Thanks!

Looking at test_array.py I see the test case should fit under test_array_slice_negative_step(). I guess this test has not been failing due to the numpy mark.

It is failing now for a different reason (see AppVeyor error) and the failure is connected. The issue is when the indices that feed into arrow_obj.take(indices) are an empty list.

@shu-kitamura
Copy link
Contributor Author

shu-kitamura commented Jun 7, 2025

Thanks!

I moved test test_slicing_with_non_trivial_step() under test_array_slice_negative_step().

The failure of the test test_array_slice_negative_step() is not yet resolved.
You're right, it seems that the test fails when arrow_obj.take(indices) is fed an empty list.

When using np.arange(start, stop, step), an empty ndarray was fed to arrow_obj.take(indices).
But when using list(range(start, stop, step)), an empty list was fed to arrow_obj.take(indices), which seems to cause test to fail.

@shu-kitamura
Copy link
Contributor Author

I added handling for the case where indices is an empty list.
The tests passed and AppVeyor built successfully.

@AlenkaF
Copy link
Member

AlenkaF commented Jun 9, 2025

Thanks for the updates!

I moved test test_slicing_with_non_trivial_step() under test_array_slice_negative_step().

What I meant earlier is that the test case using arr[::-1] is already effectively covered in test_array_slice_negative_step() via slice(None, None, -1), so there's no need to add an additional test as I originally suggested — sorry for that. We can add a comment in this specific slice case (# GH-46606) as it is done here.

With the change introduced in this PR, test_array_slice_negative_step() should now pass without requiring NumPy, which means the NumPy mark can (and should) be removed.

@shu-kitamura
Copy link
Contributor Author

shu-kitamura commented Jun 9, 2025

Thank you for reviewing it so many times.

What I meant earlier is that the test case using arr[::-1] is already effectively covered in test_array_slice_negative_step() via slice(None, None, -1), so there's no need to add an additional test as I originally suggested — sorry for that.

Sorry too. I misunderstood.
I removed test_slicing_with_non_trivial_step().

We can add a comment in this specific slice case (# GH-46606) as it is done here.

I added the comment # GH-46606 to the line slice(None, None, -1).

With the change introduced in this PR, test_array_slice_negative_step() should now pass without requiring NumPy, which means the NumPy mark can (and should) be removed.

I removed @pytest.mark.numpy from test_array_slice_negative_step()

@AlenkaF
Copy link
Member

AlenkaF commented Jun 9, 2025

Thanks! I have run the full CI, let's see how it goes =)

@shu-kitamura
Copy link
Contributor Author

Three CIs have failed.😭

I think the following failure is caused by using np.arrange in an environment without Numpy.
AMD64 Conda Python 3.11 without NumPy

I don't know about the other two yet, I'll look at the logs.

@AlenkaF
Copy link
Member

AlenkaF commented Jun 9, 2025

Three CIs have failed.😭

All good, that is why they are set - to make sure we do not miss anything (or as little as possible 😉 )

I think the following failure is caused by using np.arrange in an environment without Numpy. AMD64 Conda Python 3.11 without NumPy

Correct. Similar to what you have done in this PR, the test data needs to be updated to use list(range(..)) too.

I don't know about the other two yet, I'll look at the logs.

Other two are not connected.

@shu-kitamura
Copy link
Contributor Author

Correct. Similar to what you have done in this PR, the test data needs to be updated to use list(range(..)) too.

I fixed to not use np.arange in test_array_slice_negative_step()

Other two are not connected.

I'm sorry, but I don't understand what it means to "not connected."

@AlenkaF
Copy link
Member

AlenkaF commented Jun 9, 2025

I'm sorry, but I don't understand what it means to "not connected."

No problem. One other CI build that is failing has a known issue (#46516) and so is not connected to the changes in this PR and we can ignore it. Similar for the lint one, I can't find an open issue for it though.

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the contribution @shu-kitamura !
@raulcd mind giving a sanity check before I merge?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 9, 2025
@raulcd
Copy link
Member

raulcd commented Jun 10, 2025

@github-actions crossbow submit -g python

Copy link

Revision: 9b3cb60

Submitted crossbow builds: ursacomputing/crossbow @ actions-f955378e43

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@raulcd raulcd changed the title GH-46606: [Python] Weird exception when slicing an array with non-trivial step GH-46606: [Python] Do not require numpy when normalizing slice Jun 10, 2025
Copy link

⚠️ GitHub issue #46606 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I am running extended CI to double check and have updated the title of the PR to describe what we are doing. Will merge once CI run finishes if successful.
Thanks @AlenkaF for the reviews and @shu-kitamura for the PR!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jun 10, 2025
@AlenkaF
Copy link
Member

AlenkaF commented Jun 10, 2025

The failing builds do not look related, though example-python-minimal-* should probably already be green? (they are not failing on one of my PRs from today), cc @raulcd

@raulcd
Copy link
Member

raulcd commented Jun 10, 2025

Yes, CI failures are unrelated, the example-python-minimal* are related to:

And the test-conda-python-emscripten was successful on retry.

@raulcd raulcd merged commit 494d0e3 into apache:main Jun 10, 2025
14 of 16 checks passed
@raulcd raulcd removed the awaiting merge Awaiting merge label Jun 10, 2025
@shu-kitamura shu-kitamura deleted the fix_normalize_slice branch June 10, 2025 12:53
@pitrou
Copy link
Member

pitrou commented Jun 10, 2025

Were any benchmarks run on this change? Calling list(range(...)) and converting it to a Arrow array afterwards is going to be significantly more costly (and memory-consuming) than np.arange.

@raulcd
Copy link
Member

raulcd commented Jun 10, 2025

Thanks @pitrou for taking a look, I should have pinged you on this one before merging.

Were any benchmarks run on this change?

I haven't run benchmarks, maybe we should validate the performance changes and if significant use numpy if available otherwise use the new code path?

Calling list(range(...)) and converting it to a Arrow array afterwards is going to be significantly more costly (and memory-consuming) than np.arange.

when you say converting it to a Arrow array afterwards you mean on the case of no indices being returned?

        if len(indices) == 0:
            return arrow_obj.slice(0, 0)

or when using the indices list on take? this would still have to convert from the Numpy array to an Arrow array on the previous case, right?
Is Pylist to Arrow array that much slow than from Numpy array to Arrow array?

@AlenkaF
Copy link
Member

AlenkaF commented Jun 10, 2025

Calling list(range(...)) and converting it to a Arrow array afterwards is going to be significantly more costly (and memory-consuming) than np.arange

We are not really converting numpy array or list to PyArrow array. We are only using a different path to construct indices to pass to pa.Array.take(indices). I would think that saving indices as a list and not a numpy array would not yield hight performance loss?

@pitrou
Copy link
Member

pitrou commented Jun 10, 2025

or when using the indices list on take?

Yes, this one.

this would still have to convert from the Numpy array to an Arrow array on the previous case, right?

np.arange is quick and Numpy to Arrow is zero-copy.

Is Pylist to Arrow array that much slow than from Numpy array to Arrow array?

Extremely slower as you have to convert generic Python objects to a contiguous native array.

>>> start, stop, step = 1, 1_000_000, 2

>>> %timeit np.arange(start, stop, step)
115 μs ± 741 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit pa.array(np.arange(start, stop, step))
120 μs ± 479 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

>>> %timeit list(range(start, stop, step))
13.1 ms ± 84.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit pa.array(list(range(start, stop, step)))
32.9 ms ± 56.9 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

And then:

>>> a = pa.array(np.arange(0, 2_000_000))
>>> %timeit a.take(np.arange(start, stop, step))
818 μs ± 1.86 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit a.take(list(range(start, stop, step)))
33 ms ± 101 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@raulcd
Copy link
Member

raulcd commented Jun 10, 2025

@pitrou @AlenkaF I've created this issue to follow up, let me know what you think:

Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 494d0e3.

There were 69 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants