GH-46606: [Python] Do not require numpy when normalizing slice #46732

shu-kitamura · 2025-06-06T17:23:26Z

Rationale for this change

Slicing an array in non-trivial steps raises an exception when Numpy is not installed.
#46606

What changes are included in this PR?

I changed np.arange(...) to list(range(...)) In python/pyarrow/array.pxi

Are these changes tested?

Yes

Are there any user-facing changes?

No

GitHub Issue: [Python] Weird exception when slicing an array with non-trivial step #46606

github-actions · 2025-06-06T17:23:51Z

⚠️ GitHub issue #46606 has been automatically assigned in GitHub to PR creator.

AlenkaF · 2025-06-06T18:39:24Z

Thanks for the contribution @shu-kitamura !
It would be good to add a test case with the example from the issue to test_array.py. Conda Python 3.11 without NumPy CI job will show if this is working correctly.

shu-kitamura · 2025-06-07T01:44:51Z

@AlenkaF
Thank you for your quick review.

I added the test test_slicing_with_non_trivial_step() to test_array.py.

I ran the test in an environment without numpy and confirmed that it passes.

~/py_projects/arrow/python/pyarrow$ pip3 list | grep numpy
~/py_projects/arrow/python/pyarrow$ pytest tests/test_array.py::test_slicing_with_non_trivial_step
============================================== test session starts ==============================================
platform linux -- Python 3.8.10, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/shusei/py_projects/arrow/python
configfile: setup.cfg
plugins: hypothesis-6.113.0
collected 1 item                                                                                                

tests/test_array.py .                                                                                     [100%]

=============================================== 1 passed in 0.38s ===============================================

AlenkaF · 2025-06-07T10:40:07Z

Thanks!

Looking at test_array.py I see the test case should fit under test_array_slice_negative_step(). I guess this test has not been failing due to the numpy mark.

It is failing now for a different reason (see AppVeyor error) and the failure is connected. The issue is when the indices that feed into arrow_obj.take(indices) are an empty list.

shu-kitamura · 2025-06-07T15:36:29Z

Thanks!

I moved test test_slicing_with_non_trivial_step() under test_array_slice_negative_step().

The failure of the test test_array_slice_negative_step() is not yet resolved.
You're right, it seems that the test fails when arrow_obj.take(indices) is fed an empty list.

When using np.arange(start, stop, step), an empty ndarray was fed to arrow_obj.take(indices).
But when using list(range(start, stop, step)), an empty list was fed to arrow_obj.take(indices), which seems to cause test to fail.

shu-kitamura · 2025-06-08T02:58:39Z

I added handling for the case where indices is an empty list.
The tests passed and AppVeyor built successfully.

AlenkaF · 2025-06-09T07:15:21Z

Thanks for the updates!

I moved test test_slicing_with_non_trivial_step() under test_array_slice_negative_step().

What I meant earlier is that the test case using arr[::-1] is already effectively covered in test_array_slice_negative_step() via slice(None, None, -1), so there's no need to add an additional test as I originally suggested — sorry for that. We can add a comment in this specific slice case (# GH-46606) as it is done here.

With the change introduced in this PR, test_array_slice_negative_step() should now pass without requiring NumPy, which means the NumPy mark can (and should) be removed.

shu-kitamura · 2025-06-09T10:47:40Z

Thank you for reviewing it so many times.

What I meant earlier is that the test case using arr[::-1] is already effectively covered in test_array_slice_negative_step() via slice(None, None, -1), so there's no need to add an additional test as I originally suggested — sorry for that.

Sorry too. I misunderstood.
I removed test_slicing_with_non_trivial_step().

We can add a comment in this specific slice case (# GH-46606) as it is done here.

I added the comment # GH-46606 to the line slice(None, None, -1).

With the change introduced in this PR, test_array_slice_negative_step() should now pass without requiring NumPy, which means the NumPy mark can (and should) be removed.

I removed @pytest.mark.numpy from test_array_slice_negative_step()

AlenkaF · 2025-06-09T10:49:19Z

Thanks! I have run the full CI, let's see how it goes =)

shu-kitamura · 2025-06-09T11:25:00Z

Three CIs have failed.😭

I think the following failure is caused by using np.arrange in an environment without Numpy.
AMD64 Conda Python 3.11 without NumPy

I don't know about the other two yet, I'll look at the logs.

AlenkaF · 2025-06-09T11:33:06Z

Three CIs have failed.😭

All good, that is why they are set - to make sure we do not miss anything (or as little as possible 😉 )

I think the following failure is caused by using np.arrange in an environment without Numpy. AMD64 Conda Python 3.11 without NumPy

Correct. Similar to what you have done in this PR, the test data needs to be updated to use list(range(..)) too.

I don't know about the other two yet, I'll look at the logs.

Other two are not connected.

shu-kitamura · 2025-06-09T11:55:16Z

Correct. Similar to what you have done in this PR, the test data needs to be updated to use list(range(..)) too.

I fixed to not use np.arange in test_array_slice_negative_step()

Other two are not connected.

I'm sorry, but I don't understand what it means to "not connected."

AlenkaF · 2025-06-09T12:04:16Z

I'm sorry, but I don't understand what it means to "not connected."

No problem. One other CI build that is failing has a known issue (#46516) and so is not connected to the changes in this PR and we can ignore it. Similar for the lint one, I can't find an open issue for it though.

AlenkaF

Thanks again for the contribution @shu-kitamura !
@raulcd mind giving a sanity check before I merge?

raulcd · 2025-06-10T10:00:26Z

@github-actions crossbow submit -g python

github-actions · 2025-06-10T10:03:04Z

Revision: 9b3cb60

Submitted crossbow builds: ursacomputing/crossbow @ actions-f955378e43

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-conda-python-3.10
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest-numpy-latest
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-latest-numpy-1.26
test-conda-python-3.11-pandas-latest-numpy-latest
test-conda-python-3.11-pandas-nightly-numpy-nightly
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.12-cpython-debug
test-conda-python-3.13
test-conda-python-3.9
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5
test-conda-python-emscripten
test-cuda-python-ubuntu-22.04-cuda-11.7.1
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-python-3
test-ubuntu-22.04-python-3
test-ubuntu-22.04-python-313-freethreading
test-ubuntu-24.04-python-3

github-actions · 2025-06-10T10:15:59Z

⚠️ GitHub issue #46606 has been automatically assigned in GitHub to PR creator.

raulcd

LGTM, I am running extended CI to double check and have updated the title of the PR to describe what we are doing. Will merge once CI run finishes if successful.
Thanks @AlenkaF for the reviews and @shu-kitamura for the PR!

AlenkaF · 2025-06-10T11:41:28Z

The failing builds do not look related, though example-python-minimal-* should probably already be green? (they are not failing on one of my PRs from today), cc @raulcd

raulcd · 2025-06-10T12:51:29Z

Yes, CI failures are unrelated, the example-python-minimal* are related to:

[Python] Jobs fail if Pyarrow version is not correctly generated due to missing remote dev tags #44803

And the test-conda-python-emscripten was successful on retry.

pitrou · 2025-06-10T13:50:32Z

Were any benchmarks run on this change? Calling list(range(...)) and converting it to a Arrow array afterwards is going to be significantly more costly (and memory-consuming) than np.arange.

raulcd · 2025-06-10T14:06:56Z

Thanks @pitrou for taking a look, I should have pinged you on this one before merging.

Were any benchmarks run on this change?

I haven't run benchmarks, maybe we should validate the performance changes and if significant use numpy if available otherwise use the new code path?

Calling list(range(...)) and converting it to a Arrow array afterwards is going to be significantly more costly (and memory-consuming) than np.arange.

when you say converting it to a Arrow array afterwards you mean on the case of no indices being returned?

        if len(indices) == 0:
            return arrow_obj.slice(0, 0)

or when using the indices list on take? this would still have to convert from the Numpy array to an Arrow array on the previous case, right?
Is Pylist to Arrow array that much slow than from Numpy array to Arrow array?

AlenkaF · 2025-06-10T14:18:27Z

Calling list(range(...)) and converting it to a Arrow array afterwards is going to be significantly more costly (and memory-consuming) than np.arange

We are not really converting numpy array or list to PyArrow array. We are only using a different path to construct indices to pass to pa.Array.take(indices). I would think that saving indices as a list and not a numpy array would not yield hight performance loss?

pitrou · 2025-06-10T14:26:06Z

or when using the indices list on take?

Yes, this one.

this would still have to convert from the Numpy array to an Arrow array on the previous case, right?

np.arange is quick and Numpy to Arrow is zero-copy.

Is Pylist to Arrow array that much slow than from Numpy array to Arrow array?

Extremely slower as you have to convert generic Python objects to a contiguous native array.

>>> start, stop, step = 1, 1_000_000, 2

>>> %timeit np.arange(start, stop, step)
115 μs ± 741 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit pa.array(np.arange(start, stop, step))
120 μs ± 479 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

>>> %timeit list(range(start, stop, step))
13.1 ms ± 84.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit pa.array(list(range(start, stop, step)))
32.9 ms ± 56.9 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

And then:

>>> a = pa.array(np.arange(0, 2_000_000))
>>> %timeit a.take(np.arange(start, stop, step))
818 μs ± 1.86 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit a.take(list(range(start, stop, step)))
33 ms ± 101 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

raulcd · 2025-06-10T16:02:40Z

@pitrou @AlenkaF I've created this issue to follow up, let me know what you think:

[Python] If numpy is available use it for normalizing slice #46771

conbench-apache-arrow · 2025-06-10T21:37:41Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 494d0e3.

There were 69 benchmark results with an error:

Commit Run on arm64-t4g-2xlarge-linux at 2025-06-10 14:40:43Z
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-07, scale_factor=1
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-09, scale_factor=1
and 67 more (see the report linked below)

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

Changed from numpy Array to list(range(...))

50f9ec1

shu-kitamura requested review from AlenkaF, raulcd and rok as code owners June 6, 2025 17:23

github-actions bot added Component: Python awaiting review Awaiting review labels Jun 6, 2025

shu-kitamura added 2 commits June 7, 2025 10:34

add test

0758cb0

fix test case

70c876e

move test case

acda29b

Add handling for empty list

44260b8

delete test_slicing_with_non_trivial_step()

25904e9

Fixed test cases to not use np.arange

9b3cb60

AlenkaF approved these changes Jun 9, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 9, 2025

raulcd changed the title ~~GH-46606: [Python] Weird exception when slicing an array with non-trivial step~~ GH-46606: [Python] Do not require numpy when normalizing slice Jun 10, 2025

raulcd approved these changes Jun 10, 2025

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jun 10, 2025

raulcd merged commit 494d0e3 into apache:main Jun 10, 2025
14 of 16 checks passed

raulcd removed the awaiting merge Awaiting merge label Jun 10, 2025

raulcd mentioned this pull request Jun 10, 2025

[Python] Weird exception when slicing an array with non-trivial step #46606

Closed

shu-kitamura deleted the fix_normalize_slice branch June 10, 2025 12:53

raulcd mentioned this pull request Jun 10, 2025

[Python] If numpy is available use it for normalizing slice #46771

Open

GH-46606: [Python] Do not require numpy when normalizing slice #46732

GH-46606: [Python] Do not require numpy when normalizing slice #46732

Uh oh!

Conversation

shu-kitamura commented Jun 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

AlenkaF commented Jun 6, 2025

Uh oh!

shu-kitamura commented Jun 7, 2025

Uh oh!

AlenkaF commented Jun 7, 2025

Uh oh!

shu-kitamura commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shu-kitamura commented Jun 8, 2025

Uh oh!

AlenkaF commented Jun 9, 2025

Uh oh!

shu-kitamura commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlenkaF commented Jun 9, 2025

Uh oh!

shu-kitamura commented Jun 9, 2025

Uh oh!

AlenkaF commented Jun 9, 2025

Uh oh!

shu-kitamura commented Jun 9, 2025

Uh oh!

AlenkaF commented Jun 9, 2025

Uh oh!

AlenkaF left a comment

Choose a reason for hiding this comment

Uh oh!

raulcd commented Jun 10, 2025

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

AlenkaF commented Jun 10, 2025

Uh oh!

raulcd commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pitrou commented Jun 10, 2025

Uh oh!

raulcd commented Jun 10, 2025

Uh oh!

AlenkaF commented Jun 10, 2025

Uh oh!

pitrou commented Jun 10, 2025

Uh oh!

raulcd commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Jun 10, 2025

Uh oh!

Uh oh!

shu-kitamura commented Jun 6, 2025 •

edited by github-actions bot

Loading

shu-kitamura commented Jun 7, 2025 •

edited

Loading

shu-kitamura commented Jun 9, 2025 •

edited

Loading

raulcd commented Jun 10, 2025 •

edited

Loading

raulcd commented Jun 10, 2025 •

edited

Loading