Running Awkward test suite as part of Numba-CUDA CI #3587

gmarkall · 2025-07-23T10:42:30Z

gmarkall
Jul 23, 2025

I'd like to ensure that Numba-CUDA changes don't break Awkward Array by having a CI job that tests Awkward with Numba-CUDA PR branches (and release tags). I'm not sure about the best way to go about doing this - I'd like to avoid building Awkward in CI if I can, and use the latest release, but I think I might need the source repo as well for the tests.

So far I have a locally-tried flow that works like:

mamba create -n awkward-test python=3.13 awkward cupy numba-cuda pyarrow pandas
git clone --recursive git@github.com:scikit-hep/awkward.git
cd awkward
git checkout <version of awkward that was installed by mamba>
cd tests-cuda
pytest -n auto .

For a CI flow I think it'd be a bit fiddly but I could get it to check out the right tag for the installed awkward version, but I wonder if there's a better way? Are there some improvements that can be made on the above steps?

Many thanks in advance!

Answered by ariostas

Jul 23, 2025

Hi Graham, that proposal sounds good to me.

There's also some auto-generated tests that you can use. If you run nox -s prepare -- --tests it will generate tests in the tests-cuda-kernels and tests-cuda-kernels-explicit directories.

So it could be something like this

mamba create -n awkward-test python=3.13 awkward cupy numba-cuda pyarrow pandas nox
git clone --recursive git@github.com:scikit-hep/awkward.git
cd awkward
git checkout <version of awkward that was installed by mamba>
nox -s prepare -- --tests
pytest -n auto tests-cuda/ tests-cuda-kernels/ tests-cuda-kernels-explicit/

View full answer

ariostas · 2025-07-23T18:10:16Z

ariostas
Jul 23, 2025
Maintainer

Hi Graham, that proposal sounds good to me.

There's also some auto-generated tests that you can use. If you run nox -s prepare -- --tests it will generate tests in the tests-cuda-kernels and tests-cuda-kernels-explicit directories.

So it could be something like this

mamba create -n awkward-test python=3.13 awkward cupy numba-cuda pyarrow pandas nox
git clone --recursive git@github.com:scikit-hep/awkward.git
cd awkward
git checkout <version of awkward that was installed by mamba>
nox -s prepare -- --tests
pytest -n auto tests-cuda/ tests-cuda-kernels/ tests-cuda-kernels-explicit/

0 replies

gmarkall · 2025-11-24T17:01:57Z

gmarkall
Nov 24, 2025
Author

Thanks for the help here! I have this set up now, but I'm finding that one test seems to sometimes (mostly) fail:

______ test_unit_cudaawkward_RecordArray_reduce_nonlocal_outoffsets_64_3 _______
[gw1] linux -- Python 3.12.12 /pyenv/versions/3.12.12/bin/python

    def test_unit_cudaawkward_RecordArray_reduce_nonlocal_outoffsets_64_3():
        outoffsets = cupy.array([123, 123, 123], dtype=cupy.int64)
        outcarry = cupy.array([123, 123], dtype=cupy.int64)
        outlength = 2
        parents = cupy.array([1, 1], dtype=cupy.int64)
        lenparents = 2
        funcC = cupy_backend['awkward_RecordArray_reduce_nonlocal_outoffsets_64', cupy.int64, cupy.int64, cupy.int64]
        funcC(outoffsets, outcarry, parents, lenparents, outlength)
    
        try:
            ak_cu.synchronize_cuda()
        except Exception as e:
            if "not implemented for given n" in str(e):
                print("Not implemented for given n in compiled CUDA code (awkward_ListArray_combinations)")
            else:
                pytest.fail(f"Unexpected error raised: {e}: This test case shouldn't have raised an error")
        pytest_outoffsets = [0, 2, 2]
        cpt.assert_allclose(outoffsets[:len(pytest_outoffsets)], cupy.array(pytest_outoffsets))
        pytest_outcarry = [1, 0]
>       cpt.assert_allclose(outcarry[:len(pytest_outcarry)], cupy.array(pytest_outcarry))

funcC      = <CupyKernel awkward_RecordArray_reduce_nonlocal_outoffsets_64, int64, int64, int64>
lenparents = 2
outcarry   = array([1, 2])
outlength  = 2
outoffsets = array([0, 2, 2])
parents    = array([1, 1])
pytest_outcarry = [1, 0]
pytest_outoffsets = [0, 2, 2]

tests-cuda-kernels-explicit/test_unit_cudaawkward_RecordArray_reduce_nonlocal_outoffsets_64.py:83: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

actual = array([1, 2]), desired = array([1, 0]), rtol = 1e-07, atol = 0
err_msg = '', verbose = True

    def assert_allclose(actual, desired, rtol=1e-7, atol=0, err_msg='',
                        verbose=True):
        """Raises an AssertionError if objects are not equal up to desired tolerance.
    
        Args:
             actual(numpy.ndarray or cupy.ndarray): The actual object to check.
             desired(numpy.ndarray or cupy.ndarray): The desired, expected object.
             rtol(float): Relative tolerance.
             atol(float): Absolute tolerance.
             err_msg(str): The error message to be printed in case of failure.
             verbose(bool): If ``True``, the conflicting
                 values are appended to the error message.
    
        .. seealso:: :func:`numpy.testing.assert_allclose`
    
        """  # NOQA
>       numpy.testing.assert_allclose(
            cupy.asnumpy(actual), cupy.asnumpy(desired),
            rtol=rtol, atol=atol, err_msg=err_msg, verbose=verbose)
E       AssertionError: 
E       Not equal to tolerance rtol=1e-07, atol=0
E       
E       Mismatched elements: 1 / 2 (50%)
E       Max absolute difference among violations: 2
E       Max relative difference among violations: inf
E        ACTUAL: array([1, 2])
E        DESIRED: array([1, 0])

actual     = array([1, 2])
atol       = 0
desired    = array([1, 0])
err_msg    = ''
rtol       = 1e-07
verbose    = True

/pyenv/versions/3.12.12/lib/python3.12/site-packages/cupy/testing/_array.py:24: AssertionError

In the runs where this test passed, it was skipped with the message "Unable to generate any tests for kernel".

Can you suggest what I should look at to track down why this is happening please? Or, can this test just be skipped? I think the test is not related to Numba-CUDA so I think it might be OK to skip anyway, but I'm not sure how to control that.

If it adds any useful context:

The logs for a "good" run are: https://github.com/NVIDIA/numba-cuda/actions/runs/19530606137/job/55913157717
Logs for a "bad" run are: https://github.com/NVIDIA/numba-cuda/actions/runs/19640880237/job/56244556563?pr=607

(In the "bad" run, I was also attempting to fix some other failures by downgrading to CuPy 13.4, which seems to have helped with every other failure. But I note the "good" run also passed with 13.6, so I'm not sure what's going on there. But I think that may be tangential to the specific test above)

0 replies

ianna · 2025-11-24T17:43:51Z

ianna
Nov 24, 2025
Maintainer

@gmarkall - Thanks for setting it up! I think, the failing test is due to our (most probably buggy) kernel code src/awkward/_connect/cuda/cuda_kernels/awkward_RecordArray_reduce_nonlocal_outoffsets_64.cu

0 replies

gmarkall · 2025-11-24T18:16:55Z

gmarkall
Nov 24, 2025
Author

Many thanks for clarifying this! I'll see if I can get it to skip that test. Is there a way I should be invoking nox to avoid generating this test?

2 replies

ianna Nov 24, 2025
Maintainer

Unfortunately, there is no way to exclude a generated test without editing the following

awkward/dev/generate-tests.py

Line 973 in c75037e

"awkward_RecordArray_reduce_nonlocal_outoffsets_64",

I see there are a few issues with the kernel. Let me try to fix it and we'll make a release asap. Thanks.

gmarkall Nov 24, 2025
Author

Many thanks for the quick reply - in our CI setup it would be quite straightforward to apply a patch there so I think there's no hurry to fix this issue and cut a release - I'll give this a try and follow up shortly.

gmarkall · 2025-11-24T22:46:58Z

gmarkall
Nov 24, 2025
Author

It looks like patching out the the generation of that particular test worked. Can I check - does tests-cuda/test_3459_virtualarray_with_cuda.py have any known or likely issues with it? This one seems to also fail intermittently, e.g.:

FAILED tests-cuda/test_3459_virtualarray_with_cuda.py::test_listarray_nanargmin - AssertionError: assert False
 +  where False = <function array_equal at 0x741070f9e020>(<Array [] type='0 * ?int64'>, <Array [0, 0] type='2 * ?int64'>)
 +    where <function array_equal at 0x741070f9e020> = ak.array_equal

(from https://github.com/NVIDIA/numba-cuda/actions/runs/19648199281/job/56269593457?pr=607)

5 replies

ikrommyd Nov 24, 2025
Maintainer

Is there any pattern at when it fails? Is it just random?

ikrommyd Nov 24, 2025
Maintainer

Asking cause as far as I remember, I have not seen such things in the awkward ci gpu tests.

ikrommyd Nov 24, 2025
Maintainer

Oh yeah the ci failure that you sent looks like a total cuda crashout. For example:

FAILED tests-cuda/test_3459_virtualarray_with_cuda.py::test_listoffsetarray_drop_none - ValueError: Negative dimensions are not allowed

FAILED tests-cuda/test_3459_virtualarray_with_cuda.py::test_listoffsetarray_pad_none - OverflowError: can't convert negative value to size_t

ianna Nov 25, 2025
Maintainer

@gmarkall - please feel free to exclude the virtual array tests. This feature is still experimental and is not used on GPUs. Thanks!

gmarkall Nov 25, 2025
Author

Many thanks! I think it seems to randomly fail, I couldn't spot a pattern (though I haven't been through that many runs to find out).

Running Awkward test suite as part of Numba-CUDA CI #3587

Uh oh!

gmarkall Jul 23, 2025

Replies: 5 comments · 7 replies

Uh oh!

ariostas Jul 23, 2025 Maintainer

Uh oh!

gmarkall Nov 24, 2025 Author

Uh oh!

ianna Nov 24, 2025 Maintainer

Uh oh!

gmarkall Nov 24, 2025 Author

Uh oh!

ianna Nov 24, 2025 Maintainer

Uh oh!

gmarkall Nov 24, 2025 Author

Uh oh!

gmarkall Nov 24, 2025 Author

Uh oh!

ikrommyd Nov 24, 2025 Maintainer

Uh oh!

Uh oh!

ikrommyd Nov 24, 2025 Maintainer

Uh oh!

ikrommyd Nov 24, 2025 Maintainer

Uh oh!

ianna Nov 25, 2025 Maintainer

Uh oh!

gmarkall Nov 25, 2025 Author

gmarkall
Jul 23, 2025

Replies: 5 comments 7 replies

ariostas
Jul 23, 2025
Maintainer

gmarkall
Nov 24, 2025
Author

ianna
Nov 24, 2025
Maintainer

gmarkall
Nov 24, 2025
Author

ianna Nov 24, 2025
Maintainer

gmarkall Nov 24, 2025
Author

gmarkall
Nov 24, 2025
Author

ikrommyd Nov 24, 2025
Maintainer

ikrommyd Nov 24, 2025
Maintainer

ikrommyd Nov 24, 2025
Maintainer

ianna Nov 25, 2025
Maintainer

gmarkall Nov 25, 2025
Author