Skip to content

[BUG] tests/test_umap.py::test_umap_precomputed_knn[dense] failed cudaErrorInvalidValue: invalid argument #893

@pxLi

Description

@pxLi

spark-rapids-ml_nightly, run: 690

FAILED tests/test_umap.py::test_umap_precomputed_knn[dense]
...
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidValue: invalid argument
[2025-04-17T04:01:53.900Z] Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 830, in main
[2025-04-17T04:01:53.900Z]     process()
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 822, in process
[2025-04-17T04:01:53.900Z]     serializer.dump_stream(out_iter, outfile)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 345, in dump_stream
[2025-04-17T04:01:53.900Z]     return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 86, in dump_stream
[2025-04-17T04:01:53.900Z]     for batch in iterator:
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 338, in init_stream_yield_batches
[2025-04-17T04:01:53.900Z]     for series in iterator:
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 519, in func
[2025-04-17T04:01:53.900Z]     for result_batch, result_type in result_iter:
[2025-04-17T04:01:53.900Z]   File "/home/jenkins/agent/workspace/jenkins-spark-rapids-ml_nightly-690/python/src/spark_rapids_ml/umap.py", line 1155, in _train_udf
[2025-04-17T04:01:53.900Z]     embedding, raw_data = cuml_fit_func(inputs, params).values()
[2025-04-17T04:01:53.900Z]   File "/home/jenkins/agent/workspace/jenkins-spark-rapids-ml_nightly-690/python/src/spark_rapids_ml/umap.py", line 1008, in _cuml_fit
[2025-04-17T04:01:53.900Z]     umap_object = CumlUMAP(
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 354, in processor
[2025-04-17T04:01:53.900Z]     return init_func(self, *args, **filtered_kwargs)
[2025-04-17T04:01:53.900Z]   File "umap.pyx", line 444, in cuml.manifold.umap.UMAP.__init__
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
[2025-04-17T04:01:53.900Z]     return func(*args, **kwargs)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cuml/common/sparsefuncs.py", line 300, in extract_knn_infos
[2025-04-17T04:01:53.900Z]     results = extract_pairwise_dists(knn_info, n_neighbors)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
[2025-04-17T04:01:53.900Z]     return func(*args, **kwargs)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cuml/common/sparsefuncs.py", line 246, in extract_pairwise_dists
[2025-04-17T04:01:53.900Z]     pw_dists, _, _, _ = input_to_cupy_array(pw_dists)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/nvtx/nvtx.py", line 122, in inner
[2025-04-17T04:01:53.900Z]     result = func(*args, **kwargs)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cuml/internals/input_utils.py", line 486, in input_to_cupy_array
[2025-04-17T04:01:53.900Z]     out_data = input_to_cuml_array(
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/nvtx/nvtx.py", line 122, in inner
[2025-04-17T04:01:53.900Z]     result = func(*args, **kwargs)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cuml/internals/input_utils.py", line 427, in input_to_cuml_array
[2025-04-17T04:01:53.900Z]     arr = CumlArray.from_input(
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
[2025-04-17T04:01:53.900Z]     return func(*args, **kwargs)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/nvtx/nvtx.py", line 122, in inner
[2025-04-17T04:01:53.900Z]     result = func(*args, **kwargs)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cuml/internals/array.py", line 1185, in from_input
[2025-04-17T04:01:53.900Z]     arr.to_output(
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
[2025-04-17T04:01:53.900Z]     return func(*args, **kwargs)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/nvtx/nvtx.py", line 122, in inner
[2025-04-17T04:01:53.900Z]     result = func(*args, **kwargs)
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cuml/internals/array.py", line 655, in to_output
[2025-04-17T04:01:53.900Z]     return output_mem_type.xpy.asarray(
[2025-04-17T04:01:53.900Z]   File "/root/miniconda3/lib/python3.10/site-packages/cupy/_creation/from_data.py", line 88, in asarray
[2025-04-17T04:01:53.900Z]     return _core.array(a, dtype, False, order, blocking=blocking)
[2025-04-17T04:01:53.900Z]   File "cupy/_core/core.pyx", line 2455, in cupy._core.core.array
[2025-04-17T04:01:53.900Z]   File "cupy/_core/core.pyx", line 2482, in cupy._core.core.array
[2025-04-17T04:01:53.900Z]   File "cupy/_core/core.pyx", line 2643, in cupy._core.core._array_default
[2025-04-17T04:01:53.900Z]   File "cupy/_core/core.pyx", line 2738, in cupy._core.core._alloc_async_transfer_buffer
[2025-04-17T04:01:53.900Z]   File "cupy/_core/core.pyx", line 2735, in cupy._core.core._alloc_async_transfer_buffer
[2025-04-17T04:01:53.900Z]   File "cupy/cuda/pinned_memory.pyx", line 215, in cupy.cuda.pinned_memory.alloc_pinned_memory
[2025-04-17T04:01:53.900Z]   File "cupy/cuda/pinned_memory.pyx", line 289, in cupy.cuda.pinned_memory.PinnedMemoryPool.malloc
[2025-04-17T04:01:53.900Z]   File "cupy/cuda/pinned_memory.pyx", line 309, in cupy.cuda.pinned_memory.PinnedMemoryPool.malloc
[2025-04-17T04:01:53.901Z]   File "cupy/cuda/pinned_memory.pyx", line 306, in cupy.cuda.pinned_memory.PinnedMemoryPool.malloc
[2025-04-17T04:01:53.901Z]   File "cupy/cuda/pinned_memory.pyx", line 180, in cupy.cuda.pinned_memory._malloc
[2025-04-17T04:01:53.901Z]   File "cupy/cuda/pinned_memory.pyx", line 181, in cupy.cuda.pinned_memory._malloc
[2025-04-17T04:01:53.901Z]   File "cupy/cuda/pinned_memory.pyx", line 30, in cupy.cuda.pinned_memory.PinnedMemory.__init__
[2025-04-17T04:01:53.901Z]   File "cupy_backends/cuda/api/runtime.pyx", line 555, in cupy_backends.cuda.api.runtime.hostAlloc
[2025-04-17T04:01:53.901Z]   File "cupy_backends/cuda/api/runtime.pyx", line 146, in cupy_backends.cuda.api.runtime.check_status
[2025-04-17T04:01:53.901Z] cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidValue: invalid argument
[2025-04-17T04:01:53.901Z] 
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:561)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:118)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:514)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
[2025-04-17T04:01:53.901Z] 	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
[2025-04-17T04:01:53.901Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:95)
[2025-04-17T04:01:53.901Z] 	at scala.collection.Iterator.foreach(Iterator.scala:943)
[2025-04-17T04:01:53.901Z] 	at scala.collection.Iterator.foreach$(Iterator.scala:943)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.foreach(ArrowConverters.scala:75)
[2025-04-17T04:01:53.901Z] 	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
[2025-04-17T04:01:53.901Z] 	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
[2025-04-17T04:01:53.901Z] 	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
[2025-04-17T04:01:53.901Z] 	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
[2025-04-17T04:01:53.901Z] 	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
[2025-04-17T04:01:53.901Z] 	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.to(ArrowConverters.scala:75)
[2025-04-17T04:01:53.901Z] 	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
[2025-04-17T04:01:53.901Z] 	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.toBuffer(ArrowConverters.scala:75)
[2025-04-17T04:01:53.901Z] 	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
[2025-04-17T04:01:53.901Z] 	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.toArray(ArrowConverters.scala:75)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$6(Dataset.scala:4150)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.SparkContext.$anonfun$runJob$6(SparkContext.scala:2352)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:139)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
[2025-04-17T04:01:53.901Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
[2025-04-17T04:01:53.901Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2025-04-17T04:01:53.901Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2025-04-17T04:01:53.901Z] 	at java.lang.Thread.run(Thread.java:750)
[2025-04-17T04:01:53.901Z] = 1 failed, 685 passed, 2 skipped, 12 xpassed, 108615 warnings in 4000.60s (1:06:40) =

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions