Skip to content

spark-rapids-ml and spark-rapids Accelerator CUDARuntimeError: cudaErrorMemoryAllocation #552

@nsaman

Description

@nsaman

I'm use rapids accelerator and spark-rapids-ml in conjunction and am facing below error. If rapids accelerator is disabled, it runs successfully. The documentation implies the two should be able to work together: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/ml-integration.html#existing-ml-libraries.

Is there something I'm missing?
spark.rapids.memory.gpu.pool=NONE seems to be the only suggestion on avoiding memory conflicts

Environment:
Docker running on AWS Sagemaker (ml-p3-2xlarge) (base: nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10)

Stacktrace

2024-01-23 01:29:21,462 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 76.0 (TID 382) (algo-2 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/spark_rapids_ml/core.py", line 698, in _train_udf
    if cuda_managed_mem_enabled:
  File "/opt/conda/lib/python3.10/site-packages/spark_rapids_ml/core.py", line 383, in _set_gpu_device
    cupy.cuda.Device(gpu_id).use()
  File "cupy/cuda/device.pyx", line 192, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 198, in cupy.cuda.device.Device.use
  File "cupy_backends/cuda/api/runtime.pyx", line 375, in cupy_backends.cuda.api.runtime.setDevice
  File "cupy_backends/cuda/api/runtime.pyx", line 144, in cupy_backends.cuda.api.runtime.check_status
--
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory
#011at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
#011at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:101)
#011at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:50)
#011at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
#011at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
#011at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
#011at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
#011at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
#011at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:86)
#011at scala.collection.Iterator.foreach(Iterator.scala:943)
#011at scala.collection.Iterator.foreach$(Iterator.scala:943)
#011at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:80)
#011at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
#011at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:670)
#011at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:424)
#011at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
#011at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:259)
2024-01-23 01:29:21.480 INFO clientserver - close: Closing down clientserver connection
2024-01-23 01:29:21.487 INFO ModelService - process: Exception during processing: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

nvidia-smi

+-----------------------------------------------------------------------------+
--
\| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.2     \|
\|-------------------------------+----------------------+----------------------+
\| GPU  Name        Persistence-M\| Bus-Id        Disp.A \| Volatile Uncorr. ECC \|
\| Fan  Temp  Perf  Pwr:Usage/Cap\|         Memory-Usage \| GPU-Util  Compute M. \|
\|                               \|                      \|               MIG M. \|
\|===============================+======================+======================\|
\|   0  Tesla V100-SXM2...  On   \| 00000000:00:1E.0 Off \|                    0 \|
\| N/A   27C    P0    23W / 300W \|      0MiB / 16160MiB \|      0%      Default \|
\|                               \|                      \|                  N/A \|
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
\| Processes:                                                                  \|
\|  GPU   GI   CI        PID   Type   Process name                  GPU Memory \|
\|        ID   ID                                                   Usage      \|
\|=============================================================================\|
\|  No running processes found                                                 \|
+-----------------------------------------------------------------------------+

Dockerfile

FROM rapidsai/base:23.12-cuda11.2-py3.10

USER root

RUN apt-get update
RUN apt-get install -y openjdk-8-jdk curl zip unzip
# Fix certificate issues
RUN apt-get update && \
    apt-get install ca-certificates-java && \
    apt-get clean && \
    update-ca-certificates -f;
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

# Install Hadoop
ENV HADOOP_VERSION 3.0.0
ENV HADOOP_HOME /usr/hadoop-$HADOOP_VERSION
ENV HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
ENV PATH $PATH:$HADOOP_HOME/bin
RUN curl -sL --retry 3 \
  "http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" \
  | gunzip \
  | tar -x -C /usr/ \
 && rm -rf $HADOOP_HOME/share/doc \
 && chown -R root:root $HADOOP_HOME

# Install Spark
ENV SPARK_VERSION 3.2.0
ENV SPARK_PACKAGE spark-${SPARK_VERSION}-bin-without-hadoop
ENV SPARK_HOME /usr/spark-${SPARK_VERSION}
ENV SPARK_DIST_CLASSPATH="$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
ENV PATH $PATH:${SPARK_HOME}/bin
RUN curl -sL --retry 3 \
  "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}.tgz" \
  | gunzip \
  | tar x -C /usr/ \
 && mv /usr/$SPARK_PACKAGE $SPARK_HOME \
 && chown -R root:root $SPARK_HOME

# http://blog.stuart.axelbrooke.com/python-3-on-spark-return-of-the-pythonhashseed
ENV PYTHONHASHSEED 0
ENV PYTHONIOENCODING UTF-8
ENV PIP_DISABLE_PIP_VERSION_CHECK 1

# Point Spark at proper python binary
ENV PYSPARK_PYTHON=/opt/conda/bin/python
ENV PYSPARK_DRIVER_PYTHON=/opt/conda/bin/python
# Setup Spark/Yarn/HDFS user as root
ENV PATH="/usr/bin:/opt/program:${PATH}"
ENV YARN_RESOURCEMANAGER_USER="root"
ENV YARN_NODEMANAGER_USER="root"
ENV HDFS_NAMENODE_USER="root"
ENV HDFS_DATANODE_USER="root"
ENV HDFS_SECONDARYNAMENODE_USER="root"

RUN curl -s https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.12.1/rapids-4-spark_2.12-23.12.1.jar -o ${SPARK_HOME}/jars/rapids-4-spark_2.12-23.12.1.jar
COPY requirements.txt .
RUN pip3 install -r requirements.txt
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/*

# set up source code
COPY src /opt/program
RUN mkdir -p /opt/module/py_files
WORKDIR /opt/program
RUN zip -r /opt/module/py_files/short-term-model.zip .

# Set up bootstrapping program and Spark configuration
COPY configuration/program /opt/program
RUN chmod +x /opt/program/submit
COPY configuration/hadoop-config /opt/hadoop-config

# make output folder for spark history logs
RUN mkdir -p /opt/ml/processing/output

RUN pip3 install psutil

WORKDIR $SPARK_HOME

ENV CUDA_VISIBLE_DEVICES 0
RUN export CUDA_VISIBLE_DEVICES=0

ENV LD_LIBRARY_PATH=/usr/local/cuda-11.2/compat/:/usr/local/cuda-11.2/lib64:${LD_LIBRARY_PATH}
RUN export LD_LIBRARY_PATH=/usr/local/cuda-11.2/compat/:/usr/local/cuda-11.2/lib64:${LD_LIBRARY_PATH}

ENTRYPOINT ["/opt/program/submit", "/opt/program/processor.py"]

requirements.txt

findspark
pyspark==3.2.0
statsmodels
scikit-learn>=1.2.1
spark_rapids_ml==23.12.0

spark-defaults.conf (note, some variables in there)

spark.driver.host=sd_host
spark.driver.memory=driver_mem
spark.yarn.am.cores=driver_cores
spark.executor.memory=exec_mem
spark.executor.cores=exec_cores
spark.task.cpus=task_cores
spark.executor.instances=exec_instances
spark.driver.maxResultSize=max_result_size
spark.executor.memoryOverhead=exec_overhead
spark.sql.adaptive.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.default.parallelism=shuffle_partitions
spark.sql.shuffle.partitions=shuffle_partitions
spark.sql.files.maxPartitionBytes=256m
spark.sql.execution.arrow.pyspark.enabled=true
spark.sql.execution.arrow.maxRecordsPerBatch=3000
spark.sql.execution.arrow.pyspark.fallback.enabled=true
spark.network.timeout=900s
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=8g
spark.executor.pyspark.memory=22g
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max=1g
spark.eventLog.enabled=true
spark.eventLog.dir=/opt/ml/processing/output
spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.driver.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps

; https://nvidia.github.io/spark-rapids/docs/configs.html
spark.jars=/usr/spark-3.5.0/jars/rapids-4-spark_2.12-23.12.1.jar
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.sql.concurrentGpuTasks=2
spark.rapids.sql.explain=NOT_ON_GPU
; https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/on-premise.html#running-on-yarn
spark.executor.resource.gpu.discoveryScript=/opt/program/discover_gpus.sh
spark.driver.resource.gpu.discoveryScript=/opt/program/discover_gpus.sh
spark.executor.resource.gpu.amount=1
spark.driver.resource.gpu.amount=1
spark.task.resource.gpu.amount=.5
spark.rapids.memory.gpu.pool=NONE

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions