-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Description
I'm use rapids accelerator and spark-rapids-ml in conjunction and am facing below error. If rapids accelerator is disabled, it runs successfully. The documentation implies the two should be able to work together: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/ml-integration.html#existing-ml-libraries.
Is there something I'm missing?
spark.rapids.memory.gpu.pool=NONE seems to be the only suggestion on avoiding memory conflicts
Environment:
Docker running on AWS Sagemaker (ml-p3-2xlarge) (base: nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10)
Stacktrace
2024-01-23 01:29:21,462 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 76.0 (TID 382) (algo-2 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/spark_rapids_ml/core.py", line 698, in _train_udf
if cuda_managed_mem_enabled:
File "/opt/conda/lib/python3.10/site-packages/spark_rapids_ml/core.py", line 383, in _set_gpu_device
cupy.cuda.Device(gpu_id).use()
File "cupy/cuda/device.pyx", line 192, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 198, in cupy.cuda.device.Device.use
File "cupy_backends/cuda/api/runtime.pyx", line 375, in cupy_backends.cuda.api.runtime.setDevice
File "cupy_backends/cuda/api/runtime.pyx", line 144, in cupy_backends.cuda.api.runtime.check_status
--
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory
#011at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
#011at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:101)
#011at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:50)
#011at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
#011at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
#011at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
#011at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
#011at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
#011at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:86)
#011at scala.collection.Iterator.foreach(Iterator.scala:943)
#011at scala.collection.Iterator.foreach$(Iterator.scala:943)
#011at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:80)
#011at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
#011at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:670)
#011at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:424)
#011at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
#011at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:259)
2024-01-23 01:29:21.480 INFO clientserver - close: Closing down clientserver connection
2024-01-23 01:29:21.487 INFO ModelService - process: Exception during processing: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
nvidia-smi
+-----------------------------------------------------------------------------+
--
\| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.2 \|
\|-------------------------------+----------------------+----------------------+
\| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \|
\| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \|
\| \| \| MIG M. \|
\|===============================+======================+======================\|
\| 0 Tesla V100-SXM2... On \| 00000000:00:1E.0 Off \| 0 \|
\| N/A 27C P0 23W / 300W \| 0MiB / 16160MiB \| 0% Default \|
\| \| \| N/A \|
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
\| Processes: \|
\| GPU GI CI PID Type Process name GPU Memory \|
\| ID ID Usage \|
\|=============================================================================\|
\| No running processes found \|
+-----------------------------------------------------------------------------+
Dockerfile
FROM rapidsai/base:23.12-cuda11.2-py3.10
USER root
RUN apt-get update
RUN apt-get install -y openjdk-8-jdk curl zip unzip
# Fix certificate issues
RUN apt-get update && \
apt-get install ca-certificates-java && \
apt-get clean && \
update-ca-certificates -f;
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
# Install Hadoop
ENV HADOOP_VERSION 3.0.0
ENV HADOOP_HOME /usr/hadoop-$HADOOP_VERSION
ENV HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
ENV PATH $PATH:$HADOOP_HOME/bin
RUN curl -sL --retry 3 \
"http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" \
| gunzip \
| tar -x -C /usr/ \
&& rm -rf $HADOOP_HOME/share/doc \
&& chown -R root:root $HADOOP_HOME
# Install Spark
ENV SPARK_VERSION 3.2.0
ENV SPARK_PACKAGE spark-${SPARK_VERSION}-bin-without-hadoop
ENV SPARK_HOME /usr/spark-${SPARK_VERSION}
ENV SPARK_DIST_CLASSPATH="$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
ENV PATH $PATH:${SPARK_HOME}/bin
RUN curl -sL --retry 3 \
"https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}.tgz" \
| gunzip \
| tar x -C /usr/ \
&& mv /usr/$SPARK_PACKAGE $SPARK_HOME \
&& chown -R root:root $SPARK_HOME
# http://blog.stuart.axelbrooke.com/python-3-on-spark-return-of-the-pythonhashseed
ENV PYTHONHASHSEED 0
ENV PYTHONIOENCODING UTF-8
ENV PIP_DISABLE_PIP_VERSION_CHECK 1
# Point Spark at proper python binary
ENV PYSPARK_PYTHON=/opt/conda/bin/python
ENV PYSPARK_DRIVER_PYTHON=/opt/conda/bin/python
# Setup Spark/Yarn/HDFS user as root
ENV PATH="/usr/bin:/opt/program:${PATH}"
ENV YARN_RESOURCEMANAGER_USER="root"
ENV YARN_NODEMANAGER_USER="root"
ENV HDFS_NAMENODE_USER="root"
ENV HDFS_DATANODE_USER="root"
ENV HDFS_SECONDARYNAMENODE_USER="root"
RUN curl -s https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.12.1/rapids-4-spark_2.12-23.12.1.jar -o ${SPARK_HOME}/jars/rapids-4-spark_2.12-23.12.1.jar
COPY requirements.txt .
RUN pip3 install -r requirements.txt
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/*
# set up source code
COPY src /opt/program
RUN mkdir -p /opt/module/py_files
WORKDIR /opt/program
RUN zip -r /opt/module/py_files/short-term-model.zip .
# Set up bootstrapping program and Spark configuration
COPY configuration/program /opt/program
RUN chmod +x /opt/program/submit
COPY configuration/hadoop-config /opt/hadoop-config
# make output folder for spark history logs
RUN mkdir -p /opt/ml/processing/output
RUN pip3 install psutil
WORKDIR $SPARK_HOME
ENV CUDA_VISIBLE_DEVICES 0
RUN export CUDA_VISIBLE_DEVICES=0
ENV LD_LIBRARY_PATH=/usr/local/cuda-11.2/compat/:/usr/local/cuda-11.2/lib64:${LD_LIBRARY_PATH}
RUN export LD_LIBRARY_PATH=/usr/local/cuda-11.2/compat/:/usr/local/cuda-11.2/lib64:${LD_LIBRARY_PATH}
ENTRYPOINT ["/opt/program/submit", "/opt/program/processor.py"]
requirements.txt
findspark
pyspark==3.2.0
statsmodels
scikit-learn>=1.2.1
spark_rapids_ml==23.12.0
spark-defaults.conf (note, some variables in there)
spark.driver.host=sd_host
spark.driver.memory=driver_mem
spark.yarn.am.cores=driver_cores
spark.executor.memory=exec_mem
spark.executor.cores=exec_cores
spark.task.cpus=task_cores
spark.executor.instances=exec_instances
spark.driver.maxResultSize=max_result_size
spark.executor.memoryOverhead=exec_overhead
spark.sql.adaptive.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.default.parallelism=shuffle_partitions
spark.sql.shuffle.partitions=shuffle_partitions
spark.sql.files.maxPartitionBytes=256m
spark.sql.execution.arrow.pyspark.enabled=true
spark.sql.execution.arrow.maxRecordsPerBatch=3000
spark.sql.execution.arrow.pyspark.fallback.enabled=true
spark.network.timeout=900s
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=8g
spark.executor.pyspark.memory=22g
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max=1g
spark.eventLog.enabled=true
spark.eventLog.dir=/opt/ml/processing/output
spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.driver.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
; https://nvidia.github.io/spark-rapids/docs/configs.html
spark.jars=/usr/spark-3.5.0/jars/rapids-4-spark_2.12-23.12.1.jar
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.sql.concurrentGpuTasks=2
spark.rapids.sql.explain=NOT_ON_GPU
; https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/on-premise.html#running-on-yarn
spark.executor.resource.gpu.discoveryScript=/opt/program/discover_gpus.sh
spark.driver.resource.gpu.discoveryScript=/opt/program/discover_gpus.sh
spark.executor.resource.gpu.amount=1
spark.driver.resource.gpu.amount=1
spark.task.resource.gpu.amount=.5
spark.rapids.memory.gpu.pool=NONE
Metadata
Metadata
Assignees
Labels
No labels