Skip to content

Commit 9fe871f

Browse files
authored
Merge branch-25.02 into main [skip ci] (#857)
Merge branch-25.02 into main Release notes as follows: - Adds pyspark-rapids cli for zero-code change accelerated pyspark shell and Jupyter notebook. - Adds example init scripts for setting up zero-code change accelerated notebook environments in cloud Spark clusters. - Updates RAPIDS dependencies to 25.02. - Fixes UMAP precomputed KNN error message. Note: merge this PR with **Create a merge commit to merge**
2 parents 5ec5020 + d330714 commit 9fe871f

40 files changed

+423
-198
lines changed

.github/workflows/blossom-ci.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2023-2024, NVIDIA CORPORATION.
1+
# Copyright (c) 2023-2025, NVIDIA CORPORATION.
22
#
33
# Licensed under the Apache License, Version 2.0 (the "License");
44
# you may not use this file except in compliance with the License.
@@ -60,7 +60,7 @@ jobs:
6060
Vulnerability-scan:
6161
name: Vulnerability scan
6262
needs: [Authorization]
63-
runs-on: ubuntu-latest
63+
runs-on: vulnerability-scan
6464
steps:
6565
- name: Checkout code
6666
uses: actions/checkout@v4

ci/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#
2-
# Copyright (c) 2024, NVIDIA CORPORATION.
2+
# Copyright (c) 2025, NVIDIA CORPORATION.
33
#
44
# Licensed under the Apache License, Version 2.0 (the "License");
55
# you may not use this file except in compliance with the License.
@@ -37,6 +37,6 @@ RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86
3737
&& conda config --set solver libmamba
3838

3939
# install cuML
40-
ARG CUML_VER=24.12
40+
ARG CUML_VER=25.02
4141
RUN conda install -y -c rapidsai -c conda-forge -c nvidia cuml=$CUML_VER cuvs=$CUML_VER python=3.10 cuda-version=11.8 numpy~=1.0 \
4242
&& conda clean --all -f -y

docker/Dockerfile.pip

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#
2-
# Copyright (c) 2024, NVIDIA CORPORATION.
2+
# Copyright (c) 2025, NVIDIA CORPORATION.
33
#
44
# Licensed under the Apache License, Version 2.0 (the "License");
55
# you may not use this file except in compliance with the License.
@@ -18,7 +18,7 @@ ARG CUDA_VERSION=11.8.0
1818
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04
1919

2020
ARG PYSPARK_VERSION=3.3.1
21-
ARG RAPIDS_VERSION=24.12.0
21+
ARG RAPIDS_VERSION=25.2.0
2222
ARG ARCH=amd64
2323
#ARG ARCH=arm64
2424
# Install packages to build spark-rapids-ml

docker/Dockerfile.python

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#
2-
# Copyright (c) 2024, NVIDIA CORPORATION.
2+
# Copyright (c) 2025, NVIDIA CORPORATION.
33
#
44
# Licensed under the Apache License, Version 2.0 (the "License");
55
# you may not use this file except in compliance with the License.
@@ -17,7 +17,7 @@
1717
ARG CUDA_VERSION=11.8.0
1818
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04
1919

20-
ARG CUML_VERSION=24.12
20+
ARG CUML_VERSION=25.02
2121

2222
# Install packages to build spark-rapids-ml
2323
RUN apt update -y \

docs/source/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2024, NVIDIA CORPORATION.
1+
# Copyright (c) 2025, NVIDIA CORPORATION.
22
#
33
# Licensed under the Apache License, Version 2.0 (the "License");
44
# you may not use this file except in compliance with the License.
@@ -23,7 +23,7 @@
2323
project = 'spark-rapids-ml'
2424
copyright = '2024, NVIDIA'
2525
author = 'NVIDIA'
26-
release = '24.12.0'
26+
release = '25.02.0'
2727

2828
# -- General configuration ---------------------------------------------------
2929
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

notebooks/README.md

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,13 +34,27 @@ To run notebooks using Spark local mode on a server with one or more NVIDIA GPUs
3434
8. **OPTIONAL**: If you have multiple GPUs in your server, replace the `CUDA_VISIBLE_DEVICES` setting in step 4 with a comma-separated list of the corresponding indices. For example, for two GPUs use `CUDA_VISIBLE_DEVICES=0,1`.
3535

3636
## No import change
37-
In these notebooks, the GPU accelerated implementations of algorithms in Spark MLlib are enabled via import statements from the `spark_rapids_ml` package. Alternatively, acceleration can also be enabled by executing the following import statement at the start of a notebook:
37+
In the default notebooks, the GPU accelerated implementations of algorithms in Spark MLlib are enabled via import statements from the `spark_rapids_ml` package.
38+
39+
Alternatively, acceleration can also be enabled by executing the following import statement at the start of a notebook:
3840
```
3941
import spark_rapids_ml.install
4042
```
41-
After executing a cell with this command, all subsequent imports and accesses of supported accelerated classes from `pyspark.ml` will automatically redirect and return their counterparts in `spark_rapids_ml`. Unaccelerated classes will import from `pyspark.ml` as usual. Thus, with the above single import statement, all supported acceleration in an existing `pyspark` notebook is enabled with no additional import statement or code changes. Directly importing from `spark_rapids_ml` also still works (needed for non-MLlib algorithms like UMAP).
43+
or by modifying the PySpark/Jupyter launch command above to use a CLI `pyspark-rapids` installed by our `pip` package to start Jupyter with pyspark as follows:
44+
```bash
45+
cd spark-rapids-ml/notebooks
46+
47+
PYSPARK_DRIVER_PYTHON=jupyter \
48+
PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=0.0.0.0' \
49+
CUDA_VISIBLE_DEVICES=0 \
50+
pyspark-rapids --master local[12] \
51+
--driver-memory 128g \
52+
--conf spark.sql.execution.arrow.pyspark.enabled=true
53+
```
54+
55+
After executing either of the above, all subsequent imports and accesses of supported accelerated classes from `pyspark.ml` will automatically redirect and return their counterparts in `spark_rapids_ml`. Unaccelerated classes will import from `pyspark.ml` as usual. Thus, all supported acceleration in an existing `pyspark` notebook is enabled with no additional import statement or code changes. Directly importing from `spark_rapids_ml` also still works (needed for non-MLlib algorithms like UMAP).
4256

43-
For an example, see the notebook [kmeans-no-import-change.ipynb](kmeans-no-import-change.ipynb).
57+
For an example notebook, see the notebook [kmeans-no-import-change.ipynb](kmeans-no-import-change.ipynb).
4458

4559
*Note*: As of this release, in this mode, the remaining unsupported methods and attributes on accelerated classes and objects will still raise exceptions.
4660

notebooks/aws-emr/README.md

Lines changed: 11 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -15,25 +15,22 @@ If you already have a AWS EMR account, you can run the example notebooks on an E
1515
export S3_BUCKET=<your_s3_bucket_name>
1616
aws s3 mb s3://${S3_BUCKET}
1717
```
18-
- Create a zip file for the `spark-rapids-ml` package.
18+
- Upload the initialization script to S3.
1919
```
20-
cd spark-rapids-ml/python/src
21-
zip -r spark_rapids_ml.zip spark_rapids_ml
22-
```
23-
- Upload the zip file and the initialization script to S3.
24-
```
25-
aws s3 cp spark_rapids_ml.zip s3://${S3_BUCKET}/spark_rapids_ml.zip
26-
cd ../../notebooks/aws-emr
27-
aws s3 cp init-bootstrap-action.sh s3://${S3_BUCKET}/init-bootstrap-action.sh
20+
aws s3 cp init-bootstrap-action.sh s3://${S3_BUCKET}/
2821
```
2922
- Print out available subnets in CLI then pick a SubnetId (e.g. subnet-0744566f of AvailabilityZone us-east-2a).
3023

3124
```
3225
aws ec2 describe-subnets
3326
export SUBNET_ID=<your_SubnetId>
3427
```
28+
29+
If this is your first time using EMR notebooks via EMR Studio and EMR Workspaces, we recommend creating a fresh VPC and subnets with internet access (the initialization script downloads artifacts) meeting the EMR requirements, per EMR documentation, and then specifying one of the new subnets in the above.
3530

3631
- Create a cluster with at least two single-gpu workers. You will obtain a ClusterId in terminal. Noted three GPU nodes are requested here, because EMR cherry picks one node (either CORE or TASK) to run JupyterLab service for notebooks and will not use the node for compute.
32+
33+
If you wish to also enable [no-import-change](../README.md#no-import-change) UX for the cluster, change the init script argument `Args=[--no-import-enabled,0]` to `Args=[--no-import-enabled,1]` below. The init script `init-bootstrap-action.sh` checks this argument and modifies the runtime accordingly.
3734

3835
```
3936
export CLUSTER_NAME="spark_rapids_ml"
@@ -50,24 +47,14 @@ If you already have a AWS EMR account, you can run the example notebooks on an E
5047
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.2xlarge \
5148
InstanceGroupType=CORE,InstanceCount=3,InstanceType=g4dn.2xlarge \
5249
--configurations file://${CUR_DIR}/init-configurations.json \
53-
--bootstrap-actions Name='Spark Rapids ML Bootstrap action',Path=s3://${S3_BUCKET}/init-bootstrap-action.sh
50+
--bootstrap-actions Name='Spark Rapids ML Bootstrap action',Path=s3://${S3_BUCKET}/init-bootstrap-action.sh,Args=[--no-import-enabled,0]
5451
```
55-
- In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Clusters", you can find the cluster id of the created cluster. Wait until all the instances have the Status turned to "Running".
56-
- In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Workspace(Notebooks)", then create a workspace. Wait until the status becomes ready and a JupyterLab webpage will pop up.
52+
- In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Clusters", you can find the cluster id of the created cluster. Wait until the cluster has the "Waiting" status.
53+
- To use notebooks on EMR you will need an EMR Studio and an associated Workspace. If you don't already have these, in the [AWS EMR console](https://console.aws.amazon.com/emr/), on the left, in the "EMR Studio" section, click the respective "Studio" and "Workspace (Notebooks)" links and follow instructions to create them. When creating a Studio, select the `Custom` setup option to allow for configuring a VPC and a Subnet. These should match the VPC and Subnet used for the cluster. Select "\*Default\*" for all security group prompts and drop downs for Studio and Workspace setting. Please check/search EMR documentation for further instructions.
5754

58-
- Enter the created workspace. Click the "Cluster" button (usually the top second button of the left navigation bar). Attach the workspace to the newly created cluster through cluster id.
55+
- In the "Workspace (Notebooks)" list of workspaces, select the created workspace, make sure it has the "Idle" status (select "Stop" otherwise in the "Actions" drop down) and click "Attach" to attach the newly created cluster through cluster id to the workspace.
5956

60-
- Use the default notebook or create/upload a new notebook. Set the notebook kernel to "PySpark".
57+
- Use the default notebook or create/upload a new notebook. Set the notebook kernel to "PySpark". For the no-import-change UX, you can try the example [kmeans-no-import-change.ipynb](../kmeans-no-import-change.ipynb).
6158

62-
- Add the following to a new cell at the beginning of the notebook. Replace "s3://path/to/spark\_rapids\_ml.zip" with the actual s3 path.
63-
```
64-
%%configure -f
65-
{
66-
"conf":{
67-
"spark.submit.pyFiles": "s3://path/to/spark_rapids_ml.zip"
68-
}
69-
}
70-
71-
```
7259
- Run the notebook cells.
7360
**Note**: these settings are for demonstration purposes only. Additional tuning may be required for optimal performance.

notebooks/aws-emr/init-bootstrap-action.sh

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/bin/bash
2-
# Copyright (c) 2024, NVIDIA CORPORATION.
2+
# Copyright (c) 2025, NVIDIA CORPORATION.
33
#
44
# Licensed under the Apache License, Version 2.0 (the "License");
55
# you may not use this file except in compliance with the License.
@@ -27,16 +27,46 @@ sudo bash -c "wget https://www.python.org/ftp/python/3.10.9/Python-3.10.9.tgz &&
2727
tar xzf Python-3.10.9.tgz && cd Python-3.10.9 && \
2828
./configure --enable-optimizations && make altinstall"
2929

30-
RAPIDS_VERSION=24.12.0
30+
RAPIDS_VERSION=25.2.0
3131

3232
sudo /usr/local/bin/pip3.10 install --upgrade pip
3333

3434
# install scikit-learn
3535
sudo /usr/local/bin/pip3.10 install scikit-learn
3636

3737
# install cudf and cuml
38-
sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12 --extra-index-url=https://pypi.nvidia.com --verbose
39-
sudo /usr/local/bin/pip3.10 install --no-cache-dir cuml-cu12 cuvs-cu12 --extra-index-url=https://pypi.nvidia.com --verbose
40-
38+
sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12~=${RAPIDS_VERSION} \
39+
cuml-cu12~=${RAPIDS_VERSION} \
40+
cuvs-cu12~=${RAPIDS_VERSION} \
41+
--extra-index-url=https://pypi.nvidia.com --verbose
42+
sudo /usr/local/bin/pip3.10 install spark-rapids-ml
4143
sudo /usr/local/bin/pip3.10 list
4244

45+
# set up no-import-change for cluster if enabled
46+
if [[ $1 == "--no-import-enabled" && $2 == 1 ]]; then
47+
echo "enabling no import change in cluster" 1>&2
48+
cd /usr/lib/livy/repl_2.12-jars
49+
sudo jar xf livy-repl_2.12*.jar fake_shell.py
50+
sudo sed -i fake_shell.py -e '/from __future__/ s/\(.*\)/\1\ntry:\n import spark_rapids_ml.install\nexcept:\n pass\n/g'
51+
sudo jar uf livy-repl_2.12*.jar fake_shell.py
52+
sudo rm fake_shell.py
53+
fi
54+
55+
# ensure notebook comes up in python 3.10 by using a background script that waits for an
56+
# application file to be installed before modifying.
57+
cat <<EOF >/tmp/mod_start_kernel.sh
58+
#!/bin/bash
59+
set -ex
60+
while [ ! -f /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh ]; do
61+
echo "waiting for /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh"
62+
sleep 10
63+
done
64+
echo "done waiting"
65+
sleep 10
66+
sudo sed -i /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh -e 's#"spark.pyspark.python": "python3"#"spark.pyspark.python": "/usr/local/bin/python3.10"#g'
67+
sudo sed -i /mnt/notebook-env/bin/start_kernel_as_emr_notebook.sh -e 's#"spark.pyspark.virtualenv.enabled": "true"#"spark.pyspark.virtualenv.enabled": "false"#g'
68+
exit 0
69+
EOF
70+
sudo bash /tmp/mod_start_kernel.sh &
71+
exit 0
72+

notebooks/aws-emr/init-configurations.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@
5050
"spark.executor.resource.gpu.amount":"1",
5151
"spark.executor.cores":"8",
5252
"spark.task.cpus":"1",
53-
"spark.task.resource.gpu.amount":"1",
53+
"spark.task.resource.gpu.amount":"0.125",
5454
"spark.rapids.memory.pinnedPool.size":"2G",
5555
"spark.executor.memoryOverhead":"2G",
5656
"spark.sql.files.maxPartitionBytes":"256m",
@@ -61,7 +61,6 @@
6161
"spark.rapids.sql.explain":"ALL",
6262
"spark.rapids.memory.gpu.reserve":"20",
6363
"spark.rapids.sql.python.gpu.enabled":"true",
64-
"spark.rapids.memory.pinnedPool.size":"2G",
6564
"spark.rapids.sql.batchSizeBytes":"512m",
6665
"spark.locality.wait":"0",
6766
"spark.sql.execution.sortBeforeRepartition":"false",
@@ -70,6 +69,8 @@
7069
"spark.sql.cache.serializer":"com.nvidia.spark.ParquetCachedBatchSerializer",
7170
"spark.pyspark.python":"/usr/local/bin/python3.10",
7271
"spark.pyspark.driver.python":"/usr/local/bin/python3.10",
72+
"spark.pyspark.virtualenv.enabled":"false",
73+
"spark.yarn.appMasterEnv.PYSPARK_PYTHON":"/usr/local/bin/python3.10",
7374
"spark.dynamicAllocation.enabled":"false",
7475
"spark.driver.memory":"20g",
7576
"spark.rpc.message.maxSize":"512",

notebooks/databricks/README.md

Lines changed: 11 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -7,48 +7,24 @@ If you already have a Databricks account, you can run the example notebooks on a
77
export PROFILE=spark-rapids-ml
88
databricks configure --token --profile ${PROFILE}
99
```
10-
- Create a zip file for the `spark-rapids-ml` package.
10+
- Copy the init scripts to your *workspace* (not DBFS) (ex. workspace directory: /Users/< databricks-user-name >/init_scripts).
1111
```bash
12-
cd spark-rapids-ml/python/src
13-
zip -r spark_rapids_ml.zip spark_rapids_ml
14-
```
15-
- Copy the zip file to DBFS, setting `SAVE_DIR` to the directory of your choice.
16-
```bash
17-
export SAVE_DIR="/path/to/save/artifacts"
18-
databricks fs cp spark_rapids_ml.zip dbfs:${SAVE_DIR}/spark_rapids_ml.zip --profile ${PROFILE}
19-
```
20-
- Edit the [init-pip-cuda-11.8.sh](init-pip-cuda-11.8.sh) init script to set the `SPARK_RAPIDS_ML_ZIP` variable to the DBFS location used above.
21-
```bash
22-
cd spark-rapids-ml/notebooks/databricks
23-
sed -i"" -e "s;/path/to/zip/file;${SAVE_DIR}/spark_rapids_ml.zip;" init-pip-cuda-11.8.sh
12+
export WS_SAVE_DIR="/path/to/directory/in/workspace"
13+
databricks workspace mkdirs ${WS_SAVE_DIR} --profile ${PROFILE}
14+
databricks workspace import --format AUTO --file init-pip-cuda-11.8.sh ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
2415
```
25-
**Note**: the `databricks` CLI requires the `dbfs:` prefix for all DBFS paths, but inside the spark nodes, DBFS will be mounted to a local `/dbfs` volume, so the path prefixes will be slightly different depending on the context.
26-
27-
**Note**: this init script does the following on each Spark node:
16+
**Note**: the init script does the following on each Spark node:
2817
- updates the CUDA runtime to 11.8 (required for Spark Rapids ML dependencies).
2918
- downloads and installs the [Spark-Rapids](https://github.com/NVIDIA/spark-rapids) plugin for accelerating data loading and Spark SQL.
3019
- installs various `cuXX` dependencies via pip.
31-
32-
- Copy the modified `init-pip-cuda-11.8.sh` init script to your *workspace* (not DBFS) (ex. workspace directory: /Users/< databricks-user-name >/init_scripts).
33-
```bash
34-
export WS_SAVE_DIR="/path/to/directory/in/workspace"
35-
databricks workspace mkdirs ${WS_SAVE_DIR} --profile ${PROFILE}
36-
```
37-
For Mac
38-
```bash
39-
databricks workspace import --format AUTO --content $(base64 -i init-pip-cuda-11.8.sh) ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
40-
```
41-
For Linux
42-
```bash
43-
databricks workspace import --format AUTO --content $(base64 -w 0 init-pip-cuda-11.8.sh) ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
44-
```
20+
- if the cluster environment variable `SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=1` is define (see below), the init script also modifies a Databricks notebook kernel startup script to enable no-import change UX for the cluster. See [no-import-change](../README.md#no-import-change).
4521
- Create a cluster using **Databricks 13.3 LTS ML GPU Runtime** using at least two single-gpu workers and add the following configurations to the **Advanced options**.
4622
- **Init Scripts**
47-
- add the workspace path to the uploaded init script, e.g. `${WS_SAVE_DIR}/init-pip-cuda-11.8.sh`.
23+
- add the workspace path to the uploaded init script `${WS_SAVE_DIR}/init-pip-cuda-11.8.sh` as set above (but substitute variables manually in the form).
4824
- **Spark**
4925
- **Spark config**
5026
```
51-
spark.task.resource.gpu.amount 1
27+
spark.task.resource.gpu.amount 0.125
5228
spark.databricks.delta.preview.enabled true
5329
spark.python.worker.reuse true
5430
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-24.10.1.jar:/databricks/spark/python
@@ -74,6 +50,8 @@ If you already have a Databricks account, you can run the example notebooks on a
7450
```
7551
LIBCUDF_CUFILE_POLICY=OFF
7652
NCCL_DEBUG=INFO
53+
SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=0
7754
```
55+
If you wish to enable [no-import-change](../README.md#no-import-change) UX for the cluster, set `SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=1` instead. The init script checks this cluster environment variable and modifies the runtime accordingly.
7856
- Start the configured cluster.
79-
- Select your workspace and upload the desired [notebook](../) via `Import` in the drop down menu for your workspace.
57+
- Select your workspace and upload the desired [notebook](../) via `Import` in the drop down menu for your workspace. For the no-import-change UX, you can try the example [kmeans-no-import-change.ipynb](../kmeans-no-import-change.ipynb).

0 commit comments

Comments
 (0)