You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge branch-25.02 into main
Release notes as follows:
- Adds pyspark-rapids cli for zero-code change accelerated pyspark shell
and Jupyter notebook.
- Adds example init scripts for setting up zero-code change accelerated
notebook environments in cloud Spark clusters.
- Updates RAPIDS dependencies to 25.02.
- Fixes UMAP precomputed KNN error message.
Note: merge this PR with **Create a merge commit to merge**
Copy file name to clipboardExpand all lines: notebooks/README.md
+17-3Lines changed: 17 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,13 +34,27 @@ To run notebooks using Spark local mode on a server with one or more NVIDIA GPUs
34
34
8. **OPTIONAL**: If you have multiple GPUs in your server, replace the `CUDA_VISIBLE_DEVICES` setting in step 4 with a comma-separated list of the corresponding indices. For example, for two GPUs use `CUDA_VISIBLE_DEVICES=0,1`.
35
35
36
36
## No import change
37
-
In these notebooks, the GPU accelerated implementations of algorithms in Spark MLlib are enabled via import statements from the `spark_rapids_ml` package. Alternatively, acceleration can also be enabled by executing the following import statement at the start of a notebook:
37
+
In the default notebooks, the GPU accelerated implementations of algorithms in Spark MLlib are enabled via import statements from the `spark_rapids_ml` package.
38
+
39
+
Alternatively, acceleration can also be enabled by executing the following import statement at the start of a notebook:
38
40
```
39
41
import spark_rapids_ml.install
40
42
```
41
-
After executing a cell with this command, all subsequent imports and accesses of supported accelerated classes from `pyspark.ml` will automatically redirect and return their counterparts in `spark_rapids_ml`. Unaccelerated classes will import from `pyspark.ml` as usual. Thus, with the above single import statement, all supported acceleration in an existing `pyspark` notebook is enabled with no additional import statement or code changes. Directly importing from `spark_rapids_ml` also still works (needed for non-MLlib algorithms like UMAP).
43
+
or by modifying the PySpark/Jupyter launch command above to use a CLI `pyspark-rapids` installed by our `pip` package to start Jupyter with pyspark as follows:
After executing either of the above, all subsequent imports and accesses of supported accelerated classes from `pyspark.ml` will automatically redirect and return their counterparts in `spark_rapids_ml`. Unaccelerated classes will import from `pyspark.ml` as usual. Thus, all supported acceleration in an existing `pyspark` notebook is enabled with no additional import statement or code changes. Directly importing from `spark_rapids_ml` also still works (needed for non-MLlib algorithms like UMAP).
42
56
43
-
For an example, see the notebook [kmeans-no-import-change.ipynb](kmeans-no-import-change.ipynb).
57
+
For an example notebook, see the notebook [kmeans-no-import-change.ipynb](kmeans-no-import-change.ipynb).
44
58
45
59
*Note*: As of this release, in this mode, the remaining unsupported methods and attributes on accelerated classes and objects will still raise exceptions.
- Print out available subnets in CLI then pick a SubnetId (e.g. subnet-0744566f of AvailabilityZone us-east-2a).
30
23
31
24
```
32
25
aws ec2 describe-subnets
33
26
export SUBNET_ID=<your_SubnetId>
34
27
```
28
+
29
+
If this is your first time using EMR notebooks via EMR Studio and EMR Workspaces, we recommend creating a fresh VPC and subnets with internet access (the initialization script downloads artifacts) meeting the EMR requirements, per EMR documentation, and then specifying one of the new subnets in the above.
35
30
36
31
- Create a cluster with at least two single-gpu workers. You will obtain a ClusterId in terminal. Noted three GPU nodes are requested here, because EMR cherry picks one node (either CORE or TASK) to run JupyterLab service for notebooks and will not use the node for compute.
32
+
33
+
If you wish to also enable [no-import-change](../README.md#no-import-change) UX for the cluster, change the init script argument `Args=[--no-import-enabled,0]` to `Args=[--no-import-enabled,1]` below. The init script `init-bootstrap-action.sh` checks this argument and modifies the runtime accordingly.
37
34
38
35
```
39
36
export CLUSTER_NAME="spark_rapids_ml"
@@ -50,24 +47,14 @@ If you already have a AWS EMR account, you can run the example notebooks on an E
--bootstrap-actions Name='Spark Rapids ML Bootstrap action',Path=s3://${S3_BUCKET}/init-bootstrap-action.sh
50
+
--bootstrap-actions Name='Spark Rapids ML Bootstrap action',Path=s3://${S3_BUCKET}/init-bootstrap-action.sh,Args=[--no-import-enabled,0]
54
51
```
55
-
- In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Clusters", you can find the cluster id of the created cluster. Wait until all the instances have the Status turned to "Running".
56
-
-In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Workspace(Notebooks)", then create a workspace. Wait until the status becomes ready and a JupyterLab webpage will pop up.
52
+
- In the [AWS EMR console](https://console.aws.amazon.com/emr/), click "Clusters", you can find the cluster id of the created cluster. Wait until the cluster has the "Waiting" status.
53
+
-To use notebooks on EMR you will need an EMR Studio and an associated Workspace. If you don't already have these, in the [AWS EMR console](https://console.aws.amazon.com/emr/), on the left, in the "EMR Studio" section, click the respective "Studio" and "Workspace(Notebooks)" links and follow instructions to create them. When creating a Studio, select the `Custom` setup option to allow for configuring a VPC and a Subnet. These should match the VPC and Subnet used for the cluster. Select "\*Default\*" for all security group prompts and drop downs for Studio and Workspace setting. Please check/search EMR documentation for further instructions.
57
54
58
-
-Enter the created workspace. Click the "Cluster" button (usually the top second button of the left navigation bar). Attach the workspace to the newly created cluster through cluster id.
55
+
-In the "Workspace (Notebooks)" list of workspaces, select the created workspace, make sure it has the "Idle" status (select "Stop" otherwise in the "Actions" drop down) and click "Attach" to attach the newly created cluster through cluster id to the workspace.
59
56
60
-
- Use the default notebook or create/upload a new notebook. Set the notebook kernel to "PySpark".
57
+
- Use the default notebook or create/upload a new notebook. Set the notebook kernel to "PySpark". For the no-import-change UX, you can try the example [kmeans-no-import-change.ipynb](../kmeans-no-import-change.ipynb).
61
58
62
-
- Add the following to a new cell at the beginning of the notebook. Replace "s3://path/to/spark\_rapids\_ml.zip" with the actual s3 path.
databricks workspace import --format AUTO --file init-pip-cuda-11.8.sh ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
24
15
```
25
-
**Note**: the `databricks` CLI requires the `dbfs:` prefix for all DBFS paths, but inside the spark nodes, DBFS will be mounted to a local `/dbfs` volume, so the path prefixes will be slightly different depending on the context.
26
-
27
-
**Note**: this init script does the following on each Spark node:
16
+
**Note**: the init script does the following on each Spark node:
28
17
- updates the CUDA runtime to 11.8 (required for Spark Rapids ML dependencies).
29
18
- downloads and installs the [Spark-Rapids](https://github.com/NVIDIA/spark-rapids) plugin for accelerating data loading and Spark SQL.
30
19
- installs various `cuXX` dependencies via pip.
31
-
32
-
- Copy the modified `init-pip-cuda-11.8.sh` init script to your *workspace* (not DBFS) (ex. workspace directory: /Users/< databricks-user-name >/init_scripts).
- if the cluster environment variable `SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=1` is define (see below), the init script also modifies a Databricks notebook kernel startup script to enable no-import change UX for the cluster. See [no-import-change](../README.md#no-import-change).
45
21
- Create a cluster using **Databricks 13.3 LTS ML GPU Runtime** using at least two single-gpu workers and add the following configurations to the **Advanced options**.
46
22
-**Init Scripts**
47
-
- add the workspace path to the uploaded init script, e.g. `${WS_SAVE_DIR}/init-pip-cuda-11.8.sh`.
23
+
- add the workspace path to the uploaded init script`${WS_SAVE_DIR}/init-pip-cuda-11.8.sh` as set above (but substitute variables manually in the form).
@@ -74,6 +50,8 @@ If you already have a Databricks account, you can run the example notebooks on a
74
50
```
75
51
LIBCUDF_CUFILE_POLICY=OFF
76
52
NCCL_DEBUG=INFO
53
+
SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=0
77
54
```
55
+
If you wish to enable [no-import-change](../README.md#no-import-change) UX for the cluster, set `SPARK_RAPIDS_ML_NO_IMPORT_ENABLED=1` instead. The init script checks this cluster environment variable and modifies the runtime accordingly.
78
56
- Start the configured cluster.
79
-
- Select your workspace and upload the desired [notebook](../) via `Import` in the drop down menu for your workspace.
57
+
- Select your workspace and upload the desired [notebook](../) via `Import` in the drop down menu for your workspace. For the no-import-change UX, you can try the example [kmeans-no-import-change.ipynb](../kmeans-no-import-change.ipynb).
0 commit comments