|
| 1 | +# Spark Rapids ML (Scala) |
| 2 | + |
| 3 | +**NOTE**: The Scala algorithm is deprecated as of v25.04. |
| 4 | + |
| 5 | +### PCA |
| 6 | + |
| 7 | +Comparing to the original PCA training API: |
| 8 | + |
| 9 | +```scala |
| 10 | +val pca = new org.apache.spark.ml.feature.PCA() |
| 11 | + .setInputCol("feature_vector_type") |
| 12 | + .setOutputCol("feature_value_3d") |
| 13 | + .setK(3) |
| 14 | + .fit(vectorDf) |
| 15 | +``` |
| 16 | + |
| 17 | +We used a customized class and user will need to do `no code change` to enjoy the GPU acceleration: |
| 18 | + |
| 19 | +```scala |
| 20 | +val pca = new com.nvidia.spark.ml.feature.PCA() |
| 21 | + .setInputCol("feature_array_type") // accept ArrayType column, no need to convert it to Vector type |
| 22 | + .setOutputCol("feature_value_3d") |
| 23 | + .setK(3) |
| 24 | + .fit(vectorDf) |
| 25 | +... |
| 26 | +``` |
| 27 | + |
| 28 | +Note: The `setInputCol` is targeting the input column of `Vector` type for training process in `CPU` |
| 29 | +version. But in GPU version, user doesn't need to do the extra preprocess step to convert column of |
| 30 | +`ArrayType` to `Vector` type, the `setInputCol` will accept the raw `ArrayType` column. |
| 31 | + |
| 32 | +## Build |
| 33 | + |
| 34 | +### Build in Docker: |
| 35 | + |
| 36 | +We provide a Dockerfile to build the project in a container. See [docker](../docker/README.md) for more instructions. |
| 37 | + |
| 38 | +### Prerequisites: |
| 39 | + |
| 40 | +1. essential build tools: |
| 41 | + - [cmake(>=3.23.1)](https://cmake.org/download/), |
| 42 | + - [ninja(>=1.10)](https://github.com/ninja-build/ninja/releases), |
| 43 | + - [gcc(>=9.3)](https://gcc.gnu.org/releases.html) |
| 44 | +2. [CUDA Toolkit(>=11.5)](https://developer.nvidia.com/cuda-toolkit) |
| 45 | +3. conda: use [miniconda](https://docs.conda.io/en/latest/miniconda.html) to maintain header files |
| 46 | +and cmake dependecies |
| 47 | +4. [cuDF](https://github.com/rapidsai/cudf): |
| 48 | + - install cuDF shared library via conda: |
| 49 | + ```bash |
| 50 | + conda install -c rapidsai -c conda-forge cudf=22.04 python=3.8 -y |
| 51 | + ``` |
| 52 | +5. [RAFT(22.12)](https://github.com/rapidsai/raft): |
| 53 | + - raft provides only header files, so no build instructions for it. Note we fix the version to |
| 54 | + 22.12 to avoid potential API compatibility issues in the future. |
| 55 | + ```bash |
| 56 | + $ git clone -b branch-22.12 https://github.com/rapidsai/raft.git |
| 57 | + ``` |
| 58 | +6. export RAFT_PATH: |
| 59 | + ```bash |
| 60 | + export RAFT_PATH=ABSOLUTE_PATH_TO_YOUR_RAFT_FOLDER |
| 61 | + ``` |
| 62 | +Note: For those using other types of GPUs which do not have CUDA forward compatibility (for example, GeForce), CUDA 11.5 or later is required. |
| 63 | + |
| 64 | +### Build target jar |
| 65 | + |
| 66 | +Spark-rapids-ml uses [spark-rapids](https://github.com/NVIDIA/spark-rapids) plugin as a dependency. |
| 67 | +To build the _SNAPSHOT_ jar, user needs to build and install the denpendency jar _rapids-4-spark_ first |
| 68 | +because there's no snapshot jar for spark-rapids plugin in public maven repositories. |
| 69 | +See [build instructions](https://github.com/NVIDIA/spark-rapids/blob/branch-23.04/CONTRIBUTING.md#building-a-distribution-for-multiple-versions-of-spark) to get the dependency jar installed. |
| 70 | +
|
| 71 | +User can also modify the pom file to use the _release_ version spark-rapids plugin as the dependency. In this case user doesn't need to manually build and install spark-rapids plugin jar by themselves. |
| 72 | + |
| 73 | +Make sure the _rapids-4-spark_ is installed in your local maven then user can build it directly in |
| 74 | +the _project root path_ with: |
| 75 | +``` |
| 76 | +cd jvm |
| 77 | +mvn clean package |
| 78 | +``` |
| 79 | +Then `rapids-4-spark-ml_2.12-24.04.1-SNAPSHOT.jar` will be generated under `target` folder. |
| 80 | +
|
| 81 | +Users can also use the _release_ version spark-rapids plugin as the dependency if it's already been |
| 82 | +released in public maven repositories, see [rapids-4-spark maven repository](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark) |
| 83 | +for release versions. In this case, users don't need to manually build and install spark-rapids |
| 84 | +plugin jar by themselves. Remember to replace the [dependency](https://github.com/NVIDIA/spark-rapids-ml/blob/branch-23.04/pom.xml#L94-L96) |
| 85 | +in pom file. |
| 86 | +
|
| 87 | +_Note_: This module contains both native and Java/Scala code. The native library build instructions |
| 88 | +has been added to the pom.xml file so that maven build command will help build native library all |
| 89 | +the way. Make sure the prerequisites are all met, or the build will fail with error messages |
| 90 | +accordingly such as "cmake not found" or "ninja not found" etc. |
| 91 | +
|
| 92 | +## How to use |
| 93 | +
|
| 94 | +After the building processes, spark-rapids plugin jar will be installed to your local maven |
| 95 | +repository, usually in your `~/.m2/repository`. |
| 96 | +
|
| 97 | +Add the artifact jar to the Spark, for example: |
| 98 | +```bash |
| 99 | +ML_JAR="target/rapids-4-spark-ml_2.12-24.04.1-SNAPSHOT.jar" |
| 100 | +PLUGIN_JAR="~/.m2/repository/com/nvidia/rapids-4-spark_2.12/24.04.1/rapids-4-spark_2.12-24.04.1.jar" |
| 101 | +
|
| 102 | +$SPARK_HOME/bin/spark-shell --master $SPARK_MASTER \ |
| 103 | + --driver-memory 20G \ |
| 104 | + --executor-memory 30G \ |
| 105 | + --conf spark.driver.maxResultSize=8G \ |
| 106 | + --jars ${ML_JAR},${PLUGIN_JAR} \ |
| 107 | + --conf spark.plugins=com.nvidia.spark.SQLPlugin \ |
| 108 | + --conf spark.rapids.sql.enabled=true \ |
| 109 | + --conf spark.task.resource.gpu.amount=0.08 \ |
| 110 | + --conf spark.executor.resource.gpu.amount=1 \ |
| 111 | + --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \ |
| 112 | + --files ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh |
| 113 | +``` |
| 114 | + |
| 115 | +### PCA examples |
| 116 | + |
| 117 | +Please refer to |
| 118 | +[PCA examples](https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.04/examples/ML+DL-Examples/Spark-cuML/pca/) for |
| 119 | +more details about example code. We provide both |
| 120 | +[Notebook](https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.04/examples/ML+DL-Examples/Spark-cuML/pca/notebooks/Spark_PCA_End_to_End.ipynb) |
| 121 | +and [jar](https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.04/examples/ML+DL-Examples/Spark-cuML/pca/scala/src/com/nvidia/spark/examples/pca/Main.scala) |
| 122 | + versions there. Instructions to run these examples are described in the |
| 123 | +[README](https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.04/examples/ML+DL-Examples/Spark-cuML/pca/README.md). |
0 commit comments